Introduction

With the development of the imaging-based “AT(N) framework [1]” for neurodegenerative research, simultaneous amyloid positron emission tomography/magnetic resonance imaging (PET/MRI) provides a potential “one-stop shop” imaging exam for dementia research, diagnosis, and clinical trials [2,3,4]. PET allows the acquisition of the amyloid (A) and tau (T) biomarkers, hallmarks of Alzheimer’s disease neuropathology [3, 5,6,7,8], while MRI with its exquisite soft tissue contrast allows for imaging cortical atrophy, representative of neurodegeneration (N) [9].

However, multiple factors will affect the utility of PET. The scan time, the cost of the tracer, and the radiation given for the PET imaging might all be limiting factors, affecting respectively the logistics, economics, and the scanned subjects. Since PET imaging quality is highly dependent upon the number of detected events (counts), reduced dose or reduced scan time typically results in lower signal-to-noise images. Previous work to tackle this issue involves direct interpretation of the low-count images [10,11,12] or with machine learning-based methods [13, 14] but using few subjects collected at a single site. To increase the utility of this hybrid modality in ultra-low-dose imaging, we have previously trained deep learning (DL) networks using a U-net structure with residual learning [15, 16] to generate diagnostic amyloid PET images from PET/MRI scans with simulated ultra-low injected radiotracer dose [17].

To move single-site studies to multi-center studies, traditionally for multi-center machine learning applications, a DL network would be trained centrally, on data collected using a harmonized protocol from multiple sites. However, privacy issues such as sharing patient information and data ownership often limit the ability to collect a large number of medical images from multiple institutions [18,19,20]. Moreover, when a pre-trained network is applied to data acquired at other sites, performance of the network may decrease [21]. To overcome this data bias, a sequential training approach may be considered for network generalization, a method sometimes known as “transfer learning” [22]. Under this approach, the network may be applied to data acquired on different scanner models, with different scan protocols, and reconstructed with different methods or parameters. Most previous machine learning work using data from different sites assume that the same image types exist for input to the network [18,19,20, 23]. However, it is more realistic that different sites with different scanners also employ different scan protocols that may not include all the inputs required to directly apply an algorithm trained elsewhere. Local populations with different disease prevalence might also affect the results, and there is evidence that for optimal performance, networks should be trained depending on the target study population [24].

In this project, we investigated various approaches to apply a pre-trained convolutional neural network (CNN) originally meant to denoise ultra-low-dose amyloid PET/MRI on new cases from a separate institution, collected on a different PET/MRI scanner, and with different reconstruction parameters and MR sequences. Moreover, we focused on whether these approaches can improve ultra-low-count PET data obtained from severely reduced imaging duration (1 min, far lower counts than previous work in literature [10, 14]). A better understanding of how to best apply a pre-trained network to a new population should enable the optimal performance for generalizing DL-based image synthesis tasks.

Methods

This study was approved by the local institutional review boards. Written informed consent for imaging was obtained from all participants or an authorized surrogate decision-maker.

PET/MRI data acquisition: Site 1

Forty datasets from 39 participants (23 female, 67 ± 8 years; one female participant was scanned twice, 9 months apart) with MRI and PET data were simultaneously acquired on scanner 1: an integrated PET/MRI scanner with time-of-flight capabilities (SIGNA PET/MR, GE Healthcare). T1-weighted, T2-weighted, and T2 FLAIR morphological MR images were acquired, with the parameters listed in Chen et al. [17].

330 ± 30 MBq of the amyloid radiotracer [18F]florbetaben (Life Molecular Imaging, Berlin, Germany) was injected intravenously with PET acquired 90–110 min post-injection. The list-mode PET data were reconstructed for the ground-truth (i.e., reconstructed from 20-min full-dose PET acquisitions) image as well as a random subset containing 1/100th of the events (also taking the different randoms rate into account) to produce a low-dose PET image [25]. Time-of-flight ordered-subsets expectation-maximization (OSEM), with two iterations and 28 subsets, accounting for randoms, scatter, dead-time, and attenuation, and a 4 mm full-width at half-maximum post-reconstruction Gaussian filter was used for all PET images. MR attenuation correction was performed using the vendor’s atlas-based method [26].

PET/MRI data acquisition: site 2

Analysis was performed on 40 participants (23 female, 64 ± 11 years) who were scanned on scanner 2 (mMR, Siemens Healthineers). Only the T1-weighted and T2-weighted (no T2-FLAIR-weighted) MR images (parameters in Table S1) were acquired. [18F]florbetaben (283 ± 10 MBq) was injected with PET and MRI acquired 90–110 min after injection. The 20-min list-mode PET data was reconstructed for the ground-truth image. The first minute of PET acquisition was reconstructed to produce low-count, short-time (5% of the original) PET images. OSEM, with 8 iterations and 21 subsets, accounting for randoms, scatter, dead-time, and attenuation, and a 3 mm full-width at half-maximum post-reconstruction Gaussian filter was used for all PET images, based on the standard protocol at site 2. MR attenuation correction was performed using RESOLUTE [27].

Image preprocessing

The site 2 ground-truth PET images were resliced based on the site 1 PET volumes: 89 2.78-mm-thick slices with 256-by-256 matrix size (1.17 × 1.17 mm2 in-plane voxel size); to compensate for any residual motion between the modalities and sequences, all other images from site 2 were co-registered to the resliced ground-truth PET image following the pipeline outlined in Chen et al. [17].

CNN training and testing

We trained a CNN (using a “U-net” structure [16]) with the structure, hyperparameters, and training algorithm described in Chen et al., using site 1 data (32 training datasets, chosen randomly) [17]. The network inputs are multi-contrast MR (T1-, T2-, and T2 FLAIR-weighted) and the ultra-low-dose PET images. The network was trained using residual learning, based on the ground-truth PET image [15] (Fig. 1). The remaining datasets were used as the test set. In the previous work, 5-fold cross-validation was used, resulting in five trained networks; one was randomly selected for this study. Network training details and network selection can be found in the Supplementary Materials.

Fig. 1
figure 1

Schematic of the U-Net used in this work and its inputs and outputs. The arrows denote computational operations, and the tensors are denoted by boxes with the number of channels indicated above each box. Note that for the site 2 data in which T2-FLAIR was not available, this input was replaced with the T1-weighted image. BN batch normalization, Conv convolution, ReLU rectified linear unit activation, tanh hyperbolic tangent

To apply this trained network to site 2 data, two preliminary studies were conducted: for the missing T2-FLAIR channel, T1-weighted images were used as inputs. Site 2 1-min images were chosen as inputs to the network. The choices for time reduction and contrast replacement was made based on the peak signal-to-noise ratio (PSNR) between the low-count and its corresponding ground-truth image (details outlined in the Supplementary Materials [28]). Four approaches were investigated (schematically shown in Fig. 2): in the first (method A), the site 2 data were input directly into the network trained at site 1, with no attempts to account for site differences. In the second (method B), network weights were initialized with the final weights from site 1 data, then further tuned for 100 epochs with a learning rate of 0.0001. 5-fold cross-validation (i.e., 32 datasets for training, 8 for testing per network trained) was used during transfer learning. For method C, a new network was trained from random initialization on site 2 data only. Finally, method D was trained from random initialization on all data from both scanners (32 cases from site 1 and 32 cases from site 2, with testing on 8 cases from site 2). The network inputs for methods C and D are the multi-contrast MR images (T1- and T2-weighted images only) and the low-count PET image. The hyperparameters and training follow that implemented in Chen et al. [17].

Fig. 2
figure 2

The various methods used in this work for network generalization. Five-fold cross-validation was used to utilize all datasets (when coming from the same site) for testing and training. The training and testing data for method A were from different sites; therefore, all site 2 data could be applied to one network

Assessment of image quality

Dataset-specific FreeSurfer-based [29, 30] T1-derived brain masks were used for voxel-based analyses. For each axial slice, the image quality of the synthesized PET images and the original low-count PET images within the brain mask were assessed using peak signal-to-noise ratio (PSNR), structural similarity (SSIM) [31], and root mean square error (RMSE), where:

$$ {\displaystyle \begin{array}{c}\mathrm{PSNR}=20{\log}_{10}\left(\operatorname{MAX}\left({I}_{GT}\right)\right)-10{\log}_{10}\left(\mathrm{MSE}\left({I}_{GT},{I}_x\right)\right)\\ {}\mathrm{SSIM}=\left[\left(2{\upmu}_{GT}{\upmu}_x+{C}_1\right)\left(2{\upsigma}_{GT,x}+{C}_2\right)\right]/\left[\left({\upmu}_{GT}^2+{\upmu}_x^2+{C}_1\right)\left({\upsigma}_{GT}^2+{\upsigma}_x^2+{C}_2\right)\right]\\ {}\mathrm{RMSE}=\left\Vert {I}_{GT}-{I}_x\right\Vert /\left\Vert {I}_{GT}\right\Vert \end{array}} $$

and IGT denotes the ground-truth image (mean μGT, variance σGT2, maximum pixel value MAX(IGT)), Ix denotes the image to be tested, σGT,x denotes the covariance of IGT and Ix, C1 and C2 are the square of 0.01 and 0.03 times the pixel value range of IGT, respectively, MSE denotes the mean squared error, and || . || denotes the Frobenius norm.

The respective metrics for each slice were then averaged (weighted by the number of voxels in the brain mask). A repeated measures analysis of variance (ANOVA) test followed by pair-wise paired t-tests were conducted at the p = 0.05/3 level with Bonferroni correction for multiple (three) comparisons, to compare the values of the image quality metrics across the different image processing methods. Image metric improvement from the low-count to the synthesized image was also calculated for data from both sites; this comparison was conducted using the one-tailed two-sample unequal variance t-test (also at the p = 0.05/3 level).

Region-based analyses

Region-based analyses were carried out to assess the agreement of the tracer uptake between images as well as differentiating between amyloid positive and negative images. FreeSurfer-based cortical parcellations and cerebral segmentations based on the Desikan-Killiany Atlas [32] were created, yielding a maximum of 111 regions per dataset. Mean standard uptake value ratios (SUVR, normalized to the cerebellar cortex) in 4054 total regions from 37 successful segmentations were calculated and compared between methods and evaluated by Bland-Altman plots. Next, a composite ROI was derived from the frontal, parietal, lateral temporal, occipital, anterior and posterior cingulate cortices and the mean composite SUVR was calculated by Hermes BRASS software for all datasets, again with the cerebellar cortex as reference. Using the clinical readers’ majority ground-truth reads, receiver operator characteristic (ROC) analysis was carried out using different SUVRs as cutoff values for amyloid positive vs. negative and the area under the ROC curve (AUC) was calculated for each image type; the AUCs were analyzed according to DeLong et al. [33] for significance and a non-inferiority threshold of 5% was set to compare the DL-based AUCs and the ground-truth AUC. Cohen’s d [34] was also calculated for the composite SUVRs between amyloid positive and negative groups of each image type.

Clinical readings

All PET images of each dataset were anonymized, their series numbers were randomized, then presented to four readers (H.B., O.S., G.Z.: board-certified physicians with 10+ years’ experience of reading amyloid images; M.E.I.K.: resident with 4 years’ experience) for independent reading. The amyloid uptake status (positive, negative, uninterpretable) of each image was determined; the ground-truth amyloid status was based on the majority read from the ground-truth images. A fifth reader (G.D.: board-certified physician with 10 years’ experience) was a tiebreaker for a single case of a 2–2 positive-negative reading. Reader agreement was assessed using the Krippendorff’s alpha test. The accuracy, sensitivity, and specificity were calculated for the readings of the short-time and synthesized images. Symmetry tests were also carried out to examine whether the readings produced an equal number of false positives and negatives.

For each PET image, the four physicians also assigned an image quality score on a five-point scale: 1 = uninterpretable, 2 = poor, 3 = adequate, 4 = good, 5 = excellent. Also, these scores were dichotomized into 1–3 vs. 4–5 to analyze the percentage of images with high scores.

Results

Assessment of image quality

Visually, all synthesized images showed marked noise reduction (Fig. 3). For site 2 data, the ANOVA test showed that the four methods yielded different results than the low-count images and from each other (Table 1), indicating that image quality improved from the low-count images but improved the least with method A (i.e., simply applying the site 1 model to site 2 data). Pair-wise t tests showed that image quality improved the most using method B (Fig. 4; p < 0.05/3 for all metric comparisons). Comparing the metric difference (improvement from the low-dose/short-time images to the images output from different trained/tuned networks) across sites, all methods showed more improvement (p < 0.05/3) in SSIM than that in site 1 data. For RMSE, methods B, C, and D results showed more improvement (p < 0.05/3) than the site 1 results, while for PSNR, methods C and D showed similar improvement (p > 0.05/3) and method B showed more improvement (p < 0.05/3) than the site 1 results. Method A results showed less improvement (p < 0.05/3) compared with the site 1 results in both PSNR and RMSE.

Fig. 3
figure 3

Representative amyloid positive (top)/negative (bottom) images, with T1-weighted MRI and the corresponding PET images. Difference images between the ground-truth and the other images are also shown. All synthesized images show marked noise reduction. However, method A images are blurrier than the other synthesized images. Network training methods: A, direct application of pre-trained network; B, transfer learning starting with pre-trained network; C, training new network from scratch; D, training new network with combined datasets

Table 1 Analysis of variance (ANOVA) results comparing the images generated by the four deep learning (DL)-based methods and the low-count images, and within the images generated by the four DL-based methods. The F value is calculated at the alternative probability of 5%. df degrees of freedom; PSNR peak signal-to-noise ratio; RMSE root mean square error; SSIM structural similarity
Fig. 4
figure 4

The peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and root mean square error (RMSE) of the synthesized and low-count images compared to the ground-truth image. Site 1 data (used to train the original network in Chen et al. [17]) are also shown for comparison. Network training methods: A, direct application of pre-trained network; B, transfer learning starting with pre-trained network; C, training new network from scratch; D, training new network with combined datasets

Region-based analyses

SUVRs derived from method B had the least variability from the ground-truth SUVRs (Fig. 5). Out of all image types, images generated by method B also yielded the highest AUC (Fig. 6) and the largest Cohen’s d values to distinguish positive and negative amyloid status (Table 2). Comparing the four DL method-based AUCs with the low-count AUC and the ground-truth AUC yielded p values of 0.46 and 0.70, respectively, and the 95% confidence interval of the DL-based AUCs fell within the non-inferiority threshold of the ground-truth AUC.

Fig. 5
figure 5

Bland-Altman 2-D histograms of regional standardized uptake value ratios (SUVRs) compared between methods (ground-truth to low-count and methods A, B, C, and D) across all datasets with FreeSurfer segmentations (n = 37). The scale bar denotes the number of data points in each pixel; the solid and dashed lines denote the mean and 95% confidence interval of the SUVR differences respectively. GT ground-truth. Network training methods: A, direct application of pre-trained network; B, transfer learning starting with pre-trained network; C, training new network from scratch; D, training new network with combined datasets

Fig. 6
figure 6

The receiver operating characteristic (ROC) curves of the standardized uptake value ratios (SUVRs) from the various image types used to differentiate between amyloid positive and negative readings. Network training methods: A, direct application of pre-trained network; B, transfer learning starting with pre-trained network; C, training new network from scratch; D, training new network with combined datasets

Table 2 Region-based analyses: the mean, standard deviation (SD), and the 95% confidence interval (CI) of the regional standardized uptake value ratio (SUVR) differences between various image types and the ground-truth images; Cohen’s d effect sizes; area under the curves (AUC); 95% CI of the AUC difference between the DL-based methods and the ground truth

Clinical readings

Inter-reader agreement on amyloid uptake status was high (Krippendorff’s alpha > 0.7) for all methods except for method A (Krippendorff’s alpha = 0.5) and the readings from all four readers were pooled. Seventy-six of 160 (47.5%) total reads of the ground-truth images were amyloid positive.

When comparing the accuracy, sensitivity, and specificity of the readings between the synthesized images and the ground-truth images, methods B, C, and D produced higher values than those from method A (Table 3). The accuracy of the readings from images synthesized using methods B, C, and D was high, though method B produced more false positives than false negatives (p = 0.031 for the symmetry test). For the short-time images that were interpretable (only 56% of them), the accuracy, sensitivity, and specificity of the clinical assessments were also high (Confusion matrices in Table S2).

Table 3 Accuracy, sensitivity, and specificity of the amyloid status readings (since a significant fraction of the low-dose images were uninterpretable [71/160 reads], they are not included in the analysis)

The mean image quality scores assigned by each reader to all PET volumes are shown in Table 4 and Table S3. The results showed relevant inter-reader variability and limited agreement, and thus, no statistics were done on the readings. However, for all readers, methods B, C, and D had similar (greater than − 10%) or even higher proportions of high-scoring (i.e., 4 or 5) readings compared to those of the ground-truth images. In contrast, readings of images from method A performed worse than the other deep-learning-based methods and the ground-truth images.

Table 4 Mean and standard deviation (SD) image quality scores (1 = uninterpretable; 2 = poor; 3 = adequate; 4 = good; 5 = excellent) and the proportion of high-quality images (scores 4–5) from the four readers

Discussion

When conducting retrospective multi-center imaging studies or applying models trained on one site to another, differences such as scanner hardware, acquisition protocol, and reconstruction parameters will pose challenges in the generalization of these trained models. In this work, we were able to apply a pre-trained network to ultra-short-time duration amyloid PET/MRI data from another institution, overcoming the differences in acquisition protocol. Through further training iterations, the pre-trained network adjusted for data bias stemming from the differences in acquisition and reconstruction between institutions. Furthermore, we showed that the network could still be used in the event of missing input data; providing another structurally similar MRI contrast (the approach used in this work) as an input for the missing channel preserved the functionality of the network [28]. From this and the previous study [17], we have shown that DL-assisted extreme time-shortening and dose-reducing methods for amyloid PET/MRI can potentially increase the utility of PET imaging.

Certainly, the hyperparameter space for network tuning is vast, and methods for data/network sharing across institutions are many. However, we believe the methods investigated in this study represent four main DL-based approaches in multi-site studies: the first (method A), naïvely applying the network trained on data from one site to another. However, network tuning is needed to account for data bias from each site, evidenced by the weak performance of this method. The second method (method B) requires passing the network between sites, using pre-training from site 1 to act as the initialization of a model that is further trained on site 2 data (i.e., transfer learning). Method B represents the most extreme case of optimizing the test set results, where the network is first trained on site 1 data, then tuned with site 2 training data (tuning all layers of the network), and directly applied on the site 2 test set. The final two methods are extreme examples in which institutions approach data sharing: method C where each site keeps its own data and trains its own network for the institution’s own use, and the traditional “data-sharing” approach (method D) for multi-center studies where all data are collected and stored in a central repository for training. To simulate this approach for method D, where all data acquisition protocols are harmonized across sites, we did not include the T2 FLAIR channel during training.

Based on the evaluation metrics, selection of the optimal network training/tuning method is application-specific. In this specific study, while training an institution’s own network (method C) produces good results quantitatively and qualitatively as expected (the network is specifically trained on the image quality of the institution), for the network generalization and data sharing methods, using the pre-trained network (method B), an efficient way to allow each institution to keep its own data, provides better quantification results overall, possibly due to the image quality difference from the scanner and image processing protocols between the two sites. The U-Net architecture, which emphasizes low spatial frequencies in the input and results in a blurrier output [35], also contributes to a slightly blurrier image when using methods C and D. On the other hand, for applications involving expert readers, with methods B, C, and D performing similarly in the clinicians’ image quality preference and amyloid status readings, any of the methods would be sufficient as long as the data bias from each site is accounted for. This is not surprising since previous studies have shown that clinicians can confidently read PET images that have lower counts than that routinely acquired, though not as extremely reduced as in the current study [11, 12].

There are several limitations to this study. First, the network training and tuning methods evaluated are not exhaustive. Second, with the approach of method B, the problem of “forgetting [19, 36, 37]” how to produce an optimal image for site 1 data is unavoidable, since the pre-trained network is now tuned for site 2 data instead. However, in this project, our focus was on sharing the pre-trained network to other sites and thus in actual practice the newer networks will not be applied back on site 1 data. Finally, in this study, there are potential sources of bias such as the site 1 training dataset (there are two datasets from one participant, though the two scans are independent, with the head positioned differently, and took place 9 months apart); the readers’ experience or institution may also lead to bias. For example, we enlisted multiple readers so that a consensus reading by majority vote could be used as the ground-truth, but in terms of image quality readers 3 and 4 (belonging to the same institution) showed a clear preference for the synthesized images while readers 1 and 2 (belonging to the same institution) preferred the site 2 ground-truth images. This preference may be due to many reasons, such as the PET image quality at the readers’ original sites, or experience with reading lower count images [11]. However, variability in the image quality scores also prevented pooling scores for further statistical analyses. This reader bias also demonstrated the need for multiple readers from different sites when conducting reader studies.

Conclusion

To perform deep-learning-based ultra-short-time amyloid PET/MRI imaging using transfer learning methods, further tuning of pre-trained networks or training new networks including data acquired from the new scanner is required to overcome data bias. Sharing the network parameters between sites rather than the images themselves can be a potential way for collaboration across multiple amyloid PET/MRI sites.