Introduction

In oncologic positron emission tomography (PET), the standardized uptake value (SUV) is commonly used for various interpretive tasks, including differentiation of benign from malignant lesions and assessment of treatment response. That said, the SUV, which reflects tracer concentrations at a single time point, has inherent limitations [1]. As many tumors accumulate tracer progressively, longer uptake times often result in higher SUVs, complicating longitudinal assessments of tumor activity [2]. Furthermore, the SUV reflects both bound tracer within cells and free tracer in the blood pool, which can obscure tracer-avid lesions in highly vascular organs (e.g., liver) [3].

The Patlak methodology, which offers a practical graphical approach to kinetic modeling, addresses these limitations by separating the signal derived from bound versus free tracer [4]. The Patlak slope (PS), a surrogate of Ki derived from compartmental kinetic modeling, reflects the rate of irreversible tracer trapping within a tissue. The PS has the potential to reduce the dependence of tracer-avidity assessments on uptake times and to improve the conspicuity of mildly tracer-avid lesions, particularly within organs with relatively high extracellular tracer concentrations [5]. However, PS values, which are derived from dynamic PET data and require an estimation of the blood input function (IF), may be more prone to motion/noise artifacts and computational errors than SUVs [6].

Knowledge of the intrinsic variability of quantitative imaging biomarkers is essential to the accurate interpretation of changes in these metrics during therapy. Prior 2-deoxy-2-[18F]fluoro-D-glucose ([18F]FDG) PET/CT studies have shown that the maximum SUV exhibits relatively high repeatability, with within-subject coefficients of variation (wCVs) of 10–12% for solid tumors [7]. In contrast, the repeatability of PS values on oncologic PET/CT examinations has not been thoroughly studied. This information is critical for the clinical implementation of PS as a marker of tumor response, especially given the greater complexity inherent to the kinetic modeling required by the Patlak method. Furthermore, as PET interpretation guidelines often require quantitative assessments of internal reference structures, such as the liver, the repeatability of PS values relative to SUVs among normal organs is also of interest [8].

Thus, the aim of our study was to determine the test–retest repeatability of PS-based versus SUV-based quantitative imaging metrics for tracer-avid lesions and normal organs on oncologic [18F]FDG-PET/CT.

Materials and Methods

Study Design

This prospective, IRB-approved, HIPAA-compliant study (NCT04283552) enrolled subjects scheduled to undergo standard-of-care (SOC) oncologic [18F]FDG-PET/CT for various indications. All imaging occurred at a single tertiary care center from June 2020 through April 2023. Inclusion criteria were as follows: ≥ 18 years of age; ability to provide written informed consent; and self-reported ability to tolerate ~ 90 min of near-motionless supine positioning. A subset of these patients undergoing [18F]FDG-PET/CT were eligible to enroll in a test–retest substudy. Study imaging was performed prior to and following the SOC PET/CT, utilizing the same [18F]FDG dose.

Imaging Protocol

The study imaging protocol is summarized in Suppl. Figure 1 (see ESM). Each imaging session was repeated (i.e., retest session) utilizing identical or near-identical preparation and acquisition parameters within seven days of initial imaging (i.e., test session). Subjects were instructed to fast for ≥ 4 h and had blood glucose (BG) levels ≤ 200 mg/dL at [18F]FDG injection. The administered [18F]FDG dose was weight-based, as follows: < 54 kg, 370 MBq; 54–113 kg, 555 MBq; > 113 kg, 740 MBq. Imaging occurred on a Biograph Vision 600 PET/CT scanner (Siemens Healthineers; Knoxville, USA) with FDA-approved, commercially available software for reconstruction of multiparametric PET images (FlowMotion Multiparametric PET Suite; Siemens Healthineers; Knoxville, USA). Subjects were placed on the scanner table in the supine position, and care was taken to maximize patient comfort due to the prolonged acquisition period. The [18F]FDG was administered intravenously as a bolus upon initiating a 6-min dynamic PET acquisition centered on the mediastinal blood pool. Subsequent ‘whole-body’ (WB) PET passes extended from the skull base to proximal thighs for most subjects, with modification of the craniocaudal extent to include other structures as indicated by the SOC protocol. Each WB pass was performed in list mode, utilizing continuous bed motion. Ten WB passes of various lengths (five × 2 min, five × 5 min) were performed before the SOC PET. Subjects were instructed to leave the scanner table to empty their bladders immediately before SOC imaging. Three additional WB passes (5 min each) were performed after the SOC PET. Low-dose CT scans were acquired immediately before each PET portion utilizing CARE Dose4D (reference: 111mAs), CARE kV (reference: 120 kV), and ADMIRE (strength: 2).

Input Function Analysis

Volumes of interest (VOIs) were placed in the descending thoracic aorta on the 6-min dynamic thoracic acquisition and subsequent WB passes via fully automated scanner tools [9]. Time-activity curves were generated to estimate the input function (IF) for Patlak modeling. The area under the time-activity curve (AUTAC) was extracted for comparison with the injected [18F]FDG dose to screen for IF errors. Inter-session percent changes (%∆) in the AUTAC and [18F]FDG dose were calculated. Substantial differences (i.e., [%∆ AUTAC] – [%∆ [18F]FDG dose] ≥ 15% or ≤ -15%) prompted further investigation. Identifiable technical errors were corrected prior to image reconstruction. Furthermore, all WB PET passes were reviewed to confirm the absence of reconstruction errors related to patient motion.

PET Image Reconstruction

Using data from three 5 min WB passes at early (35–50 min post-injection; pre-SOC) and late (75–90 min; post-SOC) post-injection time points, SUV and PS images were reconstructed according to manufacturer-recommended parameters: SUV (PSF + TOF, 4 iterations, 5 subsets, 440 × 440 matrix, all-pass filter); Patlak slope (Patlak, 8 iterations, 5 subsets, 220 × 220 matrix, 2 mm Gaussian filter). The multiparametric software requires at least three WB passes for Patlak analysis. The three latest pre-SOC WB passes were selected to ensure adequate time for steady state conditions to be established. SUVs (g/ml) were adjusted by actual body weight. In generating PS images, the scanner software automatically scaled the PS by the provided blood glucose level, resulting in units of mg/min/ml. This PS is equivalent to the [18F]FDG metabolic rate (MRFDG). To facilitate comparison with other studies, the PS units were scaled by 100 to mg/min/100 ml.

Quantitative Analysis

[18F]FDG-avid lesions interpreted as sites of probable malignancy on the SOC PET/CT study were analyzed by one author (S.I., 10 years of post-training experience in PET imaging). For cases with numerous lesions, the largest and/or most [18F]FDG-avid were selected (maximum of 5 per subject). Each lesion was manually segmented in MIM version 7.1.5 (MIM Software; Cleveland, USA), using the co-registered CT for guidance, on the PS-early, SUV-early, PS-late, and SUV-late images to generate a VOI for each reconstruction. Maximum (max) and peak values were extracted for each lesion [8]. Additionally, for each PET reconstruction, spherical VOIs of various sizes were placed in the liver, lung, blood pool, bone marrow, skeletal muscle, grey matter, and spleen (Suppl. Table 1; see ESM) to extract mean values. The blood glucose (BG) levels at the time of [18F]FDG injection were also analyzed to assess for potential biases in PS related to large BG differences between test and retest sessions This procedure was repeated for the same lesions and normal anatomic structures on all four reconstructions derived from the retest session. Care was taken to ensure visually similar lesion segmentations and normal anatomic structure VOI placements for both the test and retest sessions. Notably, as the lesion analysis focused on max and peak values (rather than mean), slight differences in lesion segmentation between test and retest sessions should not affect the extracted values.

Statistical Analysis

All statistical analysis was conducted by one author (T.J.F.; 5 years of post-training experience), with statistician guidance, in Prism 9 (GraphPad Software; San Diego, USA) or Excel 2016 (Microsoft, Inc.; Redmond, WA). Subject and scan characteristics were summarized descriptively. For all lesions and each normal anatomic structure, repeatability was assessed for four reconstructions (SUV-early, PS-early, SUV-late, PS-late) according to Bland–Altman methods [10]. Test–retest percent differences (T-RT %Δ) were calculated (instead of absolute differences) to facilitate comparison of results across metrics of different magnitudes. The T-RT %Δ was defined as the retest value minus the test value, divided by the mean of the test and retest values. The mean T-RT %Δ reflects the mean of T-RT %Δ values across all lesions (lesion analysis) or subjects (normal organ analysis). The standard deviation (SD) of the T-RT %Δ values was calculated for each metric. The wCV was defined as \(SD/\surd 2\). The 95% confidence interval (95% CI) of the mean T-RT %Δ was calculated based on the t statistic due to the small sample size.

Due to an apparent bias toward higher values on the retest study for multiple lesion metrics (i.e., mean T-RT %Δ substantially > 0.0%), the intraclass correlation coefficient (ICC) of each metric was also calculated for lesions and normal organs (ICC: range, 0–1; 1 = perfect repeatability), as an alternative means of assessing repeatability [11]. Note that the ICC is a metric of agreement rather than simple correlation. The apparent bias in the lesion repeatability analysis was also investigated by correlating subjects’ clinical features (e.g., treatment history, disease trends on imaging performed before and after study imaging) with the T-RT %Δ values of their lesions and normal organs (specifically, liver and grey matter). The liver is a commonly used reference organ for clinical PET interpretation, though it violates the assumptions of Patlak analysis due to physiologic dephosphorylation and efflux of [18F]FDG [12]. In contrast, [18F]FDG dephosphorylation has minimal effects on metabolic rate measurements in grey matter during the time points assessed in our study [13], making the grey matter an ideal reference organ for tracking inter-scan PS stability.

Results

Study Cohort

Among 81 study subjects, 10 opted to participate in the test–retest substudy; 9/10 were eligible for the normal organ repeatability analysis, and 4/10 (with 17 lesions) were eligible for the lesion repeatability analysis (Fig. 1). The normal organ cohort was 66.7% female (6/9), with a mean age of 61.7 years. Additional patient and scan characteristics are summarized in Table 1.

Fig. 1
figure 1

Study flowchart. Abbreviations: SOC = standard-of-care. * 1 of 9 subjects did not complete post-SOC dynamic whole-body passes, so this subject was included in the normal organ repeatability analysis for the early time point only.

Table 1 Subject and scan characteristics

AUTAC Analysis

Results of the AUTAC analysis are summarized in Suppl. Table 2 (see ESM). A substantial difference between the %∆ AUTAC and the %∆ [18F]FDG dose existed for Subject 2 only (-44.1%). After correcting an error in the scanner’s automated placement of the aortic VOI (Suppl. Figures 2 and 3; see ESM), the difference between the %∆ AUTAC and the %∆ [18F]FDG dose for Subject 2 fell within the expected range (7.5%). The adjusted Subject 2 IF was utilized for Patlak reconstruction. There were no large differences in blood glucose levels between sessions for any subjects, including the four subjects in the lesion repeatability analysis (subjects 1–4; %∆ BG range: 11.2% to -13.4%).

Lesion Repeatability Analysis

Results of the lesion repeatability analysis are summarized in Table 2; source data for this analysis are provided in Suppl. Tables 3 and 4 (see ESM). The wCV was higher for PS-early-max (16.2%) than PS-late max (8.5%) and higher for PS-early-peak (15.6%) than PS-late-peak (6.4%), whereas the wCVs were similar for SUV metrics at the early and late time points. The mean T-RT %Δ was > 0.0% for all metrics (i.e., bias toward higher retest values), though generally of greater magnitude for PS metrics (e.g., PS-late-max: 20.4%) than the corresponding SUV metrics (SUV-late-max: 10.0%). The mean T-RT %Δ was significantly > 0.0% for all PS-based metrics, such as PS-late-max (95% CI: 14.3%, 26.4%), as well as for SUV-late-max (95% CI: 2.6%, 17.0%). This bias is also captured in the Bland–Altman plots for representative metrics (Fig. 2), with a preponderance of data points above the 0.0% T-RT %Δ line (though less pronounced for SUV-early-max). Scatterplots of lesion test–retest values for select metrics (Fig. 3) similarly show a bias toward higher retest values. However, the data points generally remain near or slightly above the unity line, resulting in ICCs near 1.0. In general, the ICCs were similar for SUV (0.93–0.98) and PS (0.94–0.98) metrics across the early and late time points.

Table 2 Repeatability of SUV and PS metrics among lesions
Fig. 2
figure 2

Bland–Altman lesion repeatability analysis for select metrics. Bland–Altman plots are shown for PS-early-max (A), PS-late-max (B), (C) SUV-early-max, and (D) SUV-late-max. Horizontal orange lines represent the mean T-RT %∆. Horizontal dashed and solid black lines represent ± 1 standard deviation and ± 1.96 standard deviations, respectively, relative to the mean of the T-RT %∆ distribution. Abbreviations: PS = Patlak slope; SUV = standardized uptake value; T-RT = test–retest; wCV = within-subject coefficient of variation.

Fig. 3
figure 3

Scatterplots from lesion repeatability analysis for select metrics. Lesion retest versus test values are plotted for PS-early-max (A), PS-late-max (B), (C) SUV-early-max, and (D) SUV-late-max. Individual subjects are color-coded. Dashed grey unity line (x = y) represents perfect repeatability. Dots above the unity line indicate higher values on retest. Abbreviations: ICC = intraclass correlation coefficient.

Normal Organ Repeatability Analysis

Results of the normal organ repeatability analysis are summarized in Table 3, with descriptive statistics for each organ in Suppl. Table 5 (see ESM). In contrast to the lesion analysis, there was no bias toward higher retest values for the normal organs, as the mean T-RT %Δ values were fairly evenly distributed above and below 0.0% (all within ± 10%, except for PS-early-mean in the liver). None of the mean T-RT %Δ values were significantly different from 0.0%, with the exception of SUV-mean-early (95% CI: -11.5%, -2.0%) in the liver. As in the lesion analysis, the normal organ wCVs were mostly lower at the late time point for PS metrics but similar between early and late time points for SUV metrics. For example, the wCV for grey matter was substantially higher for PS-early-mean (13.0%) than for PS-late-mean (6.4%), with a relatively small difference in wCVs between SUV-early-mean (12.4%) and SUV-late-mean (9.6%). In general, the wCVs were lower (i.e., better repeatability) for SUV metrics than for PS metrics at equivalent time points, with the exception of grey matter at the late time point. In terms of ICCs, repeatability was also generally better for SUV metrics than PS metrics, including the liver (Fig. 4) and grey matter (Fig. 5), though with similar values for the grey matter at the late time point. These scatterplots confirm a lack of bias toward higher retest values, with data points fairly evenly distributed above and below the unity lines for most metrics.

Table 3 Repeatability of SUV and PS metrics for normal organs
Fig. 4.
figure 4

Scatterplots from normal liver repeatability analysis for select metrics. Normal liver retest versus test values are plotted for PS-early-max (A), PS-late-max (B), SUV-early-max (C), and SUV-late-max (D). Individual subjects from the lesion analysis are color-coded. The other subjects correspond to grey dots. Dashed grey unity line (x = y) represents perfect repeatability. Dots above the unity line indicate higher values on retest. Abbreviations: ICC = intraclass correlation coefficient.

Fig. 5
figure 5

Scatterplots from normal grey matter repeatability analysis for select metrics. Normal grey matter retest versus test values are plotted for PS-early-mean (A), PS-late-mean (B), SUV-early-mean (C), and SUV-late-mean (D). Individual subjects from the lesion analysis are color-coded. The other subjects correspond to grey dots. Dashed grey unity line (x = y) represents perfect repeatability. Dots above the unity line indicate higher values on retest. Abbreviations: ICC = intraclass correlation coefficient.

Analysis of Repeatability Results by Subject

Table 4 summarizes the oncologic history and post-study follow-up for the four subjects included in the lesion analysis, along with T-RT %∆ values for the analyzed lesions and several normal organs. Colored dots in Fig. 3 represent individual lesions for each subject, as numbered in Table 4. These subjects are also color-coded in Figs. 4 and 5. Fairly large positive T-RT %∆ values were seen for multiple lesions from subjects 1, 3, and 4 at the late time point for both SUV and PS (e.g., subject 1, lesion 4: SUV-late-max, 14.6%; PS-late-max, 20.5%; see Fig. 6). Despite this bias toward higher lesion retest values, the grey matter and liver did not show the same bias for these subjects (e.g., subject 1, grey matter: SUV-late-max, -0.2%; PS-late-max, 3.0%; see Fig. 6).

Table 4 Detailed clinical features of subjects in lesion analysis
Fig. 6
figure 6

Stability of normal organs despite higher lesion retest activity. Late test and retest SUV and Patlak slope (PS) images are shown for subject 1, highlighting a substantial test-to-retest increase in the SUV-max and PS-max for lesion 4. Other subject 1 lesions (not shown) also exhibited substantial test-to-retest increases. In contrast, the SUV-mean and PS-mean of the grey matter and liver were very similar for test and retest sessions. Abbreviations: T-RT = test–retest.

Discussion

In this study, we assessed the test–retest repeatability of PS versus SUV among tracer-avid lesions on oncologic [18F]FDG-PET/CT and found that the wCVs of PS-based metrics were generally worse than those of equivalent SUV-based metrics at the early post-injection time point (e.g., SUV-early-max: 8.9%; PS-early-max: 16.2%) but similar at the late post-injection time point (e.g., SUV-late max: 9.7%; PS-late-max: 8.5%). As lesion retest values were significantly higher than test values for many metrics (likely a combination of true progression and stochastic variation in the setting of a small sample size), we alternatively utilized the ICC as a marker of repeatability and found high degrees of correlation (ICC: 0.93–0.98) between test and retest values for all PS and SUV metrics. Among normal organs, as for tracer-avid lesions, we found wCVs were generally lower (i.e., better repeatability) for SUV-based metrics than equivalent PS-based metrics and also lower for later time points than early time points (particularly for PS). For nearly all metrics across nearly all normal organs evaluated, the test and retest values were statistically similar at the early and late time points.

Robust repeatability metrics are required for PS-based metrics to be incorporated into clinical trials and treatment response algorithms. To our knowledge, no prior studies have assessed the test–retest repeatability of PS-based metrics on whole-body [18F]FDG-PET/CT. Several studies from the pre-PET/CT era compared the repeatabilities of SUV and PS for malignant lesions on [18F]FDG-PET [14, 15]. However, the analytic methodologies utilized in these studies preclude direct comparison to our results. For example, Weber et al. utilized mean lesion values rather than maximum/peak values, which were assessed in our study and are more commonly utilized clinically. Furthermore, in that study, the uptake times for utilized PET data were not identical for SUV and PS reconstructions. Importantly, our study showed that repeatability (as captured by the wCV) can change during the uptake period. That said, the prior study reported wCVs of 6.4% and 7.1% for SUV-mean and PS-mean, respectively. For comparison, we observed wCVs of 7.2% and 6.4% for SUV-late-peak and PS-late-peak, respectively. Our wCVs for maximum SUVs and maximum PS values were slightly higher, as maximum values are intrinsically noisier given their dependence on single voxels. Furthermore, our wCVs of 8.9% and 9.7% for SUV-early-max and SUV-late-max align with expected values, based on prior [18F]FDG-PET/CT studies [7].

Surprisingly, we observed a statistically significant bias toward higher values for lesions in the retest study than in the test study. This effect was apparent for all PS-based metrics (e.g., PS-late-max: mean T-RT %Δ, 20.4% [14.3%, 26.4%]), as well as for SUV-late-max (mean T-RT %Δ, 10.0% [2.6%, 17.0%]), suggesting that this finding was not simply related to an error in acquisition, reconstruction, or analysis of the PS images. This notion is further corroborated by the lack of bias toward higher retest values among normal organs, indicating that the observed bias was likely derived from true biological variation in the lesions assessed. Furthermore, the small %Δ BG values, particularly for the subjects with hypermetabolic lesions, confirm that the higher retest values were not simply a consequence of including BG values in the PS calculation (i.e., MRFDG). On assessment of the clinical features for the four subjects included in the lesion analysis, subject 4 had enlargement of the study lesions between MRI examinations performed before and after study imaging (2-month interval). Thus, the substantial test-to-retest increases in lesion SUV and PS for this subject (particularly lesions 2–5), without corresponding increases in liver and grey matter, likely represented true metabolic progression during the six-day test–retest period. In contrast, subject 1 had anatomic stability of the study lesions between CT examinations performed before and after study imaging (3-month interval). Therefore, the substantial test-to-retest increases in SUV and PS for this subject (particularly lesions 1, 2, 4), again without corresponding increases in liver and grey matter, were more likely due to stochastic temporal variations in the [18F]FDG-avidity of the study lesions. Because the 17 lesions analyzed in our study were derived from only four subjects, the results of our analysis depend heavily on the particular behaviors of their individual lesions. Hence, if a larger percentage of enrolled subjects had had tracer-avid lesions available for analysis, the bias toward higher retest values might have been negated by directionally opposite stochastic variations in lesion activity.

Our study has several limitations, including its single-center design and utilization of a single make/model of PET scanner. Our findings will need to be confirmed on different Patlak-enabled scanners at other institutions. As discussed in detail above, the greatest limitation of our study was the small number of patients with tracer-avid lesions, which likely contributed to the observed bias. A shorter test–retest interval may have mitigated this bias by reducing the potential for true biological changes in tumor activity. Thus, studies with larger numbers of subjects and more diverse pathologies are needed to confirm our lesion repeatability data. Furthermore, as only patients who reported an ability to tolerate prolonged supine positioning were enrolled (with an even smaller subset agreeing to the test–retest substudy), the subjects in our analysis were particularly well-suited to the demands of dynamic PET imaging (i.e., prolonged motionlessness). Consequently, the true repeatability of PS-based metrics may be worse in a general oncology population. However, PS-based metrics may prove more repeatable than SUV-based metrics in actual clinical practice, as scan-to-scan variations in uptake times often occur and would be expected to affect the repeatability of SUV-based metrics more than PS-based metrics. Finally, future studies might assess the repeatability of PS-based metrics when a population-based input function (PBIF) is employed, as the PBIF has proven to be a useful, relatively unbiased surrogate for the image-derived input function (IDIF), such as when the IDIF is unavailable or unreliable [16, 17].

Conclusions

In this study, we found that PS-based metrics of tumor activity exhibited similar repeatability to equivalent SUV-based metrics, provided imaging was performed at late post-injection time points. However, our study was limited by the small number of subjects with analyzable lesions, resulting in a significant bias toward higher retest values in the lesion analysis As such, the specific values of repeatability metrics derived from our data require validation in larger, more diverse patient cohorts. For example, the true value of the wCV for PS-max among [18F]FDG-avid lesions might, on average, be higher than what we have reported. Consequently, we caution against utilizing our results in isolation to inform PS-based treatment response assessments. Despite this limitation, our study points to several issues that are important to consider for future more robust PS repeatability studies, as well as for utilization of Patlak images in routine clinical PET interpretation. More specifically, confirmation of accurate VOI placement is essential when image-based input functions are utilized. Time-activity curve analysis can also be helpful to screen for VOI placement errors or to identify issues related to suboptimal bolus injection (e.g., extravasation, venous stenosis), particularly when the injection site is not within the PET field-of-view. Finally, reviewing the whole-body passes dynamically to screen for patient motion is a critical element of quality control, as patient motion can introduce artifacts in images derived from Patlak analysis.