Introduction

Whole-body 18F-fluorodeoxyglucose (FDG) PET imaging has been widely used in the management of various malignant cancers [1,2,3]. Not only lesion detection, staging, and characterization, but also therapy response assessment are key roles for FDG PET in oncology [4]. With the advent of molecular targeted therapy and immunotherapy, metabolic activity of tumors is frequently assessed by quantitative FDG PET imaging. FDG PET has become a quantitative imaging biomarker, moving beyond a qualitative functional imaging tool [5, 6].

For measuring responses to therapy by FDG PET, major methodologies such as the EORTC criteria and PERCIST have been proposed [7, 8]. In these methodologies, tumor response is assessed by visual interpretation as well as percentage change in standardized uptake values (SUVs), and then classified into the following four definitions: complete metabolic response (CMR), partial metabolic response (PMR), stable metabolic disease (SMD), and progressive metabolic disease (PMD). In this manner, maximum and peak SUVs (SUVmax, SUVpeak) and SUVs normalized by lean body mass (SULs) have been used as quantitative markers for primary and secondary endpoints in FDG PET studies and trials in oncology [9,10,11].

However, PET image quality and quantitative accuracy are considerably affected by numerous factors such as injection activity, uptake duration, subject body size, scanner specifications, and image reconstruction parameters [12, 13]. Figure 1 overviews the factors affecting diagnostic accuracy in FDG PET. Small lesion detectability and tumor SUVs are easily made variable owing to these many factors. This variability may not have a significant impact on results in the case of a single-scanner study. In multicenter studies using multiple scanners, however, the inter-scanner variability might seriously degrade the reliability of the study outcomes [14]. Therefore, in multicenter oncology FDG PET studies, imaging protocols and image characteristics should be verified and standardized using an appropriate phantom before starting the study. As stated by Boellaard [12], the required level of standardization depends on the intended use of FDG PET. When PET is used for visual interpretation such as lesion detection and characterization, image quality should be verified and standardized to ensure detectability of small lesions. On the other hand, more strict standards are required for quantitative PET. When using lesion SUVs to measure responses to certain therapies [8], harmonization of SUVs is essential to minimize the inter-scanner variability in SUVs [15]. Groups led by Kinahan have reported that reducing variability to measure true metabolic change can greatly reduce the required sample size and study costs [16, 17]. Therefore, image quality standardization and SUV harmonization are essential to improve the reliability of multicenter oncology PET studies.

Fig. 1
figure 1

Factors affecting the diagnostic accuracy of FDG PET in oncology

Motivated by this issue, several organizations such as EANM/EARL, RSNA/QIBA, ACR/ACRIN, and SNMMI/CTN have provided their own criteria for optimizing image quality as well as reducing SUV variability [18,19,20,21,22,23,24,25,26]. In Japan, the Japanese Society of Nuclear Medicine (JSNM) provides the standard PET imaging protocol and phantom test procedures with the NEMA NU2 image quality phantom (NEMA body phantom) [27, 28]. The JSNM presents image quality reference levels and an SUV harmonization range for each sphere of the phantom (10–37 mm diameters). However, the reference levels and specified range were determined by the phantom data that had been acquired in the early 2010s with the PET scanners available at that time [29]. In the meantime, clinical PET scanner performance has been improved by recent novel technologies such as the point-spread function (PSF) modeling [30, 31], time-of-flight (TOF) measurements [32, 33], and the penalized likelihood reconstruction algorithm [34]. In particular, TOF coincidence timing resolution has been greatly improved by replacing the conventional photomultiplier tube (PMT) with a newer silicon photomultiplier (SiPM) [35,36,37,38]. With such new technologies, recent PET scanners can visualize small spheres with higher SUVs (a smaller partial volume effect). Because their SUVmax recovery curves often exceed the upper range, downsmoothing is required to satisfy the current range. Although downsmoothing of the images is a simple way to harmonize, it spoils the image contrast and may degrade the visual detectability of small lesions. To adapt to advanced PET scanners with better performance, image quality reference levels and the range for SUVmax should be updated accordingly [12]. Also, a harmonization range for SUVpeak should be established, because this term has been widely used in many clinical studies [12, 39,40,41,42].

In addition to SUV harmonization (minimizing the inter-scanner variability), image noise levels should be lowered to reduce the intra-scanner variability. Increasing image noise levels (e.g., short scan duration) would provide a positive bias for SUVs [43]. A sufficient scan duration is needed to reduce uncertainties in SUV measurements as much as possible [44]. The relationship between SUV variability and image noise levels should be investigated in detail to establish reasonable criteria for image noise levels. The combination of SUV harmonization and image noise management can lead to significant improvement in the value and reliability of quantitative FDG PET studies (Fig. 2).

Fig. 2
figure 2

Significance of SUV harmonization and image noise management in multicenter quantitative PET studies

Motivated by these backgrounds, we investigated image quality and SUV variability in hot spheres of almost all recent PET/CT scanner models using an image quality phantom. The first aim of this study was to propose new image quality reference levels with a focus on 10 mm sphere detectability. The second aim was to propose a new SUV harmonization range and an image noise criterion for minimizing the inter-scanner and intra-scanner SUV variabilities.

Materials and methods

PET/CT scanners

Table 1 lists the PET/CT scanner models and image reconstruction parameters used in this study. Detailed scanner specifications and correction methods are summarized in Supplemental Table 1 [45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61]. We evaluated the 23 scanner models (16 PMT-based scanners and 7 SiPM-based scanners) used at 19 clinical sites. Phantom data were acquired from November 2018 to May 2020. This study did not include human data or any personal information.

Table 1 PET/CT scanners and image reconstruction settings

Phantom experiments

Phantom measurements were performed according to the JSNM phantom test procedures [27]. The NEMA NU2 image quality phantom (NEMA body phantom) was used for all evaluations. We provided the phantom test procedure manual to all sites, and we visited several sites and supported the phantom test, if necessary. The phantom contains six spheres, having diameters of 10, 13, 17, 22, 28, and 37 mm. All spheres were filled with 18F-FDG solutions, so that the sphere-to-background activity ratio was 4. The activity concentration in the background area was 2.53 ± 0.13 (± 5%) kBq/mL, which was determined by the following equation:

$$A_{x} = \frac{a}{60} \times \exp \left( {\frac{ - 60}{{109.8}} \times \ln \left( 2 \right)} \right) \times S {\text{ [kBq/mL]}} ,$$
(1)

where Ax (kBq/mL) is the activity concentration in the background area, a (MBq) is the assumed injection activity for 60-kg subjects, and S is the assumed specific gravity of a human body, that is 1.0 (g/mL). Since the assumed injection dose was 3.7 MBq/kg in this study, a was 222 MBq (3.7 × 60). The patient’s weight section (0010, 1030) of the DICOM header was filled with the phantom background volume, so that the true SUV was 1.00 in the background area.

Data acquisition and image reconstruction

Emission data were acquired for 1800s in list mode. PET images were reconstructed with various acquisition durations of 30, 60, 90, 120, 150, 180, 210, 240, 270, 300, and 1800s. For each acquisition duration except 1800s, three image datasets were reconstructed by changing the data start time of 0, 60, and 120 s. Table 1 shows the image reconstruction parameter, which is the setting for clinical whole-body FDG PET imaging used at each site. For the scanner models with PSF reconstruction, both PET images were reconstructed with and without PSF modeling. A total of 37 patterns of images were obtained. In the data analyses described below, the data were classified into four groups: overall (n = 37), TOF + PSF (n = 17), TOF (n = 15), and PSF (n = 5).

Average SUV in the background area (SUVB,ave)

To confirm the quantitative accuracy of data, we examined the average SUV in the background area (SUVB,ave) on PET images with 1800-s acquisition. Image analysis was performed with the PETquactIE Ver. 3 software (Nihon Medi-Physics Co., Ltd) [62]. On the axial slice of the sphere center, 12 circular regions-of-interest (ROIs) with a 37-mm diameter were placed over the background area [63]. The ROIs were also placed on the slices ± 1 and ± 2 cm away from the central slice (60 ROIs in total). The SUVB,ave was calculated by the following equation:

$${\text{SUV}}_{{\text{B,ave}}} = \frac{{\mathop \sum \nolimits_{k = 1}^{K} {\text{SUV}}_{{{\text{B,37}}\,{\text{mm,k}}}} }}{K} ,$$
(2)

where SUVB,37 mm is the average SUV for the 37-mm ROIs and K is the number of ROIs, that is 60. An acceptable range of the SUVB,ave was defined as 0.95–1.05. When the SUVB,ave did not meet this acceptable range, re-testing was done after cross calibration and, if necessary, scanner maintenance.

Part I: image quality with a focus on 10 mm sphere detectability

Visual detectability score

Detectability of the 10-mm-diameter hot sphere was visually assessed by five nuclear medicine technologists in a 3-step scale (0, not visualized; 1, visualized, but similar hot spots are observed; and 2, identifiable). The VOX-BASE/MANAGER (J-MAC SYSTEM, INC., Japan) was used to display PET images using an inverted gray scale with an upper level of 4 and a lower level of 0 (SUV-scaled). The score was averaged across the three image sets and then averaged across the five raters. A score of 1.5 was defined as an acceptable level (i.e., the 10 mm hot sphere can be detected by half or more of the raters) [29].

NECphantom

To examine coincidence count data quality, the noise-equivalent count for phantom (NECphantom) was calculated by the following equations [29, 64, 65]:

$${\text{NEC}}_{{{\text{phantom}}}} = \left( {1 - {\text{SF}}} \right)^{2} \frac{{\left( {T + S} \right)^{2} }}{{\left( {T + S} \right) + \left( {1 + k} \right)fR}} {{\text{ [Mcounts]}}}$$
(3)
$$f = \frac{{S_{a} }}{{\pi r^{2} }},$$
(4)

where SF represents scatter fraction, and T, S, and R are true, scatter and random coincidence counts. T + S was calculated by subtracting estimated random coincidence counts (R) from prompt coincidence counts (T + S + R). k is a random scaling factor, depending on the random correction method used [66]. We simply set k = 1 for a delayed coincidence-based method, and k = 0 for a singles-based method. f is the ratio of object size to the transaxial field-of-view, Sa is the cross-sectional area of the phantom, and r is the radius of the detector ring. The scatter fraction (SF) for each scanner, according to NEMA NU2 standards, is shown in Supplemental Table 1. The SF values were obtained from previous publications or scanner specification sheets or measured at the clinical site.

Image quality [10-mm-sphere contrast (QH ,10 mm), background variability (N 10 mm), and image noise level (CVBG)]

For image quality assessment, we evaluated the contrast for the 10 mm hot sphere, background variability and image noise level in the background area using the PETquactIE Ver.3 software [62]. On the axial slice of the sphere center, we placed a circular ROI on the 10 mm sphere. In addition, we placed twelve 10-mm-diameter circular ROIs on the background area on the slice of the sphere center and on slices ± 1 cm and ± 2 cm away from the central slice (60 ROIs in total). The percent contrast for the 10 mm hot sphere (QH,10 mm) was calculated as follows:

$$Q_{{{\text{H}},10\,{\text{mm}}}} = \frac{{C_{{{\text{H}},10\,{\text{mm}}}} /C_{{{\text{B}},10\,{\text{mm}}}} - 1}}{{a_{{\text{H}}} /a_{{\text{B}}} - 1}} \times 100\text{ (\%)},$$
(5)

where CH,10 mm and CB,10 mm are the average activity in the ROI for the 10 mm sphere and the average activity in all the background 10-mm-diameter ROIs, respectively. \({a}_{\mathrm{H}}/{a}_{\mathrm{B}}\) is the activity concentration ratio between the hot spheres and the background. The percent background variability (N10 mm) for the 10 mm circular ROIs was calculated as follows:

$$N_{{10\,{\text{mm}}}} = \frac{{{\text{SD}}_{{10\,{\text{mm}}}} }}{{C_{{{\text{B}},10\,{\text{mm}}}} }} \times 100\text{ (\%)}$$
(6)
$${\text{SD}}_{{10\;{\text{mm}}}} = \sqrt {\frac{{\mathop \sum \nolimits_{k = 1}^{K} \left( {C_{{{\text{b}},10\;{\text{mm,k}}}} - C_{{{\text{B}},10\;{\text{mm}}}} } \right)^{2} }}{K - 1}} , K = 60,$$
(7)

where SD10 mm is the standard deviation of the mean activity for the background 60 ROIs. For image noise assessment, we placed 37-mm-diameter circular ROIs on the background area in the same manner as for the background variability assessment (60 ROIs). The coefficient of variation on the background area (CVBG) (image noise levels) was calculated by the following equation:

$${\text{CV}}_{{{\text{BG}}}} = {\text{mean of}} \left( {\frac{{{\text{SD}}_{{37\;{\text{mm}}}} }}{{C_{{{\text{B,37}}\;{\text{mm}}}} }} \times 100} \right)\left[ \% \right],\left[ {n = 60} \right],$$
(8)

where SD37 mm and CB,37 mm are the standard deviation and average of the activity in each 37-mm-diameter ROI, respectively. The QH,10 mm, N10 mm and CVBG were measured and averaged by five nuclear medicine technologists.

Investigation of image quality reference levels allowing the 10 mm sphere to be visible

The relationships between each image quality metric and visual detectability score were examined to explore an appropriate image quality level for 10 mm sphere detection. The NECphantom, QH,10 mm, N10 mm, QH,10 mm/N10 mm, CVBG, and visual detectability score are shown as a function of acquisition duration (30–300 s). As mentioned earlier, a visual detectability score of 1.5 was defined as an acceptable level. Figure 3 shows the workflow to determine a reference level for each image quality metric. For each image quality metric and each dataset, we measured a 10-mm-sphere-detectable value so as to achieve the visual detectability score of 1.5 (Fig. 3, step 1). For all data, the acquisition duration corresponding to the visual detectability score of 1.5 was calculated by linear interpolation between the nearest data. If the visual detectability score was higher than 1.5 at the minimum acquisition duration of 30 s, the data with the acquisition duration of 30 s were used as the 10-mm-sphere-detectable value. Subsequently, the reference level for each image quality metric (NECphantom, N10 mm, QH,10 mm/N10 mm and CVBG) was calculated (Fig. 3, step 2). The reference level was defined as the median for all 10-mm-sphere-detectable values.

Fig. 3
figure 3

The two-step workflow to determine a reference level for each image quality metric

Inter-rater variability in each image quality metric

To evaluate the inter-rater variability in QH,10 mm, N10 mm and CVBG, we calculated the respective coefficient of variation across five raters (inter-rater variability) as follows:

$${\text{Inter-rater variability}} = \frac{\sigma }{\mu } \times 100{ }\text{ (\%)},$$
(9)

where σ and μ are the standard deviation and mean of the measurement values, respectively. To remove the effect of statistical noise, the PET images with 300 s acquisition were used for this evaluation.

Part II: SUV variability

SUVs of hot spheres

On PET images with 1800-s acquisition, SUVmax and SUVpeak for the hot spheres were measured using PETquactIE Ver. 3 and RAVAT, respectively (Nihon Medi-Physics Co., Ltd.) [15, 62]. To measure SUVmax for each sphere, a circular ROI was placed with a diameter equal to the inner diameter of the sphere. To measure SUVpeak for each sphere, a volume-of-interest (VOI) was placed, so that the VOI covered the whole uptake. The SUVpeak was defined as the average value within a 1 mL spherical VOI (12-mm-diameter) that was placed so as to maximize the average SUV [18]. Considering this definition, we did not measure the SUVpeak of the 10-mm sphere. When showing recovery coefficient curves, the SUVs were normalized by the true value of 4.

SUV harmonization range

SUVs of the hot spheres among all images with 1800-s acquisition (n = 37) were investigated for all-size spheres. To investigate feasible lower and upper limits, 0–30th percentiles and 70th–100th percentiles were calculated in a fifth percentile step. On PET images with PSF reconstruction, the SUVs of 13–22 mm spheres were often overestimated by edge artifact [67, 68]. Here, the maximum overshoot rate in SUVs (MOR) was calculated by the following equation:

$${\text{MOR}} = \frac{{{\text{SUV}}_{i} - {\text{SUV}}_{{37\;{\text{mm}}}} }}{{{\text{SUV}}_{{37\;{\text{mm}}}} }} \times 100 \left( \% \right),$$
(10)

where SUVi is the SUV of the i-mm diameter sphere that shows the highest SUV among 13–22 mm spheres, and SUV37 mm is the SUV of the 37-mm-diameter sphere. Based on these data, we investigated a feasible SUV harmonization range. The upper limit was determined, so that the MOR was lower than 5%. For the lower limit, we considered that it should be lower than the true SUV of 4 for all spheres.

Relationships between SUVs of hot spheres and image noise levels (CVBG)

On PET images with 30–300 s acquisition, we investigated relationships between SUVs of the hot spheres and image noise levels. In this evaluation, SUVmax of the hot spheres was measured using spherical VOIs that sufficiently covered the whole uptake, assuming realistic tumor uptake measurements. Each SUV of the hot spheres on PET images with 1800-s acquisition was defined as a reference, because the images were in low noise conditions. Then, on PET images with 30–300 s acquisition, relative differences of SUVs were plotted as a function of CVBG. The measurement procedure of the CVBG was described above (Eq. 8). The relative differences of SUVs (RDSUV) were calculated by the following equation:

$${\text{RD}}_{{{\text{SUV}}}} = \frac{{{\text{SUV}}_{i} - {\text{SUV}}_{{i, {\text{ref}}}} }}{{{\text{SUV}}_{{i, {\text{ref}}}} }} \times 100 \left( \% \right) ,$$
(11)

where SUVi is the SUV of the i-mm diameter sphere on each PET image and SUVi,ref is the SUV of the i-mm-diameter sphere on PET images with 1800-s acquisition. The RDSUV was calculated for SUVmax and SUVpeak. To investigate the effect of the uptake volume, the RDSUV values were classified into two groups based on the sphere diameter (diameter: < 20 mm and ≥ 20 mm). This was based on the recommendation by the QIBA and PERCIST that the minimum lesion size was 2 cm in diameter for the target lesion at the baseline [8, 18].

Statistical analysis

All statistical analyses were performed with EZR (Saitama Medical Center, Jichi Medical University, Saitama, Japan) [69], which is a graphical user interface for R (The R Foundation for Statistical Computing, Vienna, Austria). Comparisons of values between two groups were performed with the Mann–Whitney U test. Comparisons of values among three or more groups were performed using the Kruskal–Wallis test, followed by the Steel–Dwass pair-wise multiple comparison test. Spearman’s correlation test was used to investigate the correlation of each image quality metric with the visual detectability score. Correlations between RDSUV and CVBG were examined with Pearson’s correlation test. In all analyses, P < 0.05 was defined as statistically significant.

Results

Average SUV in the background area (SUVB,ave)

The mean ± SD of the SUVB,ave was 1.00 ± 0.03 and all values were within 0.95–1.05. Supplemental Fig. 1 shows SUVB,ave for all scanner models. There was no significant difference among reconstruction algorithms (P = 0.56).

Part I: image quality

Figure 4 shows PET images with 120-s acquisition, which were reconstructed with clinical settings. There were no artifacts in any images, but large differences were found in visual contrasts of the smallest 10 mm sphere among scanners. Figure 5 shows NECphantom, QH,10 mm, N10 mm, QH,10 mm/N10 mm, CVBG and visual detectability score as a function of scan duration. The NECphantom, QH,10 mm/N10 mm, and visual detectability score increased with acquisition duration, while N10 mm and CVBG decreased with it. The QH,10 mm did not correlate with acquisition duration.

Fig. 4
figure 4

PET images obtained with 120-s acquisition, which were reconstructed with the clinical settings at each site. For the scanners with PSF reconstruction, the PET images reconstructed with PSF modeling are shown. They are displayed with an upper level of SUV = 4, which equals the activity concentration of the hot spheres, and a lower level of SUV = 0

Fig. 5
figure 5

A NECphantom, B QH,10 mm, C N10 mm, D QH,10 mm/N10 mm, E CVBG, and F visual detectability score as a function of scan duration

Figure 6 shows distributions of 10-mm-sphere-detectable values (i.e., corresponding to visual detectability score = 1.5) for NECphantom, N10 mm, QH,10 mm/N10 mm and CVBG. The data were classified into four groups by image reconstruction methods as follows: Overall (n = 37), TOF + PSF (n = 17), TOF (n = 15), and PSF (n = 5). The medians [min, max] of the 10-mm-sphere-detectable values were 3.2 [0.5, 6.8] for NECphantom, 10.6 [7.3, 19.6] for N10 mm, 2.5 [0.3, 3.5] for QH,10 mm/N10 mm, and 14.1% [8.8, 33.5] for CVBG. For NECphantom and N10 mm, significant differences were observed in the 10-mm-sphere-detectable values among the three groups. For more detailed information, the relationships between each image quality metric and visual detectability score are shown in the supplemental data (Supplemental Figs. 2–5). Each image quality metric was significantly correlated with the visual detectability score (P < 0.001) (Supplemental Table 2).

Fig. 6
figure 6

Box plots of 10-mm-sphere-detectable values (i.e., corresponding visual detectability score = 1.5) for A NECphantom, B N10 mm, C QH,10 mm/N10 mm and D CVBG. The data were classified into four groups by image reconstruction algorithms. The midline indicates the median, the box indicates the first and third quartiles of the distribution, whiskers indicate the 10% and 90% values, and circles represent outliers. * Indicates P < 0.05 and ** indicates P < 0.01

Medians [min, max] of the inter-rater variability in QH,10 mm, N10 mm and CVBG were 4.0 [1.0, 9.4], 5.6 [2.1, 13.3], and 0.8 [0.3, 5.6], respectively (Fig. 7). Inter-rater variability was significantly lower for CVBG compared to QH,10 mm and N10 mm (P < 0.001).

Fig. 7
figure 7

Box plots of inter-rater variability for QH,10 mm, N10 mm and CVBG. The midline indicates the median, the box indicates the first and third quartiles of the distribution, whiskers indicate the 10% and 90% values, and circles represent outliers. ** Indicates P < 0.01

Part II: SUV variability

Figure 8 shows recovery coefficients for SUVmax and SUVpeak on PET images with 1800-s acquisition. A large variability was observed especially for the 13 mm sphere. Table 2 summarizes median, minimum, and maximum values of SUVmax and SUVpeak on PET images with 1800-s acquisition. For the small spheres (10–17 mm diameter spheres), the inter-scanner variability in SUVpeak was smaller than that in SUVmax.

Fig. 8
figure 8

Recovery coefficients of SUVmax and SUVpeak for four groups: Overall, TOF + PSF, TOF, and PSF

Table 2 Median, minimum, and maximum values of SUVmax, SUVpeak, and CVBG in 1800-s PET images

The mean ± SD and various (0–30th and 70th–100th) percentile values for SUVmax and SUVpeak of all spheres are shown in Table 3 for PET images with 1800-s acquisition. The MOR for each upper range of 70th–100th percentiles is also given in that table. Using the 100th percentile, we obtained MORs for SUVmax and SUVpeak of 11.0% and 2.3%, respectively.

Table 3 Mean ± SD and various percentile values for SUVmax and SUVpeak for all spheres

The MOR for SUVmax was lower than 5% when using ≤ 90th percentile values as the upper limit (Table 3). Therefore, the 90th percentile values were defined as the upper limit for the SUV harmonization range (Fig. 9). Then, the 10th percentile values were defined as the lower limit. This was selected, because the lower limit for all spheres was lower than the true SUV of 4, and the exclusion rate was the same as the upper limit.

Fig. 9
figure 9

Representative percentiles in SUVmax (left) and SUVpeak (right). The range of the 10th-to-90th percentiles was proposed as the SUV harmonization range

For SUVmax and SUVpeak for the hot spheres on PET images with 30–300 s acquisition, RDSUV in relation to CVBG are shown in Fig. 10. In SUVmax for the small spheres (10–17 mm diameter), a positive bias was observed in RDSUV. Table 4 shows median, minimum, and maximum values for the RDSUV. The median [min, max] of the RDSUV for SUVmax and SUVpeak in all spheres were 5.3% [− 30.6%, 340.7%] and 1.1% [− 17.8%, 49.8%], respectively. There was a significant difference in the RDSUV between SUVmax and SUVpeak (P < 0.001). The RDSUV for both the SUVmax and SUVpeak significantly depended on sphere diameter (< 20 mm and ≥ 20 mm) and CVBG (≤ 10% and > 10%) (P < 0.001).

Fig. 10
figure 10

Scatter plots of relative differences for SUVmax (upper) and SUVpeak (lower). Each reference was a corresponding SUV for an 1800-s PET image. Five different categorizations (TOF + PSF, TOF, PSF, sphere diameter < 20 mm and ≥ 20 mm) are shown on the right

Table 4 Median, minimum, and maximum values for the RDSUV with various categorizations

Discussion

We investigated image quality and SUV variability in hot spheres using 23 recent PET scanner models. Since almost all recent PET/CT scanner models were included in this study, the data precisely reflect current PET image characteristics available at clinical sites. Based on the data, we have proposed a reference level for each image quality metric (NECphantom, N10 mm, QH,10 mm/N10 mm and CVBG) with a focus on 10 mm sphere detectability. In addition, we have proposed a new SUV harmonization range and image noise criterion with a focus on the inter-scanner and intra-scanner SUV variabilities. Our proposed new standards will be useful for image quality standardization and SUV harmonization of PET studies in oncology.

Part I: image quality

Figures 4 and 5 show PET images and image quality metrics under clinical image reconstruction conditions. Because standardization of PET image quality was not performed, there was a large difference in 10-mm-sphere contrasts among scanners. As theoretically expected, longer scan durations provided lower image noise levels and better visual detectability scores. The results indicate that a simple way to obtain better image quality is to extend scan duration [70]. Looking at the 180-s scan data, which is the standard scan duration recommended by the JSNM [27], almost all scanners achieved the visual detectability score of 2.0 (Fig. 5). Therefore, a 180-s scan for each bed position would be reasonable as a reference standard.

For each image quality metric, we have proposed a reference level that makes the 10 mm sphere visible. The calculation procedure for the reference level (Fig. 3) was the same as that of the previous work in 2014 [29], in which the reference levels were proposed as follows: NECphantom > 10.8 Mcounts, N10 mm < 5.6%, QH,10 mm/N10 mm > 2.8. On the other hand, we have provided reference levels as follows: NECphantom ≥ 3.2 Mcounts, N10 mm ≤ 10.6%, QH,10 mm/N10 mm ≥ 2.5, CVBG ≤ 14.1%. The CVBG has been newly added to the image quality metrics.

The proposed new reference level for the NECphantom was lower than that in the 2014 study [29]. This result suggests that recent PET scanners can visualize the 10 mm sphere even with a low NECphantom value. This is mainly because significant progress has been made in developing image reconstruction algorithms. The NEC is a count-based metric, and independent of image reconstruction algorithms [65]. Because PET image quality is determined by detected coincidence count quality (e.g., NEC), image reconstruction algorithms, and so on (Fig. 1), the NECphantom would not be suitable for the use for image quality standardization [71, 72].

The N10 mm, which is a metric of background variability, had similar results to those of NECphantom. The proposed reference level for the N10 mm was higher than that in the previous study. This is also probably due to advances in image reconstruction algorithm. Specifically, PSF and TOF would contribute mainly to improving contrast for the 10 mm sphere [73]. These new techniques allow recent PET scanners to visualize the 10 mm sphere even with higher background variability. In addition, smaller voxel sizes were used in this study (1.3–4.1 mm) compared with those in the previous study (3.1–5.3 mm) [29]. Higher background variability might be derived from smaller voxel size.

On the other hand, the reference level for QH,10 mm/N10 mm (contrast-to-noise ratio) was almost the same as that in the previous study. In addition, there was no significant difference in the 10-mm-sphere-detectable values for QH,10 mm/N10 mm among the image reconstruction algorithms (Fig. 6). These results suggest that the QH,10 mm/N10 mm would be a useful metric for assuring 10 mm sphere visibility, irrespective of PET scanner models and image reconstruction algorithms. The QH,10 mm/N10 mm includes information on both the 10 mm-sphere-contrast and background variability, and the balance of contrast and noise might be a key component for visual detectability of small hot lesions.

As for the CVBG (image noise levels), there was no significant difference in the 10-mm-sphere-detectable values among image reconstruction algorithms (Fig. 6). Additionally, the CVBG has some advantages compared with other metrics. The CVBG showed the lowest inter-rater variability among all image quality metrics (Fig. 7). The reason for its low variability is that the large 37 mm ROIs were used to measure the CVBG (10 mm ROIs were used for QH,10 mm and N10 mm measurements). The CVBG is therefore more reproducible than QH,10 mm and N10 mm are. Furthermore, the CVBG has been widely used for standardization of FDG PET in oncology. RSNA/QIBA and EANM/EARL specify that image noise levels are assessed by measuring the CV in the uniform background area as part of their standardization strategies [18, 22]. They have provided an acceptable level of 15% that is close to our proposed reference level (14.1%), although the phantom and ROI conditions are somewhat different. The CVBG and its reference level are compatible with other international standards. The use of CVBG may facilitate international standardization and global PET studies. What should be taken account for the CVBG is not considering the image contrast. Not only the CVBG also other image contrast-related metrics such as QH,10 mm/N10 mm and recovery coefficients [29] should be evaluated to assure small lesion detectability.

Part II: SUV variability

As shown in Supplemental Fig. 1, the SUVB,ave of all scanner models were within 0.95–1.05. This result indicated that all scanners were well calibrated, and their quantitative accuracy was within ± 5% error. Therefore, our phantom data are sufficiently reliable to establish an SUV harmonization range. In the previous report on 2013, the SUVB,ave of 16 scanners were distributed from 0.87 to 1.14 [74]. Quantitative accuracy of PET scanners would have been improved by scanner performance progress. As described in the Materials and methods section, we visited several sites and supported the phantom test when requested. Such support might be effective in minimizing any technical errors in the process of phantom preparation.

Subsequently, we investigated inter-scanner SUV variability in each sphere on PET images with 1800-s acquisition (in noise-less conditions). Most scanner models showed higher SUVmax recovery coefficients than their upper limit provided by JSNM (Supplemental Fig. 6). This result suggested that the SUV harmonization range should be regularly updated according to the performance improvement of commercial scanners [12]. In comparison to the large spheres (28–37 mm diameters), the small spheres (10–22 mm diameters) had larger SUV variability (Fig. 8). Many studies have reported that TOF PET scanners provided higher SUVs for small lesions compared with those without TOF [26, 75, 76]. Since this study used both TOF and non-TOF scanner models (19 TOF PET scanner models and 4 non-TOF PET scanner models), the SUV variability in the small spheres would result in large variability.

Comparing TOF + PSF and TOF groups (Fig. 8), higher SUVs were obtained for the 17-mm sphere when using PSF reconstruction. Furthermore, in most cases, SUVmax of the 17-mm sphere was higher than that of the 37-mm sphere. This overshoot would be derived from the edge artifact [31, 67, 68]. If we use the SUVmax of a small lesion on PSF-based PET images for monitoring treatment response, this overshoot must be suppressed by SUV harmonization [77]. For SUVpeak, on the other hand, the overshoot was suppressed even in PSF-based PET images, and the inter-scanner variability was lower than that for SUVmax.

Based on various percentile values for SUVmax and SUVpeak of all spheres, we proposed a new SUV harmonization range (Fig. 9, 10th–90th percentile). To address the overshoot due to PSF reconstruction [77], we determined the upper limit, so that the MOR was lower than 5% (Table 3). By satisfying our proposed harmonization range, PET images can be used for both lesion detection and quantification even if PSF reconstruction is applied; and feasible and practical SUV harmonization is possible using this harmonization range. Compared with the SUV recovery coefficients for EANM/EARL standards 2 [22, 78], our proposed SUVmax harmonization range is lower (Supplemental Table 3). This is probably due to differences in the phantom test conditions. Because of the low activity concentration, the short scan duration, and high sphere-to-background contrast, the EANM/EARL standards 2 provided a higher bandwidth for SUVmax recovery coefficients. Taking the difference in phantom test conditions into consideration, there would be no big differences between the SUV recovery coefficient harmonization ranges. Interestingly, the differences in SUVpeak recovery coefficient ranges were exceedingly small despite the different phantom test conditions. International harmonization may be possible, although further investigations are required.

Then, we investigated intra-scanner SUV variability in relation to image noise levels. For all data (n = 37), three images each with the same acquisition time (30–300 s) were reconstructed. The number of images (n = 1110) would be adequate to investigate the relationships. For SUVmax, the variability increased as the CVBG increased. Because SUVmax is derived from a single maximum voxel value, its variability depends considerably on image noise levels [44]. For the large spheres (≥ 20 mm diameter), a positive bias was clearly observed (ρ = 0.82). This noise-dependent bias was also reported by Lodge et al. [43]. On the other hand, for the small spheres (< 20 mm diameter), the positive bias was weaker (ρ = 0.60) and the numbers of negative values were increased (Fig. 10). When measuring a sequential percentage change in SUVs between two time points, the variability may be large for small lesions. Low image noise is essential for accurate quantitative evaluation, especially for small lesions.

As shown in Table 4, the RDSUV values for SUVmax were distributed from − 30.6 to 340.7% on the PET images with CVBG of higher than 10%. Meanwhile, on the PET images with CVBG of 10% or lower, the RDSUV were distributed from − 22.3 to 35.3%. In the QIBA/UPICT, the CV in the uniform area should be lower than 15% as a target level, and ideally, it should be lower than 10% [18, 79]. The SNMMI/CTN also uses CV in the uniform area as an image noise metric, and it is recommended that CV be 10% or lower [80, 81]. Akamatsu et al. [44] examined the relationships between image noise levels and SUVs using a phantom and a single PET scanner, and suggested the CV in the uniform area should be below 10% to minimize the SUVmax fluctuation. Considering the results in this study and the standards set by the major nuclear medicine societies, CVBG ≤ 10% would be reasonable and feasible as the image noise criterion.

Comparison of SUVmax and SUVpeak showed that each has its own advantages and disadvantages. SUVmax has been most commonly used to measure lesion uptakes in FDG PET, because its measurement is easy and observer-independent [8, 13]. The partial volume effect is relatively small even in small lesions [82]. Furthermore, SUVmax reflects the highest metabolically active area inside potentially heterogeneous tumors. This is important, because the highest metabolic activity might be critical information clinically. The most challenging issue is the variability in SUVmax (Figs. 8 and 9). Because the inter-scanner and intra-scanner variabilities in SUVmax are problematic, SUV harmonization and image noise management are essential in multicenter studies. In contrast to SUVmax, SUVpeak has lower intra-scanner variability (Fig. 10). SUVpeak was less sensitive to image noise levels than SUVmax. On the PET images with CVBG of 10% or lower, the RDSUV values for SUVpeak were distributed from − 10.8 to 15.4%. Makris et al. [25] also reported that the SUVpeak was less sensitive to variability in image characteristics and might be less affected by noise-dependent bias in comparison to SUVmax. Since SUVpeak may provide lower inter-scanner and intra-scanner variabilities than SUVmax, it is more suitable for use in multicenter studies. However, there are some considerations if SUVpeak is to be used. Because SUVpeak is derived from the 12-mm-diameter spherical VOI, lesion uptakes might be underestimated due to the partial volume effect, particularly in lesions smaller than 20 mm, and it is not applicable to lesions smaller than 12 mm. In addition, there are various definitions for SUVpeak itself [83] and variability will be introduced depending on the image analysis software. To compare the values derived from multiple software codes, VOI definitions should be verified and standardized among image analysis software codes. The appropriate quantitative measure (SUVmax, SUVpeak, etc.) should be selected according to each study’s purpose and the characteristics of the target lesion.

Limitations and future issues

The image quality reference levels that we proposed are not appropriate for all FDG PET studies. We focused on 10-mm-sphere detectability; however, if sub-centimeter lesions are the study target, smaller spheres should be evaluated for more effective standardization. In addition, the NEMA image quality phantom mimics an average human body size. In some cases, such as pediatric studies or studies on overweight patients, phantoms of corresponding size would be suitable. Fukukita et al. [29] evaluated larger size body phantoms, and demonstrated that a longer scan time was required for larger phantoms to keep the 10 mm sphere visual detectability. Appropriate evaluations and quality controls should be made according to the purposes of the individual FDG-PET studies [12].

Regarding FDG distributions, intra-tumoral FDG uptakes are not homogeneous but heterogeneous in some types of tumor [84,85,86]. SUVmax and SUVpeak reflect only the amount of FDG uptakes in specified regions. Recently, other quantitative measures to characterize lesion FDG uptakes have been used, such as metabolic tumor volumes, total lesion glycolysis, and textural features [86,87,88]. If these quantitative metrics are being used in multicenter studies, the inter-scanner and intra-scanner variabilities should be verified using an appropriate phantom to move toward harmonization.

Conclusions

We experimentally investigated image quality and SUV variability in hot spheres using 23 recent PET scanner models and the NEMA image quality phantom. Then, we investigated appropriate image quality reference levels, so that a 10 mm sphere is visible. The reference levels were newly proposed as: QH,10 mm/N10 mm ≥ 2.5 and CVBG ≤ 14.1%. CVBG is the most reliable and useful, because it has the lowest inter-rater variability (Fig. 7) and is compatible with other international standards such as RSNA/QIBA and EANM/EARL. In addition, we investigated the inter-scanner and intra-scanner SUV variabilities. The new SUV harmonization range (in which PSF reconstruction is applicable) and the image noise criterion (CVBG ≤ 10%) were proposed based on these data. Then, our study results supported that SUVpeak is a useful quantitative metric, because it provided reduced inter-scanner and intra-scanner variabilities compared with SUVmax. International SUV harmonization may be facilitated using SUVpeak, although further investigations are needed.

Our proposed new standards are useful for image quality standardization and SUV harmonization of whole-body FDG PET studies in oncology. The reliability of multicenter PET studies will be improved by satisfying the standards before starting the study. We believe that the new standards will help facilitate research and development of new treatments for cancers.