Introduction

In quantitative 18F-fluorodeoxyglucose (18F-FDG) positron emission tomography (PET) imaging, several functional parameters of 18F-FDG-positive lesions can be extracted from images, thus allowing prognosis, staging and assessing the treatment response in oncology [1]. The most widely used parameter is the maximal standard uptake value (SUVmax; g.ml-1) that is obtained from the voxel with the highest activity in a 18F-FDG-positive lesion. It is equal to activity concentration (kBq.ml-1) normalised to body activity concentration at injection (ID/W: injected dose to patient’s weight; MBq/kg) [2]. Any SUV outcome should be considered with its measurement uncertainty. As an example, SUV repeatability (R; in %) represents the minimal relative change between two SUVs assessed at pre- and post-treatment examinations, which is required to consider a significant response to treatment [3, 4]. The lower the R value, the more accurate the assessment, which is the reason why lowering the R of the SUV is of major importance. It is well known that averaging SUV from several voxels leads to lowering R, and several averaging methods are available. A mean SUV can be obtained from outlining the 18F-FDG-positive tissue and numerous segmentation methods have been proposed but, to date, no method has been recognised as a “gold standard” [5, 6]. SUVpeak represents the average SUV obtained over a fixed-size region of interest (ROI), which is usually defined as a 1-ml sphere (1.2-cm diameter) centred on a high-uptake part of the tumour [4]. However, various shapes, sizes and ROI locations have been proposed that can significantly affect the use of SUVpeak for assessing treatment response [7]. In a recent paper, Burger et al. [8] showed that average SUV obtained by pooling several (5–10) hottest voxels resulted in an improvement of R in comparison with that of SUVmax, i.e. the hottest voxel only. In comparison with SUVpeak, which involves several contiguous voxels belonging to a unique volume of interest (VOI), and hence are not mandatorily the hottest ones of the whole 18F-FDG-positive lesion, this average SUV involves voxels that may be located in separate places of the 18F-FDG-positive tissue, as is typically met when the uptake is heterogeneous (Fig. 1).

Fig. 1
figure 1

Left side: PET maximal intensity projection of a patient presenting a lung squamous cell carcinoma of the left lower lobe, with a second malignant nodule of the left upper lobe and a subcarinal lymph node. Right side: Zoom of the main lesion showing a heterogeneous uptake: the set of green areas (as displayed by the AW workstation, GEMS) corresponds to the N = 30 hottest voxels of the lesion (SUVmax-30 = 15.5 g/ml), while the arrow indicates the location of the hottest voxel (SUVmax = 18.0 g/ml). The largest axis of the green area involving SUVmax is 6.5 mm. SUVpeak was centred on SUVmax and involved its 26 neighbouring voxels (0.66-ml volume; in-plane 1.16-cm diagonal line)

The aim of this work was to systematically assess the variability of the average SUV obtained from the N = 5, 10, 15, 20, 25 or 30 hottest voxels (SUVmax–N ; N ≠ 1), in 18F-FDG-positive lesions of lung cancer patients, since the limitation of SUVmax has been shown before [8], but the direct impact of SUVmax-N on variability has not been assessed. SUVmax-N variability was compared to that of SUVmax (N = 1) and SUVpeak, computed using the VOI encompassing the voxel corresponding to SUVmax and its 26 neighbouring voxels [9].

Materials and methods

Patients

This study received the approval of the Ethics Committee of the Teaching Hospital and all patients gave their informed consent before their inclusion in the study. Twelve lung cancer patients were included in the study and a total of 20 lesions were investigated. Minimal lesion size, assessed with CT either in-plane or axial, was larger than 15 mm in order to minimise partial volume effects [10]. All patients fasted for at least 6 h before the 18F-FDG injection and pre-injection average plasma glucose concentration was 1.00 g.l-1 (range, 0.90-1.17). 18F-FDG was administered intravenously for less than 1 min with a mean injected dose of 344 MBq (range, 229-460; assessed with a dose calibrator).

PET imaging and data processing

Patients underwent a static whole-body PET/CT for diagnosis purposes and a dynamic acquisition over the chest for the study purpose (one step, ten consecutive frames of 2.5 min each) that ranged within 60-110 min post-injection (Discovery ST PET/CT camera; General Electric Medical Systems, Milwaukee, USA; three-dimensional [3D] mode without septa; decay correction on). All PET images were reconstructed iteratively (FORE + OSEM; subsets, 32; iterations, 5; 3D postfilter of Hann, 0.9, 10.0); the voxel size was 2.73–3.27 mm (in-plane and axial respectively; FOV, 700 × 700 mm; matrix, 256 × 256 pixels) leading to a voxel volume of 0.0244 ml. For each examination, a low-dose unenhanced CT was performed for attenuation correction and used for anatomical localisation (pitch, 1.675; slice thickness, 3.75 mm; FOV, 500 × 500 mm; matrix, 512 × 512 pixels) leading to a voxel volume of 0.0036 ml.

An Advantage 4.6 workstation (General Electric Medical Systems) was used for drawing in each dynamic frame, a VOI encompassing each 18F-FDG-positive lesion, separately. A histogram representing the percentage of all voxels included in an arbitrary VOI versus SUV was used (Fig. 2; histogramming bin size = 0.032 g.ml-1) allowing us to assess SUVmax and SUVmax–N for N = 1 and 5, 10, 15, 20, 25 or 30 hottest voxels, respectively. On the histogram, two cursors (vertical green dotted lines in Fig. 2) were used: the first one was set at SUVmax and the second one was moved along the SUV axis. For each position of the second cursor, i.e. for each histogramming bin, the total hottest volume and the associated mean SUV value were calculated between the two cursors (Fig. 2). Hottest volumes of 0.0244 × 5, 10, 15, 20, 25 or 30 ml, corresponding to N = 5, 10, 15, 20, 25 or 30 voxels, provided associated mean SUV values corresponding to SUVmax–N respectively. When several voxels were encompassed within the same histogramming bin, i.e. several voxels whose SUV value falls into the same SUV range, the exact total hottest volume for an arbitrary N value of 5, 10, 15, 20, 25 or 30, and hence SUVmax–N , was not always directly given by the software. A rule of three was therefore used to deduce the mean SUV value corresponding to SUVmax–N . SUVpeak calculated as the mean SUV value over the voxel corresponding to SUVmax and its 26 neighbouring voxels were assessed using the PMOD Software version 3.3 (PMOD Technologies, Zurich, Switzerland).

Fig. 2
figure 2

Histogram representing the percentage of all voxels included in a selected VOI (encompassing an arbitrary 18F-FDG-positive lesion in an arbitrary dynamic frame) versus SUV (g.ml-1). For each lesion and in each of the ten dynamic frames, that histogram was used to obtain SUVmax (N = 1) and SUVmax–N for the N = 1, 5, 10, 15, 20, 25 or 30 hottest voxels (see text for details of the procedure). As an example, for a 0.61-ml volume, i.e. N = 25, SUVmax–25 is 20.4 g/ml (Mean displayed on the screen), SUVmax is 21.6 g/ml (maximal bin) and the minimal bin is 19.5 g/ml (right and left vertical dotted green cursor respectively); note that the SUV standard deviation, SD = 0.6 g/ml, is also given between these limits

Statistical analysis

Statistical analysis by Bland and Altman was implemented over the lesion series to assess variability of SUVmax–N , SUVmax and SUVpeak , involving the following steps [11, 12]: (1) for each 18F-FDG-positive lesion, mean and standard deviation (SD) of each SUV metric were assessed from the ten dynamic frames; (2) for each SUV metric, we found that SD was significantly related to magnitude (i.e. SUV mean) over the lesion series (Pearson coefficient), i.e. SD did not follow a normal law; we therefore transformed the data by using the relative standard deviation (SDr), i.e. a percentage, and for each SUV metric it was verified over the lesion series that SDr was not significantly related to magnitude (i.e. SUV mean) over the lesion series (Pearson coefficient); this feature then allowed us to calculate a mean SDr over the series (SDr N , SDrmax, SDrpeak) respectively; (3) for each SUV metric, relative measurement error (MEr), i.e. the relative difference between a single estimate of a parameter and its average “true” value, was computed as 1.96 × mean SDr and 2.58 × mean SDr with 95 % and 99 % confidence level (CL) respectively. Repeatability (R), i.e. the minimal relative change between two SUVs assessed from two successive examinations that is required to consider a significant difference, was computed as 21/2 × MEr; i.e. 21/2 × 1.96 × mean SDr and 21/2 × 2.58 × mean SDr with 95 % and 99 % CL respectively [11, 12].

Comparison between SDr N , SDrmax and SDrpeak was achieved using two-tailed paired T-test. Two-tailed sign test was performed to assess whether SUVmax-30 and SUVmax increased with time. P values of less than 0.05 were considered statistically significant.

Results

Ten men and two women were included in the study (mean age, 63 years; range, 43–78) with 20 lesions in the thorax. Lesions were localised in the lung tissue (n = 13), or in mediastinal lymph nodes (n = 7; two patients presented two lymph nodes) and consisted of squamous cell carcinoma (n = 3), small cell lung cancer (n = 3), adenocarcinoma (n = 4), large cell carcinoma (n = 2). Minimal lesion size (assessed either in-plane or axial) ranged between 15.7 and 61.0 mm (mean, 29.4 mm). Patients’ mean weight and height were: 72 kg (range, 44–95 kg) and 169 cm (range, 157–179 cm), respectively.

For each SUV metric, Table 1 summarises the corresponding volume, the mean value (and range) over the lesion series, the correlation coefficient of the fit of SUV SDr versus mean SUV over the lesion series, mean SDr (and range) over the lesion series, MEr and R for 95 and 99 % CL. For each SUV metric, the correlation coefficient of the fit of SUV SDr versus mean SUV over the lesion series shows that SUV SDr was not significantly related to SUV magnitude (r 0 = 0.44; 95 % reliability; Table 1), therefore allowing calculation of mean SDr. Figure 3 graphically demonstrates this lack of correlation for SUVmax–30 SDr versus mean SUVmax–30 over the lesion series. A significant relationship was found between the correlation coefficient of the fit of the SDr of SUVmax–N versus mean SUVmax–N and N over the lesion series (Table 1; N = 5, 10, 15, 20, 25 or 30; y = -0.008x + 0.400; r = 0.94; P < 0.01; graph not shown).

Table 1 SUV characteristics
Fig. 3
figure 3

SUVmax–30 SDr versus mean SUVmax–30 over the lesion series, which were obtained from ten values in each lesion. No significant correlation was found (y = -0.0665x + 5.41; r = 0.17), thus justifying calculation of SDr30. For each SUV metric, such a lack of correlation was found over the lesion series

Figure 4 shows a significant and a non-significant linear correlation with time for SUVmax-30 and SUVmax in a typical lesion. Four and 11 lesions of a total of 20 showed a significant increase in SUVmax-30 and SUVmax with time respectively (linear correlation; 95 % reliability). This result indicates that SUVmax-30 significantly increases with time over the lesion series (two-tailed sign test, P = 0.012), whereas no significant increase was found for SUVmax (two-tailed sign test, P > 0.75).

Fig. 4
figure 4

SUVmax (diamonds) and SUVmax–30 (squares) versus time in a typical lesion, showing a non-significant and a significant linear correlation respectively (r = 0.024 and 0.844 respectively; 95 % reliability)

For all considered numbers of hottest voxels (N ranging from 5 to 30), SDrN was found to be significantly lower than SDrmax and SDrpeak (two-tailed paired T-test, P ≤ 0.004). Figure 5 shows a strong correlation between SDr N and N: SDr N  = 6.471 × N -0.103 (r = 0.994; P < 0.01). No significant difference was observed between SDrmax and SDrpeak (two-tailed paired T-test, P = 0.64).

Fig. 5
figure 5

SDr N versus N showing a strong correlation: SDr N  = 6.471 × N -0.103 (r = 0.994; P < 0.01). For N = 5, 10, 15, 20, 25 or 30 (diamonds), SDr N was found to be significantly lower than SDrmax (square; N = 1) and SDrpeak (triangle; N = 27) (two-tailed paired T-test, P ≤ 0.004; bars represent 95 % confidence limits). No significant difference was found between SDrmax and SDrpeak (two-tailed paired T-test, P = 0.64). Note that when SUVmax is added to the fit, i.e. N = 1, the equation of the further fit is: SDr N  = 6.932 × N -0.127 (r = 0.993; P < 0.01)

Discussion

According to the guidelines (version 1.0) of the European Association of Nuclear Medicine (EANM), assessment of either SUVmax or SUVpeak, which is defined as the average SUV value of a 1-ml sphere (1.2-cm diameter) within the tumour that results in the highest value possible, is required for each clinically relevant 18F-FDG-positive lesion [4, 13]. The present study showed that averaging SUV from several hottest voxels (i.e. SUVmax–N ; N ≠ 1), instead of using one unique voxel (i.e. SUVmax; N = 1) results in a significantly lower variability of about 1/3 for N = 30. This result is in very good agreement with the previous study by Burger et al. [8], whose results were obtained with N = 5–10 in patients with malignant chest tumours: the ratios of SDr5/SDrmax and of SDr10/SDrmax (from ratios of mean percentage changes in reference [8]) were found to be 0.78 (=5.52/7.07) and 0.71 (=5.03/7.07) in the current study versus 0.81 (=6.60/8.17) and 0.77 (=6.30/8.17) in the study by Burger et al. Furthermore, irrespective of the N value, the current study showed that SUVmax–N also has a better variability performance than SUVpeak. Various shape, size and ROI location have been proposed in the literature and SUVpeak was calculated in this study from SUVmax and its 26 neighbouring voxels, corresponding to a 0.66-ml volume (in-plane 1.16-cm diagonal line) that is slightly lower than a 1-ml sphere (1.2-cm diameter) [4, 7, 9]. The better variability performance of SUVmax–N in comparison with SUVpeak is very likely related to the feature that SUVmax–N involves the hottest voxels not mandatorily close to each other in a 18F-FDG-positive lesion, whereas SUVpeak is obtained over 27 adjacent voxels. As a consequence, SUVpeak includes voxels that may not be the hottest ones, as is typically encountered in a lesion showing a heterogeneous uptake (Fig. 1) or in an excavated lesion. In contrast, SUVmax–N takes into consideration the heterogeneity of the lesion uptake, a feature whose relevance has been recently shown [14, 15].

We suggest that the improvement in SUV accuracy and repeatability obtained by pooling several hottest voxels (SUVmax–N ) could be useful for predicting outcome or for assessing treatment response, besides the use of other functional parameters such as the tumour volume (TV) and the total lesion glycolysis (TLG) [1618]. Besides those volume-based parameters, baseline SUVmax value (i.e. the hottest SUV value) remains relevant in PET quantification, as recently demonstrated for prognostic stratification of lung cancer treated with erlotinib, thus opening up the interest for SUVmax–N (i.e. average of N hottest SUV values) [19]. This 3D pooling better represents the most aggressive parts of the tumour than SUVmax does, which is related to one voxel only (N = 1; even if SUVmax is influenced by its neighbouring voxels due to the limited spatial resolution of PET systems), or SUVpeak, which is also fixed to one specific location within the VOI (Fig. 1). Moreover, it should be noted that SUVmax–N variability is slightly lower than that of a mean SUV obtained with a fixed 40 % of the SUVmax threshold, which does not only involve the hottest voxels [20]: MEr–R of 8.94–12.63 % versus 9.72–13.74 % (95 % confidence level), for SUVmax-30 and the latter respectively. Furthermore, SUVmax–N calculation can be easily implemented in current clinical practice and no intra- or inter-observer variability may be expected. Indeed, when a selected 18F-FDG-positive lesion is separated from other 18F-FDG-positive tissues, the histogram representing the percentage of all voxels included in that arbitrary VOI versus SUV (or any list of the SUV voxel values sorted by descending order), is definitely established for the highest SUV values. This is particularly true when the total hottest volume that is averaged (in our study N = 30 corresponds to a total hottest volume of 0.732 ml) is much lower than the whole lesion volume and Fig. 1 illustrates this feature. For the lowest SUV values of the histogram, not involved in the SUVmax–N calculation, there is a potential variability that depends on the number of included background voxels, in other words, that depends on the delineation method used.

As our results demonstrate that the best variability performance is achieved for N = 30 in comparison with other SUV metric, the use of SUVmax–30 might be recommended in current clinical practice. Moreover, it can be noted that N = 30 is close to the total number of 27 neighbouring voxels used for the computation of SUVpeak [9]. However, this study has been conducted with specific reconstruction parameters (voxel volume of 0.0244 ml) and N = 30 corresponds to a 18F-FDG-positive volume of 0.732 ml (Table 1). It is worth noting that N = 40 corresponds to a total 18F-FDG-positive volume of 1 ml in our study, i.e. the volume of the sphere recommended by Wahl et al. [4], and N = 40 could be recommended as well, with predicted MEr and R for SUVmax–40 of 8.67–12.26 % (95 % CL; Fig. 5; SDr N  = 6.471 × N -0.103; r = 0.994). However, increasing the N number of hottest voxels in the SUVmax–N calculation may be relevant to continue lowering its variability, but at the detriment of its inherent nature to represent the most “aggressive” parts of the tumour (Fig. 1). In this connection the present study may be considered, as in the knowledge phase of the Rapid Learning methodology [21], and further studies are required to give a decision on the optimal total hottest volume to be used (N × voxel volume) rather than on the optimal number of voxels (N), as the specific reconstruction parameters of each PET system may be different. Moreover, it should be noted that the malignant pathology could also play a role in the determination of the optimal total hottest volume to be used [22].

The improvement in SUV accuracy and repeatability obtained by pooling several hottest voxels could also be applied to the assessment of 18F-FDG-positive TV obtained with fixed percentage of the SUVmax thresholds [23]. For different percentage thresholds, it has been shown that TV variability correlated with SUVmax variability. Because SUVmax-N variability is lower than that of SUVmax, assessing 18F-FDG-positive TV created by using a fixed percentage of the SUVmax-N threshold, instead of an SUVmax threshold, might result in a lower TV variability. This suggestion also appears relevant for other functional parameters requiring tumour outlining such as the TLG, which is the product of TV with associated mean SUV [16, 20].

The current study presents some limitations. First, SUV variability was assessed by using a single 25-min dynamic PET acquisition, therefore providing data over a ± 12.5 (=25/2)-min time window around a mean injection-acquisition time delay. As a result, temporal changes in SUV occurring during a 12.5-min time duration may provide a slight overestimation of the SUV variability. Indeed, SUVmax-30 was found to significantly increase with time over the lesion series, a feature that is well-established in 18F-FDG-positive tumours (two-tailed sign test, P = 0.012), and no significant increase was found for SUVmax, a feature that may be related to its greater variability than that of SUVmax-30 (Fig. 4). We suggest that this slight overestimation is more clinically acceptable than an underestimation [20]. Furthermore, we also suggest that test-retest studies, involving two baseline examinations repeated on two different days with the same injection-acquisition time delay, are warranted for comparison with the present one. Indeed, although the patient radiation dose should be taken into consideration, test-retest studies allow assessing other origins of SUV variability than those investigated with the present study design, such as changes in plasma glucose level, in injected dose and in positioning [24]. Second, the SUV values investigated in the present study were higher than those of Burger et al.’s: 6.61–23.18 g.ml-1 versus 3–16 g.ml-1 (Fig. 2 in reference [8]) respectively. In a recent meta-analysis de Langen et al. [25] have shown that the lower the SUVmax, the greater the repeatability percentage (and hence the greater the SDr). Moreover, minimal lesion size was larger than 15 mm in order to minimise partial volume effects [10]. Therefore, it is suggested that a full separate analysis is required to more precisely assess SUVmax-N variability in small-size lesions and in lesions showing faint 18F-FDG uptake, the former requiring correction for partial volume effect and respiratory gating.

To conclude, this study shows that averaging SUV from several hottest voxels (SUVmax–N ) significantly improves the SUV variability performance in comparison with that of SUVmax and SUVpeak, of about 1/3 for N = 30. This averaging can be very easily implemented clinically and therefore we recommend the use of SUVmax–N in current clinical practice to improve the accuracy of SUV values for predicting outcome or for assessing treatment response. However, further studies are warranted to determine the optimal total hottest volume to be used.