Introduction

Imaging biomarkers must be objective, quantitative, accurate and reproducible to be useful in the era of molecular and personalized medicine [1]. The 18F-fluorodeoxyglucose positron emission tomography (FDG PET) maximum standardized uptake value (SUVmax) has been used as a prognostic, predictive and sometimes diagnostic imaging parameter in oncology [27]. However, SUVmax, representing single pixel metabolic information about the tumor, may not accurately represent the tumor volume or biology. Imaging parameters that measure FDG metabolic tumor volume (MTV) and total glycolytic activity (TGA) of the tumor are emerging as exploratory imaging biomarkers for prediction of outcome and therapy response in patients with solid tumors [814], adding value to current staging prognoses [15].

Studies have demonstrated that tumor volume segmentation depends on the tumor to background ratio, image resolution, image noise, the size of the tumor and the segmentation method [1618]. Manual contouring of tumor volume delineation depends on the experience of the reader and leads to high variability [19]. Semiautomatic segmentation methods have less variability than manual methods [11]. In addition, previous studies of intra-reader and inter-reader reliability of MTV using semiautomatic segmentation methods are mostly limited to phantoms [16]. The intra-reader and inter-reader reliability of TGA, to our knowledge, has not been studied.

The objective of this exploratory study is to establish the effects of primary tumor size and segmentation methods on intra-reader reliability of primary tumor MTV and TGA in head and neck, lung, and breast cancers.

Methods

Patients and study design

This study is a retrospective evaluation of positron emission tomography/computed tomography (PET/CT) images. Approval from our institutional review board was granted with a waiver of informed consent. All patients with lung, head and neck, and breast cancers, who had a baseline PET/CT at our institution in 2009 were included in the study. These cancer types were selected since the tumor and background tissue FDG uptakes provide a mixture of signal-to-noise characteristics representative of day-to-day clinical practice. Patients who had any local or systemic therapy were excluded. The study population included 121 patients (mean age of 60.9 ± 12.7 years).

PET/CT

All PET/CT studies were performed on a GE Discovery STE 16 (GE Healthcare, Milwaukee, WI, USA) PET/CT scanner according to the institutional standard clinical protocol. A dedicated head and neck protocol was instituted for all head and neck cancer patients. For patients with oral and oropharyngeal cancers, the scans were done from skull base to aortic arch with the arms down and clavicle to mid-thigh with the arms up. For patients with breast and lung cancers, the scans were done from skull base to mid-thigh with the arms up. The average patient blood glucose level was 102.5 mg/dl (SD 25.0). Patients were injected with a mean dose of 13.4 mCi (SD 3.1) of FDG and incubated for an average period of 92.9 min (SD 25.3).

When intravenous contrast was used (n = 66, 54.5 %), torso (mid-thigh to clavicle) and then neck (skull base to clavicle) were imaged regardless of the location of the primary tumor. About 100 ml of Optiray IV (Tyco Healthcare/Mallinckrodt, Hazelwood, MO, USA) with a 30 ml saline chaser was injected using a power injector (GE Healthcare, Milwaukee, WI, USA) at 3 ml/s for torso and a second bolus of 60 ml of Optiray IV (Tyco Healthcare/Mallinckrodt, Hazelwood, MO, USA) was used for head and neck images.

Image analysis

All PET/CT studies were retrieved from the electronic archival system and reviewed on a MIMvista workstation (software version 4.1) (MIM Software Inc., Cleveland, OH, USA). To establish the intra-reader reliability, 1 author (RS), with board certification in radiology and nuclear medicine fellowship and 3 years of experience as faculty, was chosen. There was a 12-week interval between the first and second image analyses. The images were read in random order to minimize bias.

PET, CT, and fused PET/CT images were reviewed in axial, coronal, and sagittal planes. For the purposes of this study, the relevant imaging parameter measurements were the primary tumor MTV and TGA segmented from PET. MTV was defined as the tumor volume with FDG uptake segmented by a gradient-based (PET edge) method or a fixed threshold method at 38 and 50 % of SUVmax. The TGA was defined as (MTV) × (SUVmean). The commercially available MIMvista software analysis suite (MIM Software Inc., Cleveland, OH, USA) includes a contouring suite for radiation therapy planning and a PET/CT fusion suite. Once the primary tumor (target) was segmented, MTV and TGA were automatically calculated by the MIMvista software.

Segmentation methods

There are many methods of segmentation proposed for PET tumor volumetric measurements [2023]. The gradient and threshold segmentation methods of volume measurement available in MIMvista software have been previously described [15, 16]. The gradient segmentation method in the MIMvista software relies on an operator-defined starting point near the center of the lesion. As the operator drags the cursor out from the center of the lesion, six axes extend out, providing visual feedback for the starting point of gradient segmentation. Spatial gradients are calculated along each axis interactively, and the length of an axis is restricted when a large gradient is detected along that axis. The six axes define an ellipsoid that is then used as an initial bounding region for gradient detection.

The fixed SUVmax threshold contouring method relies on including all voxels that are greater than a defined percent of the maximum voxel within an operator-defined sphere (in this study 38 and 50 %). We used 38 and 50 % SUVmax as it has been previously reported. Cross-sectional circles are displayed in all 3 projections (axial, sagittal, and coronal) to ensure three-dimensional coverage of the primary tumor. The edges of the primary tumor are automatically calculated and outlined in both segmentation methods.

Statistical methods

We present our summary statistics as the mean ± standard deviation for continuous variables, or frequency and percentage for categorical variables. Reproducibility of MTV and TGA was measured using the intraclass correlation coefficient (ICC) as generated by a two-way mixed effects model and reported as a point estimate with a 95 % confidence interval (95 % CI). The ICC ranges between 0 and 1.00, with values closer to 1.00 representing better intra-reader reliability. ICC is interpreted as follows: 0.00–0.20 indicates slight reproducibility, 0.21–0.40 indicates fair reproducibility, 0.41–0.60 indicates moderate reproducibility, 0.61–0.80 indicates substantial reproducibility, and >0.80 indicates almost perfect reproducibility [24]. We also investigated the reliability with Bland–Altman analyses [25].

To investigate the impact of partial volume effect on intra-reader reliability, we divided the cohort into small versus large volume tumors with a mean MTV cut point of 10 ml (n = 69, 10 ml or greater) (Figs. 1, 2). Thus, the cut point of 10 ml was chosen as the partial volume effect is greater for tumors <2 cm. Between-group analyses were performed using a t test and analysis of variance (ANOVA) with Friedman test and post hoc multiple pair comparisons, as appropriate. We used the Prism 5 (GraphPad Software Inc, San Diego, CA, USA) and SPSS 19 (SPSS Inc, Chicago, IL, USA) statistical packages for all analyses, and all hypothesis tests were two-sided with a significance level of 0.05.

Fig. 1
figure 1

Large tumor volumetric segmentation and reader reliability: a 60-year-old male with a large stage IV tumor in the upper lobe of right lung. a Axial, b sagittal, c coronal PET and corresponding d, e and f fused PET/CT images. The metabolic volumetric contours demonstrated in this figure were drawn using gradient segmentation. The first and second read volumes were 328 and 311 ml. For 38 % SUVmax segmentation, the corresponding first and second read volumes were 259 and 247 ml and for 50 % SUVmax segmentation, volumes were 168 and 155 ml

Fig. 2
figure 2

Small tumor volumetric segmentation and reader reliability: a 63-year-old female with a small, stage IB tumor in the upper lobe of left lung. a Axial, b sagittal, c coronal PET and corresponding d, e and f PET/CT. The metabolic volumetric contours demonstrated in this figure were drawn using gradient segmentation. The first and second read volumes were 6.7 and 13 ml. For 38 % SUVmax segmentation, the corresponding first and second read volumes were 7.9 and 8.3 ml and for 50 % SUVmax segmentation, volumes were 6.2 and 4.7 ml

Results

Patients

There were 67 women (55.4 %) and 54 men (44.6 %) in the study. The American Joint Committee on Cancer (AJCC 6th edition) stage of the cancers included 15 stage I patients (12.4 %), 24 stage II patients (19.8 %), 40 stage III patients (33.1 %), and 42 stage IV patients (34.7 %). Patient demographics are summarized in Table 1.

Table 1 Characteristics of the patients included in the study

Intra-reader MTV reliability

For MTV, the ICCs were 0.97 for the gradient method, 0.98 for the 38 % threshold, and 0.99 for the 50 % SUVmax threshold method (p value <0.001 for all point estimates). The Bland–Altman analysis showed a bias of 2.95 % (SD 32.5 %) for gradient, 14.76 % (SD 34.5 %) for the 38 % SUVmax, and 11.13 % (SD 31 %) for the 50 % SUVmax threshold MTV segmentations (p < 0.0001).

For small lesions (<10 ml of mean MTV), the biases were −1.53 % (SD 40 %) for gradient, 22.12 % (SD 46.5 %) for the 38 %, and 16.4 % (SD 42.2 %) for the 50 % SUVmax threshold MTV segmentations (p < 0.0006).

For large lesions (>10 ml of mean MTV), biases were 3.9 % (SD 21.9 %) for gradient, 9.32 % (SD 20.4 %) for the 38 %, and 6.55 % (SD 17.9 %) for the 50 % SUVmax threshold MTV segmentations (p < 0.01) (Table 2; Fig. 3). There were no statistically significant differences in the biases between the first and second readings for MTV segmented into small and large volume tumors by the gradient method (p < 0.34) or 50 % SUVmax method (p < 0.08). However, there were statistically significant differences in the corresponding biases for the 38 % SUVmax threshold segmentation (p < 0.04) (Fig. 4).

Table 2 MTV and TGA—Bland–Altman analysis: effect of metabolic volume of primary tumor and segmentation methods
Fig. 3
figure 3

Intra-reader reliability of segmentation of MTV: patients were divided into those with a mean MTV < 10 ml and 10 ml or greater. The biases were significantly smaller for larger tumors (3.9–9.32 %) than for smaller tumors (−1.53 to 22.12 %)

Fig. 4
figure 4

Intra-reader reliability of segmentations of TGA: patients were divided into those with a mean MTV < 10 ml and 10 ml or greater. The bias was smaller for larger tumors (2.42–6.32 %) than smaller tumors (−4.26 to 16 %), though not statistically significant

Intra-reader TGA reliability

The ICC for gradient, 38 %, and 50 % SUVmax threshold TGA was 0.99 for all methods (p < 0.001 for all point estimates). The Bland–Altman analysis showed a bias of 0.76 % (SD 30.3 %) for gradient, 10.36 % (SD 37.9 %) for the 38 % SUVmax, and 7.46 % (SD 34.6) for the 50 % SUVmax threshold TGA segmentations (p < 0.03).

For lesions <10 ml of mean MTV, the biases were −4.26 % (SD 39.8 %) for gradient, 16 % (SD 46.4 %) for the 38 % SUVmax, and 8 % (SD 49.6 %) for the 50 % SUVmax threshold segmentations (p < 0.0005).

For lesions >10 ml of mean MTV, the biases were 2.42 % (SD 17.8 %) for gradient, 6 % (SD 29.8 %) for the 38 % SUVmax, and 6.32 % (SD 17 %) for the 50 % SUVmax threshold TGA segmentations (p < 0.06). There were no statistically significant differences in the biases between the first and second reads for TGA segmented for small and large volume tumors by the gradient method (p < 0.22), the 38 % SUVmax (p < 0.18) or the 50 % SUVmax threshold segmentation (p < 0.98).

Discussion

In this study, we sought to assess the intra-reader variability of FDG metabolic parameters (primary tumor metabolic tumor volume and total glycolytic activity) using two different semi-automatic segmentation methods. Our results show that overall bias was generally minimal for gradient threshold MTV and TGA, followed by the 50 % SUVmax and then the 38 % SUVmax threshold segmentations. We further demonstrate that the biases for MTV for smaller primary tumors are significantly greater for the 38 and 50 % fixed SUVmax threshold segmentation methods than for the gradient segmentation method. No significant difference in bias was noted for segmentation of TGA between small and large volume tumors using the gradient and fixed threshold methods, though the bias for gradient segmentation is less than for threshold segmentation.

New imaging parameters need to progress as quantitative biomarkers for therapy response assessment and for patient outcome. The reproducibility, reader reliability and variability must be established for emerging parameters to become clinically useful biomarkers. The small intra-reader variability in this study demonstrated for the gradient method is likely due to the semiautomated feature of the software algorithm. For the gradient method segmentation, the reader needs to point in the center of the tumor and drag towards the periphery as the software program creates six radii to identify the edge of the tumor. This reader intervention is likely the cause of variability between the reads. There was no statistically significant difference in the reader bias for TGA between the small and large tumors while there was difference in MTV. Since TGA is calculated by multiplication of MTV and SUVmean, bias effects may be reduced in TGA. It is also important for the results to be generalizable as the FDG uptake in the background tissue can vary and may affect the segmentation. Our study included head and neck, lung and breast tumors as the background uptake in normal lung, breast tissue and neck tissue can be variable.

The gradient method is a function of local relative change in image count levels at the tumor-normal tissue boundary. It is not sensitive to varying background counts as it depends on the location rather than the magnitude of the count change [16]. Previous phantom studies have demonstrated, for segmentation of MTV, that gradient segmentation is more accurate than threshold segmentation (using SUVmax of 45 %) for spheres <20 mm (p < 0.015) [16]. These authors further demonstrated that inter-observer variability is significantly reduced when gradient segmentation is used compared to manual or a 25 % SUVmax threshold segmentation.

Tumor size has been shown to have an effect on the reliability of metabolic volume segmentation [16, 26]. The fixed SUVmax percentage threshold segmentation depends on the SUVmax and the set threshold. There is no best single threshold for all volumes. As smaller volumes require a larger threshold because of partial volume effects, the optimum threshold is a function of volume [16]. Furthermore, fixed segmentation suffers from its binary deterministic nature and lack of robustness for varying tumor to background contrast [18]. Our study confirms that the lower intra-reader reliability seen with smaller volumes in phantom studies is also seen in patients, especially for segmenting MTV.

Our study results need to be interpreted in the context of the study design. As an exploratory study, we used a single reader with 3 years of faculty experience reading PET/CT and 18 months of experience in volumetric segmentations. We have previously demonstrated that reader training and experience have an impact on reader reliability [27]. Further studies are needed to assess the applicability of our results to different malignancies using multiple readers, with varied experience, working with semiautomated segmentation programs, and to establish the intra-and inter-reader reliability of MTV and TGA. The impact of intravenous contrast was not studied as the segmentation methods were based on the PET images and IV contrast has non-significant effect on PET images.

In conclusion, the intra-reader reliability of MTV and TGA depends on the segmentation methods and size of the tumor. Gradient segmentation has less bias than fixed threshold segmentation for MTV, especially for smaller volumetric tumors. Intra-reader variability for TGA is lower than for MTV for both segmentation methods in small and large volume tumors. Further studies are needed to validate these exploratory findings and conclusions.