Introduction

Nasopharyngeal carcinoma is a rare malignancy with highly varying geographic and racial distribution worldwide [1]. It occurs significantly more commonly in northern Africa and southeastern Asia, especially in southern China, with a peak annual incidence approaching 30 per 100,000 persons [2]. 2-Deoxy-2-[18F]fluoro-d-glucose ([18F]FDG) positron emission tomography/computed tomography (PET/CT) imaging has been established as a powerful technique for diagnosis and staging in oncology [3] and is also used in nasopharyngeal carcinoma [47]. Due to the intense physiologic uptake of [18F]FDG in the brain, [18F]FDG positron emission tomography (PET)/X-ray computed tomography (CT) lacks sensitivity for T staging, while [11C]methylcholine ([11C]choline) PET/CT has been developed as a complementary modality for T staging of nasopharyngeal carcinoma [7]. For PET/CT assessment using both [18F]FDG and [11C]choline, standardized uptake values (SUVs) have been the most widely adopted index in routine clinical oncology to support diagnosis, prognosis, and therapy response assessment.

Meanwhile, due to the high intratumor heterogeneity of malignant tumors, radiomic features have been increasingly applied to delineating tumors, stratifying risk, and assessing tumor response to therapy in different malignancies [8]. Several groups have studied metrics that quantify intratumor PET uptake heterogeneity, with special focus on [18F]FDG imaging [922]. Radiomic analysis of tumors can be achieved by using statistical, shape-based, and/or textural feature analysis, including first-, second-, and higher-order methods of increasing complexity.

At the same time, radiomic features are affected by many factors, such as PET image acquisition, reconstruction, post-smoothing, tumor delineation, and gray-scale resampling. Thus, some studies have investigated the robustness of radiomic features due to different image processing methods. Doumou et al. tested the effects of image smoothing, segmentation, and quantization on the precision of heterogeneity measurements in esophageal cancer [23]. Hatt et al. assessed the robustness of PET heterogeneity in textural features for delineation of functional volumes and partial volume correction (PVC) also in esophageal cancer [24]. Tixier et al. evaluated the reproducibility of heterogeneity measurements on double-baseline [18F]FDG PET scans again in esophageal cancer [19]. Galavis et al. studied the variability of the textural features in PET images due to different acquisition modes and reconstruction parameters for a cohort containing a wide range of cancers [25]. Van Velden et al. assessed the impact of reconstruction methods and delineation on the repeatability of texture features in nonsmall cell lung cancer (NSCLC) FDG PET/CT studies [26]. Vallieres et al. analyzed the influence of discretization on the predictive value of radiomic features in FDG PET and MRI scans of soft tissue sarcomas [27]. In addition, Willaime et al. assessed the repeatability of texture descriptors in test-retest 3′-dexoy-3′-[18F]fluorothymidine ([18F]FLT) PET scans of breast cancer prior to therapy [28].

In the present work, we focus on nasopharyngeal carcinoma, including both [18F]FDG and [11C]choline imaging. In routine clinical practice, reconstruction and post-smoothing parameters are commonly fixed. There is significantly more space for varying parameters in image analysis and quantification, including image delineation/segmentation and discretization, so we have focused on the latter aspects. Specifically, there is increasing interest in the use of automatic segmentation for routine clinical assessment [29]. Furthermore, the calculation of texture features is also affected by discretization (the bin size of SUV range). By using smaller bin size, the results will be more accurate yet computationally intensive. To the best of our knowledge, the effects of delineation/segmentation as well as discretization on the robustness of high-throughput radiomic features from imaging using different PET tracers have not been evaluated together. Robustness analysis for nasopharyngeal carcinoma also appears completely absent.

The objective of the present work was therefore to evaluate the impact of segmentation and gray-scale discretization on radiomic features in both [18F]FDG and [11C]choline PET images of nasopharyngeal carcinoma. The consensus of three expert physicians for delineation was used as reference. The following were also adopted: 41, 50, and 70 % SUVmax threshold, with background correction, Nestle’s, gradient-based watershed, and region growing methods. Subsequently, the segmented tumors were discretized with a fixed bin size in units of SUV (i.e., 0.05, 0.1, 0.2, 0.5, 1). A total of 88 radiomic features were evaluated, including 31 features based on SUV (intensity), shape, and histogram, as well as 57 second- and higher-order textural features.

Materials and Methods

Patients and [18F]FDG and [11C]Choline PET/CT Protocols

Forty patients (including 34 men and 6 women, with a mean age of 46 years [range, 17–75 years]) with nasopharyngeal carcinoma were enrolled in this study. All of the patients were diagnosed by histopathology with nonkeratinizing undifferentiated carcinoma.

All examinations were performed on the GE Discovery LS PET/CT scanner complying with the Society of Nuclear Medicine and Molecular Imaging (SNMMI) procedure guidelines [30] for tumor PET imaging at the Nangfang Hospital of Southern Medical University at Guangzhou, Guangdong. All patients underwent fasting for at least 6 h prior to tracer injection.

Among the 40 patients, 9 patients (including seven men and two women, with a mean age of 50.13 years [range, 40–75 years]) underwent additional [11C]choline scan on a different day. The maximum time interval between the two tracer studies of each patient was 3 days (more detailed time intervals were given in supplemental material Table 2), and no treatment was performed between the two scans. About 60 min (59 ± 3 min, range [53–62 min]) after the intravenous injection of 315–511 MBq (8.49–13.81 mCi) of [18F]FDG (∼150 μCi/kg of body weight), whole-body PET/CT was performed. The patients also underwent regional PET/CT of the head and neck at approximately 10 min (11 ± 2 min, range [7–13 min]) after the intravenous injection of 370–740 MBq (10.0–20.0 mCi) of [11C]choline, as described in [7].

PET images were reconstructed using standard ordered-subset expectation maximization (OSEM) with four iterations and 16 subsets, PET image voxel size of 4 × 4 × 4.25 mm3, and matrix size of 128 × 128 and then were interpolated to the same resolution as CT voxel size of 0.98 × 0.98 × 2 mm3 and matrix size of 512 × 512, and the CT scans (80 mA, 140 kVp) were used for attenuation correction [31]. The body weight SUVs were calculated according to the following equation:

$$ \mathrm{S}\mathrm{U}\mathrm{V}\left(\mathrm{g}/\mathrm{mL}\right)=\frac{\mathrm{tissue}\;\mathrm{activity}\;\left(\mathrm{Bq}/\mathrm{mL}\right)}{\mathrm{injected}\;\mathrm{dose}\left(\mathrm{Bq}\right)/\mathrm{body}\;\mathrm{weight}\left(\mathrm{g}\right)} $$
(1)

where the tissue activity was decay-corrected to account for the time elapsed between injection and acquisition.

Tumor Segmentation

Seven tumor segmentation methods were considered in the present study. Each lesion was first delineated using manual segmentation by three expert physicians on the [18F]FDG and [11C]choline-PET/CT images, respectively, and the consensus of three observers was used for subsequent analysis.

The three other methods were 41, 50, and 70 % of SUVmax threshold with background correction as follows [32]:

$$ \mathrm{Threshold} = T\times \left({\mathrm{SUV}}_{\max }+{\mathrm{SUV}}_{\mathrm{background}}\right) $$
(2)

where T is the relative threshold, and SUVbackground was obtained from normal brain region with a box size of 64 × 64.

Nestle’s method [33] was also implemented using an adaptive threshold as described in Eq. 3:

$$ \mathrm{Threshold}=\beta \times {\mathrm{SUV}}_{{\mathrm{mean}\_70\%\mathrm{S}\mathrm{U}\mathrm{V}}_{\max }}+{\mathrm{SUV}}_{\mathrm{background}} $$
(3)

where β = 0.15, \( {\mathrm{SUV}}_{{\mathrm{mean}\_70\%\mathrm{S}\mathrm{U}\mathrm{V}}_{\max }} \) represents the mean SUV of a region within all pixels, which are greater than or equal to 70 % maximum SUV, and SUVbackground was also obtained from normal brain region with a box size of 64 × 64.

In addition, the gradient-based watershed method was used via the MITK 2.4.0.0 win64 software [34], the last method was region growing [35], and the result was manually modified as necessary.

Feature Extraction

For each volume of interest (VOI), a total of 88 radiomic features were extracted in Matlab R2012a (The MathWorks Inc.) using an available radiomic analysis package (https://github.com/mvallieres/radiomics/) and software developed in-house, including 22 first-order intensity features, 9 shape features, and 57 second- and higher-order textural features, describing the intensity and spatial distribution of radiotracer uptake. The detailed mathematical definitions are provided in supplementary appendix A.

First-Order Statistical Features

First-order statistical features were based on the histogram of global-scale radiotracer uptake intensity distribution [36]. Among which, SUVmax and SUVmean were routinely used clinically, and SUVpeak was defined as the maximum average SUV within a small fixed-size (3 × 3 × 3) volume of interest centered on the SUVmax of the tumor [37]. Total lesion glycolysis (TLG) defined as the product of tumor volume and SUVmean, the variance of all voxel SUV values (SUVvar), the sum of all voxel SUV values squared (SUVenergy), and area under the curve of the cumulative SUV-volume histogram (AUC-CSH) were also considered. In addition, the maximum, mean, minimum, median, range, mean absolute deviation (MAD), standard deviation (STD), and root mean square (RMS) of intensity value were also used. Furthermore, mean, variance, skewness, kurtosis, energy, and entropy of histogram with 100 bins in this study were extracted.

Shape Geometric Features

Shape geometric features [27, 36] described the shape and size of the volume of interest. These included the following: metabolically active tumor volume (MATV), the ellipsoid that best fits the tumor region (eccentricity), ratio of the number of voxels in the tumor region to the number of voxels in the 3D convex hull of the tumor region (solidity), percent inactive (PI) tumor, surface area of tumor volume (SurfaceA), surface to volume ratio (SVratio), compactness, and sphericity (details in supplement).

Textural Features

To calculate the remaining 57 texture features, all volumes with voxel size of 0.98 × 0.98 × 2 mm3 were first isotropically resampled to initial in-plane resolution with a voxel size of 0.98 × 0.98 × 0.98 mm3 using cubic interpolation [27] and the VOI SUV range was then divided into a fixed bin size B (i.e., 0.05, 0.1, 0.2, 0.5, 1) in units of SUV as follows [38]:

$$ {\mathrm{SUV}}_{\mathrm{Dis}}(x)=\left\lceil \frac{\mathrm{SUV}(x)}{B}\right\rceil - \min \left(\left\lceil \frac{\mathrm{SUV}(x)}{B}\right\rceil \right)+1 $$
(4)

where SUV(x) is the original SUV of voxel x and SUVDis(x) is the resampled value of voxel x. The discretization step is necessary to generate occurrence/probability matrices, whose sizes (defined by the maximum SUVDis(x)) highly impact computation and are used to reduce image noise and generate a constant intensity resolution so that textural features from different patients are comparable [18].

Four types of matrices revealing spatial distribution of radiotracer uptake were computed from each VOI: the second-order gray-level co-occurrence matrix (GLCM) [39], counting the number of times of pairwise arrangement of voxels in 13 directions of a 3D space; the higher-order gray-level run length matrix (GLRLM) [40], in which its each element represents the number of occurrence of runs with certain gray level and certain run length in 13 directions of a 3D space; the gray-level size zone matrix (GLSZM) [39], describing the number of a certain size zone having same intensity within 26 connected neighbors in a 3D space; and the neighborhood gray-tone different matrix (NGTDM) [41], characterizing the difference between a center voxel and its 26 connected neighbors. Subsequently, 57 textural features were extracted from these four matrices; thus, in total, 88 features were adopted in this study. Table 1 lists the 88 extracted features. Figure 1 illustrates, for [18F]FDG PET/CT image, the steps of tumor segmentation, discretization, feature extraction, and robustness analysis. The same procedure was also performed on [11C]choline PET/CT images. Robustness analysis is discussed next.

Table 1 Features extracted from SUV, shape, histogram, and textural matrixes
Fig. 1
figure 1

General framework applied in this study. a Original PET image (dashed box represents rough initial region around the tumor). b Slicewise tumor segmentation and volume rendering. c Resampling of the segmented tumor. d Feature extraction. e Robustness analysis via the intraclass correlation (ICC).

Statistical Analysis

In order to evaluate the robustness of features due to different segmentation and discretization methods, the intraclass correlation coefficient (ICC) [42] was adopted and defined as follows:

$$ \mathrm{I}\mathrm{C}\mathrm{C}=\frac{\mathrm{BMS}\hbox{-} \mathrm{R}\mathrm{M}\mathrm{S}}{\mathrm{BMS}+\left(n-1\right)\times \mathrm{R}\mathrm{M}\mathrm{S}} $$
(5)

where BMS and RMS represent the between-subjects and residual mean squares and n is the number of segmentation methods or discretization bin size types.

For each feature, intersegmentation (expert physician, 41, 50, and 70 % SUVmax thresholding with background correction, as well as Nestle’s method, gradient-based watershed method, and region growing method) ICC was computed with a fixed bin size of 0.1, because there was no obvious difference between features extracted with 0.1 and 0.05 bin sizes. Correspondingly, interdiscretization (0.05, 0.1, 0.2, 0.5, 1 bin sizes) ICC of the 57 second- and higher-order features was computed for the case of manual segmentation by an expert physician.

Results

Impact of Segmentation

Figure 2a depicts the ICC scatter plot for all features extracted from [18F]FDG and [11C]choline images with seven different segmentation methods. Forty-four (50 %) features extracted from [18F]FDG have ICC higher than 0.8, while 55 (62 %) features obtained from [11C]choline have ICC higher than 0.8. Table 2 lists 31 features obtained from both [18F]FDG and [11C]choline having ICC higher than 0.8 for segmentation.

Fig. 2
figure 2

Scatter plot illustrating the ICC for a all 88 features extracted from [18F]FDG (red circles) and [11C]choline (blue blocks) images involving different segmentations. b Fifty-seven texture features affected by different discretization bin sizes (i.e., 0.05, 0.1, 0.2, 0.5, and 1) on manually segmented [18F]FDG (red circles) and [11C]choline (blue blocks) images (features on x-axis were placed according to the feature index of Table 1) (Color figure online).

Table 2 Thirty-one features (of total 88) having segmentation ICC ≥0.8 for both [18F]FDG and [11C]choline images

To evaluate the robustness of different types of features with respect to segmentation, we show the ICC box plot of six types of features for [18F]FDG (Fig. 3a) and [11C]choline (Fig. 3b). GLRLM features have the best performance on [18F]FDG (a narrow range of ICC close to 1), while GLCM and NGTDM perform best on [11C]choline. The first-order and shape features perform worst (a wide range of ICC from 0.1 to 1) on both [18F]FDG and [11C]choline.

Fig. 3
figure 3

Box plot comparing segmentation ICC of all types of features on a [18F]FDG and b [11C]choline images (using 0.1 discretization bin size for second- and higher-order texture features).

Impact of Discretization

Discretization is necessary for calculating the 57 textural features derived from GLCM, GLRLM, NGTDM, and GLSZM. We chose manually segmented tumors to analyze the impact of discretization bin size (i.e., 0.05, 0.1, 0.2, 0.5, and 1) on these features. Figure 2a depicts the ICC scatter plot for 57 features extracted from [18F]FDG and [11C]choline images with all different discretization bin sizes. Thirteen (23 %) features extracted from [18F]FDG had ICC higher than 0.8, and for [11C]choline, this was the case with 12 (21 %) features. Of these, ten features were shared for both [18F]FDG and [11C]choline resulting in ICC ≥0.8, as listed in Table 3.

Table 3 Ten features (of total 57) having discretization ICC ≥0.8 for both [18F]FDG and [11C]choline images

Figure 4 depicts the ICC box plots for the four types of features (GLCM, GLRLM, GLSZM, and NGTDM) on [18F]FDG and [11C]choline images. Most features show a wide range of ICC with a median value smaller than 0.5, especially for GLSZM features, which results in the smallest ICC with a median value smaller than 0.2 and a maximum value smaller than 0.6.

Fig. 4
figure 4

Box plot comparing discretization ICC values of four types of features on manually segmented a [18F]FDG and b [11C]choline images.

Simultaneous Feature Robustness for Segmentation and Discretization

For the 57 second- and higher-order textural features, Fig. 5 plots ICC of segmentation versus discretization in the cases of [18F]FDG (Fig. 5a) and [11C]choline (Fig. 5b), with each dot representing a specific feature. In the case of [18F]FDG, eight features exhibited robustness wherein ICC of segmentation and discretization both exceeded 0.8. In the case of [11C]choline, this was the case for 11 features. Segmentation ICC for most features showed a narrower distribution in the range of 0.6 to 1 for two tracers, while discretization ICC distribution spanned the entire range of 0 to 1. Two-sample Student’s t test applied to ICC of these 57 features showed that segmentation ICC values were significantly higher than discretization ICC values in both [18F]FDG and [11C]choline images (p < 0.0001).

Fig. 5
figure 5

Scatter plot of ICC of segmentation (Seg) versus discretization (Dis) on a [18F]FDG and b [11C]choline images. The yellow region defines features depicting both ICC (Seg) and ICC (Dis) equal or higher than 0.8.

Figure 6 lists the ICC values of the individual 57 features for both tracers. It is seen that for GLRLM, GLSZM, and NGTDM features, ICC values for discretization are most commonly smaller than those of segmentation. However, in the case of GLCM features, this is not the case, and some exhibit high ICC values for both segmentation and discretization. In fact, except for those features of GLCM, in the case of [18F]FDG, only one feature of NGTDM (Coarseness) and one feature of GLRLM (RP) showed both ICC values greater than 0.8, and in [11C]choline, only one NGTDM feature (Coarseness) exhibited a similar pattern. Finally, Table 4 lists the six features obtained from both [18F]FDG and [11C]choline having ICC higher than 0.8 for both segmentation and discretization, five of which were GLCM features (SumEntropy, Entropy, DifEntropy, Homogeneity1, and Homogeneity2) and one of which was an NGTDM feature (Coarseness).

Fig. 6
figure 6

Bar chart illustrating 57 features for segmentation and discretization on a [18F]FDG and b [11C]choline images.

Table 4 Eight features having ICC ≥0.8 for both segmentation and discretization in both [18F]FDG and [11C]choline images

Discussion

An increasing number of reports have investigated the use of radiomic features in oncologic [18F]FDG PET imaging. Radiomic features can reflect the heterogeneity of FDG uptake and have the potential to enhance prediction of response to chemoradiotherapy [43], patient outcome [13], and histopathology [44]. At the same time, radiomic features are influenced by different imaging processes, and there have been some efforts, as also mentioned in the introduction, to investigate sensitivity of radiomic features to different parameters/modes, data acquisition, reconstruction, and analysis [19, 2326, 45]. So far, to our knowledge, there have been no studies of robustness of radiomic features in PET imaging of nasopharyngeal carcinoma or, in fact, any radiomic analysis for [11C]choline PET.

In the present study, the impacts of different segmentation and discretization methods on radiomic features for both [18F]FDG and [11C]choline PET images of nasopharyngeal carcinoma were assessed. Our results show that the segmentation methods have relatively smaller effect on radiomic features relative to discretization methods for both [18F]FDG and [11C]choline PET imaging. This was demonstrated in a different context of [18F]FDG PET scans of esophageal cancer [23]. It is also worth noting that the effects of modifying resampling bin widths on features for [18F]FDG and [11C]choline imaging are quite similar (as shown in Fig. 4). However, for different segmentation methods, more features extracted from [11C]choline PET exhibited ICC value equal or greater than 0.8 than features derived from [18F]FDG PET as shown in Fig. 3.

Overall, features derived from GLCM and NGTDM show good robustness with respect to both segmentation and discretization, and these can be pursued for translation to clinical applications: namely, SumEntropy, Entropy, DifEntropy, Homogeneity1, Homogeneity2, and Coarseness (as listed in Table 4). Pearson correlation coefficient was used to assess the correlation of these features as shown in supplemental material Table 1. Strong correlations were found among the Entropy_GLCM, DifEntropy_GLCM, SumEntropy_GLCM, Homogeneity1_GLCM, and Homogeneity2_GLCM. However, Coarseness displayed poor correlation (−0.17–0.23) with other features, indicating that it contained different information with other features. Some of these features have also been reported as robust to image processing and associated with therapy response on [18F]FDG PET images. For instance, Homogeneity_GLCM (Homogeneity2 in the present study) also presented small variability between different segmentation methods [23], was very robust with respect to delineation and PVC [24], and depicted good reproducibility on test-retest scans [19]. This also can explain the enhanced performance of such metrics in correlation with clinical measures [46]. We should also note that a given same name of a texture feature in different publications does not always correspond to the same definition [47]. Therefore, we considered a wider range of texture feature definitions in the present work. Specifically, variance, homogeneity, correlation, and SumAverage each included two different definitions (as listed in Table 1 and described in supplemental material). Both Homogeneity1 and Homogeneity2 depicted good robustness to segmentation and discretization (as listed in Table 4).

Though many features were not robust with respect to discretization in our study, some of them were reported as potentially useful for clinical applications. For instance, a study [13] reported that energy was significantly associated with treatment failure in cervical cancer (AUC = 0.72), while contrast was significantly associated with overall survival in head and neck cancer (AUC = 0.80), and yet, energy and contrast were highly influenced by discretization in our study (ICC = 0.18 and 0.16, respectively). As such, texture features may work very well using a certain bin size, and nonrobustness with respect to bin size does not necessarily render them useless for clinical applications if appropriate settings/parameters are consistently observed. However, this is less likely for segmentation, as it is more likely to vary in the hands of different users and institutions.

As a point of caution, it must be emphasized [13, 48, 49] that aside from image processing parameters, performance and robustness of radiomic features may also depend on specific clinical context and variables, which can have a confounding effect. This includes the impact of volume [50], the partial volume effect [51], etc. As a result, closer attention to standardization of PET acquisition protocols and pre-processing steps, while accounting for possible clinical confounding effects, is being sought in efforts by the community to properly account for these related issues.

Since some of the radiomic features extracted from [18F]FDG have presented ability to assess response to therapy in other works [43, 52], it is plausible that features extracted from [11C]choline images may also associate better with response of therapy, which we plan to pursue in future work for different types of cancers. Another potential area for future investigation is to include other more advanced segmentation methods [5355] that are better designed for and applicable to heterogeneous tumors.

Conclusion

This study analyzed the robustness of a wide range of radiomic features in [18F]FDG and [11C]choline PET images of nasopharyngeal carcinoma with respect to different segmentation and discretization methods. Discretization has larger effects on feature variability than segmentation in both [18F]FDG and [11C]choline PET, and features extracted from [11C]choline PET are more robust than [18F]FDG PET with respect to segmentation. Discretization has very similar effects on features computed from [18F]FDG versus [11C]choline PET. Further investigations, including robustness and therapy response prediction in more types of cancers and radiotracers, are needed to enable radiomic standardization and promote successful clinical use of radiomic features.