Introduction

As an oncological imaging modality, PET/CT plays a vital role in evaluation and management of cancer [1, 2]. PET/CT image assessment has been primarily constrained to qualitative assessment, with some limited quantification, such as the use of SUVmax, to quantify tumour burden [2]. At the same time, there is growing interest in using mineable extracted image features in the emerging so-called field of radiomics [3,4,5,6,7,8]. PET/CT image radiomics is a new quantitative imaging approach to non-invasively assess different aspects of tumours such as intra-tumoral heterogeneity [9,10,11,12,13,14]. In this method, extracted features from images can be used for diagnosis, prognosis and prediction of response to therapy [15, 16]. In addition, recent scientific studies have shown radiomic features have strong correlations with biological and clinical findings which can be used as biomarkers [17]. It was specifically determined that texture features can predict outcome in patients with NSCLC treated by stereotactic body radiation therapy. In a recent study, joint PET/MRI textural features of soft tissue sarcoma were used as imaging biomarkers to predict lung metastases [18]. In addition, multiple attempts have been made to correlate PET/CT image radiomic features against genomics biomarkers. Moreover, studies made use of a number of radiomic features towards improved prognosis, classification and prediction of therapy for different cancers [3, 4, 14, 19, 20]. Commonly used standardized uptake value (SUV) features including SUVmax, SUVpeak and SUVmean do not thoroughly characterize tumour uptake, and some studies have shown that their value can be surpassed or complemented with new radiomic features [20].

When aiming to use radiomic features as imaging biomarkers, it is important that these features accurately quantify tumour heterogeneity, and changes in feature values are not due to image generation parameters, e.g., as arising from different protocols or scanners. Although diagnostic, prognostic and predictive values of many radiomic image features have been evaluated, there is evidence that the accuracy and variability of these features vary over different imaging protocols.

Previous studies have shown that conditions such as image acquisition [20] reconstruction [21, 22], pre-processing [23], segmentation [24] and respiratory motion [5, 25]could affect radiomic features. In a few studies, the effect of different imaging parameters including reconstruction algorithm, matrix size, iteration number, number of subsets and post-filtering have been tested on radiomic image features. In these studies, the reproducibility, repeatability and variability of features extracted from patient and phantom images were tested over different reconstruction settings using different statistical parameters.

Advances in PET scanner, image reconstruction and developing new algorithms and considering this fact that radiomic feature are useful when they have reliable values; there is a need to test radiomic feature robustness against such issues. Therefore, the aim of this study was to assess the variability and robustness of different radiomic features extracted from phantom and patient PET/CT images over a wide range of different reconstruction settings in the context of multi-center subjects.

Material and methods

Figure 1 shows the overall framework of this study in different phases. Below we outline the various aspects and steps.

Fig. 1
figure 1

Framework of present study

Data acquisition (phantom)

In this study, an in-house developed NEMA body phantom was used for all measurements. This phantom has the following characteristics: 9.6 liter volume, 180 mm interior height, six inserts with internal diameters of 10, 13, 17, 22, 28, 37 mm and a cylindrical insert filled with low atomic- number material (density = 0.3 ± 0.1 g/ml). The phantom and the spheres were filled with a solution of water and 18F-FDG. Activity concentrations of 5.3 and 2.65 KBq/ml, equivalent to 370 and 185 MBq injected in a 70-kg patient, were chosen to simulate liver and lung lesions, respectively. For each background level, two lesions to background ratio (LBR) of 4:1 and 2:1 were obtained (four acquisitions mode).

Data acquisition (patients)

Clinical data were obtained using two PET/CT scanners: GE Discovery 690 and Siemens Biograph 6 True point. All image data were acquired in the following protocol: 25 patients fasted for at least 6 h prior to scan and then were injected with 333.0 ± 62.9 MBq18F-FDG after 60 min uptake. The PET data acquisitions were obtained from the mid-thigh to the base of skull for 3 min per bed position. Blood glucose levels were under 150 mg/dL (8.3 mmol/L). Furthermore, low dose CT images without contrast were obtained for attenuation correction and anatomical localization. The study included patients with lung, head and neck and liver cancers (mean age: 60 ± 6 y, age range, 39-70 y; 15 men, ten women).

Reconstruction

To study the impact of reconstruction settings on image features, in each acquisition mode (four modes), the effect of different parameters including four different reconstruction algorithms, specifically ordered subset expectation maximization (OSEM), with or without point spread function (PSF) modeling [26,27,28] and/or time of flight (TOF) [29,30,31,32]. Furthermore, number of sub-iterations, number of subsets, full width at half maximum (FWHM) of Gaussian filter, scan time per bed position, and matrix size were studied. All these parameters are listed in Tables 1 and 2, and resulted in 654 and 60 reconstructed images for phantom and patient studies, respectively.

Table 1 Image reconstruction settings in phantom study, OSEM: Ordered Subset Expectation Maximization, PSF: Point Spread Functions, TOF: Time Of Flight
Table 2 Image reconstruction setting in patient study, OSEM: ordered subset expectation maximization, PSF: point spread functions, TOF: time of flight

Segmentation

All segmentations were performed using the OSIRIX software. Specifically, lesion VOI was delineated using a 42% threshold of the maximum SUV. Necrotic regions of tumours were also added into the segmentation. To minimize the impact of segmentation on our results, the same VOI was delineated on the OSEM + PSF with two iterations, 21 subsets and 5 mm FWHM, where matrix size of 256 × 256 was used as the reference image and copied on all the other images.

18F-FDG PET/CT image radiomic features

One hundred radiomic features from three main categories including texture-based, geometry-based and intensity-based features were extracted using developed MATLAB codes. All extracted image features are shown in Table 3. In brief, features including SUV and intensity histogram (n = 37), shape (n = 4), gray level co-occurrence matrix (GLCM, n = 7), gray-level run-length matrix (GLRLM, n = 11), neighborhood gray-tone difference matrix (NGTDM, n = 5), gray-level size zone matrix (GLSZM, n = 11), normalized GLCM (n = 6), neighboring gray level dependence (NGLD, n = 5), texture feature coding (TFC, n = 4), TFC GLCM (n = 8) and texture spectrum (TS, n = 2) were extracted.

Table 3 Radiomics features

Data analysis

For analysis of phantom images, the five largest lesions with three background spheres of 15, 20 and 22 mm in diameter were included in the study. The background VOIs had a distance of 15 mm from phantom edges and spheres. Lesions smaller than 5 cm3 were not analyzed due to the partial volume effect (PVE). The effect of PVE was not analyzed in this study.

Inter-setting coefficient of variation (COV) was calculated for each image feature over the different reconstruction settings, via the following equation:

$$ C O V=\frac{SD}{Mean}\times 100 $$

Where the SD is the standard deviation of feature value and Mean is its mean over applying different reconstruction settings. To categorize variations, four groups including a very small (COV ≤ 5%), small (5% < COV ≤ 10%), intermediate (10% < COV ≤ 20%) and large (COV > 20%) were assessed. The hierarchical cluster tree of the radiomics features across COV of image reconstruction settings was created as a variability heat map. All data were analyzed using the R program (r-project.com).

Results

Impact of reconstruction, number of sub-iterations, number of subsets, and post-smoothing

As shown in much of the literatures reconstruction settings affect both qualitative and quantitative PET/CT images. The results describing the impact of reconstruction settings, number of sub-iterations, number of subsets and FWHM of a Gaussian filter are presented in Fig. 2 and also supplementary Tables 1 to 12. In the radiomics heat map of Fig. 2, the effects of different parameter settings on variability are depicted for both patient and phantom studies as quantified using the above mentioned COV. The effects of matrix size and scan time per bed were only mapped for phantom or patient studies respectively (Fig. 2).

Fig. 2
figure 2

Heat map of Variability of features against different settings, 1 = Very small variability, 2 = Small variability, 3 = Intermediate variability, 4 = High variability

In supplementary Table 1, we show the most robust features (COV ≤ 5%) over applying different reconstruction setting. For example, a feature of NGLD (Entropy), two features of GLCM (Homogeneity, Correlation), two features of GLRLM (SRE, LRE) and 12 features of Intensity and SUV (e.g., SUVmean, Entropy) were found to be robust against the reconstruction algorithm.

Table 4 also depicts the most robust (COV ≤ 5%) features over all reconstruction settings. Results of all reconstruction settings, including phantom and patient data, were ranked based on median of COV over all reconstruction settings. Such robust features included GLCM (Entropy, Homogeneity, Dissimilarity, Correlation), GLRLM (SRE, LRE, RLV, RP), GLSZM (SZE, IV, ZP), NGLCM (Entropy, Homogeneity, Dissimilarity), Intensity and SUV (SUVmean, Entropy, SULpeak and 16 other features), NGLD (SNE, NNU, SM, Entropy), TFC (Homogeneity) GLCM-coding (Entropy, Homogeneity, Intensity, IDMcglcm, CE). In addition, there were no NGTDM and TS texture features that were robust as such.

Table 4 Variability of features over median value of COV in all reconstruction settings

Our result showed features including LIRE, LISRE, LILRE (GLRLM), LISZE, LILZE (GLSZM), CS (GLCM Coding), Coarseness (TFC), Intensity and SUVvar, SSV1, SsuvV2 (SUV) to have the greatest variability (COV > 20%).

Features including Homogeneity (GLCM), SRE (GLRLM), SZE/ZP GLSZM, Entropy (NGLCM), Entropy (NGLD), Homogeneity (TFC), Entropy/CE/Intensity/Homogeneity (GLCM coding), and TLG/TV/Surface area/Asphericity (Shape) were found to be robust against changes in all reconstruction settings in both phantom and patient studies (COV ≤ 5% for all reconstruction settings except matrix size).

Impact of matrix size

The impact of matrix size on radiomic features were tested with four different matrix sizes. As shown in the heat-map and supplementary Table 13, it has the greatest impact on image features. Figure 2 shows that 56% of all features are very sensitive (COV ≥ 20%) to matrix size changes and only six (6%) features (NGLCM (Entropy), Intensity and SUV (SUVmax, Entropy, SULpeak), GLCM coding (Entropy, CE)) had very small variability (COV ≤ 5%). All features from NGTDM, GLRLM and GLCM (except correlation) showed a large variation against matrix size change. SZE, HISZE and HIZE textures from GLSZM had intermediate COV, and other eight remaining textures have COV > 20.

Impact of time per bed position

Also, 52% of all features showed very small (COV ≤ 5%) variability against time per bed position, 27% have small (5% < COV ≤ 10%), 10% intermediate (10% < COV ≤ 20%) and 11% of features have large (COV > 20%) variability (supplementary Table 14). GLRLM (LIRE, LISRE, LILRE), GLSZM (LIZE, LISZE, LILZE), intensity and SUV (SUVskew, SUVbck), Coarseness, CS, BWS are the most redundant features.

Differences in phantom and patient studies

To assess how reconstruction settings may have different impacts on phantom and patient image features, we calculated the differences between COV of such features and considered <10% as most consistent. Results showed 95%, 92%, 88% and 87% of all features had <10% differences between phantom and patient studies, when COV was computed across reconstruction, FWHM, sub-iteration and subset changes, respectively.

Discussion

PET/CT image quantification using radiomic features has a wide range of applications including tumour diagnosis, characterization, prognosis and prediction of response to treatment [33]. For years, SUV metrics have been used most commonly, but their accuracy and capabilities have limitations [34,35,36]. At the same time, recent scientific evidence points to certain radiomic features as being susceptible to variability across different imaging protocols particularly reconstruction settings [8]. In this study, we aimed to investigate the impact of reconstruction settings available in clinical practice on PET/CT image features in a multi-scanner study involving both phantom and patient studies.

Based on the radiomics literature, accuracy of features and analysis procedures are main issues which determine the success of radiomics in clinical research, and radiomic feature accuracy depends on factors such as imaging protocol, scanner type, and equipment accessories [37]. In this light, we considered these factors and performed our studies in two participating PET/CT centers having two different scanner models.

Our results showed that the robustness of PET/CT image radiomic features to advanced reconstruction settings is feature-dependent, and different settings have different effects on radiomics features. For example, entropy from GLCM-Coding vs. LISZE from GLSZM were robust vs. non-robust, respectively, against all reconstruction settings, whilst coarseness from NGDT had very small variability against time per bed, small variability against subset/reconstruction algorithm, intermediate variability against FWHM/sub-iteration and large variability against matrix size.

We also assessed feature robustness in both phantom and patient studies. Our results demonstrated that most features had similar variability between the two kinds of studies, but there were some differences. This is maybe due to biological and physiological parameters such as proliferation, tumour vasculature, metabolism, hypoxia condition and necrosis, which contribute to intra-tumoral heterogeneity. Also, our phantom was filled with a homogenous activity and there was no heterogeneity. Although, whether the tumour being quantified is homogeneous or heterogeneous, the values of the radiomics features will obviously change, but the COV variations will remain nearly the same. The other main parameter is motion (e.g., respiratory) which is absent in phantom study. There are studies which suggested that the variability of features is due to respiratory motion [25, 38].

Based on our results and in comparison with some other studies (Fig. 3), the robustness of different radiomic features are variable against different reconstruction settings. Although these studies have been done on PET/CT image radiomic feature robustness, and because these studies were different in segmentation, quantization and same feature names, they have some differences in comparison with our results. Also, it should be remembered that quality assurance (QA) has an impact on image quality and quantity. In our work, before any measurement, we assured the QA and validity of both scanners.

Fig. 3
figure 3

Robustness of features, a comparison with previous studies (References: 39, 41-43, 45). R = Reconstruction, F = FWHM, I = iteration, O = overall, M = matrix size, 0 = it is not calculated in that study, 1 = Most robustness, 2 = intermediate robustness, 3 = low robustness

For example, Doumou et al. investigated the effects of image smoothing, segmentation and quantization on the heterogeneity features such as GLCM, GLRL, NGTDM and GLSZM [39]. Their results demonstrated that smoothing and quantization had small and large effects on the precision of features, respectively. In our work, in comparison to Doumou et al., about nine features (from 29 common features) have good agreement in such as SRE, Entropy, Homogeneity, and SZE; also ZP had the smallest variability and the LIZE feature was found to be very variable against filter in both studies.

In a recent study, Yan et al. studied the effect of reconstruction settings on 55 texture and six first-order features and reported different COV of features over changes of reconstruction settings [40]. For the 40 features in common with our own study, 60%, 52%, 65% and 70% of them showed the same COVs in reconstruction algorithms, FWHM, iterations and matrix size, respectively. This may be due to differences in data analysis. The analysis by Yan et al. was based on the highest value of COV for ranking, whilst our results were based on mean of COV.

Bailly et al. studied the robustness of 15 textural features over the number of iterations, post-filtering level, noise, reconstruction algorithm and matrix size [41]. In comparison, 13 texture features of Bailly et al. were in common with ours, and 61%, 61%, 53%, 69%, 38% and 61% of these features had the same COVs against reconstruction algorithms, matrix size, FWHM, iteration, time per bed and in overall, respectively. RP (GLELM), entropy and homogeneity (GLCM), ZP (GLSZM) have high robustness and LILZE (GLSZM) had low robustness and HISZE, HIZE from GLSZM and SAM-GLCM had intermediate robustness in both studies.

In another similar study, Rodicioa et al. investigated the sensitivity of 72 textural features to technical and biological factors [42]. Their results showed that only eight texture features had the highest robustness, and entropy exhibited good correlation with all patient parameters. These findings have 68% agreement with our results, and all of the eight features that they reported were consistent with our results.

Van Velden et al. assessed the impact of two reconstruction settings and segmentation on the repeatability of 105 radiomic features in non-small-cell lung cancer (NSCLC) [24]. Their results showed that 63 features had a high level of repeatability, but 25 and three features were sensitive to change in segmentation and a change in reconstruction, respectively.

Forgacs et al. introduced a predefined strategy to identify the most robust texture features, including volume independency, reproducibility and accuracy over different reconstruction settings [43]. They found that entropy, homogeneity and correlation features had the highest reproducibility, in good agreement with our results. But, there were some features such as SZE which had small variations (COV ≤ 5%) from our results, but were found as non-robust by Forgacs et al. This may be due to different sources of variability and statistical assessment of robustness such as interclass correlation (ICC).

In the present work, we investigated the effect of new reconstruction algorithms, and did not study the effect of more conventional (analytic) algorithms. But in a study by Galavis et al. they showed the variations of different features over changes to two reconstruction algorithms including filtered back projection (FBP) and OSEM, and indicated that features with large variations could not be selected for tumour segmentation [44].

The present work has some limitations. At first, we did not take into account the effect of quantization or segmentation which may have considerable effects on radiomic features. The effect of these parameters has been studied by Leijenaar et al. [9, 23] and Lu et al. Also, we did not study the effect of respiratory motion which can change the feature values. On the other hand, further clinical studies are needed to test the biological mechanisms of these parameters. Also, new studies may need to consider PVE on radiomic features.

In the present study, in comparison to previous studies, we tested a wider range of radiomic features, and new features were found as robust features. Intensity and SUV features including SUVmpa, STsuv3, STsuv2, STS1, EPA,, SMsuv3, SULpeakPA, SsuvE3, SSE1, SsuvE2, Epsa, SUVmpsa, SMsuv2 and SMV1; GLCM-Coding including Entropy, Homogeneity, Intensity, IDMcglcm and CE; TFC features including Homogeneity and Mean convergence were new robust radiomic features.

In the present study, one of the main aims was to investigate how newly advanced reconstruction algorithms such as PSF and TOF, would change the radiomic feature values. In this regard, we tested four different image reconstruction algorithms including OSEM, OSEM + PSF, OSEM + TOF and OSEM + TOF + PSF and other reconstruction parameters such as iteration, number of subsets, FWHM and matrix size were considered as fixed. By using such reconstruction settings, we evaluated radiomic feature robustness (by COV). In this light, our results show the net effects of different reconstruction algorithms on the radiomic feature robustness. These results have been shown in the supplementary tables 1, 5 and 9 separately.

Finally, we note that recent development in PET/CT image radiomics has opened a new potential horizon towards improved treatment response assessment in comparison to existing criteria including Response Evaluation Criteria in Solid Tumours (RECIST) [45] and PET Response Criteria in Solid Tumours (PERCIST) [46]. In this new era of imaging biomarker discovery, discovery of robust features is of particular of importance. In this light, the present work presents new data which can be considered for screening of potential radiomic features that are then subsequently evaluated in therapy response assessment tasks of interest, and ultimately established in multi-center studies.

Conclusion

We investigated the effect of different reconstruction settings, including reconstruction algorithm, iterations, post-smoothing, time per bed, and image matrix size on a wide range of PET/CT image radiomic features. Variability and robustness of PET/CT image radiomics in advanced reconstruction settings is feature-dependent, and different settings have different effects on different features. Radiomic features with low COV can be considered as good candidates for reproducible tumour quantification in multi-center studies. Features with intermediate COV should be used with caution, and features with high COV should most likely be omitted (to reduce the number of potential biomarkers for statistical purposes). In the present study we also introduced some new radiomic features such as Intensity and SUV, GLCM-Coding and TFC features as robust features.