Introduction

Radiomics is a promising tool with potential diagnostic, prognostic and predictive powers. The extraction and analysis of quantitative radiological features provides valuable information before, during and after radiation therapy (RT)1. Previous studies have linked several radiomic features directly to patient survival2. Research has shown the power of radiomics for many disease sites; however, these studies also show variability with respect to imaging modality, reconstruction algorithms, feature selection, and volume of interest (VOI)3,4,5,6,7,8,9. Several groups have studied the robustness of radiomic features with respect to contouring variability3,4,5,10,11,12. Contours are typically created by a trained radiation oncologist; however, inter-, and intra-observer contouring variation can still be significant when considering radiomics10,13.

A recent study by Yang et al., investigated the impact of contouring variability on PET-based radiomic features for lung cancer14. The study demonstrated that the impact of contour uncertainty on PET-based radiomic features varied widely and cautioned predictive use in the context of contouring uncertainty for models involving PET-based radiomic features. A study by Pavic et al., examined intra-observer variation effects on radiomic features extracted from CT images12. This study extracted a total of 137 radiomic features from planning CT images of head and neck cancer patients and warned that variation in delineation can significantly affect some radiomic features.

On-board imaging (OBI) utilizing megavoltage (MV) and kilovoltage (kV) cone beam CT (CBCT) is a widely used imaging technique for daily patient bony alignment and prostate marker alignment15. The prostate can deform and rotate daily due to differential bladder and rectal filling resulting in suboptimal dosimetry over the course of treatment16. By utilizing CBCT setup images, changes in anatomy can be accounted for at the time of treatment delivery17. Deformable image registration (DIR) can automatically propagate the contours drawn on the planning CT (pCT) to daily CBCT images, accounting for the anatomical changes and allowing adaptive radiotherapy (ART)18,19. The American Association of Physicists in Medicine (AAPM) Task Group 132 (TG-132) provides recommendations on the use of image registration and fusion algorithms and provides quantitative methods of evaluating DIR accuracy20.

Various studies have investigated the accuracy of the DIR algorithms for the automatic creation of contours (auto contours) for prostate cancer; however, these studies often used very small sample sizes18,21,22,23. A study by Woerner et al.18, evaluated the DIR performance from pCT to CBCT that was acquired near the end of treatment for 6 prostate, 5 head and neck and 5 pancreas patients. The small sample size of their study limited their analysis to organs at risk (OAR). The authors cautioned the use of automatic DIR workflows to perform contour deformation to assess the changes during the course of treatment. Another study by Thor et al., evaluated DIR performance from pCT to CBCT through the course of prostate cancer treatment for 5 patients24. The study concluded that more advanced imaging and/or DIR algorithms should be developed to confidently use DIR workflows for contour deformation.

Unlike CT, PET and MR images that are used for diagnostic and treatment planning purposes, CBCT images are used for daily patient setup prior to radiation delivery and are collected as the current standard of care on a daily basis. Most radiomic studies performed thus far have been evaluated in a pre-treatment setting and are lacking a knowledge of early tumor response to therapy, hindering the possibility for timely treatment adaptation. Hence, the day-to-day radiomic feature changes of the tumor obtained from CBCT may offer a possibility of treatment adaptation during the early course of treatment, distinct from radiomic predictions derived in the pre-treatment settings. These features can be examined for their use in early response assessment.

The automatic propagation of pCT contours to daily CBCT can also be used in the context of radiomics. However, to our knowledge, the robustness of radiomic features to varying prostate contours on daily CBCT’s has not been previously examined. By updating radiomic feature derived data on a more frequent basis, as could be done through utilizing daily CBCT derived data, radiomic features can help influence clinical decision making. The goal of this study is to utilize a commercially available DIR algorithm to deform manual prostate contours to the daily CBCTs and determine the robustness of radiomic features to DIR-based contour propagation.

Methods and materials

Patient selection

Twenty-nine prostate cancer patients who were treated with volumetric modulated arc therapy (VMAT) and had daily CBCT images were considered. The ethical approval for this study was obtained from the University of Miami Institutional Review Board (IRB). Written informed consent was obtained from all patients in this study. The data was retrospectively collected and analyzed. All methods undertaken in this work were carried out in accordance with the relevant guidelines and regulations. One patient was excluded due to prosthetic implants causing extremely poor image quality. Each patient was treated using conventional fractionation, consisting of 28–40 fractions, totaling 1,010 total fractions. Prostate volumes ranged from 15 cm3 to 92 cm3.

Image acquisition and manual contouring

Planning CT (pCT) images were acquired using Somatom Definition AS and Sensation Open (Siemens Healthineers AG, Germany), and/or Gemini TF TOF 64 (Philips, Netherlands) for each patient. The mean field-of-view (FOV) was 670 mm with a range of 492–800 mm, reconstructed with dimensions of 512 × 512 pixels, a thickness of 2 mm, and an average in-slice pixel size of 1.3 mm with a range of 0.9–1.6 mm. On the day of treatment, each patient was imaged with the same FOV using kV CBCT with 465 slices, voxel sizes of (0.9 mm, 0.9 mm, 2.0 mm) and reconstructed with dimensions of (512, 512, 232). Each patient was imaged in supine position. All 28 patients had 4 gold fiducial markers implanted in the prostate prior to imaging. Prostate volumes were manually contoured on pCT and on daily CBCT setup images by the same expert radiation oncologist who has 4 years of experience in contouring prostate cases to eliminate interobserver variation. The pCT images and daily CBCT setup images, including manually drawn contours, were uploaded to a commercial image registration software (Velocity Advanced Imaging, ver. 4.1, Varian Medical Systems, Palo Alto, CA).

Deformable image registration and delineation propagation

The DIR of pCT to daily CBCT images were performed using an Adaptive Monitoring tool available in the commercial image registration software. Figure 1 shows the DIR and delineation propagation workflow. DIR creation utilized the ‘Adaptive Monitoring’ navigator, which is comprised of three steps: (1) manual alignment, (2) rigid registration, (3) deformable registration. During the manual alignment step, a manual rigid alignment between the CBCT image and the pCT image was made using bony anatomy and the implanted gold fiducials as a guide. These were done independently of clinical set-up shifts for consistency. In step 2 the region of interest was adjusted to include the prostate and, using the manual alignment created in step 1 as a starting point, a rigid registration was created by the software. Step 3 uses the rigid registrations of step 2 as a reference to create DIR’s. After the rigid registration, the DIR algorithm was utilized to deformably register the pCT to the daily CBCTs. The DIR algorithm uses an intensity-based B-spline algorithm based on the Mattes formulation; the details have been described elsewhere14,25. For poor quality deformations, the deformable image registration workflow was repeated with smaller ROIs (67 fractions). Poor quality deformations were defined as a DSC < 0.75 and MDA > 3.5 mm, just beyond the TG-132 tolerance recommendations. For the three fractions that continued to produce poor quality deformations, a structure-guided deformation (hybrid DIR) was employed.

Figure 1
figure 1

Deformable image registration and delineation propagation workflow. All steps included in the Adaptive Monitoring Navigator on Velocity are inside the dotted-line box. The re-evaluation of poor-quality fractions (DSC < 0.75, MDA > 3.5 mm) is shown in the light-red box. Exportation of data to MATLAB for data extraction and analysis is done in the final green box.

Assessment of deformable image registration

The DIRs from pCT to daily CBCTs were done using both qualitative and quantitative assessment metrics. First, a visual assessment of deformation vector fields (DVFs) was carried out to ensure that the DIR transformation was physically and anatomically reasonable. The locations of anatomical landmarks (e.g., bones) and fiducial markers were visually inspected on fused images of the pCT and CBCT to verify that they matched. DIRs were refined by adjusting the ROI around the prostate to improve DIR accuracy26, or by applying a structure-guided deformation.

The quantitative assessment of the registered prostate contours was done using several metrics. Dice similarity coefficient (DSC) is a statistical measure of contour overlap with 0 being no overlap and 1 being a perfect match27. Distance to agreement (DTA) between two contours, sometimes referred to as distance to conformity, is the shortest distance from a given point on the surface of one contour to the surface of the other contour. Mean distance to agreement (MDA) is the mean of all DTA distances28. The geometric centers of both the auto and manual contours on the CBCTs were calculated and used to determine the difference in center of mass position (ΔCM). The difference in volume between the auto and manual contour (∆Vol) was also evaluated for all 1,010 fractions. For a smaller sub-sample of 10 patients, on a bi-weekly basis (totaling 47 fractions), the Jacobian determinant (JD) was computed. Jacobian determinant values corresponding to volume expansion, no volume change and volume reduction are > 1, 1, and < 1, respectively. JD values equal to or less than zero correspond to non-physical transformations which are indications of a poor DIR20. DSC, MDA, ΔCM, and ∆Vol were calculated between the auto and manual prostate contours for all 1,010 fractions. TG-132 defines a clear method for assessing DIR algorithms20. In this work, the tolerances defined in the TG-132 were used for quantitative assessment of the DIRs.

Radiomic features

To study the impact of contouring variability due to the DIR on CBCT radiomic features in the prostate a total of 46 radiomic features derived from 6 different classes were analyzed for all 1,010 fractions. Additionally, a subpopulation consisting of 149 fractions was identified having ∆Vol > |10%| to study the impact of larger contour variability on CBCT radiomic features. Radiomic features were extracted using the procedure described in Delgadillo et al.11, including Gray-Level Co-occurrence Matrices (GLCM), Neighborhood Gray-Tone Difference Matrix (NGTDM), Gray-Level Run-Length Matrices (GLRLM), Gray-Level Size Zone Matrices (GLSZM), Morphological and statistical features29,30,31. Each radiomic feature is distinguished by the image biomarker standardization initiative (IBSI) code32.

Percent difference in radiomic feature derived data (%∆RF) was compared to DSC to assess radiomic feature dependency on contouring variability using Spearman’s rank correlation coefficient (ρ)33,34. These correlations were classified as weak if |ρ|< 0.4, moderate if 0.4 ≤|ρ|< 0.6, relatively strong if 0.6 ≤|ρ|< 0.8, and strong if 0.8 ≤|ρ|33.

Lin’s concordance correlation coefficient (CCC) was computed for all 46 radiomic features between the auto and manual contours to find the strength of correlation35, with a perfectly linear relationship equal to 1 and no relation being 034,36. Adapting the classification scheme proposed by McGraw et al.37, radiomic features were classified as robust with CCC > 0.90, acceptable with 0.75 < CCC < 0.90, and uncertain with CCC < 0.75.

Additionally, mean absolute percent difference in radiomic feature derived data (mean |%∆|RF) was used to evaluate the stability of radiomic features to differences in prostate contours. Radiomic features were independently classified as robust with mean |%∆|RF < 5%, acceptable with 5% < mean |%∆|RF < 15%, and uncertain with 15% < mean |%∆|RF < 50%. Processing and analysis were performed using scientific computation software (MATLAB, ver. 2018b, Math-Works Inc., Natick, MA).

Results

Assessment of prostate contour accuracy

Table 1 summarizes the mean, standard deviation and range for DSC, MDA, ΔCM, ∆Vol, JD minimum and JD maximum between auto and manual prostate contours, and Fig. 2 shows the distribution of DSC, MDA, ΔCM and ΔVol over 1010 fractions. The mean DSC of all 1,010 fractions, 0.90 ± 0.04, is within the TG-132 recommended DSC value of ~ 0.8 to 0.9. A total of 42 fractions had DSC < 0.8, below the lower tolerance recommendation of TG-132.

Table 1 Similarity metrics between auto and manual contours.
Figure 2
figure 2

Histograms of (A) DSC, (B) MDA, (C) ΔCM and (D) %Δ volume between the auto and manual contours for all 1010 fractions are shown in (A)–(D) respectively.

The mean MDA of 1.81 ± 0.47 mm is well within the TG-132 recommended ~ 2 to 3 mm. A total of 33 fractions had MDA > 3 mm with a maximum MDA of 4 mm. The mean ΔCM found to be 2.17 ± 1.26 mm and mean ∆Vol of 5.1 ± 4.1%. The mean minimum and mean maximum values of the JD were found to be 0.77 ± 0.18 and 1.31 ± 0.23 respectively, with no JD values ≤ 0 (Table 1).

Impact of contouring variability on radiomic features

Spearman rank correlation coefficient for %∆RF versus DSC for each radiomic feature is shown in Fig. 3. As previously mentioned, a subpopulation of fractions with ∆Vol > |10%| was also considered, results from these fractions are plotted in red (Fig. 3). All 46 radiomic features were classified as having a weak correlation between %∆RF and DSC with |ρ|< 0.4 for both populations under consideration (all fractions and sub-population of fractions with ∆Vol >|10%|).

Figure 3
figure 3

Spearman’s rank correlation coefficient between the mean absolute percent difference in radiomic feature derived data (%∆RF) between auto and manual contours plotted against Dice similarity coefficient (DSC), stratified by class. This was done for two populations, all fractions (blue) and for fractions with ΔVol > |10%| (red).

Table 2 displays Lin’s concordance correlation coefficient (CCC) for all 46 radiomic features. Using the classification scheme mentioned earlier, 30 of 46 radiomic features were classified as robust, 8 radiomic features were classified as acceptable, and 8 radiomic features were classified as uncertain (Table 2). Neighborhood Gray-Tone Difference Matrix (NGTDM) and Gray-Level Co-occurrence Matrices (GLCM) had the highest mean CCC with values of 0.963 and 0.943, respectively.

Table 2 Lin’s concordance correlation coefficient (CCC) and mean absolute percent difference in radiomic feature derived data (|%∆|RF) between auto and manual contours for all radiomic features.

Independent classification according to mean |%∆|RF can also be seen in Table 2 separated by class. In total 21 of 46 radiomic features were classified as robust and 7 of the 13 GLRLM radiomic features were classified as robust. 14 radiomic features were classified as robust according to both CCC and mean |%∆|RF. 24 radiomic features did not have matching stability classifications when comparing the classifications from CCC and mean |%∆|RF.

Discussion

DIR and contour accuracy

Mean DSC and mean MDA of the current study (Table 1) are both well within the TG-132 recommended tolerances, indicating that the DIR workflow followed here can produce accurate prostate contours. A study done by Forde et al., found that 37% of contours created through DIR had a DSC < 0.75. The DIR algorithm of our study, having no contours with DSC < 0.7, outperformed the one used by Forde et al. which used an earlier version of Varian’s auto contouring (SmartSegmentation version 15.5) which used an atlas-based algorithm. However, 4.1% of contours created in this study still had DSC below the TG-132 recommendation of 0.8—further inspection of these fractions found acceptable displacement vector fields and acceptable contours as shown in Fig. 4 for a poor performing fraction (DSC = 0.79, MDA = 3.02).

Figure 4
figure 4

Automatically generated prostate contour (red) and manually drawn contour (green) for sample patient with poor match statistics (DSC<0.8 and MDA>3). Shown axial slice (A), coronal slice (B), and sagittal slice (C).

While ΔCM and ∆Vol are not included in the TG-132 recommendation, both are useful indications of auto and manual contour matching. A previous study by Studenski et al., considered 16 patients on hypo-fractionated schemes and found percent difference in prostate volume from contours created through a DIR workflow to be < 10%38, and a previous study by Forde et al., which looked at inter-observer delineation variability and its impacts on radiomic feature robustness found the interobserver variation in contour sizes were as high as 80%5. In the study presented here, 149 of the 1,010 fractions had a difference in volume between the auto and manual contours (∆Vol) > 10% (Table 1). Large discrepancies in volume indicate a poor DIR, however, the mean DSC of these 149 fractions was 0.88, and visual inspections resulted in acceptable contours.

The radiomic features belonging to the morphological features class can also be used as a measure of contour accuracy. As can be seen in Table 2, the mean |%∆|RF for all 4 morphological features is less than 0.1%. These low mean absolute percent difference in radiomic feature derived data for the morphological feature class is another indication that the automatically generated contour and the manual contour are in good agreement.

Radiomic feature robustness

Spearman rank correlation coefficient for %∆RF versus DSC for each radiomic feature as shown in Fig. 3, shows that all 46 radiomic features were classified as having a week correlation (for both populations under consideration). A low Spearman rank correlation coefficient between %∆RF versus DSC simply highlights the complex interplay between radiomic feature derived data of differing contours. That is, a large difference in contour does not necessarily translate to a large difference in radiomic feature derived data, and a small difference in contour does not necessarily translate to a small difference in radiomic feature derived data.

Higher Lin’s concordance correlation coefficient (CCC) translates to robust radiomic features. As seen from Table 2, the NGTDM class radiomic features had the highest mean CCC of 0.963, while GLRLM class radiomic features had the lowest mean of 0.820. Based on this, NGTDM was the most robust class of radiomic features, while GLRLM was the least robust class of radiomic features when considering differences in prostate contours. In contrast to our study, Rizzetto et al., found that GLRLM was the most robust to differing contours, considering colorectal liver metastases contours10. These incongruous results may be due to the differing locations within the body (prostate versus liver) and/or the differing contour sizes, but only further the idea that the robustness of radiomic features should be evaluated as they can vary by location. Future radiomic studies should consider the location specific radiomic feature robustness, as the radiomic feature derived data has varying dependence on contour as seen in this study and others3,4,5,10,11,12.

Similar to the results of this study, a different study done by Yang et al., evaluating radiomic feature robustness taken from PET images of lung cancer patients, found only weak or moderate correlations between %∆RF and DSC14. A weak correlation implies a more complex interplay between the delineation of the volume and the contents of the volume. While two contours of a volume may have a high DSC, the percent difference in radiomic feature-derived data from the two contours may be very different.

Mean |%∆|RF alone was also used to evaluate the stability of radiomic features to differences in prostate contours. Forde et al., found that GLRLM had the smallest mean |%∆|RF considering parotid gland contours5. The results of Forde et al. agree with the work presented here, finding 7 of the 13 GLRLM radiomic features were classified as robust, the highest performing class.

Severe discrepancies between CCC and mean |%∆|RF do occur. For example, statistical features of ‘Skewness’ and ‘Kurtosis’ had CCC values of 0.879 and 0.946 respectively and mean ± SD in |%∆|RF of 143.2 ± 694.1 and 85.7 ± 4786, respectively. If the auto contour radiomic feature derived data and manual contour radiomic feature derived data have similar means and standard deviations over all fractions, CCC will be high and indicate high similarity. However, in the same situation, |%∆|RF per auto and manual contour pair can be large and varying. Thus, leading to situations where a radiomic feature has both a desirable high CCC and an undesirable high mean |%∆|RF.

Conclusions

This study demonstrated that an intensity-based DIR algorithm applied to daily CBCTs is sufficiently robust and accurate to meet the recommendations of TG-132 for prostate cancer. The radiomic features derived from DIR-generated auto contours and manually drawn contours were acceptably similar for 22 of 46 radiomic features. However, there is a varying dependence on contours from one radiomics class to another and from one radiomic feature to another. Weak correlations between mean |%∆|RF and DSC imply a complex interplay of volume and contents when considering radiomic feature data extraction from prostate contours that requires further insight.