Introduction

Glioblastoma is the most common primary malignant brain tumor with still a very poor prognosis despite modern therapeutic strategies like image-guided resection, chemo- and radiotherapy [1, 2]. Since a definite curative therapy is lacking, lengthening of overall survival remains the main goal in most cases [3]. Promising novel therapies like bevacizumab, which targets tumor angiogenesis, unfortunately failed to show a benefit in overall survival, nevertheless recent studies observed a prolonged progression-free survival [4]. So far, early detection and repeated resections, also of little recurrences, seem to prolong overall survival, even though studies are still controversially discussed [3, 5, 6].

Postoperative and therapy-related MR tumor monitoring plays a crucial role to determine response, stable or progressive disease and to distinguish between progression and therapeutic-induced pseudo-progression [7]. In 1990, Macdonald et al. introduced response criteria based on a two-dimensional measure of contrast enhancement on computed tomography scans which got later revised by the RANO (response assessment in neuro-oncology) working group and which are now widely applied for magnetic resonance imaging (MRI) [7, 8]. Even though RANO criteria provided a quantitative and objective attempt to address radiographic changes in the course of glioblastoma, there are still several limitations due to the two-dimensional evaluation, for example, irregular shaped tumors, multifocal lesions, or cystic components [7, 9]. A three-dimensional volumetric approach could overcome these limitations. With modern software solutions being available, it is likely that tumor volume, as a quantitative measure, could play an increasingly important role in the decision making process of neuroradiologists in postoperative tumor monitoring [9].

Different approaches have been made for assessing tumor volume including manual, automated, or semi-automated segmentation methods [1012]. However, for interpreting volumetric results, it is important to benchmark the segmentation tool in terms of intra- and inter-rater reliability and precision error. Only changes in volume exceeding certain cutoffs, like the least significant change (LSC), should be regarded as significant. Thus, the purpose of this study was to evaluate the reliability of a commercially available semi-automatic segmentation tool in glioblastoma patients. Secondary objectives were to assess the user-dependence for different experience levels and to distinguish between the segmentation of fluid-attenuated inversion recovery (FLAIR) volume (FV) and contrast-enhancing volume (CEV).

Methods

Patients

A total of 320 segmentations of FLAIR and magnetization prepared rapid gradient echo (MPRage) sequences were done in five patients with glioblastoma (four male individuals, mean age at imaging 58.8 ± 12.0 years). All patients underwent tumor resection at the Department of Neurosurgery at our institution and received preoperative MRI and postoperative follow-up MRI at the Department of Neuroradiology between November 2013 and August 2014. The study was approved by the local ethics committee in accordance with the ethical standards of the 1964 Declaration of Helsinki and its later amendments [13]. Histopathological analysis was done at the Department of Neuropathology and confirmed the diagnosis of glioblastoma in all cases.

MR-Imaging

All patients underwent high-resolution MRI on a 3 T scanner (Achieva 3 T, Philips Medical Systems, The Netherlands) using an 8-channel or 16-channel phased array head coil. A 3D T2-weighted FLAIR sequence (1.11 × 1.11 × 1.12 mm³, TR/TE of 4800/306 ms) and a 3D T1-weighted MPRage sequence (isotropic resolution 1 mm³, TR/TE 9/4 ms) with and without contrast agent, aligned parallel to the anterior/posterior commissure lines were acquired. A contrast medium injection system was used (Spectris Solaris EP, Siemens Medical, Erlangen, Germany) for the administration of Magnograf® (MaRoTrast, Jena, Germany; 0.2 ml/kg body weight) as contrast agent.

Raters

Intra-rater reliability was assessed by one experienced single rater. Inter-rater reliability was assessed by three different groups of raters separately. The first group consisted of four nonexperienced raters without any medical background (volunteers) who got a brief introduction of 30 min into the software, principles of brain tumor segmentation and anatomy. For the second group, four medical students were chosen (medical students), who got the same introduction as the first group. Volunteers and medical students had stand-by supervision of an experienced rater while segmentations were done. The third group consisted of four experienced physicians working at the Department of Neuroradiology (neuroradiologists), who got a short software introduction only. The criteria of segmentation (see below) were told to all raters prior to the segmentation.

Semi-automated Volumetry

All 320 segmentations of FLAIR and MPRage sequences were done with the novel tool “smartbrush”/BrainLab Elements (BrainLab, Feldkirchen, Germany). Smartbrush is a semi-automated tool for segmentation, based on a region-growing algorithm, a standard technique in medical image processing [12, 14, 15]. First, a region-growing algorithm-aided, 2D-segmentation is manually drawn in the central part of the tumor which is then 3D-interpolated when feeding the algorithm with an additional 2D-segmentation in a perpendicular slice. Manual changes to the segmentation can be easily realized by adding or erasing certain regions of interest either with the help of the region-growing algorithm or completely manually. In each tumor both, FV and CEV were segmented separately. Segmentations of FV included all perifocal and tumor-associated FLAIR hyperintensities. Surrounding nontumor related hyperintense spots (e.g., microangiopathy) were not included in FV as well as the resection cavity and cysts (Fig. 1). For CEV, contrast-enhancement only should be segmented, whereas bigger cysts, ventricular plexus, vessels or T1-weighted hyperintense blood residuals should not be segmented. Therefore, the MPRage sequence without contrast agent or the subtraction image of the MPRage pre and post contrast agent were displayed on a different screen of the same workstation that was used for segmentation. To reduce variability of volume averaging only segmentations with a total volume of > 0.5 cm3 were considered measurable [7]. For assessment of intra-rater reliability, the images of each patient were segmented four times by the same rater at baseline and follow-up MRI with an interval of 1 week in between each segmentation approach. For determining inter-rater reliability, the images of each patient were segmented by all raters of a group at baseline and follow-up MRI. The relative change in volume between baseline and follow-up MRI was calculated as the quotient of follow-up volume and initial volume of segmentations and is displayed as a factor (f); f > 1 indicates an increase in volume, f < 1 shows a decrease in volume between the two different time points.

Fig. 1
figure 1

Exemplary data set of patient 2 showing different delineations for fluid-attenuated inversion recovery volume (FV) (outer rims) and contrast-enhancing volume (CEV) (inner rims) of a central cystic/necrotic glioblastoma in the left parietal lobe at baseline magnetic resonance imaging. FLAIR images (left) and post contrast MPRage images (middle and magnified section on the right) in axial a and coronal view b. Colors for FV: yellow (neuroradiologist), green (medical student), blue (volunteer); Colors for CEV: orange (neuroradiologist), red (medical student) and purple (volunteer)

Statistical Analysis and Illustrations

Consistency among the segmentations of a single rater is referred to as intra-rater reliability. Consistency among the segmentations of different raters is termed inter-rater reliability. Both, intra- and inter-rater reliability were assessed by intra-class correlation (ICC) in a two-way mixed, consistency, average-measure approach [16, 17]. ICC estimates can range between 1 showing perfect agreement and 0 if only random agreement exists. Cutoffs for a qualitative rating of ICC are poor (< .4), fair (.40–.59), good (.60–.74), and excellent (.75–1.0) [18].

The coefficient of variation (CoV) was calculated as the quotient of standard deviation (SD) and arithmetic mean (x) of the different segmentations (1). Significance between CoV in FV and CEV was calculated by Wilcoxon signed-rank test for dependent samples. For precision error, the root-mean-square error (RMSE) was calculated as the root mean square of CoV (2) [19]. Knowing the RMSE, the LSC was calculated by multiplying the RMSE with the factor 2.77 [20], with changes exceeding the LSC considered as statistically significant on a 95 % confidence interval. As a measure for the overlap between two segmentations (A, B) the Dice score was applied (3) [21]. A Dice score of 1 shows perfect agreement between two segmentations, a Dice score of 0 indicates no overlapping regions.

$$ Co{{V}_{n}}=\frac{S{{D}_{n}}}{x}\times 100\% $$
(1)
$$ RMSE=\sqrt{\Sigma _{n=1}^{m}Co{{V}^{2}}{_{n}}/{m}\;} $$
(2)
$$ Dice\left( A,B \right)=\frac{2\left| A\cap \right.\left. B \right|}{\left| A \right|+\left| B \right|} $$
(3)

Calculations of Wilcoxon signed-rank test and ICC were done with IBM SPSS Statistics, release 23.0 (IBM, Armonk, NY, USA). All other calculations were done with Excel (Microsoft, Redmond, USA). Illustrations were done with BrainLab Elements (BrainLab, Feldkirchen, Germany) and Power Point 2010 (Microsoft, Redmond, USA). Tables were drawn with Word 2010 (Microsoft, Redmond, USA) and IBM SPSS Statistics, release 23.0 (IBM, Armonk, NY, USA).

Results

Total tumor volumes are shown in Table 1 with the lowest volume for FV of 29.0 cm3 (± 3.3 cm3) and maximum FV of 144.4 cm3 (± 4.8 cm3). Minimal CEV was below 0.5 cm3 with a maximum of 26.2 cm3 (± 2.8 cm3).

Table 1 Mean volumes for FLAIR volume and contrast-enhancing volume as well as the relative change of volume between baseline and follow-up segmentations for all raters. The relative change (rel. change) of volume is displayed as a factor (f), with f > 1 indicating an increase in volume whereas f < 1 shows a decrease in volume

Intra-rater Reliability

Intra-rater reliability was excellent with an ICC of 0.998 [Confidence interval (CI) 0.996–1.000] for single rater segmentations of FLAIR images and with an ICC of 0.990 [CI 0.974–0.998] for post contrast MPRage images [18, 22]. The precision error for segmentations of FV trended towards lower values than for segmentations of CEV (p = .086). Overall, RMSE for FV was 3.3 % whereas RMSE for CEV was 8.2 % (Table 2). The median Dice score, as a measure for the overlap between the segmentations, was .92 for FLAIR and .88 for contrast enhancement showing higher agreement for FLAIR segmentations (Fig. 2).

Fig. 2
figure 2

Boxplots for Dice scores as a measure of the overlap between the segmentations for fluid-attenuated inversion recovery (FV) and contrast enhancement (CV) for the different groups of raters. A Dice score of 1 indicates perfect agreement. Median values are indicated by black horizontal bars

Table 2 Root-mean-square error for all groups of raters for FLAIR segmentations (FV) and segmentations of contrast-enhancing tumor tissue (CEV) in baseline and follow-up magnetic resonance imaging. The number of segmentations in each category (n) is displayed as well as the p value for Wilcoxon signed-rank test between FV and CEV computed for each group independently

Inter-rater Reliability

Among the different groups of raters ICC for FV was 0.996 [CI 0.989–0.999] for neuroradiologists, 0.994 [CI 0.985–0.998] for medical students and 0.996 [CI 0.990–0.999] for volunteers. ICC for CEV was 0.985 [CI 0.956–0.996] for neuroradiologists, 0.988 [CI 0.965–0.997] for medical students and 0.991 [CI 0.975–0.998] for volunteers, each indicating an excellent inter-rater reliability [18, 22]. Excellent inter-rater reliability shows that there is high agreement of segmented volume in each group and only minimal measurement failure which does not substantially decrease statistical power [17]. Again, the precision error for segmentations of FV compared with segmentations of CEV showed a trend towards lower values for medical students (p = .066) or was significantly lower for the group of neuroradiologists (p = .011) and volunteers (p = .011). RMSE for FV was 9.2 % for the group of neuroradiologists whereas RMSE for CEV was 16.7 %. RMSE in the group of medical students was 8.5 % for FV versus 13.6 % for CEV. Volunteers showed the lowest RMSE among all groups of raters for single time point segmentations: RMSE was 7.3 % for FV while RMSE for CEV was 10.8 %. However, precision error was significantly higher among all inter-rater groups compared with single rater segmentations (p = .009 for FV and p = .036 for CEV, data not shown in Table 2). Again, the overlap between segmentations of FLAIR was better in every group compared with the overlap of CEV segmentations with a median Dice score for neuroradiologists of .90 for FV and .83 for CEV. Median Dice scores for medical students were .88 for FV and .84 for CEV and for the group of volunteers .88 for FV and .84 for CEV. Visual comparison of segmentations between the groups revealed differences in ambiguous cases (Fig. 3).

Fig. 3
figure 3

Rendered 3D-volumes of different raters in patient 2 in fluid-attenuated inversion recovery (FLAIR) images a and post contrast MPRage images b. Wallerian degeneration (white arrow) was inconsistently segmented in FLAIR sequence among the three different raters (colors for FLAIR volume: yellow = neuroradiologist, green = medical student, blue = volunteer; colors for contrast-enhancing volume: orange = neuroradiologist, red = medical student, purple = volunteer). Objects are not scaled

Longitudinal Evaluation (Table 3)

The relative change in volume between baseline and follow-up MRI was calculated for each rater in each group. Here, single rater segmentations showed the lowest RMSE for FV of 5.2 %. Remarkably, RMSE of neuroradiologists was only 7.5 % for FLAIR segmentations, showing an improvement compared with RMSE for single segmentations in this group (9.2 %). For medical students and volunteers, RMSE for FV was 10.1 % each. RMSE for segmentations of contrast enhancement was higher for every group of raters compared with single segmentations: 12.7 % for single rater, 16.9 % for neuroradiologists, 15.3 % for medical students and 12.9 % for volunteers.

Table 3 Root-mean-square error of relative change of volume between baseline and follow-up magnetic resonance imaging for FLAIR volume and contrast-enhancing volume computed for each group of raters. The number of data sets in each category is displayed (n)

Discussion

Quantitative volumetric reports of contrast-enhancing or FLAIR hyperintense tumor compartments are needed for an objective evaluation of stable or progressive disease in patients with glioblastoma [23]. This study showed that semi-automated segmentations of glioblastoma can be done reliably by different groups of raters with a commercial software solution based on a region-growing algorithm. Single rater segmentations showed the lowest precision error. In all groups, segmentations of FV showed lower precision errors compared with segmentations of CEV. In the longitudinal evaluation, the relative change in volume between baseline and follow-up MRI showed the lowest precision error for single rater segmentations and for the group of neuroradiologists in FLAIR images.

RANO response criteria replaced the Macdonald criteria to assess changes in postoperative MR imaging of glioblastoma and are currently considered the state of the art [7, 8]. RANO response criteria define progressive disease as an increase of the perpendicular diameter product of more than 25 % in postcontrast T1w-sequences [7]. Assuming an isotropic tumor growth, this two-dimensional increase of 25 % in diameter product would result in a 39.8 % increase of tumor volume (Fig. 4). In our volumetric approach, the lowest RMSE for segmentations of contrast enhancement was 8.2 % for single raters. For interpretation of the RMSE, the LSC should be addressed. The LSC defines the minimal change which can be regarded as a significant change. A RMSE of 8.2 % translates into a LSC of 22.7 %, meaning that only changes in volume exceeding this value should be considered significant. Assuming a valid transition of two-dimensional RANO criteria to our three-dimensional volumetric approach, our results show that RANOs’ cutoff value for contrast-enhancing lesions (39.8 %) can be easily met by single-rater segmentations of CEV. For different raters, the LSC for CEV ranged between 30.0–46.3 % indicating that the cutoff value of 39.8 % for detection of disease progression cannot always be met. Instead of segmentations of CEV, our results show that FLAIR segmentations can be done more precisely. For single rater segmentations of FLAIR sequences a RMSE of 3.3 % (LSC = 9.1 %) shows that changes in FLAIR volume can be detected more than 4 times as precise as recommended by RANO for CEV (39.8 %). Even in the longitudinal evaluation, FLAIR segmentations showed the lowest precision error with a RMSE of only 5.2 % (LSC = 14.4 %). This LSC corresponds to an increase in the perpendicular diameter product of only 9.4 %, indicating that 3D-FLAIR segmentations can reliably detect changes of more than half the size as currently recommended by the 2D-RANO approach for CEV.

Fig. 4
figure 4

Volume and area of a ball-shaped tumor configuration. An increase of 25 % in tumor area (defined as perpendicular diameter product in the response assessment in neurooncology criteria) leads to an increase in volume from 100 to 139.8 % for a ball-shaped tumor configuration

Configuration of contrast enhancement shows a high variety among glioblastoma leading to certain limitations of 2D-RANO criteria, that is, in multifocal lesions, enhancing parts of resection cavities or cystic tumors [9]. Our results show that 3D-segmentation can be done reliably and with a precision error lower than RANO criteria supporting the hypothesis that volumetric approaches could be superior to area-based measurements in determining tumor size in those cases [9]. There are growing doubts if contrast enhancement is a valid surrogate marker for detection of progressive disease in glioblastoma, since any disturbance of the blood–brain barrier can lead to gadolinium enhancement (e.g., postoperative scaring, radiation, ischemia) [7]. Considering this and the high-precision error for CEV segmentations, tumor-associated FLAIR changes might be more suited for semi-automated detection of progressive disease. Obviously, most of the above-mentioned disturbances of the blood–brain barrier can also affect FLAIR changes but with a precision error being more than two times lower for FLAIR segmentations of single raters, tumor growth may be detected earlier.

Among the three different groups of raters, volunteers showed the lowest precision error for single segmentations. This may be due to the user-friendly interface, the 3D-interpolation of the smartbrush, as recently described by another group [11], and due to the fact that volunteers had a permanent stand-by supervisor. As this experienced supervisor helped in doubtful and ambiguous cases, these results may be partly interpreted as a mixture of intra- and inter-rater reliability. Interestingly, when looking at the relative change in volume between baseline and follow-up MRI for each rater, neuroradiologists showed a lower RMSE than for single segmentations, suggesting that neuroradiologists kept their individual ways of segmenting, whereas the other groups did not. However, for an objective and quantitative evaluation of volumetric data, brain tumor segmentations should not be influenced by individual ways of image interpretation.

Previous studies showed that semi-automated segmentation techniques allow a reliable and fast volumetric assessment of CEV in glioblastoma [23, 24]. In these studies, only one rater did the semi-automated segmentations and compared the results to other segmentation techniques. However, our approach aimed to assess the reliability of semi-automated volumetry itself for different groups of users. In the above mentioned studies only CEV was delineated and segmentations of FLAIR hyperintense tumor parts were not addressed. We showed that especially FLAIR changes can be delineated excellent with a semi-automated approach. Porz et al. compared an automated segmentation method to manual segmentations of two experts in terms of Dice scores [25]. Interestingly, this study did include FLAIR images and showed the highest agreement of manual segmentations for experts when FLAIR hyperintense edema was included in the segmentations, emphasizing our results. Automatic segmentation, which is considered more objective but less accurate than semi-automated volumetry, was inferior [25]. Additionally, Porz et al. also reported the highest variation among segmentations for CEV, again emphasizing our findings.

Radbruch et al. [24] showed earlier that about 10 % of MRI scans in glioblastoma show exclusive progress of FLAIR changes which was often later followed by a progress of CEV. They proposed a threshold for progressive FLAIR changes of 15 %. The results of our study showed that this FLAIR threshold could be reliably detected by our volumetric approach for single raters (LSC = 14.4 %). Using semi-automated segmentations of FLAIR changes, progressive disease in glioblastoma might be detected earlier in a reliable and quantitative way which could be easily implemented in the clinical routine.

Conclusions

Semi-automated delineation tumor volume with a commercial region-growing algorithm can be done easily and reliably by all groups of raters in patients with glioblastoma, even without neuroradiologic expertise. Segmentations of tumor-associated FLAIR changes were consistently more precise than segmentations of contrast enhancement with the best results in case of a single rater. Precision of experienced neuroradiologists outperformed the nonexperienced groups only in the longitudinal evaluation of FLAIR changes. Here, a single experienced rater could detect progressive FLAIR changes of less than 15 % reliably in a quantitative way which could help to detect progressive disease earlier and more precise as currently recommended by RANO for contrast enhancement.