Introduction

Diffuse low-grade gliomas (LGGs) are World Health Organization (WHO) grade II tumors with an infiltrative growth pattern and constitute 10–15% of all primary brain tumors in adults [14]. They are initially slow growing and predominantly affect otherwise healthy young adults. Eventually, a malignant transformation into higher grade gliomas will occur. In recent years, a survival benefit from earlier surgical and oncological treatment in the low-grade stage of the disease has been demonstrated, compared to a wait-and-see approach [2, 5, 6].

It is known that tumor size at diagnosis, extent of surgical resection and volume of the residual tumor are strong prognostic factors [717]. Several studies have also shown that the growth rate of the tumor is associated with risk of malignant transformation and overall prognosis [7, 1821]. After first-line therapy, a significant re-growth seen from repeated Magnetic Resonance Imaging (MRI) examinations during follow-up often forms the basis for clinical decision making concerning reoperations or adjuvant treatment. Thus, size and growth rate are key radiological factors in LGG care, both for prognostication and for clinical decision making. Due to the infiltrative growth pattern and often subtle changes between MRI assessments, volumetric assessment is supposed to be the most sensitive method [7, 22, 23].

Manual segmentation by an experienced operator is considered the gold standard for volumetric segmentation of brain tumors, however this is a very time consuming procedure [24, 25]. Most previous studies have investigated inter-rater variability and compare manual to semi-automatic methods [21, 2630], while few have explored the intra-rater variability in low-grade glioma segmentation. As only one radiologist is involved in most clinical studies and many clinical settings, knowledge of the intra-rater variability in LGG assessment is highly relevant.

In this study, we sought to address the intra-rater variability in manually verified semi-automatic segmentation of low-grade gliomas by performing serial segmentations of the same tumors, all done by one radiologist. We also sought to explore possible factors associated with variability.

Methods

Study population

Patients were included from an ongoing study on LGGs. Tumor borders were radiologically evaluated and classified as: (1) well-defined, (2) partially absent, or (3) absent. Well-defined indicates a discrete border between tumor and normal appearing brain, partially absent is a more diffuse border zone, but still possible to separate tumor and normal brain, and absent is an ill-defined border and infiltrative finger-like growth pattern.

Image acquisition

MRI images used for tumor segmentation were all pre-operative 3D Fluid Attenuated Inversion Recovery (FLAIR) images. Acquisitions were done on three different MRI systems, thus with slightly different echo time, repetition time and inversion time (TE/TR/TI). Seventeen patients were examined using a Siemens Skyra 3.0T scanner (389/5000/1800 ms) with both slice-thickness and in-plane resolution of 1 mm. Four patients had their scan on a Siemens Avanto 1.5T scanner (333/6000/2200 ms or 474/6500/1800 ms) with both slice-thickness and in-plane resolution of 1 mm. Two patients were examined with a Philips Intera 3.0T scanner (350/8000/2400 ms) with slice-thickness 1.2, 0.6 mm overlap and in-plane resolution of 0.43 mm.

Segmentation procedure

For the segmentation procedure we used the open source software 3D Slicer 4.4.0 (http://www.slicer.org), which is a software platform for quantitative imaging, designed for use in cancer care [31]. 3D Slicer consists of a core platform with several standard modules and a graphical user interface.

For the segmentations in this study, we used the “GrowCut” region based segmentation algorithm in the built-in “Editor” module. First, the border of the tumor was manually marked on at least one slice in each of the three planes (transversal, coronal and sagittal). Then the area outside the tumor was marked with a different color. The “GrowCut” algorithm was then run, resulting in an image label superimposed on the MRI image [32]. This label was further edited to fit with the tumor borders, first with the “dilate” and “erode” functions and then manually. Tumor volume in mL was then obtained using the “Label statistics” extension.

All segmentations were done by one radiologist (H. K. B.) with 7 years of radiology experience, but without any prior experience in segmentation or 3D Slicer. When in doubt, tumor borders were discussed with an experienced neuroradiologist (K. A. K.) with 20 years of experience. All tumors were segmented three times; once before any tumors were segmented for the second time, and twice before any tumors were segmented for the third time. To minimize recall bias when segmenting for the second and third time, we made sure that at least 40 days passed between between repeated segmentations. We also attempted to have the same time interval between the second and third segmentation as between the first and second.

Measures of agreement

The Jaccard coefficient and Hausdorff distance (HD) are widely used and validated measures of agreement in evaluation of segmentation procedures [3336]. The Jaccard coefficient is an overlap index used to compare segmentations. If S i represents a segmentation in a series of n segmentations, each with volume V(S i ), then the Jaccard coefficient is defined as:

$$J_{n} = \frac{{n\left[ {V\left( { \cap _{i}^{n} S_{i} } \right)} \right]}}{{\sum\nolimits_{{i = 1}}^{n} {V\left( {S_{i} } \right)} }}$$

Jaccard coefficient with two segmentations, J2, is the same as Dice Similarity Coefficient (DSC), which can be shown to be a special case of the Kappa-statistic used for intra-rater agreement [34]. The Jaccard coefficient takes on values from 0 to 1, with 0 when there is no overlap and 1 when there is a perfect match between the segmented volumes. Interpretation is similar to the Kappa-statistic with a strong agreement with values 0.80–0.90 and almost perfect agreement with values above 0.90 [37]. We have included both J2, comparing each pair of two segmentations, and J3, comparing all three segmentations. HD is a measure of distance between two segmentations, defined as the greatest distance measured from each point on the surface of one segmented volume to the closest point on the surface of the other [36]. HD is especially sensitive to local surface variations.

Exploring factors associated with agreement

In an attempt to explore possible features associated with agreement we compared agreement in small vs. large tumors (dichotomized from median tumor volume), in various histopathological subtypes, in relation to mean ADC levels of the tumor (the smallest of the three tumor volumes was used) and in relation to border conspicuity.

Statistics

IBM SPSS Statistics, Version 23.0 (IBM Corp., Armonk, NY) was used for statistical analysis. Central tendencies are presented as mean (standard deviation [SD]) or median (inter quartile range [IQR]) when skewed. Normality was assessed with histograms and tested with Shapiro–Wilk’s test. Differences in means were tested with two-tailed Student’s paired t test when normally distributed and with the two-tailed non-parametric Related-Samples Sign test when skewed. Furthermore, Jaccard coefficient and HD were calculated for exploring measures of segmentation variability. Differences in agreement were tested on sub-groups using Mann–Whitney U test when two groups and Kruskal–Wallis test when three groups. P values below 0.05 were considered significant, while p values between 0.05 and 0.10 were considered as trending towards significance [38].

Results

We included preoperative MRIs from 23 untreated patients (median age 41 years (range 18–49), 13 males), with histopathologically verified supratentorial WHO grade II gliomas, operated between 2011 and 2014 at our hospital. There were 10 (43%) oligodendrogliomas, 8 (35%) astrocytomas, 3 (13%) unspecified LGGs and 2 (9%) mixed astrocytomas. Localization was frontal in 12 (52%), insular in 6 (26%) and temporal in 5 (22%) patients. Border margins were well-defined in 12 (52%), partially absent in 8 (35%) and absent in 3 (13%) tumors. Four tumors (17%) had an eloquent localization (after Chang et al. [8]).

Mean time between segmentation cycle 1 and 2 was 144 days (range 43–201), and between cycle 2 and 3 it was 148 days (range 115–202) (p = 0.71). Median tumor volume from segmentation cycle 1 was 26.4 mL (range 1.4–165.9), 27.6 mL (range 1.7–166.0) for the second cycle and 19.7 mL (range 1.4–163.0) for the third cycle. Comparison between segmentations 1 vs. 2, 2 vs. 3 and 1 vs. 3 are shown in Table 1. There was a median difference in tumor volume of −1.3 mL between the first and second segmentation, corresponding to a median relative difference of 14% (IQR 5–28), a median HD of 9.8 mm (IQR 5.9–14.9) and a median J2 of 0.87 (IQR 0.79–0.91). There was a median difference in tumor volume of −1.3 mL also between the second and third segmentation cycle, corresponding to a median relative difference of 13% (IQR 2–19), a median HD of 7.4 mm (IQR 4.7–9.9) and a median J2 of 0.90 (IQR 0.83–0.93). When comparing the first and last segmentation cycle, there was a median difference in tumor volume of −4.1 mL, corresponding to a median relative difference of 14% (IQR −11.6 to −0.2), a median HD of 9.4 mm (IQR 5.2–19.1) a median J2 of 0.87 (IQR 0.71–0.93). The difference in median tumor volume in subsequent segmentation cycles was not significantly different (p = 0.68), but between the first and last segmentation cycle it was significant (p = 0.01). Median Jaccard coefficient for all three segmentations (J3) was 0.82 (IQR 0.70–0.89). The absolute volume, HD, J2 and J3 for each tumor from each segmentation cycle are shown in Table 2, while a bar chart with the absolute tumor volume for all tumors from all three segmentations is shown in Fig. 1.

Table 1 Comparison of segmentation cycle 1 vs. 2 with segmentation cycle 2 vs. 3
Table 2 Volumes in milliliter (mL) for each tumor and each segmentation cycle
Fig. 1
figure 1

Bar chart with volume in mL for all tumors and all segmentations, ordered by volume in segmentation 1. Segmentation 1 in blue, segmentation 2 in green and segmentation 3 in yellow

Subgroup analyses were performed (Table 3). There was a tendency towards smaller relative volume variability (p = 0.07) and a significantly higher J2 (p < 0.01) in larger tumors in segmentation cycle 2 vs. 3. Tumors with well-defined border showed less variability compared to tumors with partially absent and absent border in segmentation cycle 2 vs. 3, with significantly smaller difference in median absolute volume (p = 0.04), smaller HD (p < 0.01) and higher J2 (p = 0.04). Comparing histopathological subtypes, astrocytomas were significantly smaller than oligodendrogliomas in all segmentation cycles (p ≤ 0.04), but there was no difference in HD, J2 or J3 between the histopathological subgroups. There was no significant difference in tumor volume, HD, J2 or J3, between tumors with low or high ADC-values.

Table 3 Median tumor volume, absolute and relative volume differences and Hausdorff Distance (HD) for segmentation cycles 1 vs. 2 and cycles 2 vs. 3, Jaccard coefficient for segmentation cycles 1 vs. 2, cycles 2 vs. 3 and cycles 1 vs. 2 vs. 3 in sub-groups, grouped by size (dichotomized by median tumor volume), histopathology, ADC-values (dichotomized by median ADC-value) and border conspicuity

Discussion

In this study we found a better overlap agreement when the same LGGs were repeatedly segmented, with significantly increased J2 between the two last segmentations compared to the two first. We interpret this as a decreased intra-rater variability with increasing experience, which again could indicate more confidence in tumor border interpretation. There was a non-significant decrease of median difference in tumor volume between these cycles of 1.3 mL, corresponding to a median relative difference of 13–14%. The variability is however better demonstrated when comparing segmentation cycles 1 vs. 3 to segmentation cycles 1 vs. 2, where there was a significant decrease in median tumor volume of 4.1 mL (p < 0.01). Although only trending towards significance between the second and third segmentation cycle, the variability measured as relative difference in percent seems larger in smaller lesions (p = 0.07), while tumors with well-defined borders showed a significantly smaller variability measured in absolute volume (p = 0.04), HD (p < 0.01) and J2 (p = 0.04). Intra-rater agreement was not associated with histopathological subgroups, and we did not find a clear association between ADC-values and variability of volume measures. This study demonstrates that intra-rater variability of the gold standard volume assessment can be substantial and should be accounted for. Thus, some criteria defining tumor borders and progression are needed if moving from 2D to 3D volume assessment of LGGs.

In clinical situations, growth is often based on so-called “eye-balling” or unsystematic measures of tumor diameters. In clinical trials, tumor size has classically been measured as the product of two orthogonal diameters, measured on the axial slice with the largest diameter [39]. In the follow-up criteria from the Response Assessment in Neuro-Oncology (RANO) group, bi-diametric measurements are set as the standard method for response evaluation, mostly due to limited availability of volumetric measurements [40]. As LGGs have an irregular slow growth, it is commonly accepted that 3D volumetric measurements easier will catch subtle changes between examinations, although there is a lack of studies comparing 2D and 3D measurements [7, 22, 41, 42]. However, the accuracy of volume or growth measurement is presumably not only dependent on the choice of method, but may also be limited by the operator that has to draw the line between tumor and normal brain in diffusely infiltrating tumors. LGGs usually show no contrast enhancement and segmentation has to rely on the inherent contrast properties of the tissue, which can be very close to normal brain tissue.

Much work has been put into characterization of LGGs in order to determine prognostic predictors, emphasizing the importance of volumetric assessment of the tumors. Pallud et al. have in several studies shown that the radiological growth rate of the tumor can predict malignant transformation [7, 18, 19]. Their work is supported by others, using growth rates to predict transformation and patient outcome within 6 and 12 months [20, 21]. Two studies describe a semi-automatic strategy for quantifying tumor growth using grey level recognition in T2/FLAIR images, and both methods are based on prior manual expert segmentation [27, 30]. In the first, Angelini et al. found a high volume segmentation variability, both in the baseline segmentation and in the follow-up segmentations, because they are all based on manual tracing. In the other Weizman et al. looked at optic pathway and thalamic gliomas and showed quite good correlation between manual and semi-automatic segmentation volumes, but have not calculated intra- or inter-rater variability.

Several prior studies have measured inter-rater variability, but few focused on intra-rater variability. Both Kaus et al. and Akkus et al. found comparable levels of intra-rater variability between manual and semi-automatic segmentation of LGGs, but Jaccard coefficient was not calculated [26, 28]. In both studies intra-rater variability was lower than inter-rater variability. Zou et al. found a highly variable inter-rater DSC from 0.49 to 0.97 in LGG segmentation, comparable with our J2 values ranging from 0.40 to 0.96 [35]. With such high variability, experience in tumor border evaluation and general brain MRI interpretation will be highly important to minimize this factor, aiming for more consistent measurements, especially in a follow-up setting with growth evaluation.

As seen in the present study with a median difference in absolute volume between the first and third segmentation of 4.1 mL, intra-rater variability of manual volume segmentation should not be underestimated. This may be an argument for automatic methods of volume assessments. However, algorithms for automatic segmentation of LGGs have so far been disappointing, and in validation of automatic methods the substantial intra-rater variability of the current gold standard based on manual methods should be kept in mind [33]. For detecting progression or treatment responses in individual patients, a low inter- and intra-rater variability (i.e. reliability) might be more important than agreement with manual methods (i.e. validity). After all, the true volume of any glioma is always larger than depicted with any current imaging modality. In example, Pallud et al. found tumor cells 20 mm from the margins of such FLAIR abnormalities, while Zetterling et al. found IDH1-positive tumor cells up to 14 mm from FLAIR abnormalities [4, 43].

Part of the variation in tumor segmentation presented in this paper could be because of variation in manual initialisation of the “GrowCut” algorithm. Manual intialization of semi-automatic segmentation algorithms is an important yet ill addressed topic in the literature. As the tumors in this study have been segmented three times each with different manual initialisations, this study implicitly addresses the question of initialisation for this particular algorithm.

As mentioned, it seems to be an association between relative size and variability. We did not find this association within histopathological subtypes, where we could expect that variability would be larger in percent in the often smaller astrocytomas and larger in volume in the often larger oligodendrogliomas. Furthermore, since ADC values are associated with tumor cell density [44], we hypothesized higher relative variability in astrocytomas (that usually are less cell-dense than oligodendrogliomas [1]) with repeated volume assessment. Although we did not find an association between variability and ADC values, this could be a result of the fairly small sample size. Also, with the often high intratumoral heterogeneity, an ADC value from the tumor core instead of the entire tumor might be more representative.

Image acquisition makes the basis for further tumor evaluation and segmentation, and technical parameters need to be optimized. A 2D FLAIR acquisition gives higher image contrast than 3D FLAIR, both between grey and white matter and between lesion and white matter [45]. On the other hand, standard 2D FLAIR sequences typically have thicker slices of 3–5 mm to get sufficient signal-to-noise ratio, as well as interslice gaps. A significant difference between segmentations on 1 mm slices vs. 5 mm slices has been shown [46]. Thus, using interpolation to calculate segmentation volumes, volume estimates are less accurate, especially in small tumors [41, 46]. Therefore, a 3D sequence with isotropic voxels and no interslice gap will usually give a more accurate volume estimation and should possibly be part of standard tumor evaluation. Also, magnetic field strength influence image contrast, with higher contrast-to-noise ratio in FLAIR images acquired on 3.0T compared to 1.5T. This leads to small differences in lesion volume and should be taken into consideration [4750].

Assessing relative volume differences there are some extreme values, namely 73% (case 20) in segmentation cycles 1 vs. 2 and 55% (case 2) and 45% (case 4) in segmentation cycles 2 vs. 3 (Fig. 2). Case 20 is a left insular glioma considered to have well-defined borders. It was the second tumor segmented in the very beginning, and with more experience the tumor borders were interpreted quite different and more consistently the second and third time, showing a smaller difference of 19%. Case 2 resides in the right insula and case 4 in the left medial temporal lobe. Both were classified to have “absent” tumor borders, making them difficult to delineate consistently. In addition, case 4 is a rather small tumor, with a mean volume of only 2 mL.

Fig. 2
figure 2

Three tumors with extreme differences. Case 20 without (a) and with (b) segmentation label, segmentation 1 outlined in red and segmentation 2 in yellow. Case 2 without (c) and with (d) segmentation label, segmentation 2 outlined in red and segmentation 3 in yellow. Case 4 without (e) and with (f) segmentation label, segmentation 2 outlined in red and segmentation 3 in yellow

Our study has limitations. We do not have a gold standard to compare our segmentation results with, which makes it difficult to evaluate whether we are closer to the biological truth or just in more agreement with ourselves. In general, with a small number of participants, it is difficult to draw conclusions from such sub-group analysis.

Conclusion

Our study demonstrates that intra-rater variability can be considerable in LGG volume segmentation, with significant volume difference between segmentation cycles. We did however find a decreased intra-rater variability with repeated segmentations measured with DSC, suggestive of an effect of experience. With no exact gold standard for comparison, it can be difficult to point out what makes this effect, whether it is actually getting closer to the true volume, or if the operator is only reaching a better internal agreement with him-/herself, or a combination of the two. Furthermore, we found that there was a significantly lower variability in segmentation of LGGs with well-defined tumor borders. This study shows that some criteria defining tumor borders and progression in 3D volumetric segmentation is needed if moving from 2D to 3D volume assessment of LGGs.