Introduction

Pituitary adenomas (PAs) are common intracranial neoplasms, arising from epithelial cells in the anterior pituitary [1]. Approximately half of the tumours are non-functioning pituitary adenomas (NFPAs), not presenting symptoms of hormone overproduction [2]. The main treatment for NFPAs is surgery, with decompression of the optic pathways as the primary indication [3].

A substantial portion of NFPAs regrow after surgery, in particular when residual tumour is present [4]. The tailoring of the postoperative follow-up is for most cases determined by signs of growth, or regrowth, in radiological imaging series [5].

Adenoma size may simply be characterised by its largest diameter in one or more imaging planes [6]. Moreover, tumour volume may be calculated based on diameter measurements or by stereological methods [7, 8]. Only a few series report volumetric measurements of NFPAs [4]. The manual summation of slices (SOS) method of measuring tumour volume, also known as Cavalieri’s method, has been considered to be the gold standard of tumour volume measurement, but is more time consuming [9,10,11].

The issue of tumour growth is frequently encountered in everyday clinical practice. In this study, we aimed to determine the intra- and inter-rater reliability of volumetric tumour measurements based on a diametric and the SOS approach both before and after resection of NFPAs.

Methods

Twenty-two patients surgically treated for non-functioning pituitary macroadenomas were retrospectively and randomly selected from a pool of patients investigated with serial postoperative tumour volume measurements [12]. Inherent with the retrospective study design, the MRI data were collected from imaging scans obtained both at a tertiary hospital and at other referring hospitals. Forty-three scans were performed with 1.5 Tesla, while 16 were performed with 1.0 T. One and two scans were performed with 0.5 and 3 T, respectively. Acquisition parametres for the T1-weighted scans were typical: repetition time (TR) = 512 ms, echo time (TE) = 12 ms, FoV 178 mm × 220 mm, pixel size = 0.47 and Flip angle 150°. The majority of scans containing residual tumour tissue (N = 42) had a slice thickness of 3.3 mm, where the distance between slices amounted 0.3 mm. For the latter scans (N = 16), the slice thickness varied between 2.0 and 3.85 mm and distance between slices ranged from 0.0 to 1.65 mm. One scan had a slice thickness of 5 mm, this was however not included in any of the reliability calculations.

All post-scan analyses were done directly in the radiological picture archiving and communication system (PACS) (IDS7, Sectra, Sweden). Volumes were primarily calculated by the SOS method. The tumours were defined and delineated manually by a region of interest (ROI) in all MRI slices where tumour tissue was visible. All ROI areas were summed up and multiplied by the distance between the slices. Diametric measurements were retrieved from the largest diameter in coronal plane, and the two largest perpendicular diameters (height and length) in the sagittal plane. Volume was calculated by the formula width × height × length × 0.5 [8]. Cystic and haemorrhagic tumour components were included in the volume. Tumour fragments were summed in cases where the residual tumour was discontinuous.

Tumour volumes retrieved from MRI at the preoperative (MRI0), the first postoperative after submission from hospital (MRI1) and the third postoperative exam (MRI3) were calculated for all patients. A total of 62 MRI scans were investigated, a total of 58 of these scans were evaluated to have tumour tissue available for volume assessment by one of the readers. Four patients lacked preoperative MRI scans. The median (range) time intervals between MRI and surgery (defined as time point zero) were −3.6 (−0.1 to −11.9), 3.1 (2.1–10.0) and 35.5 (15.3–70.0) months for the MRI0, MRI1 and MRI3, respectively. All tumour volumes were investigated twice by the same reader (KAØ), and once by the second reader (SH).

The intraclass correlation coefficient (ICC) is a measure of reliability, taking into account both the degree of correlation and agreement between measurements based on the research model used, the type of measurement protocol and the definition of the relationship (consistency or absolute agreement) [13, 14]. A single measurement mixed-effect model was used to calculate the ICC for the intra-rater comparison, and a single measurement random-effect model was used to calculate the ICC for the inter-rater and the comparison between methods. The ICC was given for the absolute agreement of the log-transformed volumes. Based on the lower bound of the 95% CI of the ICC values, values less than 0.5, between 0.5 and 0.75, between 0.75 and 0.9 and above 0.9 were defined as poor, moderate, good and excellent reliability, respectively [15]. The Bland–Altman plot illustrates the bias and the degree of agreement between the two measures compared, agreement intervals (limits of agreement), where 95% of the differences between the first and second measurement fall [16]. All analyses were performed in SPSS software version 24.

Results

Of 22 patients, four and five patients were considered by reader 1 to not have residual tumour at MRI1 in the first and second measurement, respectively. Four patients were considered not to have residual tumour at MRI1 by reader 2. Readers 1 and 2 agreed on the complete absence of residual tumour in three of these patients. These three patients were not included in the reliability analyses.

Preoperative MRI (MRI0)

The intra-rater reliability for both SOS and diametric method for the preoperative investigation was excellent (ICC: 0.996 (95% CI: 0.978–0.999), and ICC: 0.990 (95% CI: 0.973–0.996), respectively). The inter-rater reliability gave similar results for the two methods (0.982 (95% CI: 0.824–0.995) and ICC: 0.967 (95% CI: 0.820–0.990), respectively), though with slightly wider 95% limits of agreement (Fig. 1, Column 1).

Fig. 1
figure 1

Bland–Altman plots showing intra-rater, inter-rater and intra-method variation. Column to the left shows the preoperative MRI scans (MRI0), the middle column shows the earliest MRI scan (MRI1) while the right column shows the third postoperative MRI scan (MRI3). • Row A: Intra-rater variability for summation of slices (SOS) volume measurements. • Row B: Intra-rater variability for diametric volume measurements. • Row C: Inter-rater variability for SOS volume measurements. • Row D: Inter-rater variability for diametric volume measurements. • Row E: Variability between SOS and diametric volume measurements performed by the same reader at the same time point. X-axis shows the mean of the two log-transformed volumes, while Y-axis shows the difference between the two measurements presented for all Bland–Altman plots. Upper and lower 95% limits of agreement (LoA) is given for all Bland–Altman plots. The stapled lines show a log-transformed volume difference of 0, while the solid lines show the mean difference of the measured volumes for each comparison. The intraclass correlation coefficient (ICC) with 95% confidence interval is given for all comparisons. A single measurement mixed-effect model was used to calculate the ICC for the intra-rater comparison, while a single measurement random-effect model was used to calculate the ICC for the inter-rater and inter-method comparisons [13, 14]

First postoperative MRI (MRI1)

For the intra-rater reliability, MRI1 was the investigation with the least reliability for both the SOS and diametric method (ICC: 0.872 (95% CI: 0.694–0.950) and ICC: 0.791 (95% CI: 0.343–0.929). The inter-rater reliability was also lower for both methods at this time point (ICC: 0.792 (95% CI: 0.512–0.921) and ICC: 0.810 (95% CI: 0.540–0.929), respectively) than for the preoperative MRI, with wide 95% limits of agreement (Fig. 1, Column 2).

Third postoperative MRI (MRI3)

The intra-rater reliability was good for the third postoperative MRI, for both measurement methods (ICC: 0.961 (95% CI: 0.897–0.985) and 0.962 (95% CI: 0.897–0.985), respectively), though with wider 95% limits of agreements than at MRI0 (Fig. 1, Column 3).

For the inter-rater comparison, this was the least reliable investigation for both methods (ICC: 0.759 (95% CI: 0.434–0.909) and ICC: 0.703 (95% CI: 0.348–0.884), respectively), with wide 95% limits of agreement (Fig. 1, Column 3).

Reliability according to method

The SOS method showed approximately equal ICC for most comparisons, with slightly narrower 95% limits of agreement (Fig. 1, Row A vs B, and Row C vs D). An exception was the inter-rater comparison for MRI1, which showed a slightly lower ICC for the SOS compared to the diametric method (ICC: 0.792 (95% CI: 0.512–0.921) and 0.810 (95% CI: 0.540–0.929), respectively. The reliability between the methods when performed during the same investigation by the same reader was excellent, for all three time points (ICC: 0.988 (95% CI: 0.969–0.996), ICC: 0.945 (95% CI: 0.856–0.980) and ICC: 0.962 (95% CI: 0.901–0.986) for MRI0, MRI1 and MRI3, respectively) (Fig. 1, Row E). For the log-transformed volumes at MRI0, MRI1 and MRI3, the mean differences between the diametric and the SOS method was −0.021, −0.01 and −0.003, respectively. This suggests that the SOS method provides slightly larger volume estimates than the diametric method.

Discussion

Preoperatively, the reliability of the volume assessments presented satisfactory intra- and inter-rater reliability for both SOS and diametric volume measurements. The early postoperative volume measurements (3 months) had the lowest reliability for the intra-rater comparisons, while the third postoperative volume measurements demonstrated the poorest reliability for the inter-rater comparisons. The SOS method provided approximately equal or slightly higher intra- and inter-rater reliability than the diameter-based method for most volume comparisons. There was a high reliability between the SOS and the diametric method.

Most studies reporting intra- or inter-rater comparisons of volume measurements in NFPAs have investigated preoperative MRI investigations, or do not differ between pre- or postoperative investigations [17,18,19]. Monsalves et al. reported the pre- and postoperative inter-observer consistency for an SOS-related method in pituitary adenomas [20]. They found the preoperative investigations to be more consistent than the postoperative, though both with high consistency [20]. However, the values of consistency were not directly comparable to values of absolute agreement used in the present ICC calculation, while they compared average scores in a group and not the scores of each subject [14]. Our results thus add to these previous reports showing that both the SOS and diametric methods are reliable for volumetric tumour analysis at preoperative MRI investigations.

The intra-rater reliability of volumetric measurements was poorer for both measurement methods at the first postoperative MRI compared to the later postoperative MRI in our study. The blood and secretions from surgery are resorbed during the first 2–3 months after surgery, hence the first postoperative MRI scan may typically be advised carried out after such early postoperative changes have subsided [21]. However, some investigators have found postoperative volume assessments in other intracranial tumours to have robust intra- and inter-rater reliability though with semi-automated methods [22]. The literature on the precision of volume measurements for postoperative investigations in NFPAs is sparse. The fact that there was disagreement between the readers about the residual tumour presence in some cases, but also within the same investigator, indicates that interpretation of this early postoperative MRI scan is challenging. The ICC for the diametric method in MRI1 had a wider CI than the other intra-rater comparisons. This variation might have been caused by a limited number of tumours in the analysis (Fig. 1, Row B, Column 2).

A substantial portion of NFPAs regrow postoperatively [4, 12]. We therefore assumed that the tumours would be easier to delineate and the reliability better at MRI3 than at MRI1. This was the case for the intra-rater comparison for both the SOS and the diametric method. However, this was not the case for the inter-rater comparison (Fig. 1, Column 3). Monsalves et al. also found the same pattern, when reporting inter-rater reliability in PAs measured before and after surgery [20]. In our study, each reader investigated MRI scans from the three time points (MRI0, MRI1 and MRI3) serially, and therefore the tumour delineation at MRI1 has probably affected the delineation at MRI3.

The SOS method has been shown to have less retest error than other measures of size [10, 23], and in our study, the repeatability of this test (ICC) was slightly higher than for the diametric method. For the intra-rater comparison, the SOS method demonstrated narrower 95% limits of agreement compared to the diametric method for both methods. However, the ICC scores were quite similar for most analyses. The SOS method was superior, or approximately equal, to the diametric approach for the inter-rater comparisons, though the postoperative analyses appeared to be challenging. Our results demonstrated good agreement between the methods within the same reader during the same investigation session. The diametric method could therefore be used for serial investigations performed by the same reader when detection of tumour volume change is the main issue; however, the most optimal method in regard to reliability seems to be the SOS.

Limitation

The study design was retrospective, and hence lacked a standardisation of the modalities and timing of the MRI scans, this might have introduced bias in the results. The lack of tumour tissue in four and five (in accordance to readers 1 and 2, respectively) of the postoperative MRI scans reduced the cohort size and thus accuracy of the reliability estimation. The variation of the measurements was greater on the postoperative MRI scans than on the preoperative MRI scans, and hence a larger number of investigation subjects would have improved the precision of the estimates. All measurements by reader 1 were done within a time span of 2 months. There was however a shorter period between the two measurements of MRI0, than the two measurements of MRI1 and MRI3, which possibly could affect the intra-rater comparisons. However, all image annotations from the first measurements were erased before the second measurements were started.

Conclusion

Non-functioning pituitary adenoma volume measurements were highly reliable in the preoperative setting when assessed by both SOS and the diametric approach. The reliability of both methods was poorer for the postoperative measurements, particularly at the first postoperative scan, suggesting that these investigations should be interpreted with caution. The SOS method showed equal, or slightly better, intra- and inter-reliability than the diametric volume measurements for most comparisons. However, the reliability between the methods, when performed by the same reader, showed good reliability.