Introduction

Somatostatin analogues (SA) are widely used in the medical treatment of acromegaly patients with growth hormone (GH)-secreting pituitary adenoma [1, 2]. Nonetheless, they cannot ensure biochemical control of the disease, improvement of clinical symptoms, and tumour shrinkage in all patients [1]. Several predictors of SA response including clinical, biochemical, radiological, and histopathological features, and somatostatin receptors have been described [1]. Prediction of response to SAs is essential because of their high cost. Magnetic resonance imaging (MRI) is the imaging method of pituitary adenomas [3], and hypointensity on T2-weighted images may signify a better response to SA treatment [4,5,6,7].

Quantitative texture analysis (qTA) evaluates lesion patterns that may not be visually perceptible [8]. Machine learning (ML) covers of a broad range of advanced statistical algorithms used in building autonomous predictive models in response to training data. One study has reported the qTA of GH-secreting pituitary macroadenomas for predicting response to SA [9], however with a first-order histogram analysis and no validation.

The purpose of this study was to investigate the potential value of ML-based high-dimensional qTA on T2-weighted MRI in predicting the response of GH-secreting pituitary macroadenomas to SAs, and to compare the qTA with relative signal intensity (rSI) and immunohistochemical granulation pattern evaluation that may be related with response to SAs.

Materials and methods

Ethics

This retrospective study was approved by our institutional review board. The requirement for informed consent was waived.

Patients

We reviewed our institutional databases for acromegaly patients between January 2009 and December 2017. Our inclusion criteria were as follows: (i) patients with biochemical acromegaly diagnosis based on age-adjusted serum insulin-like growth factor-1 (IGF-1) level and GH level (GH nadir > 1 μg/L) following oral glucose tolerance test; (ii) patients with no biochemical remission (GH level > 1 μg/L or elevated age-adjusted IGF-1) 3 months after surgery; (iii) patients with histopathologically confirmed GH-secreting macroadenoma (≥ 10 mm); and (iv) patients with preoperative and pretreatment (with SA) pituitary MRI including coronal T2-weighted sequences performed in our institution. Our exclusion criteria were as follows: (i) semi-solid macroadenomas with a solid component having a maximum diameter less than 10 mm (some texture features require a sufficient volume or area) and (ii) patients with pituitary apoplexy (to avoid possible distortion in texture feature parameters).

MRI technique

MRI was performed using a 1.5-T unit (Siemens, MAGNETOM Avanto). We used only turbo spin-echo T2-weighted coronal images which is a standard [5, 6, 9]. The settings were as follows: TR, 2090 ms; TE, 104 ms; echo train length or turbo factor, 24; slice thickness, 2.5 mm; slice spacing, 2.8 mm; field of view, 180 × 180 mm; and matrix size, 224 × 320, allowing a pixel size from 0.5 to 0.8 mm.

Image processing

The most important steps of the ML-based qTA are summarised in Fig. 1.

Fig. 1
figure 1

Simplified flowchart showing the machine learning-based quantitative texture analysis pipeline. LoG, Laplacian of Gaussian; GLDM, grey-level dependence matrix; GLCM, grey-level co-occurrence matrix; GLRLM, grey-level run-length matrix; GLSZM, grey-level size zone matrix; NGTDM, neighbouring grey-tone difference matrix; CV, cross-validation

T2-weighted images underwent N4 bias field correction to remove low-frequency intensity non-uniformity [10].

To minimise differences, all data sets were normalised by centring the voxel image intensity values at the mean with the standard deviation (SD), known as the ± 3 sigma technique [11]. Image normalisation was done for all grey-level values in the image, not just for the segmentation. Normalisation was based on the formula:

$$ \mathrm{f}(x)=\frac{x-\mu (x)}{\sigma (x)} $$

where f(x) is normalised image intensity, x is original image intensity, μ(x) is mean image intensity value, and σ(x) is the SD of the image intensity.

Pixel spaces in all image slices were rescaled to an in-plane resolution of 1 × 1 mm2 because the comparison of texture features necessitates identical spatial resolution [12]. Slice thickness was not rescaled because it was homogeneous.

The grey-level discretisation was done in the matrix representation of the grey levels in the segmentation, leaving the voxels outside segmentation unchanged. The discretisation was based on the following mathematical formula:

$$ X\mathrm{b},\mathrm{i}=\left[\frac{X\mathrm{gl},\mathrm{i}}{W}\right]-\left[\frac{\min \left(X\mathrm{gl},\mathrm{i}\right)}{W}\right]+1 $$

where Xb,i is grey-level intensity after discretisation, Xgl,i is grey-level intensity before discretisation, and W is the bin-width value, which was 0.06 in this study, corresponding to maximum 100 discrete grey-levels.

The N4 bias field correction, pixel resampling, normalisation, and discretisation were done with 3D data.

Except for the N4 bias field correction, all other processing steps (normalisation, resampling, and discretisation) were done before the texture feature extraction using the same module of the software, not affecting the segmentation process with changes such as blurring of the image.

Texture feature extraction

Texture features were extracted using ‘SlicerRadiomics’ extension (Revision 8e5f1e8) of 3D-Slicer software (version 4.8.1) based on the Python package named ‘PyRadiomics’ [13]. The macroadenomas were independently segmented slice-by-slice (3D whole tumour segmentation) by two radiologists. To avoid partial volume effect, segmentation was performed excluding the peripheral tumour tissue, 1 mm from the visible lesion contour as well as the most anterior and posterior slices that included the lesion (Fig. 2). Although the lesions were segmented slice-by-slice, we forced the software package to perform the analysis only in the coronal plane because of the anisotropy of the coronal T2-weighted images (voxel size = 1 × 1 × 2.5 mm3; slice spacing = 2.8 mm). The reason behind the slice-by-slice (3D) segmentation was to provide enough two-dimensional (2D) texture data by increasing the 2D segmentation area. 2D texture features using the entire tumour volume were extracted from both the original, filtered, and wavelet transformed images. Laplacian of Gaussian (LoG) filter was used for image filtration with values of 2 mm, 4 mm, and 6 mm (representing fine, medium, and coarse patterns). Of note, the LoG filtering and wavelet transformation were done to 3D volumetric data. The total number of the features extracted was 828 per lesion. Detailed texture feature groups are presented in Online Supplement Part E1.

Fig. 2
figure 2

Three-dimensional (3D) whole tumour segmentation for quantitative texture analysis and 3D segmentation-based quantitative relative signal intensity evaluation. (a) A hyperintense macroadenoma with small patchy hypointense foci in the coronal T2-weighted image. (b, c) Slice-by-slice segmentation and 3D modelling of the macroadenoma

Dimension reduction

Two radiologists, blinded to the response status, independently segmented tumours slice-by-slice (3D whole tumour segmentation). Intra-class correlation coefficient (ICC) values were calculated for each texture feature using SPSS version 20. The features with an ICC value of ≥ 0.8 indicating ‘excellent’ reproducibility were included in the further analysis.

The wrapper-based classifier-specific feature selection and model optimisation were performed using WEKA toolkit version 3.8.2 (University of Waikato) [14, 15]. A nested cross-validation method with 10-fold inner and 10-fold outer loops was adopted (Fig. 3) [16, 17]. Details regarding the feature selection are presented in Online Supplement Part E2.

Fig. 3
figure 3

Nested cross-validation with 10-fold inner loop and 10-fold outer loop. For each outer fold, the inner loop runs 10-fold cross-validation. The texture features having at least two cross-validations in the inner loop move to the outer fold. The 10-fold in the outer loop corresponds to the regular 10-fold cross-validation used in model development and validation. On the other hand, the 10-fold in the inner loop corresponds to the actual feature selecting process. Hence, this process creates ten different combinations of training and validation split. T, training; V, validation; CV, cross-validation

Relative signal intensity evaluation

The rSI was evaluated qualitatively (visual) and quantitatively (with 2D region of interest (ROI) and 3D whole tumour segmentation).

The rSI of the adenoma was classified as follows: (i) hypointense (equal or less than the white matter of the temporal lobe); (ii) hyperintense (equal or higher than the grey matter); and (iii) isointense (between white and grey matter) [5]. They were further grouped as follows: (i) T2-hypointense versus (ii) others for statistics.

For ROI-based quantitative rSI, the mean signal intensity was measured on two consecutive coronal T2-weighted images from the largest solid portion of the adenoma, white and grey matters of temporal lobe (Fig. 4) [5]. We also used the same 3D segmentation data used in qTA for rSI evaluation to allow comparison.

Fig. 4
figure 4

Region of interests (ROI) used in ROI-based quantitative relative signal intensity evaluation. The ROIs are placed on the largest solid portion of the adenoma (yellow), temporal white matter (green), and temporal grey matter (blue). Please note that ROIs are drawn for two consecutive slices

The qualitative (visual) rSI evaluation was done by two radiologists. In case of disagreement, the final decision was reached by consensus.

Immunohistochemical evaluation

Based on the staining characteristics using monoclonal cytokeratin antibody, the macroadenomas were divided into three groups as follows: (i) densely; (ii) transitionally; and (iii) sparsely granulated [18]. Because sparsely granulated adenomas are considered having a poor SA response [1, 18], the final groups were (i) sparsely granulated and (ii) the others.

Response and resistance criteria

The reference standard was biochemical response to SA treatment. Three months after surgery, SA treatment was initiated for patients with a GH level > 1 μg/L or elevated age-adjusted IGF-1. Patients were considered resistant if GH or age-adjusted IGF-1 levels were still elevated after 6 months of therapy with octreotide (40 mg per 28 days) or lanreotide (120 mg per 28 days).

Statistical analysis

The ML-based classifications were performed using WEKA toolkit version 3.8.2. The k-nearest neighbours (k-NN) classifier (IBk in WEKA toolkit) was utilised in qTA-based classifications [19]. The search algorithm for k-NN was linear with a Euclidean distance function. To minimise potential over-fitting, we created models with five-nearest neighbours (5-NN). The C4.5 decision tree classifier was utilised in rSI-based classifications. In the WEKA toolkit, the C4.5 algorithm is represented with J48 [20]. The C4.5 (or the J48 in WEKA) is a simple ML scheme. We used this classifier for a binary classification problem (presence or absence of the response to SAs). The primary purpose of using this algorithm to obtain similar performance metrics from the software to create comparable metrics for the k-NN.

For qTA, a 10-fold cross-validation procedure was adopted for the validation of the model, calculating performance metrics by averaging these ten different validation performances. On the other hand, the models for the other methods were created on the whole data, resulting in a single performance metric.

The main performance evaluation metric was the area under the receiver operating characteristic curve (AUC-ROC) [21]. In addition, sensitivity, specificity, precision (positive predictive value), recall, F-measure, the Matthews correlation coefficient, and the area under the precision-recall curve were calculated as well. Comparisons of the AUC-ROCs derived from the qTA (10-fold cross-validated), qualitative and quantitative rSI, and immunohistochemical evaluations (single AUC-ROC value for each method) were performed using the one-sample Wilcoxon signed-ranks test [22].

Shapiro-Wilk test was used for the assessment of normal distribution. The difference of the mean signal intensity between 2D ROI and 3D segmentation data was analysed using the paired t test.

Cohen’s kappa (k) was run to determine the strength of agreement between two observers’ judgments on qualitative rSI evaluation. Interobserver agreement was judged as according to the following rating: 0.00–0.20 = slight; 0.21–0.40 = fair; 0.41–0.60 = moderate; 0.61–0.80 = substantial; and 0.81–1.00 = excellent.

Results

Patient demographics and characteristics

Forty-seven patients with acromegaly and histopathologically proven GH-secreting macroadenoma were included in the analysis. The patient demographics and characteristics are presented in Table 1.

Table 1 Patient characteristics and demographics

3D segmentation and ROI characteristics

Mean (SD) of the 3D segmentation volume was 4734.1 mm3 (10880), of the maximum segmentation diameter was 24.3 mm (12.2), and of the ROI area was 95.23 mm2 (114.02).

Mean (SD) signal intensity of the ROI-based was 271.85 (78.05) and of 3D segmentation-based was 191.02 (57.49). They were statistically different (p < 0.05).

Reproducibility analysis

Following reproducibility analysis by two radiologists, 293 out of 828 features were excluded based on the predefined ICC cutoff value (ICC < 0.8). The remaining 535 were included in further analysis.

Wrapper-based feature selection

In the initial run, the wrapper-based classifier-specific feature selection algorithm yielded 12 texture features. In the following runs, the number of texture features decreased to four (Online Supplement Part E3). Distributions of the selected feature values between responsive and resistant groups are presented in Figs. 5 and 6.

Fig. 5
figure 5

Smoothed heat map created using the selected subset of texture features from all patients in the study. The map shows the distribution of normalised (0 to 1) texture feature values between responsive and resistant groups. Changes in colours and their shades indicate a difference in texture feature values in and between groups. TexF1, grey-level co-occurrence matrix (GLCM) Idn (inverse difference normalised) in the image with a LoG filter of 2 mm; TexF2, the first-order maximum in the image with a LoG filter of 6 mm; TexF3, the first-order median in the image with wavelet energy in low/high-frequency bands; TexF4, neighbouring grey-tone difference matrix (NGTDM) coarseness in the original image

Fig. 6
figure 6

(a) Deviation plot created with normalised values (0 to 1) of texture features showing the degree of overlap between responsive and resistant groups. Significant overlap is visually apparent in TexF2. Please note that although TexF2 has significant overlap, it still makes a positive contribution to the model’s predictive accuracy. (b) A three-dimensional (3D) scatter plot created using least overlapping features with their normalised values (0 to 1) shows the individual place of the features in 3D space. Blue circles, responsive group; black circles, resistant group; TexF1, grey-level co-occurrence matrix (GLCM) Idn (inverse difference normalised) in the image with a LoG filter of 2 mm; TexF2, the first-order maximum in the image with a LoG filter of 6 mm; TexF3, the first-order median in the image with wavelet energy in low/high-frequency bands; TexF4, neighbouring grey-tone difference matrix (NGTDM) coarseness in the original image

qTA-based classification

Using the selected features, the k-NN algorithm correctly classified 85.1% (40 out of 47) of the patients regarding response status to SAs with an AUC-ROC value of 0.847. Each AUC-ROC value in the 10-fold cross-validation is presented in Table 2. For detecting the responsive group, the sensitivity, specificity, and precision (or positive predictive value) were 87.5%, 82.6%, and 84%, respectively. For detecting the resistant group, the sensitivity, specificity, and precision were 82.6%, 87.5%, and 86.4%.

Table 2 Each area under the receiver operating characteristic curve (AUC-ROC) value in the 10-fold cross-validation of the machine learning-based quantitative texture analysis

Quantitative rSI-based classification

In the 2D ROI-based rSI evaluation of the macroadenomas (10 macroadenomas T2-hypointense; 37 T2-isointense or hyperintense), the C4.5 correctly classified 57.4% (27 out of 47) of the macroadenomas regarding response status with an AUC-ROC of 0.581.

In the 3D segmentation-based rSI evaluation of the macroadenomas (22 macroadenomas T2-hypointense; 25 T2-isointense or hyperintense), the C4.5 correctly classified 57.4% (27 out of 47) of the macroadenomas regarding response status with an AUC-ROC of 0.575.

Qualitative (visual) rSI-based classification

Interobserver agreement between two observers was substantial (kappa (k) coefficient = 0.651).

Using visual rSI method and consensus data (17 macroadenomas T2-hypointense; 30 T2-isointense or hyperintense), the C4.5 correctly classified 59.6% (28 out of 47) of the macroadenomas regarding response status with an AUC-ROC of 0.599.

Granulation pattern-based classification

Based on the immunohistochemical granulation pattern (27 macroadenomas sparsely granulated; 20 densely or transitionally granulated), the C4.5 correctly classified 70.2% (33 out of 47) of the macroadenomas regarding response status with an AUC-ROC of 0.704.

qTA versus other methods

Considering the AUC-ROC performance metric (10-fold cross-validation values for qTA; single value for the other methods), there were significant differences between (i) qTA and 2D ROI-based quantitative rSI evaluation (z = 2.8; p < 0.05); (ii) qTA and 3D segmentation-based quantitative rSI evaluation (z = 2.8; p < 0.05); (iii) qTA and qualitative (visual) rSI evaluation (z = 2.8; p < 0.05); and (iv) qTA and granulation pattern-based evaluation (z = 2.8; p < 0.05).

Table 3 presents the performance metrics of all the methods.

Table 3 Performance of quantitative texture analysis, quantitative relative signal intensity evaluation (ROI-based and 3D segmentation-based), qualitative (visual) relative signal intensity evaluation, and immunohistochemical granulation pattern-based evaluation in predicting response to somatostatin analogues

Discussion

The most important finding was that the k-NN classifier correctly classified more than four fifths of the macroadenomas. The predictive performance of the ML-based qTA was better than those of quantitative and qualitative rSI, and immunohistochemical granulation pattern evaluation.

The literature suggests that preoperative SA treatment improves the surgical outcomes [23,24,25,26]. Resistance to SA treatment may delay surgery and deteriorate the surgical outcomes. Hence, predicting response or resistance with preoperative predictors or biomarkers is important. Non-invasive ML-based qTA might be an interesting method.

There has been only one study of qTA against rSI in predicting response to SA treatment [9]. The authors used first-order histogram analysis with very few texture features and reported that the overall diagnostic accuracy of the histogram-based model was 82.4% for predicting good response with an AUC-ROC value of 0.861. However, there was no validation. Furthermore, they reported that the predictive performance of the histogram-based method was not different from that of the visual T2-weighted intensity evaluation. In our analysis, none of the first-order features obtained from the original image was selected by the feature selection algorithm. Conversely, some first-order features extracted from filtered or transformed images were selected. Using a higher number of features and internal validation, we found that prediction by qTA was superior to rSI evaluation.

Regarding the definition and interpretation of the selected features for model development, TexF1 corresponds to local homogeneity in finely filtered images. TexF2 corresponds to the maximum signal intensity in coarsely filtered images. TexF3 indicates median signal intensity in low- and high-frequency decomposition images. TexF4 corresponds to spatial intensity changes. According to TexF1 and TexF4, the responsive macroadenomas to SAs were locally more homogeneous in finely filtered images and more non-uniform in the original images.

Considering our small patient population, our classifier was evaluated with a complex nested cross-validation approach [27, 28]. It reduces the bias and gives a similar estimate of the error to that of independent validation [28]. Whole data might also be considered but might have led to bias due to the use of the same data set for feature selection and model development, also called ‘double-dipping phenomenon’ [16]. A random split of the data, creating separate training and validation data set, could mimic external validation but in such small data sets, the chance factor can deeply affect the results.

Generalisation of these results is subject to several limitations. The number of patients was small considering the numerous texture features. We needed to exclude patients with post-surgical biochemical remission as it could be related to surgery. The risk of over-fitting is an important issue, however the cross-validation technique intended to minimise it [16, 17]. In addition, we used five-nearest neighbours (5-NN) for the same purpose. We could have used each 2D segmentation in order to increase the number of labelled data. However, considering the very small size of tumours, this might have hampered texture analysis. In spite of a uniform imaging protocol, slight differences are unavoidable in a retrospective study. We applied N4 bias field correction [10], normalisation [11], discretisation [12], and pixel rescaling [12] to minimise differences. Although one fourth of our patients had preoperative SA treatment, which could be seen as a bias, we only used preoperative and pretreatment MRI studies. The methods shown here can only be applied to GH-secreting macroadenomas and cannot be extrapolated to others.

Conclusions

The results suggest that ML-based qTA on T2-weighted MRI has the potential to predict response to SAs in acromegaly patients with a GH-secreting pituitary macroadenoma, and performs better than quantitative and qualitative T2-weighted rSI, or immunohistochemical granulation pattern evaluation.