Introduction

Uterine leiomyomas (LM) are very common, observed in nearly 40 % of reproductive-age women [1, 2]. Whereas traditionally symptomatic LM were treated with hysterectomy or myomectomy, currently they may be managed with standard surgery or minimally-/non-invasive methods such as uterine artery embolization, MRI-guided high-intensity focused ultrasound, or hormonal therapy [3]. While the latter techniques offer obvious benefits of faster recovery and organ-preservation, they do not provide histopathologic confirmation and could therefore allow leiomyosarcoma (LMS) to go unrecognized. LMS are rare, highly aggressive tumours that require prompt radical resection [46]. Unfortunately, distinguishing LMS from LM clinically is often impossible due to the similarity of the symptoms and laboratory data they produce [46].

MRI is the best imaging modality for assessing uterine masses before intervention [7]. Most LM are easily recognized as well-circumscribed, homogenous low-T2-signal-intensity (SI) masses with avid post-contrast enhancement [7]. However, the appearance of LM may be affected by degeneration, oedema, and/or unusual patterns of growth, making their differentiation from LMS difficult [8, 9]. Conflicting results from prior studies have led to uncertainty regarding the value of qualitative MRI features for differentiating LMS from unusual/atypical LM (ALM) [918]. These studies had small numbers of patients with LMS, included few ALM, and grouped various uterine sarcomas (leiomyosarcoma, endometrial stromal sarcoma, carcinosarcoma, etc.) into one category despite their often having distinct MR features and clinical outcomes.

Furthermore, we are unaware of any published studies regarding the value of quantitative MR texture features for the differentiation of LMS from ALM. Texture analysis (TA) extracts local variations in pixel intensities using well-established mathematical formulas and provides a set of quantifiable metrics that may supplement radiologists’ qualitative image interpretations. Research suggests that TA may be of particular value for the differentiation of tumours with similar imaging characteristics on conventional imaging [1923].

Thus, the aims of our study were to investigate whether qualitative magnetic resonance (MR) features can distinguish leiomyosarcoma (LMS) from atypical leiomyoma (ALM) and assess the feasibility of texture analysis (TA).

Materials and methods

The Institutional Review Board approved this retrospective, HIPAA-compliant study and waived written informed consent.

Eligibility

We retrospectively searched our institutional database to identify all consecutive patients who 1) underwent myomectomy and/or hysterectomy, 2) had contrast-enhanced MRI within 6 months of surgery, and had either 3) histopathologically-proven leiomyosarcoma, high-grade spindle cell neoplasm, smooth muscle tumour of uncertain malignant potential (STUMP), highly cellular or cellular leiomyoma, mitotically active leiomyoma, “atypical” leiomyoma, leiomyoma with hydropic changes, myxoid leiomyoma (leiomyoma with myxoid stromal changes), epithelioid leiomyoma, or intravenous leiomyomatosis or 4) had ≥1 of the following keywords in their MRI report: atypical/unusual uterine/myometrial mass, cellular leiomyoma/fibroid, or leiomyosarcoma. The above electronic search of our institutional database yielded 144 patients with surgically resected myometrial masses between 1/1/2007 and 12/31/2013. One hundred and three patients were excluded due to lack of MRI examination prior to the surgery, MRI examination obtained without intravenous contrast administration or only partial coverage of the myometrial mass. The final study population consisted of 41 patients with histopathologically-confirmed diagnoses (ALM = 22, LMS = 19), none of whom had STUMP. As uniform acquisition parameters are required for machine learning methods, scans were excluded from TA if any of the following were present on the subjective assessment of axial T2-weighted imaging (T2WI): 1) fat saturation (ALM = 2, LMS = 2), 2) motion artefacts (ALM = 2, LMS = 5), 3) pixel size >1 standard deviation from the mean (ALM = 4, LMS = 2). Thus, TA was performed on the scans of 24 patients (ALM = 14, LMS = 10).

Histopathology

Histopathologic diagnoses rendered by fellowship-trained gynaecologic oncologic pathologists at the time of the initial surgical specimen evaluation served as our reference standard. These diagnoses were based on the Stanford criteria, supplemented by the World Health Organization’s Classification of Tumours of the Breast and Female Genital Organs [24, 25].

MR imaging protocol

MRIs were obtained at our institution (13/41, 32 %) or elsewhere (28/41, 68 %) and digitized into our picture archiving and communication system (Centricity PACS; GE Medical Systems, Milwaukee, WI). At our institution, MRIs were acquired on ≥1.5-Tesla systems (GE Medical Systems, Milwaukee, WI) using pelvic phased-array coils for signal reception. At a minimum, each study included axial and sagittal T2-weighted fast spin-echo images (repetition time msec/echo time msec, 3500–4500/85-120; bandwidth (BW), 16-32 kHz; section thickness (ST), 4 mm; intersection gap, 1 mm; field of view (FOV), 20–24 cm; ≥256 × 192 matrix), axial T1-weighted spin-echo images (typical parameters: 400-650/minimum; BW, 32 kHz; ST, 5 mm; intersection gap, 1 mm; FOV, 32–36 cm), and fat-suppressed three-dimensional spoiled gradient-recalled echo T1-weighted images (typical parameters: flip angle, 12°; BW, 62.5-83.3 kHz; ST, 3 mm with no intersection gap; FOV, 20-24 cm) obtained before and 1, 2, 3, and 4 min after intravenous administration of gadopentetate dimeglumine (Magnevist; Berlex Laboratories, Montville, NJ) at a dose of 0.1 mmol/kg of body weight.

All 28 outside MR imaging examinations (performed at 24 different imaging centres) met or exceeded standards agreed to by the investigators. These standards required ≥1.5-Tesla systems, phased-array surface coils, acquisition of axial and sagittal T2-weighted images, and spin-echo or gradient-echo axial T1-weighted images obtained before and after intravenous contrast agent administration. With the exception of one MRI examination, all contrast-enhanced sequences were obtained with fat saturation. For all sequences, the parameters were the following: field of view, 20–34 cm; section thickness, 6 mm or less; and 256 × 192 or greater matrix. We could not determine the type of the intravenous contrast agent administered or the timing protocol used at the outside institutions.

Qualitative MR feature analysis

One radiologist (D.F., not involved in MR feature interpretation) reviewed all MRI scans. For each of the patients with more than one myometrial mass (16/41, 39 %), this radiologist correlated MRI and relevant histopathologic findings and marked one index lesion per patient for both qualitative and quantitative assessment. Two blinded radiologists (Y.L. and J.C.) with 7 and 6 years of experience in oncologic MR imaging independently interpreted all MRI scans.

Each reader evaluated the following qualitative MR features in each lesion: 1) borders (smooth or nodular), 2) haemorrhage, 3) SI relative to the outer myometrium on T2WI, 4) heterogeneity on T2WI, 5) presence/location of cystic alteration(s) on T2WI (SI equal to that of urine), 6) fluid-fluid level(s) on T2WI, 7) flow voids on T2WI (round and/or serpiginous low SI vessels that enhance after contrast administration), 8) SI relative to the myometrium on contrast-enhanced images, and 9) presence/location of unenhanced areas (central versus none or non-central) on contrast-enhanced images (Fig. 1). As cystic or myxoid degeneration cannot be distinguished from necrosis on T2WI and contrast-enhanced images, we recorded the presence and location of well-demarcated unenhanced areas on contrast-enhanced images [12]. The majority of the above qualitative MR features were selected for the assessment because they were evaluated in the prior published reports [912, 15, 16]. Each reader also recorded the presence of pelvic fluid, peritoneal implants, and pelvic lymphadenopathy (i.e., short axis ≥0.8 cm).

Fig. 1
figure 1

Illustrations of the four qualitative MR features that demonstrated the strongest statistical associations with LMS at histopathology. a Sagittal T2-weighted image shows a large uterine mass with nodular superior and posterior borders (white arrows). b Sagittal T2-weighted image demonstrates “T2 dark” area in the myometrial mass (white arrow). c Noncontrast T1-weighted fat-saturated image illustrates the presence of intra-lesional haemorrhage (white arrow). d Sagittal contrast-enhanced T1-weighted fat saturated image shows the presence of central unenhanced areas (black arrow)

Quantitative MR texture feature analysis

Image pre-processing

Using Insight ToolKit (ITK) software, the same two radiologists in consensus manually contoured all lesions (one lesion per patient in a total of 24 patients (ALM = 14, LMS = 10). [26]. On all axial T2WI with visible tumour, they traced the outer edge of each mass to segment the entire lesion and generate regions of interest (ROIs) (duration of measurement: ≤5 min per mass). The ROIs were drawn in the median on 17 slices for ALM (interquartile range: 8-21) and 18 slices for LMS (interquartile range: 13-27) to cover the lesion. Any variations in the scanners and acquisition parameters led to the images having different intensity (gray-level) ranges. To ensure a comparable range of intensities (gray-levels) across the patients, all images were standardized through histogram matching using an in-house developed C++ wrapper around the HistogramMatchingImageFilter available in ITK [27].

Texture feature extraction

We employed in-house software implemented in MATLAB (MathWorks, Natick, MA) to extract a total of 21 texture features from each ROI. First, histogram-derived gray-level mean, standard deviation (SD), kurtosis, and skewness were computed from each T2-weighted intensity image, resulting in four texture features. Second, Gabor filters at two orientations (0°,90°) and a single bandwidth (g = 1.4) were used to generate two Gabor edge images [28] (Supplementary Materials). Third, gray-level co-occurrence matrices (GLCM) were constructed from one T2-weighted intensity image and two Gabor edge images. Finally, Haralick texture features comprised of energy, contrast, homogeneity, correlation, and entropy were computed from GLCM, yielding an additional five intensity image-based (Figs. 2 and 3) and 12 Gabor edge image-based texture features (Supplementary Materials) [29, 30].

Fig. 2
figure 2

a Axial T2-weighted image illustrates ALM. b - f Illustration of the intensity-based texture features (energy, contrast, homogeneity, correlation, and entropy) overlaid on the axial T2-weighted image

Fig. 3
figure 3

a Axial T2-weighted image illustrates LMS. b - f Illustration of the intensity-based texture features (energy, contrast, homogeneity, correlation, and entropy) overlaid on the axial T2-weighted image

Texture feature selection and self-tuning spectral clustering

First, differences in texture features between LMS and ALM were assessed using the Wilcoxon signed-rank test. P-values were adjusted using Bonferroni correction to correct for the 21 comparisons.

Sixteen texture features that differed significantly between the two groups were selected for further analysis with self-tuning spectral clustering, a form of unsupervised clustering [31]. Making no assumptions regarding the form of data clusters, self-tuning spectral clustering separates the data into distinct groups using eigen decomposition of the affinity matrix. The self-tuning approach automatically identifies the ideal number of data clusters by analysing the eigen vectors of the affinity matrix (Supplementary Materials).

Following the clustering, cluster labels from all ROIs belonging to the same lesion were pooled together and each lesion was assigned a cluster label using a majority-voting approach. For example, if a lesion had 20 ROIs with cluster label 1 and five ROIs with cluster label 2, the final lesion label was cluster 1.

We repeated the above process of unsupervised clustering with only intensity image-based texture features (6/16) and only Gabor edge image-based texture features (10/16) to compare the performance of various sets of texture features (intensity image-based, Gabor edge image-based or both).

Statistical analysis

Continuous variables were summarized with means and ranges; categorical variables were described with frequencies and percentages. Relationships between clinical characteristics, pathologic findings, and LMS were assessed with the Wilcoxon rank-sum test for continuous variables and Fisher’s exact test for categorical variables. The associations between qualitative MR features and LMS were evaluated with the Fisher’s exact test. Four MRI features—nodular borders, haemorrhage, “T2 dark” areas, and central unenhanced areas—had the strongest statistical associations with LMS for both readers (i.e., lowest p-values) and were therefore included in further analysis. Multivariate logistic regression analysis was not possible since reliable parameter estimates could not be calculated because of the relatively small number of patients with LMS.

For each lesion, a qualitative MR feature score was determined by counting how many of the four most significant features (mentioned above) were present in the lesion. For each reader, diagnostic accuracy measures with exact 95 % CI were calculated using the presence of ≥1, ≥2, ≥3, and 4 features in a lesion as different cut-offs for diagnosing LMS.

Inter-reader agreement was assessed with the Cohen’s kappa (k) and interpreted as follows: 1) 0-0.20, slight agreement; 2) 0.21-0.40, fair agreement; 3) 0.41-0.60, moderate agreement; 4) 0.61-0.80, substantial agreement; and 5) 0.81-1.00 almost perfect agreement [32]. P values less than 0.05 were considered statistically significant. All statistical analyses were performed using SAS 9.4 (SAS Institute, Cary, NC).

Results

Patients

Patient and lesion characteristics are summarized in Table 1. Patients with ALM were younger than those with LMS (p = 0.0068). Median tumour diameters at histopathology were similar (p = 0.12) between patients with ALM (median: 9.25 cm, range: 2.0-20.4 cm) and LMS (median: 11.4 cm, range: 7.0-20.0 cm). Most common symptoms leading to MRI examinations were lower abdominal pain/pressure (17/41, 41.5 %), dysmenorrhea/ menorrhagia (11/41, 27 %), or both of the above (4/41, 10 %).

Table 1 Patient and lesion characteristics

Qualitative MR imaging features

As shown in Table 2, for both readers, the following seven qualitative MRI features were observed significantly more frequently in LMS than in ALM: 1) nodular borders (p ≤ 0.0001), 2) intra-lesional haemorrhage (p < 0.0001), 3) “T2 dark” area(s) (p < 0.0001), 4) flow voids (p ≤ 0.0041), 5) unenhanced area(s) (p ≤ 0.0041), 6) heterogeneity on contrast-enhanced images (p = 0.0041), and 7) central unenhanced area(s) (p < 0.0001). For both readers, four of the above seven features (presence of nodular borders, intra-lesional haemorrhage, “T2 dark” area(s), and central unenhanced area(s) demonstrated the strongest statistical association with LMS (p ≤ 0.0001 each feature and reader).

Table 2 All qualitative MR features and their associations with LMS

For both readers, pelvic fluid, lesional SI and heterogeneity on T2WI, intra-lesional cystic alterations and their location, intra-lesional fluid-fluid levels, and lesional SI on contrast-enhanced images did not differ significantly between LMS and ALM (p-value range: 0.05-0.54). No patients had peritoneal implants or lymphadenopathy.

Diagnostic accuracy measures for the four most significant MR features (lesion borders, haemorrhage, “T2 dark” area(s), and location of unenhanced area(s)) are summarized in Table 3. When the presence of ≥3 of the four most significant MR features was used to diagnose LMS, the highest combined sensitivities (R1/R2: 1.00[95%CI:0.82-1.00]/0.95[95%CI:0.74-1.00]) and specificities (R1/R2:0.95 [95%CI:0.77-1.00]/1.00[95%CI:0.85-1.00]) were observed (Table 4).

Table 3 Diagnostic accuracy measures for each of the four qualitative MR features that had the strongest statistical associations with LMS
Table 4 Diagnostic accuracy measures for diagnosing LMS using different criterions to signify LMS

Inter-observer agreement

Inter-reader agreement regarding qualitative MR features was substantial to almost perfect (Table 5). In particular, agreement levels for lesion borders, haemorrhage, “T2 dark” area(s), and location of unenhanced area(s) were almost perfect (Table 5).

Table 5 Inter-reader agreement with regard to the qualitative MR features

Quantitative texture features

Comparison of texture feature values

Of the 21 extracted texture features, 16 (six intensity image-based and ten Gabor edge image-based) features differed significantly between LMS and ALM (p-values: <0.001-0.036) (Table 6).

Table 6 Comparison of texture feature values between LMS and ALM using a Wilcoxon signed-rank test with Bonferroni correction for multiple variables

In comparison to ALM, LMS were associated with higher intensity image-based and Gabor edge image-based contrast, lower energy, and lower homogeneity (p < 0.001 for each feature); furthermore, LMS had higher SD and lower kurtosis (p < 0.001 for both features), indicative of less peaked and more variable distribution of intensities. Taken together, these findings all suggested greater textural heterogeneity of LMS.

Interestingly, LMS demonstrated lower Gabor edge image-based entropy compared to ALM (p-values: 0.029-0.036). Intensity-based entropy did not differ significantly between ALM and LMS (p = 0.092).

Self-tuning spectral clustering

Self-tuning spectral clustering selected the ideal number of clusters by identifying the cluster groupings with the least cost. The cost measures for various data groupings were 1.04, 1.09, 1.03, and 1.07 for 2, 3, 4 and 5 clusters, respectively. Therefore, four clusters (1 ALM and 3 LMS clusters) were selected (Fig. 4). Self-tuning spectral clustering grouped the lesions with accuracy of 0.75, sensitivity of 0.70 (95%CI:0.35-0.93, 7/10 LMS), and specificity of 0.79 (95%CI:0.39-0.95, 11/14 ALM).

Fig. 4
figure 4

A plot demonstrating the results of self-tuning spectral clustering. To facilitate the ease of illustration, only three of 16 texture features with statistically significant difference between LMS and ALM were used to generate this figure. Self-tuning spectral clustering identified a total of four distinct data clusters (C1 though C4) that comprised of one ALM (star) grouping and 3 LMS (blue, orange, and purple circles) groupings

When only intensity image-based texture features were analyzed, self-tuning spectral clustering identified four clusters (two ALM and two LMS) and achieved accuracy of 0.71, sensitivity of 0.7 (95%CI:0.35-0.93, 7/10 LMS) and specificity of 0.71 (95%CI:0.42-0.92, 10/14 ALM). When only Gabor edge image-based texture features were included, self-tuning spectral clustering selected four clusters (two ALM and two LMS) and categorized the lesions with accuracy of 0.58, sensitivity of 0.60 (95%CI:0.26-0.88, 6/10 LMS), and specificity of 0.57 (95%CI:0.29-0.82, 8/14 ALM).

Discussion

Accurate diagnosis of LMS is a clinical challenge. LM are much more common than LMS, and patients with these entities tend to present at similar ages and with similar clinical symptoms [33]. While rapid mass enlargement, particularly after menopause, may signify malignancy, it may also be observed with cellular or degenerating LM [33]. Furthermore, establishing the rate of tumour growth may not be possible when a large myometrial mass is discovered at the initial encounter. Serum markers such as LDH and LDH isozyme type 3 can be elevated not only with LMS but also with cellular and degenerated LM [11]. Similarly, CA-125 levels may overlap between LM and early-stage LMS [33, 34].

Our study first examined the ability of qualitative MRI features to differentiate LMS from atypical LM (ALM). The four qualitative MR features that had the strongest statistical associations with LMS—nodular borders, haemorrhage, “T2 dark” area(s), and central unenhanced area(s)—were reproducible between the two readers and accurately distinguished LMS from ALM. When the presence of ≥3 of these four features in a lesion was used as the criterion to diagnose LMS, the highest combined sensitivities (95-100 %) and specificities (95-100 %) were achieved. If externally validated, our results may help gynaecologic surgeons to better plan their surgical approaches and counsel women presenting for management of presumed leiomyomas.

Similarly to Sahdev et al., Schwartz et al. and Tanaka et al., we found that LMS appeared as large masses with ill-defined/nodular (rather than smooth) borders, heterogeneous iso-hyperintensity on T2WI, intralesional haemorrhage, and unenhanced areas on contrast-enhanced images [9, 10, 12]. Assuming that a combination of T1 and T2 hyperintensity and unenhanced areas signified malignancy, Tanaka et al. achieved moderate sensitivity (73 %) and high specificity (100 %) in differentiating benign and malignant smooth muscle tumours [12]. We attained higher sensitivity (95-100 %) and similarly high specificity (95-100 %) for distinguishing LMS from ALM by using the presence of ≥ 3 features to diagnose LMS.

Our findings contrast with those of Cornfield et al., who studied 17 ALM and nine malignant mesenchymal tumours (four LMS, two STUMP, two endometrial stromal sarcoma [ESS], one mixed ESS and smooth muscle tumour) and were unable to distinguish between these two groups of lesions using qualitative MR features [16]. Although they found the presence of ill-defined borders and reader “gestalt” (i.e., overall impression) to have the highest sensitivity and specificity, sensitivity was moderate at best (44-56 %). However, their study neither evaluated the location of unenhanced areas nor examined MR features other than lesion hyperintensity on T2WI.

Our study found high inter-observer agreement; however, it was based on the comparison of two readers from the same institution with similar levels of sub-specialty training and a common exposure to a large volume of gynaecologic oncologic examinations at a tertiary care cancer centre. It is possible that different results would have been obtained with other readers.

The second part of our study evaluated texture analysis (TA), where the only user activity was manual lesion segmentation. In comparison to ALM, LMS were associated with higher contrast and SD but lower energy, homogeneity, and kurtosis. These results indicate that LMS were more texturally heterogeneous than were ALM. Although higher entropy is typically associated with more heterogeneous tumours, we found Gabor edge imaged-based entropy of LMS to be lower than that of ALM. Furthermore, intensity image-based entropy did not differ significantly between LMS and ALM. The small differences in entropy we found may be explained by the fact that we focused on ALM rather than typically homogenous, T2-hypointense classic LM.

We computed both intensity image-based and Gabor edge image-based texture features because we hypothesized that Gabor filters, with their superior edge detection properties, might be more sensitive to the inter-group differences in lesion borders, frequency of intralesional haemorrhage, and presence/location of necrosis. Using unsupervised clustering, we found that a combination of intensity-based and Gabor edge image-based texture features yielded the highest accuracy of 0.75, while Gabor edge image-based texture features alone had the lowest accuracy of 0.58. Thus, while edge-based texture features alone were insufficient to accurately capture the textural differences between LMS and ALM, the addition of edge-based to intensity-based texture features improved lesion categorization.

Our study had several limitations. First, it was retrospective, and we only included patients with surgically resected tumours and preoperative MR imaging, introducing a selection bias. This was necessary to ensure accurate image-pathology correlation. Second, only LMS and ALM were evaluated that makes our conclusions less applicable to a broader patient population with various myometrial masses. This was intentional and justifiable given characteristic imaging features of typical LM and the limited data on the ability of MRI to distinguish LMS from ALM. Third, our cohort lacked STUMP. Fourth, we were unable to retrospectively correlate qualitative MRI findings to histopathologic findings on a per-feature-basis secondary to the lack of histopathologic whole-tumour step-section tumour maps. Fifth, we could not perform multivariate logistic regression analysis for the selection of qualitative MR features because of the small sample size. Sixth, we were unable to assess the role of diffusion-weighted imaging (DWI) because few MRIs (13/41, 32 %) included it. Furthermore, available studies with DWI could not be compared meaningfully due to differences in acquisition parameters and scanner manufacturers. Seventh, the acquisition parameters were relatively heterogeneous between the scans since many studies were performed elsewhere. As uniform acquisition parameters are required for radiomics and machine learning methods, a number of patients were excluded from texture analysis. Only T2WI were used for TA because of their robustness and least variation in acquisition parameters. We anticipate that the performance of TA may improve with the inclusion of textural information from additional sequences such as DWI and contrast-enhanced images. Lastly, a larger sample size and a distinct validation set are required to determine the efficacy of supervised machine learning for differentiating LMS from ALM. While TA has the potential to assist in the differentiation of tumours in the daily clinical practice, this would require the development of user-friendly tools for robust automated lesion segmentation and efficient extraction of texture features.

In summary, we identified four qualitative MRI features that had the strongest statistical association with LMS and found that the presence of ≥3 of them could accurately distinguish LMS from ALM; in addition, we found texture analysis to be a feasible semi-automated approach for lesion categorization. Future studies are needed to externally validate our results and to further evaluate the ability of machine learning techniques to correctly classify LMS and ALM.