Introduction

The detection and report of “black holes” (BH) is part of the standard neuroradiological evaluation in Multiple Sclerosis (MS). Indeed, given the correlation existing between BH and clinical disability [1,2,3], MS international guidelines recommend to always include their presence and number in neuroradiological reports [4, 5]. Although the presence of acute inflammation is related to the detection of transient BH [6, 7], the prognostic relevance relies on the presence of chronic BH, persisting for at least 6 months [8, 9] in absence of contrast-enhancement [7], which represent areas of severe tissue destruction, with irreversible axonal and neuronal loss [1, 3, 6].

Operationally, BH have been defined, more than 20 years ago, as T1-weighted (T1w) hypointense lesions with signal intensity comprised between the one of the gray matter (GM) and that of the cerebrospinal fluid (CSF) on Spin-Echo (SE) T1w images [10], corresponding to hyperintense lesions on T2w images [6, 11, 12]. Nonetheless, the last decades have seen the increase in acquisition of 3D T1w sequences not only in research settings (i.e., for brain atrophy quantitative assessment), but also in everyday clinical practice [13]. These sequences, with the Magnetization-Prepared RApid Gradient-Echo (MPRAGE) being the most representative among them, lead to a large variety of advantages, such as increased spatial resolution and decreased acquisition time, but are obviously characterized by a different tissue contrast compared to standard SE-T1w sequences [14]. Accordingly, it has been demonstrated that the evaluation of SE-T1w and 3D-Gradient-Echo (GrE)-T1w sequences leads to the identification of a different number of T1w hypointense lesions in MS [15]. So far, however, no information about intra-reader reproducibility, a crucial point in the evaluation of a condition such as MS in which seriate MRI scans are acquired, is available in the literature. Furthermore, no previous work has investigated inter-reader reproducibility between neuroradiologists with different years of expertise in MS. Indeed, it can be hypothesized that new generations of neuroradiologists might be more likely trained to evaluate 3D-GrE-T1w sequences, which are acquired always more widely and routinely.

Given this background, aim of this study was to investigate the possible impact of different sequences (SE-T1w or 3D-GrE-T1w), image resolution, and level of training on the intra- and inter-rater reliability of BH identification in MS. Finally, as different degrees of microstructural changes have been reported in SE-T1w compared to 3D-GrE-T1w hypointense lesions [16], to explore the clinical meaningfulness of different assessment approaches we tested correlations between BH identified on different sequences and disability.

Material and methods

Compliance with ethical standards

This study was approved by the local Ethics Committee, in accordance with the ethical standards of the institutional research committee and with the 1964 Helsinki Declaration and its later amendments. Written informed consent was obtained from all patients prior to enrolment.

Participants

In this single center study, MRI data from MS patients prospectively acquired from January 2019 to December 2021 in the context of a larger prospective MRI study were selected. To be included in this study, patients had to fulfill the following inclusion criteria: age ≥ 18 or ≤ 70 years; MS diagnosis according to the 2017 revision of the McDonald’s criteria [17] absence of any medical conditions associated with brain pathology other than MS; an Expanded Disability Status Scale (EDSS) obtained within one week from the MRI exam; a Relapsing–Remitting (RR-MS) course according to Lublin et al. [18]. The following exclusion criteria were then applied: unavailability of a T1w sequence acquired after gadolinium administration; unavailability of both SE-T1w and 3D-GrE-T1w sequences acquired in the same MRI session; images with poor quality (i.e. due to motion artifacts) or patients with exclusively large confluent lesions.

A flowchart showing the number of patients included and excluded from the study is available in Fig. 1.

Fig. 1
figure 1

Flowchart showing inclusion and exclusion criteria. Flowchart showing how the sample size of this study was reached after the application of inclusion and exclusion criteria. Abbreviations: MS = Multiple Sclerosis; SE = Spin-Echo; GrE = Gradient-Echo; T1w = T1-weighted

Images acquisition

All brain MRI scans were acquired on the same 3T scanner (Trio, Siemens Medical Systems, Erlangen, Germany) using the same acquisition protocol, that included a 3D Fluid-Attenuated Inversion Recovery sequence (FLAIR; TR = 6000 ms, TE = 396 ms, TI = 2200 ms, voxel size = 1x1x1mm, 176 sagittal slices, no gap), a 2D SE-T1w sequence acquired before gadolinium administration (TR = 615 ms, TE = 8.5 ms, voxel size = 1x1x3mm, number of slices = 40, no gap) as well as a 3D-GrE-T1w volume (MPRAGE, TR = 2500 ms, TE = 2.8 ms, TI = 900 ms, voxel size = 1x1x1mm, 160 axial slices, no gap) before and after contrast administration.

All T1w sequences were acquired with the same bicommisural orientation along the AC-PC line, to minimize possible errors in image evaluation due to multiplanar reconstruction of the 3D-GrE-T1w volume.

MRI data analysis

Three different sequences were evaluated by the readers in this study, namely the SE-T1w, the 3D-GrE-T1w (with a slice thickness = 1 mm) and the resliced-(rs)GrE-T1w (with a slice thickness = 3 mm). Indeed, given the differences in resolution between the 2D-SE-T1w and the 3D-GrE-T1w volume, the latter was resliced to a 2D-GrE-T1w sequence with a slice thickness equal to the one of the SE-T1w (3 mm), in order to retain only the effect of different tissue contrasts on BH evaluation and evaluate the possible effect of the different spatial resolution.

All images were independently evaluated by two readers with different expertise: a neuroradiology fellow with 4 years of experience in the field of MS (Reader A) and a board-certified neuroradiologist with more than 10 years of experience in MS field (Reader B).

Both readers evaluated images a first time (T0), and after a wash-out period of 30 days (T1).

To provide data in a different order and minimize possible learning curve effects, a random alphanumeric identification code was assigned to each sequence at T0 and randomly changed at T1.

At all steps, images were evaluated with the readers being blinded to any clinical or demographic information.

According to the literature [7, 11, 19], BH were defined as non-enhancing T1w hypointense lesions, with a minimal diameter of 3 mm and an intensity comprised between the CSF and the GM and corresponding to FLAIR hyperintensities. An example of what has been defined as chronic BH on both SE-T1w and GrE-T1w sequences is shown in Fig. 2. Confluent or poorly defined lesions, as well as acute BH (i.e., those showing enhancement after gadolinium administration), were excluded from the lesion count. An example of confluent periventricular and acute BH lesions is shown in Fig. 3.

Fig. 2
figure 2

Examples of chronic BH. In the upper row, SE-T1w (A) and 3D-GrE-T1w (B) images of a 58-year-old woman with MS. In the lower row, examples of lesions classified as chronic BH by an expert neuroradiologist (Reader B) on SE-T1w (C, arrows) and 3D-GrE-T1w (D, arrows), respectively

Fig. 3
figure 3

Examples of large periventricular and active BH. In the upper row, SE-T1w (A) and 3D-GrE-T1w (B) images of a confluent periventricular BH in 53-year-old woman with MS that were not evaluated in this study. In the lower row, pre- (C) and post-contrast 3D-GrE-T1w (D) images showing an active BH in a 28-year-old woman with MS

Statistical analysis

All statistical analyses were performed using R (v. 4.2.1). Descriptive statistics are reported for demographics and lesion count.

Intraclass correlation coefficient (ICC) and corresponding 95% confidence intervals (CI) were employed to assess intra- and inter-reader reliability separately for each of the three sequences. Inter-reader reliability was assessed between readers’ T0 evaluations to minimize the influence of possible learning curve effects. According to the study by Koo and colleagues [20], values greater than 0.9 indicated excellent reliability, values ranging from 0.75 to 0.9 and from 0.5 to 0.75 indicated good and moderate reliability respectively, while values of less than 0.5 were indicative of poor reliability.

Additionally, these analyses were also replicated using Cohen's kappa statistics (Supplementary Materials).

Possible correlations between BH number on each MRI sequence at T0 (Reader B) and patients’ clinical status, assessed via EDSS, were tested with Pearson correlation coefficient analysis.

Results

After the application of the inclusion and exclusion criteria, eighty-five MS patients were included in the analysis (M/F = 22/63; mean age = 36.0 ± 10.2 years; median EDSS = 2.0 [range: 2.0 – 3.0]).

Means, standard deviations and medians of BH counts for each reader, sequence and assessment session are reported in Table 1.

Table 1 Summary of descriptive statistics of the BH assessment

For both readers, the intra-reader ICC analysis showed that SE-T1w and rsGrE-T1w images achieved an excellent performance in terms of reliability, whereas 3D-GrE-T1w scans achieved a moderate one. In particular, when evaluating the intra-reader reliability for Reader A, the highest reliability was associated with SE-T1w images (ICC = 0.98, CI = 0.97—0.99), followed by rsGrE-T1w images (ICC = 0.95, CI = 0.92—0.97), while 3D-GrE-T1w images presented the lowest ICC value (ICC = 0.86, CI = 0.79—0.91) (Fig. 4). On the other hand, when evaluating the intra-reader reliability for reader B, the highest reliability was associated to rsGrE-T1w images (ICC = 0.94, CI = 0.91—0.96), followed by SE-T1w images (ICC = 0.91, CI = 0.86—0.95). 3D-GrE-T1w images presented the lowest ICC value also in this case (ICC = 0.86, CI = 0.78—0.91) (Fig. 4).

Fig. 4
figure 4

Results of ICC analysis for intra-reader reliability evaluation. Intraclass correlation coefficient and corresponding confidence intervals of BH assessment by the two readers (A and B, with 4 and 10 years of experience respectively) for the evaluated sequences. Abbreviations: ICC = Intraclass Correlation Coefficient; SE = Spin-Echo; GrE = Gradient-Echo; T1w = T1-weighted; rs = resliced

Finally, in the inter-reader ICC analysis between Reader A and Reader B assessments at T0, each of the three sequences achieved a moderate performance. Indeed, despite the highest reliability being associated with SE-T1w images (ICC = 0.84, CI = 0.76—0.89), followed by 3D-GrE-T1w (ICC = 0.83, CI = 0.74—0.89) and rsGrE-T1w images (ICC = 0.81, CI = 0.72—0.87), similar ICC values and respective confidence intervals were observed. Comparable results were also obtained when the Cohen's kappa analysis was carried out (Supplementary Materials).

For all sequences, a significant correlation was observed between BH number and EDSS score (SE-T1w: r = 0.25, p = 0.03, CI = 0.03–0.45; rsGrE-T1w: r = 0.30, p < 0.01, CI = 0.08–0.49; 3D-GrE-T1w: r = 0.28, p = 0.01, CI = 0.06–0.47).

Discussion

The present study demonstrates that, applying the traditional definition of BH, the 3D-GrE-T1w sequence is prone to a greater intra-reader variability compared to the SE-T1w, with this effect being driven by the higher spatial resolution of the 3D-GrE-T1w sequence. Indeed, when evaluating the latter sequence but resampled to a resolution comparable to the one usually acquired of the SE-T1w acquisitions (thus preserving the different contribution to tissue contrast only, minimizing the possible effects of voxel resolution), we observed a comparable reliability in comparison to the one achievable with the SE-T1w.

Over the last years, 3D-GrE-T1w sequences have been largely preferred to SE-T1w, allowing for the relatively fast acquisition of whole brain volumes indispensable in research settings for morphometric segmentation and GM volume and thickness quantitative evaluation, but also reducing acquisition times with direct impact on the everyday clinical practice [16, 21]. Furthermore, the higher sensitivity of these sequences in identifying small lesions and subtle differences in tissue contrast is well-known [15, 21], as in future they also might be use as inputs in machine-learning algorithms that might help improving their detection [22]. Regarding BH identification, the issue of reproducibility has gained a growing interest over time [15, 23]. Indeed, reliability in BH identification directly affects neuroradiological reports, consequently impacting the information provided to the neurologists. Our results indicate that, more than image contrast, spatial resolution is directly related to reliability. As such, the evaluation of a resliced 3D-GrE-T1w ​​sequence in clinical practice could result in higher reproducibility in BH assessment, closer if not comparable to the one obtained with the SE-T1w images. Interestingly, when assessing the inter-observer reliability between a young reader and a more experienced neuroradiologist, we observed similar concordance for all sequences, with only a slightly higher agreement when the SE-T1w sequence was evaluated, suggesting that the different level of expertise could exert a similar influence on variability independently form the sequence. A possible explanation for this finding could be researched in the abovementioned widespread and increasing diffusion of 3D-GrE-T1w images in clinical practice, which could have led to the acquaintance with BH detection on this sequence also in less experienced raters. On the other hand, whereas the last update of MAGNIMS–CMSC–NAIMS guidelines on MRI protocols in MS focuses on acquiring high-resolution T1w sequences [13], it is noteworthy to mention that some MS centers kept the habit of acquiring SE-T1w sequences. Thus, further studies are warranted to explore the reliability of these different sequences in BH assessment by readers without a specific experience in the MS field, to better simulate the daily clinical setting.

When evaluating correlations between BH and disability, our results are in line with previous studies showing the correlation between BH numbers and EDSS for all three sequences [15, 16]. Indeed, we found a weak, although statistically significant, correlation between these variables, a result expected also given the small range of EDSS of our MS group. It is noteworthy to remember that while the pathological relevance of SE-T1w hypointense lesions is clear and well understood, changes underlying hypointense lesions on 3D-GrE-T1w sequences do not seem to be univocally clarified [24]. Indeed, while more severe microstructural changes characterize SE-T1w compared to 3D-GrE-T1w hypointense lesions [16], the latter might represent the sum of a wide range of pathologic features, part of which could be reversible, such as edema and inflammation [15, 25].

This investigation does present some limitations. In the first place, being a single center study, the relatively small sample size here included could have lowered the sensitivity of our analyses, also considering that our group of patients included only RR-MS phenotypes, while it is known that the BH are commonly found (although usually in a confluent manner) in progressive stages of the disease [26]. Secondly, we only explored possible correlations between BH number and EDSS, as it would be of interest to further confirm their prognostic role by proving the correlation with other known prognostic biomarkers of the disease. Furthermore, since we have analyzed only images acquired on a 3T scanner, this study lacks information about the reliability of these sequences at 1.5T. Despite the well know limitations in routine acquisitions associated with a lower magnetic field strength [27], 1.5T scanners are widely used in clinical neuroradiological practice and therefore it could be of interest to compare the reliability of these different sequences at 1.5T. Moreover, in this study we could not address the variability in BH assessment across the wide range of scanner vendors and platforms currently available, also in the light of the systematic differences that can be present in studies with consistent scanner field strength and manufacturer after protocol harmonization [28]: for these reasons, future multi-center perspective studies are warranted to evaluate whether the results here presented can be generalized to different scanner platforms and field strengths.

Despite being characterized by these limitations, our study suggests that to ensure reliability levels comparable with the standard SE-T1w in BH count, which is crucial in the neuroradiological workup of MS patients, an assessment on a resliced GrE-T1w sequence should be recommended.