Introduction

Arterial vasospasm plays a critical role in the development of delayed cerebral ischemia (DCI) after aneurysmal subarachnoid hemorrhage (SAH). It occurs in 20–50% of patients following aneurysmal rupture and contributes substantially to poor outcome through a combination of secondary injury mechanisms [1,2,3,4,5,6]. Advancing the understanding of these mechanisms is facilitated by reproducible measurement of cumulative hemorrhage burden. In 1980, C. Miller Fisher first utilized initial CT imaging to risk-stratify patients for vasospasm [7]. The Fisher scale specifically defined thin SAH as less than 1 mm and thick SAH as greater than 1 mm. It was not meant to be an ordinal scale, as group 3 has a higher risk of vasospasm than group 4. It has since been utilized, however, as an ordinal scale and reported incorrectly. It also does not account for the additive risk of ventricular blood. As a result, classification of patients by the Fisher scale is unclear and errors are often noted in its application [8, 9].

Claassen et al. [10] revisited the Fisher scale in order to develop a simpler admission CT rating scale with superior predictive value for DCI. Claassen’s scale accounts for the separate and additive risk of thick SAH (completely filling one or more cistern or fissure) and bilateral intraventricular hemorrhage. This score is often referred to as the “Modified Fisher scale” (mFS) although the manuscript made no mention of this name [11]. Some studies report this as the “Claassen scale” or “Columbia scale.” Then, Frontera et al. [12, 13] specifically coined the term “mFS,” which they found predictive of symptomatic vasospasm (not DCI). Their analysis utilized retrospective data collected on 1378 scans from a clinical trial, and no images or measurements were available. Therefore, no explicit measurement criteria were used in Frontera’s mFS to classify blood as thick or thin, and any IVH (not bilateral) was graded as present or absent (Fig. 1). Although Frontera et al. used different criteria from Claassen et al., many reference their work interchangeably when referring to the mFS (Table 1, Supplemental Table 1) [14, 15].

Fig. 1
figure 1

The modified Fisher Scale according to Frontera et al. criteria with representative images

Table 1 Comparison of CT imaging scales of vasospasm/delayed cerebral ischemia

While prior studies have assessed the interrater reliability (IRR) of the mFS between a limited number of investigators at a single institution and found a moderate-to-good reliability [16, 17], IRR across institutions and working definitions of the criteria of the mFS is unknown. In the present study, we hypothesized that attending physicians that routinely take care of patients with SAH do not agree on the definitions of mFS criteria, and therefore, the mFS has limited IRR. We also performed a systematic literature review to assess for inconsistencies in the application of the mFS. As the National Institute of Neurological Disorders and Stroke (NINDS) Common Data Elements initiative has recently included the mFS as a highly recommended supplemental imaging grade [18], it is important that any misunderstanding of mFS grading be elucidated now and that the definitions of the scale criteria are properly understood to increase its validity in clinical trials and large population studies. A search of clinicaltrials.gov (accessed October 14, 2020) revealed 18 active or recruiting studies involving aneurysmal SAH and vasospasm, including trials of prevention (cilostazol and nimodipine combined, clazosentan, milrinone, CSF alteration, stellate ganglion block) and prediction (acetazolamide challenge with perfusion, 18F-FDG PET/CT) as well as 26 completed studies. Despite the low prevalence of aneurysmal SAH, VSP is a very active area of study, and enrollment into these studies should rely on good IRR with a scale that is used to predict VSP.

Methods

Study Design

There were two parts to this study, a cross-sectional survey and systematic review of existing literature. The survey was administered online to physicians from multiple institutions through the research survey portion of the Neurocritical Care Society website as well as a convenience sample of personal email contacts through snowball sampling. This study was reviewed and approved by LVHN’s institutional review board (IRB) and qualified as Human Subjects Research in Exempt Category (2)(i).

Instrument (Supplemental Figure 1)

Fifteen admission CT scans of patients with SAH from the authors’ institutions were randomly selected, anonymized, and made into videos for ease of scrolling and then uploaded into a Google Form survey (Fig. 2, Supplemental Figure 1). Participants were asked to grade the mFS for each CT scan. Only self-identified attending physicians who assign mFS were asked to participate. Additional data collected include details of how the participant defines “thick” versus “thin” clot, how “IVH” is scored, as well as demographic data with questions about medical training, experience with grading mFS, and training in mFS administration. Surveys were anonymous.

Fig. 2
figure 2

A representative scan (scan 13) from the survey with participants’ modified Fisher Scale grading. Full scan can be seen here: https://youtu.be/znM-I282Kbs

Statistical Analysis

Descriptive statistics were used to summarize the training characteristics of the participants. Frequencies and percentages are presented for continuous variables, while the median and interquartile range (IQR) are presented for continuous and ordinal variables. Kendall’s coefficient of concordance was applied to determine interrater reliability (IRR). Kendall’s coefficient of concordance is appropriate when there are 3 or more raters rating the same subjects (the same raters are used to assess each subject), and the rating is on an ordinal or continuous scale; 0 indicates no concordance, and 1 indicates perfect concordance [19]. Subset analysis was performed to determine IRR of several subgroups based on definitions of mFS criteria and level of training.

Literature Search

In a secondary analysis, we performed a systematic review of original research that cites the mFS paper by Frontera et al., as well as Claassen et al.’s paper on “The Fisher Scale Revisited”. We searched “Pubmed” and “Scopus” (accessed February 8, 2020) for original research citing those two papers. We excluded case reports, review articles, studies not readily available in English, and studies that used the mFS for reporting complications of procedures. We assessed each paper for its inclusion of definitions of the mFS criteria, whether or not the definitions (if present) were correct, and how the scale was used for the study (as part of the demographics reported, as a variable in a predictive/correlative model, as a matched variable, an adjusted variable, or as a comparator) (for definitions, see Supplemental Table 1).

Results

We received 47 responses to the survey—one response from a non-physician was excluded, leaving 46 responses. We could not determine a response rate due to our utilization of snowball sampling. There were 32 medical centers represented, reported as treating a median of 80 (IQR 50–100) patients with SAH per year. Nearly all participants had completed a fellowship in neurocritical care, but only approximately one quarter reported “formal training” in grading the mFS (Table 2).

Table 2 Demographics and training characteristics of participants

In reporting definitions of mFS criteria according to Frontera et al.’s criteria, only 24% of participants correctly identified that there is no clear measurement of thick or thin SAH, but just over half (52%) correctly identified that any blood in any ventricle is scored as intraventricular blood (Table 3). Half of the participants recognized that there is a distinction between Claassen’s scale and the mFS, while 33% refuted the distinction, and 17% reported not knowing whether there was a distinction or not. Most participants (72%) would take an online training module to standardize scoring of the mFS.

Table 3 Participants’ definitions of mFS components

In grading the 15 CT scans for mFS without being provided criteria, the overall IRR by Kendall’s coefficient of concordance was W = 0.586 (p < 0.0005), which is considered a statistically significant, moderate level of agreement. Those who correctly identified thin and ventricular blood definitions demonstrated a statistically significant better level of agreement (W = 0.727, p < 0.0005), while those who claim to have had formal training performed similarly to the entire cohort (W = 0.588, p < 0.0005).

In a secondary analysis, we found 241 papers referencing Frontera et al.; 108 fit the inclusion criteria for evaluation. There were 421 papers referencing Claassen et al., and 91 fit the inclusion criteria. With overlap, there were a total of 164 original research papers utilizing the mFS with only 17 explicitly listing Frontera et al.’s criteria when utilizing the mFS. There were nine papers explicitly listing Claassen et al.’s criteria as the criteria for the mFS. While the majority did not state any criteria used to grade the mFS, several papers were unclear in their criteria —interchangeably using Fisher and modified Fisher, partially listing criteria, or having incorrect references for named scales (Supplemental Table 1). The majority of studies use the mFS as a variable for prediction or to show correlation (100 studies), and another 44 studies report the mFS in demographics. In addition, mFS was used as an adjusted variable (5 studies), comparator variable (10 studies), inclusion criteria (2 studies), and as a matched variable (2 studies) (Supplemental Table 1).

Discussion

Among attending neurointensivists from over 30 institutions with high volumes of patients with SAH, we found only moderate IRR of the mFS. Most participants reported being responsible for grading the mFS, and nearly one in five of the participants had published data utilizing the mFS.

Prior studies have found higher IRR compared to our data. Claassen et al. did not measure IRR for their grading scale but did find that the IRR for their SAH and IVH measurements indicated good (ΚW = 0.6–0.8) and excellent (ΚW = 0.8–1.0) agreement [10]. A retrospective study of 271 patients’ CT scans graded by four raters found the mFS to have a moderate-to-good agreement (ΚW = 0.64) [16]. A later single-center study of 150 patients found similar results with two raters (ΚW = 0.61) [17]. Most recently four raters from a single institution of 165 patient scans were found to have a moderate agreement for the mFS (Κ = 0.42) [20]. Our study differs from prior work due to the large number of raters and institutions represented. We think our study reflects the heterogeneity of raters that would be contributing mFS grades to a large, multicenter clinical trial.

The burden of subarachnoid and intraventricular hemorrhage predicts symptomatic vasospasm and delayed cerebral ischemia [2, 21]. Hemolysis leads to inflammation, oxygen-free radical reactions, and endothelial injury that drives vasoconstriction [22]. The mFS holds tremendous research value as a grading system that facilitates risk stratification for symptomatic vasospasm based on blood burden. However, in order to be valid, a grading scale must have clear criteria and demonstrate good IRR. The mFS has clear utility in research and should not be replaced, but should be standardized. According to our data, the mFS lacks good IRR which is likely due to uncertainty regarding the scale criteria. We hypothesize that much of the confusion stems from the slight differences in the foundational papers by Claassen et al. and Frontera et al., in which the same author group used slightly different criteria to assess related but not identical outcomes (delayed cerebral ischemia vs. symptomatic vasospasm) [10, 12]. Our literature search found that many authors have attributed the mFS to Claassen et al. and sometimes integrate the original Fisher criteria for thin versus thick blood into the mFS, adding to the confusion. Of the 37 studies that had some definition of the mFS in their manuscript, only 17 listed the correct criteria and 20 listed incorrect or incomplete criteria. Even the NINDS Common Data Elements Project Investigators incorrectly attributed the mFS to Claassen et al. in one publication [23], while defining thin blood and thick blood according to the original Fisher criteria in another [24]. The CDE states that it is an attempt to “harmonize and standardize data collected for clinical studies in neuroscience,” and if incorrect and inconsistent definitions are used, this will only further the confusion on the correct definition of mFS. The incorrect scale criteria offered by commonly referenced Web sites such as mdcalc.com and UpToDate® compound the problem [11, 25].

In our study, the IRR of the mFS was only moderate. Of note, half of our participants were not aware that Claassen’s scale was unique from the mFS, about half could correctly identify the criteria for IVH, and less than a quarter recognized the criteria for thin versus thick blood. Nonetheless, there is reason to believe that proper training could improve the IRR. The study with the highest IRR provided each rater with a detailed description of the scale [16], suggesting that simply providing the criteria can bolster the IRR of the scale. Similarly, in our study those participants who could properly define the mFS score components showed good agreement.

With the inclusion of the mFS on the CDE, it is time to standardize the definition and training for the mFS. Many other grading scales (clinical and radiographic) require standardized training prior to inclusion in clinical trials and continued re-education and certification to assure the validity of the data. For example, inclusion of the National Institutes of Health Stroke Scale (NIHSS) into any trial requires certification by the examiners. An online-/video-based training program for the NIHSS has improved the reliability of the scale and is now standard practice prior to inclusion into any study [26]. Similarly, interventional stroke trials require online training and certification for the Alberta Stroke Program Early CT score (ASPECTS) after early studies showed insufficient interrater reliability [27, 28].

The consequences of the deficiencies in the IRR of the mFS are unknown. We found 164 original research articles referencing the mFS. A substantial portion found the scale to be an important predictor variable or an adjusted variable in a predictive model. The reproducibility of those results may depend on the consistency of mFS grading across institutions. As trials attempt to enrich their cohorts for the outcomes of symptomatic vasospasm and DCI [29, 30], having more reliability around scoring would be important for clinical trial entry. Specifically, studies that have focused on subtypes of SAH using components of the mFS without naming the mFS (such as a focus on thick clot) [30] showed benefit of therapy only for this subtype, underlining the importance of accurate IRR in evaluating scans for therapeutic effect. Further, the validity of trial results based on mFS with poor IRR should be evaluated for type II error. Our study has several limitations. The survey participants were mostly neurologists with neurocritical care certification. Others, including neurosurgeons, neuroradiologists, or non-physicians, may grade the mFS at some institutions. Although our participants self-identified as being primarily responsible for grading the mFS in their patients, we cannot be sure that they represent mFS graders at large. In addition, our participants were relatively inexperienced with a median of 5 years of neurocritical care practice. It is uncertain whether this is problematic. Others have reported no influence of experience on the IRR for the mFS [16]. We anticipated a higher response, though our recruitment methods did not allow for calculating a true response rate. In order to improve response rate, we limited the number of CT scans reviewed and were not able to show all possible visual permutations. Additionally, our sample was obtained through non-probability sampling methods. We hypothesize that participants were more interested in mFS grading than those who did not complete the survey and therefore may be more likely to be familiar with the accurate definitions. Thus, we do not think that the poor response rate would bias our results toward lowering the IRR, and those interested in responding may be more familiar with accurate definitions biasing results toward a higher IRR, but we cannot be sure. The survey included videos of scans that could be scrolled, but did not have available windowing or measurement tools, though the mFS does not require measurement to be accurate. However, we did not receive any feedback from participants that technical factors impaired their grading.

Conclusion

IRR among raters in grading the mFS is inadequate and may be related to discrepancies regarding the definitions of the score criteria. The NINDS SAH Common Data Elements may require further clarification in order to standardize research in SAH. More importantly, mFS may become a core tracking metric required for Comprehensive Stroke Centers and endorsed by Joint Commission (like the Hunt and Hess Scale). Many other common data points such as NIHSS and ASPECTs for ischemic stroke involve formal standardized training and certification with continuing education, especially if the data are to be used in research. The mFS would benefit from a similar formalized training program with certification, and 72% of participants agreed that they would take formal online training. Alternatively, an automated imaging pipeline capable of more accurately and rapidly measuring cisternal and ventricular hemorrhage may prove superior in facilitating large cohort studies evaluating underlying mechanisms of injury. [24].