Introduction

Idiopathic intracranial hypertension (IIH) is a syndrome of elevated intracranial pressure for which no causative factor can be identified. Raised intracranial pressure typically manifests as papilloedema, that is swelling of the intraocular (prelaminar) portion of the optic nerve head, which can result in permanent visual impairment in approximately 10% of patients [1]. Assessment of the appearance of the optic disc (judging both the presence and the degree of papilloedema) is a key determinant when evaluating disease status and ultimately influences therapeutic decisions.

The Frisén classification is the most frequently used papilloedema grading system, being widely employed in both clinical and research environments in IIH [2]. Papilloedema is classified into six grades reflecting optic disc axonal distension (Table 1), and the original article included photographic examples. Although the Frisén classification is widely utilized, we felt that it has a number of limitations. Firstly, the grading does not take into account vascular changes at the disc such as venous stasis, hyperaemia, haemorrhages and infarcts, which can be of clinical importance. Secondly, the grading describes the disc changes seen in progressively developing papilloedema and consequently does not classify features observed in resolving papilloedema (e.g. residual optic disc halo). Finally, the system does not classify optic disc atrophy (appearing as disc pallor), a potential and irreversible consequence of prolonged severe papilloedema. Despite these limitations, the Frisén classification is frequently used as an outcome measure in research studies [36]. There is thus a need to further investigate the inter-rater reproducibility and the ability to discriminate changes in the optic discs using the Frisén classification.

Table 1 Frisén classification of optic disc swelling [2]

The aim of this study was to examine the reproducibility of the Frisén classification in a series of optic discs from patients with IIH reviewed by multiple observers. In addition, we evaluated for the first time the ability of the Frisén classification to discern changes in the appearance of optic discs over time. We hypothesised that the Frisén classification was not sufficiently sensitive to discriminate changes in optic appearance, and hence compared it with a simple strategy of disc ranking (discs were ranked in order of papilloedema severity) in order to determine which was the better method for discriminating changes in optic disc appearance.

Methods

Disc photographs were obtained from 25 patients with newly diagnosed (acute) IIH and 22 patients with chronic IIH (disease duration greater than 3 months). Pairs of photographs were obtained, one taken at enrolment into a research study, the other following a period of study intervention. Subjects with acute IIH had photographs taken 12 months apart as part of a prospective randomized controlled pilot trial of acetazolamide [7]. Subjects with chronic IIH had photographs taken 3 months following the introduction of an intensive low-energy diet as part of a prospective evaluation of the effect of weight loss in IIH [8]. Ethical approval for the study was obtained from the North West Multicentre Regional Ethics Committee and from Dudley Local Research Ethics Committee in accordance with the ethical standards laid down in the 1964 Declaration of Helsinki. All study participants gave written informed consent. All photographs were acquired at the Birmingham and Midland Eye Centre, Sandwell, and West Birmingham Hospitals NHS Trust from subjects with dilated pupils using identical equipment comprising a Topcon TRC-50IX camera (Topcon, Singapore) with a Nikon D1X camera body (Nikon Optical, Milton Keynes, UK).

Photographs were assigned a computer-generated random numerical code known only to the investigators (A.S. and A.B.). The photographs from each eye taken at the start and end of the study intervention period were paired and randomly sequenced. Six observers (A.J., A.R., B.G., M.L., T.M., M.B.) blinded to patient identity, clinical information and chronology of the photographs examined the paired photographs using Windows Photo Gallery (Microsoft) without conferring. All observers were senior clinicians with expertise in optic disc evaluation. Each observer, guided by the descriptions of the Frisén classification and photographic examples, allocated a Frisén grade to each disc photograph. Additionally, a system of “disc ranking” was used to classify the pairs of optic disc photographs. The observers were asked to compare the severity of papilloedema in each pair of photographs and to choose the disc which they judged to demonstrate less papilloedema, or to label the discs the same if they considered that there was no difference in the extent of papilloedema between the two photographs.

Statistical analysis

Agreement amongst the reviewers in classifying the optic discs was evaluated. Initially agreement amongst all six reviewers was assessed. The probability of pairs of reviewers agreeing on the disc classification was then examined by analysing all possible pairings of the reviewers (e.g. A.J. and A.R., A.J. and B.G., A.J. and M.L. etc., totalling 15 possible combinations of observer pairings). For Frisén grading, in addition to evaluating agreement between pairs of reviewers, the extent of disagreement was also noted (i.e. classification differing by one, two, three, four or five Frisén grades). Agreement between all six reviewers, and subsequently pairs of reviewers, was assessed for disc ranking. Disagreement in disc ranking amongst pairs of reviewers was quantified as either “minor disagreement”, where one reviewer classified a photograph as demonstrating less papilloedema while the other reviewer classified both photographs as the same, or “major disagreement”, where reviewers chose different photographs from the pair of photographs as showing less papilloedema.

The agreement between Frisén grading and disc ranking to identify the disc photograph with less papilloedema was then assessed for all reviewers. Complete agreement indicated concordance in allocating the lower Frisén grade to the photograph labelled as having less papilloedema. Minor disagreement indicated that a difference between a pair of photographs noted using one method was not noted using the other method (a difference being dissimilar Frisén grades or a disc identified as having less papilloedema on disc ranking). Major disagreement indicated that a disc of a pair noted as having less papilloedema on disc ranking was given the highest Frisén grade. Finally, the abilities of the two methods to discriminate changes in papilloedema were compared. Analyses were primarily carried out for the whole cohort, but disc photographs from patients with acute and chronic IIH were also separately analysed.

Statistical analyses were performed using SPSS 17.0 (SPSS, Chicago, IL). Continuous variables were analysed using descriptive statistics. Dichotomous variables were compared using the Sign test. The degree of agreement between rankings was calculated using Kendall’s tau-b statistic. Paired comparisons of disc ranking and Frisén grading where evaluated using McNemar’s test. The significance level was set at 0.05.

Results

Photographs of each eye of 47 patients before and after therapeutic intervention were included. Thus a total of 188 photographs were assessed by each of the six observers. One observer omitted to rank the discs of one pair of photographs, but completed all the other assessments.

Frisén grading

The frequencies with each Frisén grade (0–5) were allocated to the optic disc photographs by each of the reviewers are illustrated in Fig. 1a. Grades 1 and 2 were observed most frequently (30.2% and 29.7%, respectively). This was also the case when the optic disc from patients with acute and chronic IIH were considered separately (grade 1, acute 31.3%, chronic 29.0%; grade 2, acute 25.8%, chronic 34.1%). Grade 5 was least frequently allocated: 16 comparisons (1.4%), with most of these being in patients with acute IIH (12 comparisons) compared to those with chronic IIH (4 comparisons).

Fig. 1
figure 1

Bar charts illustrating the Frisén grading results, and the degree of agreement between Frisén grading and disc ranking. a Distribution of Frisén grades from optic discs of patients with acute and chronic IIH. b Concordance between Frisén grade and disc ranking in identifying which of pairs of photographs demonstrated less papilloedema. Minor disagreement indicates that a disc was judged different by one method but the same by the other method; Major disagreement indicates that a disc was judged as showing less papilloedema by one method but as showing more severe papilloedema by the other method (*p < 0.05, disagreement between the two methods significantly more likely in patients with chronic IIH than in those with acute IIH). c Ability of Frisén grading (Frisén) and disc ranking (Rank) to distinguish pairs of photographs from patients with acute and chronic IIH. Same indicates that the pairs of disc photographs could not be distinguished; Different indicates that differences were noted in the pairs of photographs (***p < 0.001)

Complete agreement amongst all six reviewers in the allocation of Frisén grades to the optic disc photographs was rarely observed (3 of 188 photographs, 1.6%). Agreement amongst five out of six reviewers was more frequently noted (26 of 188 photographs, 13.8%). Agreement between pairs of reviewers in allocating a Frisén grade to the optic disc photographs was then evaluated (Table 2). Complete agreement between pairs of reviewers was observed in 1,019 of 2,820 comparisons (36.1%).

Table 2 Observer variability in Frisén grading showing the agreement between pairs of reviewers in allocating a Frisén grade to optic disc photographs of patients with acute and chronic IIH

The pairs of reviewers disagreed in the allocation of the Frisén grade in 63.9% (1,801) of the comparisons. Most frequently, the reviewers disagreed by only one Frisén grade (1,288 comparisons, 45.7%) with disagreement by five Frisén grades (major disagreement) only being noted for two comparisons (0.1%). Frisén grades were allocated with similar agreement to optic discs from patients with acute and chronic IIH (Kendall’s tau-b statistic 0.016, p = 0.360).

It was not possible to determine whether the degree of papilloedema influenced the accuracy of allocating a Frisén grade, as the majority of optic discs evaluated in this study had mild to moderate papilloedema (Frisén grades 1 or 2) and few discs had severe papilloedema (Frisén grade 5). A cross-tabulation of the allocation of Frisén grades by each of the pairs of reviewers is shown in Supplementary Table 1.

Disc ranking

Complete agreement amongst all six reviewers in ranking pairs of photographs (agreement in identifying the optic disc with less papilloedema) was observed in 45.2% of the comparisons (42 of 93).

Agreement in the disc ranking amongst pairs of reviewers was evaluated (Table 3). The probability of agreement in the disc ranking was 70.0%. Rarely (4.1% of comparisons) did the reviewers completely disagree as to which of a pair of optic disc photographs demonstrated less papilloedema (i.e. one reviewer choosing a photographs as demonstrating less papilloedema and the other reviewer selecting the same photograph as demonstrating more papilloedema). More frequently (25.0% of comparisons), one reviewer would identify a difference in a pair of disc photographs and the other reviewer would judge that there was no difference in the same pair of photographs. Reviewers were significantly more likely to disagree on the disc ranking in patients with chronic IIH than in those with acute IIH (Kendall’s tau-b statistic 0.380, p < 0.001).

Table 3 Observer variability in disc ranking showing the agreement between pairs of reviewers in ranking the optic discs from patients with acute and chronic IIH

Agreement between Frisén grading and disc ranking

The agreement between the Frisén grading and disc ranking was assessed to evaluate the probability of the two methods agreeing as to which disc photograph demonstrated less papilloedema. The probability of the two methods agreeing was 77.8%, and the probability of the two methods differing was 22.2% (Fig. 1b). Most disagreements were minor (22% of comparisons), that is when a pair of photographs was considered to be different by disc ranking but the same by Frisén grading. Major disagreements were rare (0.2%), that is when one of a pair of photographs was considered to show less papilloedema by disc ranking but more severe papilloedema by Frisén grading. Disagreement between the disc rank and Frisén grade was more frequent amongst patients with chronic IIH than among those with acute IIH (28.2% and 17.3%, respectively; Kendall’s tau-b statistic 0.124, p = 0.003; Fig. 1b).

Comparison of sensitivity of Frisén grading and disc ranking

Frisén grading and disc ranking were compared to determine which system more frequently differentiated the appearance of pairs of optic disc photographs in patients with IIH. Pairs of photographs from patients with acute IIH were more frequently judged to be different than those from patients with chronic IIH for both Frisén grading (acute 72.7%, chronic 31.1%; Kendall’s tau-b statistic −0.416, p < 0.001) and disc ranking (acute 89.7%, chronic 58.9%; Kendall’s tau-b statistic −0.356, p < 0.001; Fig. 1c).

A difference in pairs of photographs was significantly more likely to be identified by disc ranking than by Frisén grading (75.3 vs. 53.2%; p < 0.001. McNemar’s Test). The improved sensitivity of disc ranking compared to Frisén grading applied to pairs of photographs both from patients with acute and from those with chronic IIH. Consequently, despite the limited ability of both disc ranking and Frisén grading to differentiate pairs of photographs from patients with chronic IIH, disc ranking was significantly more likely to identify a difference in the pairs of photographs than Frisén grading (58.9% vs. 31.1%; p < 0.001, McNemar’s test; Fig. 1c).

Discussion

This is the first study to formally evaluate the use of Frisén grading to monitor changes in the appearance of the optic disc in IIH. The original Frisén classification was published together with fundus photographs (12 normal and 66 swollen discs of various aetiologies) classified by three reviewers who agreed in 49% of cases [2]. It is unclear if any photographs from patients with IIH were included. In this study, six observers each evaluated 188 optic disc photographs from patients with IIH. We evaluated the reproducibility and ability of the Frisén classification to discriminate serially measured optic discs of patients with IIH. Additionally we compared Frisén grading with a system of disc ranking.

Complete agreement amongst the six observers was noted in just 1.6% of discs assessed by Frisén grading compared with 44.6% of discs assessed by disc ranking. Similarly, the probability of pairs of reviewers agreeing on the disc classification was much lower for Frisén grading than for disc ranking (36.1% vs. 70.0%). Additionally, our analysis of Frisén grading in IIH by pairs of reviewers demonstrated lower reproducibility than that noted in the original description of Frisén grading, despite the original validation being carried out by three reviewers (36.1% vs. 49%) [2]. Thus, Frisén grading has poor inter-rater reproducibility in IIH.

The sensitivity of Frisén grading to differentiate pairs of disc photographs was only 53.2% and lower than that of disc ranking (75.3%). With regard to the likelihood of agreement between the two methods, Frisén grading and disc ranking disagreed in 22.2% of evaluations. In 22.0% of evaluations this was due to Frisén grading allocating identical grades to the discs whilst disc ranking noted a difference in the disc appearances.

Both systems of disc classification performed better in differentiating fundus images from patients with acute IIH than those from patients with chronic IIH. This probably reflects the more dramatic changes demonstrated in the discs of patients with acute disease. This might be expected, as the fundi of patients with acute IIH are likely to undergo change more rapidly than during the chronic phase. In addition, the pairs of disc photographs from patients with acute IIH were obtained 12 months apart compared to just 3 months apart for the pairs of photographs from patients with chronic IIH. Minimal differences in the appearance of the discs, which may have been more difficult to interpret, is likely to explain why reviewer agreement in disc ranking was significantly lower in pairs of photographs from patients with chronic rather than acute IIH (52.2% vs. 87.3%, p < 0.001). A difference in reviewer agreement between pairs of photographs from patients with acute and chronic IIH was not noted for Frisén grading. This probably reflects the limited discriminative ability of the classification, whereby discs with minimal differences in appearance were allocated the same grade. Therefore in the patients with chronic IIH, in whom changes in the appearance of the optic disc are likely to have been minimal, more raters were able to notice a difference in the appearance of the disc using disc ranking than Frisén grading (ranking 58.9% vs. Frisén grading 31.1%, p < 0.001). In addition, disc ranking was more reproducible (ranking 52.2% vs. Frisén grading 35.9%). This would suggest that in both a clinical and a research environment, disc ranking is a more useful tool than Frisén grading for monitoring changes in optic discs over time. We would suggest, however, that Frisén grading remains the tool of choice to characterize the degree of papilloedema, although this should be interpreted with some caution as inter-rater reproducibility was limited.

There are a number of limitations to this study. Firstly, as observers performed the assessment of the discs on a single occasion, intrarater reproducibility could not be measured. This would be of considerable interest in future studies. Secondly, the effect of the number of classification options within each grading system (six for Frisén grading, but only three for disc ranking) could have influenced the results. In practice, however, this effect was limited due to the majority of fundi being of Frisén grades 0, 1 or 2 (74.7%) restricting the majority of selections to three. Also, the p values quoted assume there is no correlation between observations for the two eyes of the same patient, or between observations of the same patient by different reviewers. However, with the relatively small degree of correlation observed in this study, all significant results would remain significant at the 5% level if the p values were adjusted for this correlation (data not shown). We also acknowledge that both eyes from each patient were evaluated; however, we feel that any potential bias from this method of analysis was negligible as the observers were not able to link the right and left eyes of individual patients when assessing the 188 disc photographs. Finally, the images obtained were two-dimensional, although this was also the case in the original description and validation of the Frisén classification [2]. In clinical practice, the use of a slit lamp to obtain a highly magnified three-dimensional view of the retina allows additional fundus features to be appreciated, which may influence the classification of the optic disc. Our study did not, nor did it intend to, assess the validity of the Frisén classification in classifying three-dimensional fundus images. Three-dimensional fundus images obtained from slit lamp examination are of limited use in clinical trials due to the need for blinding the observer to the patient’s identity. Stereoscopic retinal photography may, however, have a role in documenting the three-dimensional appearance of discs in IIH, and future analysis of the Frisén classification in this situation would be of interest. Studies are also required to establish the optimal number of reviewers necessary to meaningfully classify discs.

The Frisén classification is frequently employed in IIH clinics and research [35, 9, 10] to describe and monitor papilloedema, yet our study highlights its limited reproducibility and sensitivity in the evaluation of optic disc changes. Newer more objective measures of papilloedema, such as ultrasound scanning and optical coherence tomography, are gaining popularity but are not yet widely available [8]. We suggest that the Frisén classification requires modification when applied to patients with IIH so as to reflect resolving papilloedema, optic atrophy and other more subtle changes noted on clinical examination. Until such a scheme becomes available and has been validated, it would appear that there is a role for simple ranking of discs in IIH. This method has been shown in this study to exhibit superior interobserver reproducibility and sensitivity to change, especially in patients with more longstanding disease.