Introduction

Since the House-Brackmann facial nerve grading system (HBGS) was introduced in 1983 and endorsed by the Facial Nerve Disorders Committee of the American Academy of Otolaryngology-Head and Neck Surgery in 1984 as the standard for reporting facial nerve function, there have been ongoing controversies about its suitability [11, 12]. Samii [25] and Kanzaki [15] reported that the HBGS is widely validated and generally accepted, and grading of facial nerve function is possible with only small interobserver variability [25], whereas other studies suggested only fair to moderate levels of interobserver agreement for precise functional outcome assessment [1, 2, 7, 29].

Multiple facial nerve grading systems were developed to improve the reliability and the interobserver agreement, such as the HBGS and its modifications [8, 11, 15], the Nottingham [20], the Sydney [5], the Sunnybrook Facial Grading Systems [24] and the MoReSS (standing for movement, rest, secondary defects and subjective scoring) [7].

However, it has been difficult to reach consensus at the Acoustic Neuroma Conferences since 1991 [15].

Considering the fact that facial nerve function serves as a quality control for the treatment modalities of vestibular schwannomas and taking into account the controversies about the HBGS and the availability of modern techniques for video-analysis by the strongly expanded use of high-resolution cameras, it is questionable, whether subjective clinical assessments are still up to date.

Despite preliminary publications [5, 9, 13, 17], this is the first report investigating the interobserver variability of the HBGS at defined pre- and postoperative time points by four independent investigators in a dedicated group of patients of a previously published multi-center trial [26].

Methods

The primary aim of the underlying multi-center trial was to investigate the efficacy and safety of prophylactic parenteral nimodipine and HES treatment in VS surgery [26]. For the present study the treatment and control group were pooled. Facial nerve function was investigated at defined time points (preoperative, during the in-patient stay and 1 year after surgery) and documented photographically at rest and in motion as described by House and Brackmann [12]. Photographs were evaluated by the investigator of the multi-center trial and additionally by three blinded reviewers experienced in facial nerve disorders (neurologist, ENT, neurosurgeon) and classified using the HBGS. Interobserver variability was investigated if data from at least two reviewers were present.

All patients underwent resection via a retrosigmoid approach. In all cases intraoperative neurophysiological monitoring including brainstem auditory evoked potentials (BAEP), continuous facial nerve electromyography (EMG) and direct facial nerve stimulation were applied, and the diagnosis of a schwannoma was confirmed histopathologically.

Results

Participant flow

A total of 112 patients were enrolled in the multicenter phase III trial [26]. There were nine dropouts before preoperative data acquisition. In one patient data concerning the preoperative facial nerve function was missing. Therefore, 102 patients were suitable for further investigation (Fig. 1).

Fig. 1
figure 1

Participant flow diagram and preoperative assessment of facial nerve function

Preoperative function of the facial nerve was investigated by the four investigators in all 102 cases.

In the early postoperative course, 14 patients could not be analyzed because of missing photographs (n = 11) or assessment by only one investigator (n = 3) (Fig. 2). Therefore, 89 patients were suitable for further investigation in the early postoperative course. In 88 of these patients facial nerve function was rated by four and in one case by three investigators.

Fig. 2
figure 2

Participant flow diagram and assessment of facial nerve function in the early postoperative course

Facial nerve function 1 year after surgery was assessed in 103 patients. Photographs were classified by only one investigator in four cases and consequently were excluded (Fig. 3). The photographs of the resulting 99 patients were rated by 4 investigators in 92 patients, by 3 investigators in 4 patients and by 2 investigators in 3 patients.

Fig. 3
figure 3

Participant flow diagram and assessment of facial nerve function 1 year after surgery

Baseline data

The mean age of the patients was 49 years; 56% were female. Tumors predominantly were medium to large sized. Preoperative hearing was predominantly useful (Gardner-Robertson grade 1 or 2 in 63%). All patients revealed a normal (97%) or only a mildly impaired (3%) facial nerve function (Table 1).

Table 1 Baseline data

Assessment of facial nerve function

Interobserver variability was considerably different with respect to the three time points depending upon the severity of facial nerve paresis. Preoperative facial nerve function that was normal or only mildly impaired (HB grade I or II) was equally assessed in 97% (Fig. 1). Facial nerve function that deteriorated in the early postoperative course was subsequently documented without dissent in only 36%, with one grade difference in 45%, two grade difference in 17% and three grade difference in 2% of the patients (Fig. 2). Within 1 year after surgery in most cases facial nerve function had improved, resulting in a consistent assessment in 66%. Differing ratings were observed in 34% with deviation of one grade in 88% and of two grades in 12% (Fig. 3).

Significant differences concerning the severity of facial nerve paresis with a mean value of 2.4 [standard error of estimate (SE): 0.1, mean of all rated HB grades] in the early postoperative course compared to 1.5 (SE: 0.1) 1 year after surgery (p < 0.001, T-test for dependent samples) could be observed. Interobserver reliabilty was also significantly different between the early postoperative course and 1 year after surgery (p < 0.001, McNemar test).

Patients with two or more differing rating grades (n = 17) showed considerably worse facial nerve function (p < 0.001, Mann-Whitney U-test). In these patients the mean value of postoperative facial nerve paresis was 3.6 (SE: 0.2). Photographs of these 17 patients were assessed by 4 investigators. In contrast, the mean value of facial nerve paresis was 2.1 (SE: 0.2) in the 72 patients with no or only one grade of differently rated grades.

There was no tendency toward better ratings within the rater group of the operating neurosurgeons.

Discussion

This is the first study to show that there is a correlation between the severity of postoperative facial nerve paresis and the extent of interobserver variability using the HBGS. There were considerable limitations regarding interobserver agreements especially in patients with moderate and severe facial nerve paresis. Former studies reported only interrater agreements or interobserver variabilities of the HBGS without considering the degree of facial nerve paresis [5, 9, 13, 17]. The investigators of this study had acquired at least 15 years’ experience in the assessment of facial nerve disorders. They are specialists for ENT, neurology and neurosurgery. Therefore, the study design (competence of the raters and the use of photographs) did not negatively influence the measure of agreement. In particular, there was no consistent tendency for a better rating of the operating center. This suggests an acceptable quality of monocentric studies reporting facial nerve functions following surgical and non-surgical procedures.

Video clips (instead of photographs) of patients at rest and with a series of facial expressions may improve the accordance of rating facial nerve function by different observers. However, Gordon et al. showed that photographs with standardized facial expressions are a reliable outcome measure for determining facial nerve function following vestibular schwannoma surgery [9].

General requirements of a facial nerve grading system are reporting the clinical assessment as objectively as possible, reflecting signs of recovery or changes in function following therapeutic intervention [8]. In principle, quantitative technology-based systems and subjective clinical assessments are described.

However, time restrictions in a purely clinical setting and the need for additional specialized equipment might be arguments in favor of the use of simple subjective standard rating scales. Even though it has been criticized as not being sufficiently sensitive to changes [20, 23] and regarding its interobserver variability [5, 6, 16, 20], the HBGS is the commonly used system and was adopted by the American Academy of Otolaryngology-Head and Neck Surgery as standard for grading facial nerve recovery in 1984 [12]. Measurement of facial nerve function could be performed quickly within a few minutes and does not require any equipment. However, there are problems with noninclusion of synkesis phenomena [10] and the assessment of facial nerve function as HB grade III, because parasympathetic (“dry eye”) and intermediate nerve functions are not addressed in the classification [15].

To improve the interobserver agreement of subjective clinical classification, further developments were proposed. The HBGS was modified at the Tokyo consensus meeting in 2001 [15] and the Facial Nerve Grading Scale 2.0 was developed [8]. In many studies [1, 7, 29], the HBGS showed only a fair to moderate level of agreement (according to the Landis and Koch guidelines: <0, no agreement; 0–0.20, slight; 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, substantial; 0.81–1, almost perfect agreement) [17], which was in contrast to data of former validation studies (up to 0.77) [11, 23]. Burres and Fisch [3] suggested that assessment of facial nerve function should focus on motor function. The “Rough” facial nerve grading system is also based on facial motor function [2]. In 1994, the Nottingham System suggested the objective measurement of three facial expressions [20]. In 1995 and 1996, two similar objective scales were published: the Sydney System [4, 5] and the Sunnybrook Facial Grading System [24]. A comparative study between the Sunnybrook and HB facial grading scales investigating the repeatability and agreement demonstrated the superiority of the Sunnybrook classification [13].

However, the pitfall of all subjective classifications of facial nerve function is only a fair to moderate interobserver agreement, which lies in their very nature and cannot be improved by more sophisticated measurement methods. Quite in line with this, a comparison of the HBGS with other subjective grading systems showed no significant improvement in interobserver reliability [28].

Prospective computerized pixel change analysis [19, 21, 27] and facogram-based objective assessment of facial nerve function showed promising results and may eliminate clinicians’ subjectivity to provide standardized measurement in the future [16, 18, 22].

Another method, the so-called “Moire topography,” uses special cameras to measure facial contours. Investigations in 51 patients with facial palsy and 10 healthy volunteers showed high correlation between the results of the moire indexes and the HBGS [30]. Neely et al. reported an excellent test-retest reliability in 30 patients with a wide range of facial nerve paresis using image substraction techniques of digital video-recordings [21]. In the past, the lack of technical equipment was the greatest problem for computer analysis of facial nerve function [14]. In the future, easy access to high-resolution cameras and the development of evaluation software may provide objective assessment of facial nerve function in the clinical routine.

Conclusions

Further improvement of the assessment of facial nerve function will require an objective rating scale using modern techniques (video documentation and subsequent video analysis with motion analysis software), which nowadays may be easily applicable without involving extensive costs.