Introduction

Careful examination of the pupil by a trained practitioner is an established element of the neurologic examination [14]. In normal conditions, the pupils are equal in both size and reactivity to light [5]. The detection of a non-reactive pupil in a patient with an acute neurological disease is considered to be an event of vital importance and often triggers a variety of diagnostic (e.g., stat brain CT) and therapeutic (e.g., mannitol infusion) maneuvers. Numerous clinicians have considered pupil size and reactivity as notable in multi-parameter predictive models of outcomes and used such models to detect and define the location of the intracranial mass lesions [69]. In the management and prognosis of traumatic brain injury (TBI), for example, abnormalities of pupillary response or anisocoria (pupil size asymmetry) have been associated with neurologic deterioration and secondary brain injury and are correlated with poor neurologic outcomes [10, 11]. The literature is historied with triage, prognostication, and treatment algorithms based in whole or part on pupillary abnormality [1217].

The traditional and most common method of pupillary assessment is performed by a practitioner using a handheld light source [2]. Automated pupillary assessment is now commercially available and is being utilized for patients hospitalized with neurologic injury [18, 19]. For example, the NeurOptics® NPi™-100 Pupillometer (Neuroptics Inc., Irvine, CA, USA) is a portable handheld device that provides automated readings of one pupil at a time. Using this device, the observer targets the pupil which is then scored for size using proprietary software. Next, a burst (0.8 s) of light is emitted at a fixed distance from the patient and a high-speed video recording of the pupillary response is used to measure the maximum pupil size, minimum pupil size, constriction velocity and neuro pupillary index (NPi™) [20]. Automated pupillometry may reduce observer bias and provide insight to intracranial pathology [21, 22]. However, three citations, a case study [23], a study of healthy volunteers receiving phenylephrine versus sterile water [24], and an interrater reliability including only nurse examiners find limitations in pupillometry [25]. This study fills an important gap by examining the interrater reliability of subjective pupillary assessments in a diverse group of practitioners, and the comparison between subjective and objective pupillary assessments.

Methods

This prospective, blinded, observational study was approved by the Institutional Review Board at the University of Texas Southwestern Medical Center and is registered with ClinicalTrials.gov (NCT02296606). Only patients with a neurological or neurosurgical diagnosis with pre-existing orders for serial pupil/neurologic exams were approached for consent. Hospital practitioners, both physicians (MDs) and nurses (RNs), were informed via email and at meetings that they would be asked to participate as assessors. Only practitioners who would routinely perform pupil assessments were invited to participate (e.g., neurologist, neurosurgeon, or neuroscience RN). Each study subject was examined by a convenient pairing of practitioners so that study observations could be obtained without interrupting daily routines.

Following consent, pupil exams were performed and independently scored by two practitioners within a 5-min time window. Each observation required each practitioner to assess two (OU) pupils. Thus, each observation provided three sets of scores (two practitioner scores and one device score) for the size, shape (traditional assessments only), and reactivity of the pupils. Care was taken to ensure that the ambient lighting and the physiological state of the patient was identical for both measurements. Each practitioner used a flashlight or penlight to observe and score the left (OS) and right (OD) pupil for the initial size (in mm), shape (round or irregular), and reactivity (brisk (normal), sluggish (abnormal), or non-reactive (fixed)). To ensure that the environment of study was representative of practice, practitioners were permitted to use the light source they would routinely use when performing a pupil exam.

After practitioners had completed the subjective pupil assessment, a member of the study team obtained OS and OD pupillometer readings using the automated pupillometer (NPi™-100) within 5 min the practitioner exams, again ensuring similar lighting and physiological conditions. The pupillometer was set to ‘research mode’ so that practitioners and study team members were blinded to the actual size and reactivity measures obtained by the pupillometer.

Participants

Study subjects and practitioners were from two hospitals (a county teaching hospital and a university hospital) and from four units, including two neurocritical care units, a stroke unit, and a general neurology ward. One hundred and twenty-seven patients participated as study subjects (Table 1). Study subjects were predominantly male (56.7 %), Caucasian (88.1 %), and did not require surgical intervention (55.1 %). Nine (7.1 %) subjects had a prior ocular history that may have impacted pupillary exams (physical eye injury, cataract, glaucoma, or prosthetic implant). Study subjects were observed an average of 8.8 times (IQR = 3–9) during the study. Two hundred and twenty-two practitioners (RN = 194, MD = 28) participated in the study and performed traditional pupil assessments. The automated pupillometer (NPi™-100) assessments were performed by three trained research investigators.

Table 1 Demographics for patient-subjects

Statistical Analysis

Statistical analyses were performed using SAS v 9.3 for Windows. Cohen’s Kappa coefficient (k) is a measure of interrater agreement and was calculated for each pupil size, shape, and reactivity, first as a composite score and then examined as separate scores [26]. Kappa requires agreement for the number of columns and rows. Therefore, when present, null cells were inserted and assigned a weight of 0.000001 for interrater reliability [27]. Bland–Altman plots were constructed to examine measurement differences between human observers and pupillometer readings; regression lines were then fitted to examine for bias. Subjective pupil size estimates were scored as equal if the difference between the two raters was ≤1.0 mm and dichotomized as ≤3 versus >3 mm. Kappa values were interpreted as slight (0–0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80), or almost perfect (0.81–1.0) [28].

Sample Size

Power calculation was performed using SAS v9.3, assuming K 0 = 0.5, alpha 0.05, and a power of 0.8 (β = 0.20). A targeted sample size of 1163 paired observations was based on the desire to remain conservative given the limited literature available from which to estimate the required sample size [20, 24, 29].

Results

There were a total of 1166 observations. Practitioners participated in an average of 6.5 observations (IQR 1–6). The results in Table 2 summarize findings from 2329 (1166 OS, 1163 OD) paired subjective pupillary assessments by practitioners, and 2192 (1099 OS, 1093 OD) automated pupillometer device assessments (3 OD observations were null for a patient with a prosthetic eye). For the sake of brevity, only the results of the OU examination for size, shape, and reactivity are presented. The results of the OS and OD examination were similar and are presented in Table 2. Practitioner agreement for the omnibus pupil assessment (size, shape, and reactivity) was low (k = 0.26, 95 % CI 0.23–0.29).

Table 2 Practitioner agreement for pupillary observations

Agreement on Pupil Size and Shape

Pupil size was dichotomized as being ≤3 versus >3 mm in size. Practitioner agreement on pupil size was moderate (k = 0.54; 95 % CI 0.50–0.57). Agreement with device on pupil size was fair (k = 0.29; 95 % CI 0.27–0.32 and k 0.31; 95 % CI 0.28–0.34 for the first and second practitioners, respectively). Figure 1 provides the full range of pupil size estimates for human and pupillometer observations. Anisocoria was scored as present if the difference between OS and OD was >1 mm. Practitioner agreement on anisocoria was moderate (k = 0.60; 95 % CI 0.54–0.64). Shape was scored as round or irregular. Agreement on pupil shape was moderate (k = 0.62; 95 % CI 0.55–0.69).

Fig. 1
figure 1

a Comparison of pupil size estimates by human observers and automated pupillometer. b Bland–Altman plots for difference in size (includes regression line with 95 %CI)

Agreement on Pupil Reactivity

Practitioner agreement for reactivity, (reactive versus fixed), was moderate (k = 0.64; 95 % CI 0.58–0.71). Agreement with automated pupillometry device on pupil reactivity (dichotomized as reactive or fixed) was moderate (k = 0.52; 95 % CI 0.44–0.60) for the first practitioner and fair (k = 0.40; 95 % CI 0.32–0.49) for the second practitioner. There were 189 practitioner observations of a fixed pupil. Of these, 94/189 (49.7 %) were scored as fixed by both practitioners and 58/189 (33.3 %) were scored as fixed by pupillometry. Practitioner agreement on pupil reactivity scored as fixed (non-reactive), sluggish, or brisk was fair (k = 0.40; 95 % CI 0.36–0.44). There were 83 observations of non-reactive pupil as scored by pupillometer. Of these, the first practitioner also scored the pupil as non-reactive in 58/83 (69.9 %) observations and the second practitioner scored the pupil as non-reactive in 46/83 (55.4 %) observations. When the sample was confined to only observations of fixed (non-reactive) pupils, practitioner agreement on reactivity was only fair [k = 0.28 (OS); k = 0.47 (OD)]. Notably, there were 21 observations where the pupillometer could not be used to determine if the pupil was reactive or not (e.g., periorbital edema prevented the examiner from fully visualizing the pupil when using the pupillometer). Of 28 OS, and 17 OD, observations of a dilated pupil (>6.0 mm), practitioner agreement on reactivity was moderate or high [k = 0.78 (OS); k = 0.88 (OD)]. Of 407 OS and 450 OD observations of a small pupil (<3.0 mm), practitioner agreement on reactivity was fair [k = 0.59 (OS); k = 0.23 (OD)].

Agreement for RN and MD Practitioners

Variability within and between practitioners with different training was explored for four subsets: (1) the entire cohort, (2) both practitioners were RNs, (3) both practitioners were MDs, and (4) one RN and one MD performed observations (Table 2). Agreement for pupil size (≤3 vs. >3 mm) was similar for all four subsets (k = 0.54, 0.53, 0.63, and 0.54 respectively). Agreement for pupil reactivity (fixed versus reactive) was similar for all four subsets (k = 0.64, 0.67, 0.55, and 0.54, respectively). Agreement for pupil size (≤3 vs. >3 mm) was fair (k = 0.30; 95 % CI 0.27–0.32) between RNs and the device, and fair (k = 0.38; 95 %CI 0.31–0.45) between MDs and the device. Agreement for pupil reactivity (fixed versus reactive) was moderate (k = 0.47; 95 % CI 0.40–0.53) between RNs and the automated device, and moderate (k = 0.42; 95 % CI 0.22–0.61) between MDs and the automated device.

Discussion

This study explored interrater reliability of the traditional pupil exams performed by two independent practitioners, and the relationship between manual examinations and automated pupillometer results. The finding of limited interrater reliability for the size, shape, and reactivity scores between two practitioners confirms the findings from a smaller study examining traditional pupil exams from six practitioners and 20 patients. The high percent of agreement seen in Table 2 likely reflects the high number of normal pupillary findings. The majority of pupil exams were documented as round (96.9 %), briskly reactive (79.0 %), and 3–4 mm (50.9 %) in size.

Cohen’s Kappa is more robust when one option (e.g., reactive) is more common by chance or guessing. Lower kappa values likely reflect the lower rates of agreement, seen when the pupil finding is abnormal [30]. This is demonstrated by the data presented in Table 3 where reactivity is dichotomized. The raw percent for the expected outcome (reactive) is 95.7 % (2135/2230). The raw percent for the unexpected outcome (fixed) is 49.7 % (94/189). Thus, kappa provides an omnibus index of the level of agreement.

Table 3 Comparison of manual observations between two observers with pupil reactivity scored as reactive or fixed

Although there was high agreement that a pupil was reactive (95.7 % or 2135/2230), there was only 49.7 % (94/189) agreement that the pupil was fixed. These data suggest that practitioners are generally in agreement that a pupil is normal, but often disagree when one practitioner reports an abnormal finding. Agreement on reactivity was highest when the pupil was >6.0 mm in size (91.1 % [41/45]). Automated pupillometry provides a fixed light source, for a fixed period of time, and then provides an objective measure of the change in pupil size [31]. Subjective pupil examination by a human observer is inadequate, especially when the findings are clinically relevant. The low agreement on a fixed pupil has specific clinical implications given that a physician may decide to obtain diagnostic imaging (brain CT), or treat intracranial pressure (e.g., mannitol), based on the finding of a fixed or non-reactive pupil.

Two abnormal findings, anisocoria and non-reactive pupils were explored in depth. There is disagreement on the definition of anisocoria ranging from pupil size difference of 0.25 mm to >1.0 mm [1, 3, 32]. To remain conservative, we scored anisocoria only if the OS to OD difference was >1.0 mm. The low agreement of anisocoria between practitioner and device (k = 0.14) assessments likely reflects the fewer number of anisocoria events (153/1166) noted in device readings compared to the number noted by the first (450/1161) or second practitioner (445/1159). It is noteworthy that anisocoria is a dynamic phenomenon and the low agreement may be accounted to temporal, rather than actual state changes.

The absence of a pupillary light reflex (fixed or non-reactive pupil) as a new finding in a hospitalized patient may signal compression of the oculomotor nerve or distortion of the midbrain and warrants emergency assessment and treatment. When pupil exam findings were dichotomized as reactive versus non-reactive, there was poor agreement among the entire cohort (k = 0.64). It is unknown whether the finding that practitioners scored 189 pupils as non-reactive, compared to 83 scored as non-reactive by pupillometry reflects an inability of the human eye to perceive slight or slow movement. On the other hand, Kramer et al. [23]. recently reported a case study wherein a neurologist that observed the pupil for 7–9 s was able to detect a 1 mm size change that was undetectable by pupillometry. This suggests a potential limitation in examining interrater reliability across a diverse population (e.g., a neurologist may have greater expertise). However, the selection of a diverse group of practitioners was purposeful and strengthens the external validity of the study. Internal validity of this study could have been enhanced by limiting enrollment to only attending neurointensivists performing examinations on patients in primary position with standardized light sources and environmental conditions (background illumination). The diverse group of practitioners in this study includes registered nurses, nurse practitioners, neurologists, neurosurgeons, and resident physicians with varied experience performing the exam that they would normally perform. Thus, our sample more accurately reflects the population of practitioners who would perform patient pupil exams throughout the day, document the results of their exam, and compare those results against prior results. Therefore, the findings from this study would be expected to have high generalizability.

Practitioners were not completely blinded to aims of the study, though they were blinded to the findings of the second practitioner and the automated pupillometer. They were aware that their results were being compared and that a pupillometer device was being used. Each practitioner examined and scored pupils as they normally would. The limitation of this approach is lower internal reliability but the more accurate depiction of true practice enhances external validity, and thus the results are more easily generalizable across practice settings. Figure 1 demonstrates heteroscedasticity of size estimates; underestimating average size in smaller pupils, and overestimating average size in larger pupils. It is arguable that practitioners may have tried harder than normal to obtain an accurate assessment (reflected as higher interrater reliability); or, knowing that there was a second examiner may have led them to be less precise (reflected as lower interrater reliability). [20] However, given that 222 practitioners performed 4658 pupil observations, it seems reasonable to reject excessive observer effect.

A final limitation is the number of times an automated pupillometer reading could not be obtained (5.9 %). Baseline data on pupil function, shape, or reactivity were not collected, nor were any data collected on whether patients had a history of glaucoma or iridectomy, and it is possible that this could have impacted agreement. Automated pupillometer readings were generally completed in less than 1-min and research staff were instructed to make no more than three attempts to obtain any one reading. The most common reasons for inability to obtain readings included periorbital edema, patient movement (especially in the patient with impaired cognition), cataract or prosthetic eye (scored as reactive by three practitioners). Although not statistically different, the number of missing pupillometer readings was higher during the first half of the study compared to the second half, suggesting that there may be an operator learning curve.

Conclusion

There is low interrater reliability among diverse practitioners performing a manual pupillary exam. These findings confirm and extend prior work suggesting that there is inadequate agreement between practitioners who solely rely upon traditional pupillary assessments for patients with neurologic injury [33]. There is a need to standardize the assessment of pupillary function in order to provide higher reliability. The use of automated pupillometry could be considered as a mechanism to standardize practice when there is a need for accurate assessment of the pupil size and reactivity.