Surgical practice will improve the skills of surgeons as they gain experience during their training. However, assessment of technical skill during training is paramount to ensure adequate levels of competency [1]. Furthermore, it enables the structuring of training programs, which can be tailor-made by the ability to review progress at regular intervals [2]. This allows competency-based progression and can ensure certification of competent surgeons through consistent, objective assessment, with broader implications for quality of patient care and patient safety.

Beyond the use in training, assessment of surgeons also has legal implications. The increased awareness of iatrogenic injury incidence and medical errors in the past few years [3] has at times resulted in the skill and competence of healthcare professionals to be questioned. Surgeons are now encouraged to demonstrate their competency by means such as revalidation and through publication of patient reported and clinical outcomes.

Currently, numerous assessment tools exist for the evaluation of surgical technical skill. One method of assessment is the use of global rating scales and checklists [4], such as the Objective Structured assessment of Technical Skills (OSATS) [4, 5]. However, the OSATS scale is limited by the potentially subjective nature of rating scales and need for a trained observer and second rater to reduce the risk of bias [6].

Attempts to automate and objectify assessment systems have led to development of motion analysis devices, which utilise electromagnetic, mechanical or optical systems [5]. Systems such as the Imperial College Surgical Assessment Device (ICSAD) [4, 5] utilise sensors attached to the back of the surgeon’s hands, within a locally generated weak electromagnetic field, to measure metrics such as movement path length. This has been shown to be effective in laparoscopic surgery, but is not validated in open surgery [5]. These devices also attach onto the surgeons’ hands, which may interfere with their natural movements, have obvious issues concerning sterility during live surgery and is only able to capture data at short ranges.

Eye tracking has been shown to provide objective measurement of surgical skill [7]. It involves analysing the movement of eyes and behaviour of the pupils using infrared cameras, and the objective nature of the data output eliminates the need for expert subjective opinion during evaluation, such that is required for OSATS.

Previous research has suggested that experts, compared to non-experts, have more focused attention and elaborate visual representation during performance of a task [7, 8]. In eye tracking, this can be represented by increased fixation rates and a higher proportion of fixation within an area of interest [7]. A higher proportion of fixations in a certain area of interest suggest greater focused attention to that particular area. Dwell time is the duration of stay in an area, and it has been found that more important areas resulted in longer dwell times [9].

Though it has been used in static representations of surgical environments, eye tracking has not been trialed or validated in live surgery. Previous studies have shown differences in gaze patterns between expert and junior surgeons in simulated environments on laparoscopic tasks.

The aim of the study was to assess differences in eye metrics between surgeons of differing levels of expertise in open surgery.

Methods

This study was reviewed and approved by local research ethics committee, reference 13/LO/0119. It was conducted at a single academic surgical centre in London, UK.

Case selection

All day-case open inguinal hernia operations were considered for inclusion in the study, subject to researcher availability. Junior surgeons who were at least in their third year of specialty training and have carried out a minimum of 40 open inguinal hernia operations as the main surgeon and all attending surgeons were included. The amount of coaching during the procedure was minimal in keeping with the level of experience. However, it was not specifically recorded for this study. An attending was available during all the procedures if the junior surgeons needed help or advice. Patients under the age of 16 or those who did not wish to participate in the study were excluded.

Informed consent was obtained from both patient and surgeons. Surgeons were fitted with calibrated eye-tracking glasses before the start of the procedure. Data were recorded throughout the surgery, from initial incision until final skin suture. Patient demographics were recorded from patient medical records. A researcher presents during the operations recorded extraneous distractions to omit during analysis.

Apparatus

SensoMotoric Instruments (SMI, Berlin, Germany) eye-tracking glasses were used. They contain a high-definition scene camera recording the environment and two infrared cameras aimed at the eyes. Infrared light is beamed into the eyes whilst a camera records the position of corneal surface reflection relative to the pupil [10]. The software calculates how the infrared eye cameras relate to the image given by the scene camera and shows what the eye is pointed at [11]. A cursor, representing gaze, appears on the video when played. Data were recorded on a personal digital assistant (PDA) attached to the back of the surgeon underneath their gown, which was downloaded onto a laptop and analysed on proprietary software (BeGaze, SMI, Berlin, Germany). It was also processed through a previously validated in-house software algorithm discerning data on pupil metrics such as pupil dilation [12].

The glasses have a spatial accuracy of 0.1° visual angle and 0.5° precision. It indicates the location of gaze on a video image of the scene [11]. It is recorded at 30 Hz for offline analysis [13].

Assessment

Areas of interest (AOIs) were defined and divided into the operative site, sterile field, scrub nurse and operating theatre. The operative site was the area within the incision of the skin, the sterile field was the area within the sterile drapes and cleaned skin (excluding operative site), the scrub nurse included the instrument trolley and the operating theatre included any other area inside the operating theatre. The surgeons’ gaze was mapped (see Fig. 1), and eye metrics were measured based on those AOIs.

Fig. 1
figure 1

Semantic gaze mapping on BeGaze with the reference view on the left and the video going through fixation by fixation on the right

The primary gaze metrics recorded were fixation frequency and dwell time. Fixation frequency is the rate of steady eye gaze on an object, and dwell time is the sum of fixations and saccades (rapid eye movement) duration. Secondary endpoints were maximum pupil size, pupil rate of change and pupil entropy. Maximum pupil size is the largest size of the pupil during activity, pupil rate of change is frequency in change of pupil size and pupil entropy is the predictability of pupil change (see Table 1). Pupil entropy is calculated by applying a low-pass filter to a moving average over 5 s, that is, a window of 150 samples on the 30 Hz eye-tracking glasses that were used. The sample was then computed to see “how chaotic” the signal is; or in other words, how much it changes. Segments of the recording that were not related to the operation itself, such as when a surgeon was showing a trainee scrub nurse how to mount a suture, were excluded from analysis.

Table 1 Table of outcome parameters with definitions and units

After surgery, senior surgeons were asked to rate the case on its level of difficulty, where performed by an expert surgeon, this was rated by the subject, where performed by a junior surgeon, the supervising senior surgeon rated on a scale of 1–7. The primary surgeon (expert or junior) completed the NASA TLX form to measure their subjective cognitive load for that particular case.

Analysis

The surgeons were separated into two groups: expert and junior. Experts were attending surgeons and senior residents whom had been deemed independently competent for inguinal hernia surgery through procedure-based assessments and had performed a minimum of 100 inguinal hernia operations independently. A standard Lichtenstein inguinal hernia operation with mesh was performed in all cases with identical operative steps.

All gaze metrics were compared across the entire procedure between groups, but as anticipated did not yield significant overall differences in outcome parameters. This is most likely related to the heterogeneity in anatomy of the hernias and differences in surgical technique rendering overall procedural comparisons difficult. To account for this, as planned, segmental procedure analysis was performed, focusing upon segments of the operation, which were most standardised and independent of variations in patient anatomy:

  • Segment 1—beginning at cutting of mesh to when application of mesh was complete.

  • Segment 2—from when application of mesh was complete until closure of external oblique aponeurosis.

  • Segment 3—first stitch on external oblique aponeurosis to first stitch on subcutaneous tissue.

All videos were calibrated, and offset corrections for eye tracking were carried out as necessary. Fixation frequency and dwell time, normalised for segment time, were compared across groups for each segment of the procedure with respect to the different AOIs. Maximum pupil size, pupil rate of change and pupil entropy were compared across groups for each segment of the procedure alone.

Statistical analyses compared eye metrics between expert and junior surgeons. The Mann–Whitney U test and the Wilcoxon test were carried out on the eye metrics through SPSS version 18.0 (IBM Corporation, Armonk, New York, USA). Medians and interquartile ranges (IQR) were calculated through Microsoft Excel (Microsoft Corp, Redmond, WA). A p value of less than 0.05 was considered statistically significant.

Results

A total of 25 cases were recorded over 8 weeks. Nine cases were lost due to equipment failure and three cases were discarded due to poor-tracking quality. Thirteen full data sets were collected in total, performed by nine surgeons (eight males and one female) (see Table 2), with 630 min of video recorded.

Table 2 Surgeon demographics

AOI metrics

Experts, compared to juniors, had higher fixation frequency (see Table 3; Fig. 2) (1.86 [IQR 0.3] vs 0.96 [IQR 0.3]; P = 0.006) and dwell time (see Table 3; Fig. 3) (792 s [IQR 169 s] vs 469 s [IQR 109 s]; P = 0.028) at the operative site during application of mesh (segment 2). Closure of the external oblique (segment 3) also showed differences, with experts having a higher fixation frequency (1.79 [IQR 0.2] vs 1.20 [IQR 0.6]; P = 0.003) and dwell time (625 s [IQR 154 s] vs 448 s [IQR 147 s]; P = 0.032) at the operative site than juniors. For both segments 2 and 3, juniors split their attention more than experts with reduced attention to the operative site and more on the sterile field. For cutting of mesh (segment 1), there was no significance in fixation frequency. However, the experts dwelled more on the sterile field (716 s [IQR 173 s] vs 268 s [IQR 297 s]; P = 0.019) (see Figs. 2, 3), whereas the juniors split their attention more with reduced dwell time on the sterile field and greater dwell time on the operative site.

Table 3 Summary of fixation frequencies and dwell time (s) of the AOIs between experts and juniors, median [IQR]
Fig. 2
figure 2

Box plots showing fixation frequency of A segment 2 and B segment 3 expert and junior surgeons between the four different areas of interest: operative site, sterile field, scrub nurse and theatre

Fig. 3
figure 3

Box plots showing dwell time (s) of A segment 1, B segment 2 and C segment 3 expert and junior surgeons between the four different areas of interest: operative site, sterile field, scrub nurse and theatre

Pupil metrics

With application of mesh (segment 2), juniors had a higher left pupil size (see Table 4) (7.72 [IQR 0.15] vs 6.84 [IQR 0.64]; p = 0.032), left pupil entropy (3.84 [IQR 0.26] vs 3.12 [IQR 0.78]; p = 0.007) and right pupil entropy (3.93 [IQR 0.38] vs 2.85 [IQR 0.98]; p = 0.022). Experts showed a greater left pupil rate of change (0.0059 [IQR 0.0015] vs 0.0031 [IQR 0.0020]; p = 0.022).

Table 4 P values of pupil metrics for each segment between expert and junior surgeons

In closure of the external oblique (segment 3), juniors had a larger left (3.51 [IQR 0.37] vs 2.48 [IQR 0.43]; P = 0.007) and right (3.19 [IQR 0.31] vs 2.15 [IQR 0.73]; P = 0.015) pupil entropy and right maximum pupil size (7.17 [IQR 0.2] vs 6.68 [IQR 0.65]; P = 0.032). Experts had a higher left pupil rate of change (0.010 [IQR 0.0042] vs 0.0066 [IQR 0.0013]; P = 0.046).

Cognitive load

Case duration was significantly shorter for experts than juniors (29.0 min [IQR 10.1 min] vs 56.3 min [IQR 21.4 min]; P = 0.022) (see Table 5). In NASA TLX, the expert group found mental demand significantly less stressful than the junior group (3 [IQR 2] vs 12 [IQR 5.2]; P = 0.038). Physical demand, temporal demand, effort, frustration and case difficulty were perceived as lower in experts compared with juniors but did not reach statistical significance. Experts thought their performance was higher than juniors, but this also did not reach statistical significance. No problems or discomfort with wearing of the glasses were reported by any of the subjects, or any complaints relating to obstruction of visual field made.

Table 5 Median [IQR] case difficulty, duration and NASA TLX scores

Discussion

This paper presents the first eye-tracking study to compare eye metrics between expert and junior surgeons during live surgery. It provides important evidence towards validating the use of eye-tracking technology in assessment of surgical skill and demonstrates a difference in eye behaviour during several key stages of hernia surgery between surgeons of differing levels of expertise.

When examining the AOIs eye metrics, there were higher fixation rates and dwell times at the operative site in expert surgeons during application of mesh (segment 2), suggesting greater attention to the task. Similarly, during closure of the external oblique (segment 3), experts had higher fixation rates and dwell times than junior surgeons at the operative site. Our findings are supported by previous research where expert surgeons have been shown to have more focused attention of a task, represented by fixation rates [7]. Experts may change to instruments left on the sterile field less and be able to apply the mesh efficiently therefore not looking away from the operative site as often as junior surgeons and may also be less susceptible to distracting stimuli. Experts also dwelled more on the sterile field than juniors during cutting of mesh to application of mesh (segment 1). This may be explained by the observation that juniors looked back at the operative site more often to measure out the appropriate mesh size, whereas this was not seen in the expert group. These findings are supported by the work of Gegenfurtner et al. [14] where it was found that experts, compared with non-experts, fixated more on task-relevant areas, and less so on task-redundant areas.

A study by Zheng et al. [15] found that juniors tend to focus their eyes more on the surgical monitor during laparoscopic cholecystectomy compared with experts who would visually scan around the operating room more and acquire information to the patient’s condition such as the anaesthetic monitor. As surgical skills improve, surgeons would attend more to the environment [16]. This is opposite to the findings of this study, which may be due to open inguinal hernia repairs being a simpler procedure compared with a laparoscopic cholecystectomy, where there are additional stimuli to consider.

A limited number of studies have assessed gaze strategies as part of the learning process during the execution of tasks in laparoscopic surgery [11, 13, 17]. Despite the different modes of visual interaction between laparoscopic and open surgery (e.g. magnification, fixed screen orientation and two-dimensionality), the results of this study suggest that similar principles in eye behaviour may apply to both open and laparoscopic procedures. Further analysis of this data reveals certain behavioural patterns associated with expert behaviour at various stages of the task. During analysis of the videos, expert surgeons are more likely to be able to anticipate what instruments they will need in advance. This enables them to request an instrument from the scrub nurse whilst continuing with the operation and keeping their gaze at the operative site, receiving the instrument from the scrub nurse directly into their hand. Analysis of our data suggested that junior surgeons realised what instrument they needed later, resulting in delays and efficiencies as a result, recorded as searching gaze patterns outside of the sterile field, e.g. to look at the scrub nurse’s trolley to determine what instrument was next required. Such findings also suggest avenues for future training and intervention to allow surgeons to be more efficient with their time and carry out the operation with greater efficiency.

Pupil entropy was found to be significantly higher in juniors for both eyes during application of mesh (segment 2) and closure of the external oblique (segment 3), which suggests that the juniors are concentrating more with a higher cognitive demand. It also suggests that juniors, compared to experts, are processing more information at that time, making their pupil changes less predictable. This may be due to experts being able to carry out the task with lower cognitive demand. This is similar to the findings from AOIs where experts were seen to concentrate more at the operative site than juniors. This may be explained by junior surgeons focusing attention elsewhere at a number of locations, giving the higher pupil entropy.

Experts would be expected to experience lower cognitive workload compared with juniors due to increased automaticity associated with increased levels of experience [18]. This is reflected in the results from the lower mental demand in experts who have carried out the procedure numerous times and therefore find the task less challenging and stressful than the junior group. With practice, coordination of movement is mentally embedded and can be performed with minimal mental resource [18]. This is also reflected by the lower case duration in experts compared with juniors.

The results reveal clear trends in physical and temporal demand, performance, effort and frustration between the two groups, which may have been statistically significant with greater subject numbers. The lack of difference in physical demand, performance, effort and frustration may be explained by the relative simplicity of the procedure which surgical trainees are expected to be able to develop proficiency in their early training [19].

There are some limitations of this study to consider. The number of subjects was limited by the quality of eye tracking, with several hardware and software issues, resulting in videos where the quality of tracking was inadequate and was unsuitable for analysis. No sample size calculation was performed, owing to this being an exploratory study and the first of its nature. It is, however, anticipated that data resulting from this study will enable calculations of this type to be performed for future studies. The experience of the scrub nurse and assistant may bias the results by affecting the behaviour and interaction of the surgeon. Though this was not explicitly recorded for this study, long durations where gaze was fixed on the scrub nurse or assistant due to their lack of experience, such as during teaching, were omitted from analysis. The number of staff, medical students and other distractions in the operating theatre will affect the eye behaviour of the surgeon through interaction and teaching. However, this is normal in real-life clinical setting, making results more applicable. The presence of an in situ researcher meant that we were able to record and minimise extraneous distractions. Although a potential source of bias, clear contiguous phases of recorded data which were unrelated to the procedure at hand were edited out, which hopefully minimised the distractions. We decided to concentrate on the operative steps of hernia repair which are most standardised and independent of patient anatomy to limit confounding related to patient factors such as anatomy or habitus. This resulted in the intentional exclusion of several stages of the procedure vital to successful hernia repair, in order to limit heterogeneity due to our limited sample size in this exploratory study. Future study incorporating larger numbers should be able to take into account the entire procedure. The Hawthorne effect also could have resulted in bias. However, as it applies equally to all subjects, it should have minimal effect on outcomes.

If eye tracking is used in assessment in future, deception may be possible by surgeons artificially increasing their fixation frequency and dwell time at the operative site. However, perhaps with further studies, we may be able to define an upper limit (i.e. ceiling of effect) of fixation frequency and dwell time and those who “cheated” would be artificially increasing their fixation and dwell above the upper limit. Future studies can also concentrate more on pupil metrics, which is a natural physiological behaviour; something the subject has no voluntary control over.

Conclusion

Expert and junior surgeons exhibit differences in eye behaviour during key stages of an open inguinal hernia repair, which may be due to automaticity and proficiency developed through practice, resulting in lower mental demand and case duration. Eye tracking can potentially be used to objectively test the technical skills of a surgeon.

The methodology from this paper can be applied to different operations to validate the use of eye tracking in surgical skill assessment. Future work should also focus on validating eye metrics against performance for additional surgical procedures with large sample sizes and variations of experience with further validation against OSATS or similar objective rating scales with consideration of scrub nurse and assistant experience. In this study, the order of fixations was not examined, and future analysis of this data may reveal differences in the temporal fixation pattern, which can further distinguish behavioural patterns.