Introduction

The key to successful medical interventions is immediate access to the patient’s anatomical image data. Especially during minimally invasive procedures, in contrast to open surgery, physicians are not able to see their target, surrounding risk structures or their instruments inside the patient and therefore rely heavily on recent medical images and 3D models of the anatomy. Because of special working conditions in the operating room (OR) and interventional radiology suite, i.e., sterility, limited space and time pressure, physicians face challenging human–computer interaction tasks. These tasks include the control of medical image viewers, interactive registration of images and interaction with medical robots. In clinical routine, sterile covers enable the direct use of interaction devices, e.g., joysticks, touchscreens or control panels. In addition, foot pedals with little functionality are used to control software directly, but the interaction with software is still very often delegated to a nonsterile assistant using speech or gesture commands [25].

However, indirect interaction might be inefficient and error-prone. Technologies in the field of touchless human–computer interaction, e.g., range cameras, voice control or eye tracking, are promising. They present new ways of interaction with medical software under sterile conditions. Bauer et al. [5] already gave a first overview of touchless interaction in sterile environments. Nevertheless, they focused on body and hand gesture interaction and did not broaden the scope to other promising modalities, such as voice recognition. Another synopsis of touchless interaction has been given by O’Hara et al. [51], who also concentrate on body gestures, especially with the structured-light-based Microsoft Kinect 1 (Microsoft Corp., Redmond, WA, USA). They furthermore mention voice control as a useful addition to gesture input.

With respect to the growing interest in the area of touchless interaction in the OR we aim to give a broad and complete systematic overview of existing approaches that deal with the given interaction challenges in the classic and the hybrid OR. We additionally discuss the main problems and future trends in intraoperative touchless gesture interaction.

Methods

In the following, the literature search strategy and inclusion criteria are described.

Search strategy

A systematic literature search for scientific papers was conducted using the PubMed database. For this purpose the preferred reporting items for systematic reviews and meta-analyses (PRISMA) guidelines [42] were followed. Our PubMed search term consisted of 8 MeSH terms and 34 title/abstract search terms and is provided as supplementary material for this paper. Forward and backward search was performed using PubMed and Google Scholar. Therefore, similar, cited and citing papers of relevant literature from the PubMed database meeting the inclusion criteria (see below) were determined. Both literature search and review were performed by two reviewers independently.

Inclusion criteria

Relevant—in the sense of this literature review—are all scientific papers in English language that deal with prototypes as well as systems in productive use in the immediate environment of an OR which can be controlled partly or completely touchless and serve as aids to successfully complete the intervention. This includes voice recognition, body movement gestures and eye tracking.

Not included in this review is literature investigating the use of dictating software for radiological diagnosis as well as systems helping in rehabilitation and training or teaching (e.g., live streaming devices), since those approaches do not have a direct impact on the outcome of an intervention.

Results

The aforementioned search strategy yielded 403 references from the PubMed database of which 41 are relevant (see section “Inclusion criteria”). The Google Scholar search added 14 more papers of relevance for a total of 55 to be considered in this review. Thirty-three of the papers were published in peer-reviewed journals, and two are book chapters. Overviews of the implementations and evaluations of touchless gesture interaction systems for interventional use are presented in the tables in the corresponding sections containing a short description of the interaction type or device, the technical approach, and, if given, evaluation results of each paper.

Most of the authors describe methods for the touchless manipulation of medical image data (34 of 55). Other objectives are laparoscopic assistance (7), telerobotic assistance (5), OR control (5), robotic OR assistance (2) and intraoperative registration (2). The most popular device for touchless intraoperative gesture control is the Microsoft Kinect 1 structured-light-based range camera (21). Other relevant interaction devices or types in this review are stereo cameras (12)—9 of which are the Leap Motion Controller (LMC) (Leap Motion, Inc, San Francisco, CA, USA), body-worn inertial sensors (6), RGB camera or webcam (5), voice recognition (7), eye tracking (4), the Intel Realsense Creative (Intel Corporation, Santa Clara, CA, USA) structured-light camera (1) and a time-of-flight range camera (1). Figure 1 illustrates the connections between those devices and methods listed in this review. Most (40) of the systems described in the references have been evaluated under laboratory conditions or single experiments. Eight systems were tested in real interventions, and 7 research teams did not evaluate their work or did not provide information about it.

Fig. 1
figure 1

Overview of used touchless interaction methods and devices

Two papers with the same content as Wachs et al. [72] and one paper similar to Jacob and Wachs [28] were excluded, because those did not make a contribution to the research. We decided to include the latest and most valuable of these publications in this literature review.

In the following, a synopsis of the relevant literature for each category is given. The papers are categorized by their objective in the OR, within which they are summarized according to the used interaction device in chronological order. Publications which provide new approaches or a significant contribution to the research area are described in greater detail than others.

Control of medical image viewers

A lot of fundamental research has been done on the interaction with the visualization of the patient’s anatomy (see Table 1). A camera-based approach was followed by Wachs et al. [72], who developed a vision-based hand gesture and posture capture system to control a medical image viewer with 7 gestures. During calibration, the hand is segmented from a camera image by subtracting a detected moving blob from the image background. The hand color is saved in a histogram as a look-up table and used as reference in the gesture recognition process. The difference between two consecutive frames is computed and serves as motion cue. Within a defined interaction area the user can browse, zoom and rotate medical images. The usability was evaluated based on interviews with one surgeon and a questionnaire. According to this, the system is easy to use and has short training times at a recognition rate of 96 %. A similar system has been introduced by Achacon et al. [1]. Their hand gesture-controlled image viewer uses Haar-like features and the AdaBoost learning algorithm to train gestures as well as principal component analysis and distance matching to later recognize them in the camera image. The algorithm requires a clean background to work. Five unambiguous gestures were mapped onto the software functions. A 3-person experiment provided the findings that the gesture recognition works better in a well-lit (96–100 % recognition rate) than in a dark environment (16–96 %). The false-positive rate is high (64–75 % precision in well-lit environment, 35–53 % in dark environment).

Table 1 Touchless control of a medical image viewer, journal publications are marked with *

A different technique to generate a depth map similar to the Microsoft Kinect 1 is the time-of-flight TOF method. Soutschek et al. [66] used such a TOF camera to define 5 hand gestures based on thresholds in the depth map as well as in the RGB image to interact with medical images. The user study with 15 subjects revealed a 94 % gesture classification rate and real-time capability (10 FPS). The users assessed the system to be intuitive, comfortable and with short response times.

Major issues of camera-based gesture control are line of sight and the fixed interaction area of the user. These problems can be avoided using inertial sensors worn on head, wrist or body, which enable a position-independent interaction. Schwarz et al. [64] introduced a technique which collects pose data from multiple body-worn inertial sensors and classifies them as low-dimensional body gestures. Those are previously learned by the software and parameterized. This enables specialized and personalized gesture sets. A usability study with 10 subjects revealed a good wearability of the system and a 90 % recognition rate. This system was later extended by a voice-based and handheld switch unlock method by Bigdelou et al. [6]. Eight different gestures were defined and tested in a user study. The system does not inhibit usual movements of the users and is responsive and accurate. The handheld switch to unlock the interaction is preferred over the voice trigger, possibly due to a faster response time. Jalaliniya et al. [29] presented a single wristband sensor and SensFloor capacitive floor sensors (Future-Shape GmbH, Höhenkirchen, Germany) with 12 different universally defined hand and foot gestures. While foot gestures are for toggling and switching purpose, the hand gestures are used to interact with medical images. Single output artificial neural networks recognize the gestures in the sensor data stream. A user study with 5 subjects resulted in 93 % recognition precision and 98 % recall. The users described the system as precise, intuitive and responsive. An approach with a myo-electric armband (Myo armband, Thalmic Labs Inc., Kitchener, Ontario, Canada) was taken by Hettig et al. [22]. This armband has 8 surface electromyographic sensors that sense electrical signals from the muscle contractions of the forearm. Five gestures were mapped on four software functions, and haptic vibration feedback was implemented. Two user studies and one clinical test provided the knowledge that the device is not robust enough for clinical use (recognition rates 56–86 %) and has a high false-positive recognition rate.

In 2010, the introduction of the Microsoft Kinect 1, an inexpensive consumer market structured-light-based depth sensor and RGB camera with a source development kit (SDK) and user tracking and voice input capabilities, made it possible to easily build touchless gesture and voice-controlled interfaces. Kirmizibayrak [33] first presented a comparison of a two-handed Kinect 1 3D rotation and target localization (2D slicing) interface for medical images with mouse control. A user study with 15 participants revealed that the two-handed gesture control outperforms mouse interaction in rotation tasks in terms of accuracy and task completion time; mouse interaction is slower but more accurate when localizing targets. Ebert et al. [14] introduced a different medical image viewer control. Voice commands and range camera input were mapped onto keyboard and mouse events. The voice recognition is provided by the operating system. With vocal commands, the interaction modes can be switched. The images are manipulated (windowing, scrolling, moving) by arm gestures. A comparison with mouse interaction in a user study with 10 subjects resulted in mouse interaction being 1.4 times faster than gesture interaction and an overall usability rating of 3.4 out of 5, 3.4 for accuracy of the gesture control and 3 for accuracy of the voice control. A very similar concept was presented in Suelze et al. [69]. Ruppert et al. [60] developed two solutions to interact with hand or arm gestures with the software. Hand recognition is realized with a depth threshold and by post-processing the noisy data. Then the center of gravity of the hand is calculated and used for cursor movement and mouse events. The OpenNIFootnote 1 framework and NiTEFootnote 2 (Kinect 1 and skeletal data tracking frameworks to work with the open source libfreenect driver) are the basis for arm gesture interaction. The user can move the mouse cursor with the right hand, while lifting the left arm triggers click events. The authors did not provide evaluation details.

The drawback of permanent user tracking and the resulting danger of triggering unintended actions was eliminated by Jacob and Wachs [28] by determining the user’s intent with the torso orientation, previously executed commands and the time between subsequent commands. They created a Kinect 1-based gesture set, which is only active if the user is directed toward the display. A set of 10 gestures was chosen and trained with 10 surgeons. The interaction was evaluated in a user study with 20 subjects, which revealed a gesture recognition accuracy of 98 and 99 % intent recognition. Gallo [16] compared the Kinect 1 interaction with state-of-the-art trackball performance with a medical image viewer. A gesture set was developed, and the task completion time with 95 and 80 % pose accuracy was measured in two user studies. The author summarizes that the higher-degrees-of-freedom gesture set of the Kinect 1 performs better than trackball interaction method in the ballistic phase, i.e., 80 % pose accuracy and worse in the correction phase (90 %). Another comparison was drawn by Hötker et al. [24]. Six voice commands and six hand gestures with the same functions for medical image manipulation were implemented. A user study with 10 subjects indicated that voice commands (97 %) are better recognized than the body gestures (88 %) with an overall false-positive rate of 30 %. The Kinect 1, a gyroscopic mouse and a tablet PC were compared by Chao et al. [10] in a study with 29 users. Five tasks had to be executed with each device. The highest usability was measured for the tablet (13.5 points), followed by the gyroscopic mouse (12.9 points) and the Kinect 1 (9.9 points). The task completion time was highest for the Kinect 1 (157s) and lowest for the tablet (41s). Only the measurement error did not differ significantly between those devices at about 1 cm each. Riduwan et al. [58] segment the user’s hand from the depth image and use k-means clustering to find the hand pixels in the RGB image. After finding the hand’s contours with a Graham scan and Moore-neighbor tracing, the fingertips are detected by finding the convex hull. Finger gestures were defined, but not evaluated. Strickland et al. [68] developed a mouse-emulating gesture control as well. It includes visual feedback on a second monitor and is tested in a study of six surgeries. The system is claimed to be robust and reliable, but no data were reported. Similar interaction concepts were introduced by Tan et al. [70] and Yusoff et al. [79]. Kocev et al. [34] combined the range camera with a projector and calibrated it as a spatial augmented reality system to interact with projected information on a deformable surface. A touchscreen-inspired multi-touch gesture set was implemented as well as contactless fingertip interaction. The algorithm execution time was evaluated and declared as real-time capable. Silva et al. [65] developed an own skeletal tracking method similar to OpenNI and NiTE. With this method, a gesture set was developed and evaluated in a user study with 16 subjects and 10 tasks. OR information software compatible with health level 7 (HL7) has been developed by Nouei et al. [50]. It provides all available data about a patient in one application, which can be controlled touchlessly by finger or hand gestures using the depth and pixel data of the hand. Additionally, radio-frequency identification (RFID) tags are used to determine the role of the current user. The evaluation was carried out during 30 surgeries. The surgeons and assistants pointed out the advantage of centralized, direct access to the patient data. Wipfli et al. [77] compared a Kinect 1 gesture interface with OR-typical interaction task delegation and mouse interaction with medical data. After a study with 30 participants they concluded that mouse interaction is significantly more efficient and has significant higher user satisfaction than gesture control and task delegation. However, there were no significant differences in error rates.

A depth map of the 3D space can also be obtained after calibrating two cameras as a stereo camera and calculating the disparity between the two images. Kipshagen et al. [32] introduced a medical image viewer that handles hand gestures with a stereo camera. The hand segmentation takes place after noise removal by segmenting the hand or glove colors, further removing noise and closing gaps. Low-frequency Fourier descriptors are used as unique feature vectors and compared to a previously trained feature database. By measuring distances and time between the frames, a hand’s position offset and therefore velocity and direction can be detected. A user study with 15 subjects resulted in a position error of less than 2 cm in 96 % and less than 1 cm in 50 % of the cases. The processing is done in real time. A very similar functional principle underlies the LMC, which is basically a stereo camera with 3 infrared LEDs that illuminate the hand above to be segmented more easily from the images. Bizzotto et al. (2014) [7] first implemented an LMC-based gesture control plugin for a medical image viewer. They used the freely available GameWave App to define their gestures and map them to the software’s functions. A study with 8 users was conducted, but no results or evaluation details were presented. The same approach was followed by Pauchot et al. (2015) [55] and compared the LMC with the Kinect 1 qualitatively. The authors assert that the LMC has a reduced work space, is less tiring, has greater precision and a much smaller casing. Ebert et al. [13] developed a two-handed gesture set for browsing and manipulating medical images, but did not evaluate their approach. However, the authors experienced a small delay before the gesture recognition and sensitivity to smudges on the device.

Mauser et al. (2014) [37] use the LMC to control medical instruments as well as a medical image viewer. A difference from other gesture sets is the lock and unlock gestures to avoid unintended gestures, as it was already suggested in [7]. No evaluation details were presented. Rosa et al. (2014) [59] tested the feasibility of medical image viewer control with the LMC during 11 dental surgeries. Hand gestures were developed as well as two-finger gestures for scaling, rotating, windowing, browsing images or measuring. The feasibility was proven without major technical errors. Mewes et al. (2015) [41] integrated a touchlessly controlled display into the radiation shield of a computed tomography (CT) angiography intervention room. The user interacts with 2D images and 3D planning models via hand gestures with the LMC. The gesture set was designed to be intuitive and metaphoric. The user study with 12 subjects showed robustness problems with the 3D rotation, although all gestures were rated intuitive and self-descriptive. Saalfeld et al. [61] improved the gesture set of Mewes et al. (2015) [41] (especially the 3D rotation) and compared it to state-of-the-art touchscreen interaction. In a study with 10 subjects the task duration and intuitiveness of the gesture set for medical image manipulation were measured. The interaction with the LMC is significantly slower, except for 3D rotation, which leads to the conclusion that high-dimensional gestures are better for more complex interaction tasks. Additionally, the touchscreen interaction was described as more intuitive, which is partly ascribed to the more frequent use of touchscreens on smartphones and tablets. A two-handed gesture set with a similar focus was developed by Opromolla et al. [52]. The evaluation with 10 users led to the conclusion that the LMC is too slow, not robust and not flexible enough for use in the OR. An advantage is the natural interaction with the software. Park et al. [53] developed a universal LMC gesture mapper to work with arbitrary medical image viewers. Either two-handed gestures or one-handed gestures with a foot-pedal form the user interface. This is achieved by mapping hand gestures on mouse events. The system is modular and battery-powered to provide maximum flexibility. The evaluation with one surgeon resulted, unlike other publications, in the LMC being significantly faster than mouse interaction, which the authors explain with the possibility of concurrent zoom and rotation. However, the gesture recognition rate ranged from 77–100 %, with a false-positive rate of 52 % of the double click gesture.

Despite the high focus on body gestures in this subject, touchless interaction is not only possible through hand and arm movements, but also with voice recognition systems. Mentis et al. [39] partly integrated voice commands in their Kinect 1-based interaction and use them as function trigger or mode switch. Hand gestures can be used for continuous functions like browsing through a set of images. A surgeon examined the system qualitatively and found it useful for use in a clinical environment. However, no evaluation data were given.

Table 2 Touchless control of laparoscopic and endoscopic devices, journal publications are marked with *

Laparoscopic assistance

During laparoscopic interventions, the physician often needs an assistant to control the laparoscopic camera, the light or the insufflator. Interhuman communication in the OR lacks precision and requires much experience as a team. To eliminate the possible complications which an indirect control implies, El-Shallaly et al. [15] evaluated a commercial voice recognition interface. Via voice commands, the light can be activated, the camera is set up and white balanced and the insufflator can be controlled. After treating 100 patients with and without the system, the authors drew the conclusion that significantly less time needs to be spent for switching components on and off compared to manual control. Nevertheless, the authors underline that the absolute gain in efficiency is only about 1 min in total per operation. The same commercial system was evaluated by Salama and Schwaitzberg [63]. They investigated the availability of the system in comparison with an assistant. As a result, the nurse was not immediately ready to execute commands in 77 % of the times voice commands were given to the voice control system. This implicates that voice commands can make a laparoscopic intervention more productive. Not only laparoscopic but also endoscopic procedures can benefit from voice recognition assistance. Nathan et al. [46] presented a robotic scope holder which is used to position an endoscope and controlled by the physician’s spoken distinct commands. The system was evaluated with 10 cadaver heads. There is no significant increase in time to set up the endoscope or software. The major advantages are that the system does not misinterpret commands, like a second surgeon or assistant would, due to the direct interaction and that the robot is not affected by fatigue.

A laparoscopic camera can be controlled with head movements as well as with spoken commands. Nishikawa et al. [49] first developed a camera-based head movement control for this scenario. The user is monitored via an RGB camera, and head movements are interpreted as gestures and mapped on the laparoscopic camera actions tilt, pan, insert and retract. According to the laboratory experiments with three users the system is highly accurate and not misguiding. An in vivo experiment with a pig revealed signs of fatigue in the user’s neck. Wachs et al. [73] implemented a similar solution. If the head angle is above a defined threshold, the camera turns to the desired direction. A simulated surgery with 4 users revealed that face orientation control is slower than keyboard control, but easier to learn. A drawback is the absence of a lock gesture or command, which makes unintended actions possible. Yoshida et al. [78] successfully tested their finger-based head-mounted display (HMD) view interaction during a laparoscopic intervention. The user is presented multiple views on the HMD, i.e., a video stream from the laparoscopic camera, medical image data and a video stream from the head-mounted camera. The number of the user’s fingertips in front of the camera controls the viewports.

Reilink et al. [57] followed a similar approach as [49] and [73], but with body-worn inertial sensors on the head to track the physician’s movements. A monitor can be used as display as well as a HMD. Three algorithms were implemented: position dependent, velocity dependent and hybrid movement control. The physician resets the initial position with a foot pedal. Fifteen subjects tested a two-directional gastroscope steering and preferred the velocity-dependent approach and the HMD. No delay between head and camera motion was noted. A disadvantage is that no information about the tip orientation is given, and thus, the users do not know which further movements are possible.

See Table 2 for a short summary.

Table 3 Touchless telerobotic assistance, journal publications are marked with *

Telerobotic assistance

Telerobotic surgery enables surgeons to conduct more precise and less invasive operations than conventional methods do. Nevertheless, the control consoles are complex and difficult to handle. To facilitate the telerobotic control, Mylonas et al. [45] developed an eye-tracking-assisted method to generate haptic constraints for the robot’s movements based on the physician’s gaze point (see Table 3). These constraints are experienced as haptic feedback with 6 degrees of freedom. The force opposed to the surgeon’s movement is related to the distance between the eyes’ fixation point and the surgical instrument and the underlying force profile (high, 1:1 scaling and linear spring force profile). This way, unwanted movements of the robot can be avoided. Ten subjects tested the eye tracking and motor channeling and 6 additional subjects used it with a commercial telerobot. The linear spring force profile performed best. The hands of the users were not overpowered and no pre- or intraoperative registration was necessary. A similar system was used to optimize ablation paths on the surface of heart tissue with nonparametric clustering by Stoyanov et al. [67]. The surgeon’s fixation points are determined by measuring the corneal reflection from a fixed infrared light source in relation to the center of the pupil. An user study with 8 subjects was conducted with a heart phantom model. The 3D path error was 2.2 mm; the path itself was jitter-free. Visentini-Scarzanella et al. [71] used this binocular eye tracking to localize the physician’s region of interest (ROI) on deformable tissue, which enables a semidense stereo surface reconstruction with reduced computational complexity and better resolution of the desired area. A decent 3D reconstruction is mandatory for dynamic active constraints, motion stabilization and image guidance. The authors tested the method on a silicon heart phantom with 15 fiducials. CT-generated 2D images and the heart were temporarily and spatially aligned. The static reconstruction of smooth featureless areas showed a maximum error of 3.5 mm, and the real-time dynamic motion recovery error was 2.9 ± 2.3 mm. The difference between an eye-tracker-based autofocus and built-in foot-pedal-based mechanical focus was investigated by Clancy et al. [11]. A liquid lens at the end of the commercial telerobot endoscope was used to automatically focus the area the surgeon is fixating with the eyes. The evaluation with 17 subjects revealed that the eye-tracking autofocus method is not only faster but also feels more comfortable and natural to the users. The liquid lense response time was \(\sim {30}\,{\mathrm{ms}}\). One advantage over mechanical autofocus is that no moving parts are used, which implies a longer durability.

Table 4 Touchless control of robotic OR assistance, journal publications are marked with *
Table 5 Touchless intraoperative registration, journal publications are marked with *

Another interaction method to control a robot is hand gesture-based with a range camera. Wen et al. [75] use a Kinect 1 to recognize the physicians’ hand gestures, which control a surgical robot to insert a needle into the operation field as well as a projected augmented reality needle guidance on the patient. Two modes of operation are possible: manual and semiautomatic generation of ablation paths. The automatically generated trajectory can be revised directly on the patient with the RFA planning models. The context can be selected with the palm; gestures are described by 90-dimensional feature descriptors. Twenty-two insertion tests were conducted to measure the accuracy of the whole system. The needle insertion error in a static scenario was less than 2 mm.

Robotic assistance

A different use of robots from actually operating on the patients is the assistance in the OR (Table 4). Li et al. [35] introduced a robotic scrub nurse, which hands medical instruments over to the physician after being instructed by hand gesture commands. A Microsoft Kinect 1 range camera is used to detect 5 different finger poses which represent a single instrument each, which will then be delivered to the surgeon. Usability tests with 4 subjects revealed a 97 % gesture recognition rate, 160-ms gesture recognition time and 5- to 6-s total interaction time including 2-s instrument delivery to a fixed spot. The delivery precision is 25 mm. Users rate the system moderately easy to use, remember and learn, moderately comfortable and safe. The robot delivery is 0.83s slower than human delivery.

Hartmann and Schlaefer [20] use a gesture-controlled robot to reposition the operating room light spot. An unlock gesture is used to activate the robot. After that, either the center of the palm or the center point between both hands is followed. Eighteen users were involved in the evaluation. A fixed track was to be followed by the light. Only one-handed interaction was tested for precision and speed. The system is robust, reliable, and the unlock gesture is suitable for clinical use.

Intraoperative registration

In addition to simple medical viewers, modern navigation systems also provide the possibility to plan needle insertion paths or register anatomical images from different imaging modalities. Herniczek et al. [21] investigated the use of body-worn inertial sensors on the hand under a sterile glove to place points for needle insertion guidance on ultrasound snapshots touchlessly. Four gestures were trained. No evaluation details were presented. The authors claim a gesture recognition rate of 100 % for all but one gesture (92 %).

Gong et al. [17] introduced an interactive 2D/3D registration method with a depth-camera-based hand gesture interaction. The user realizes an initial alignment of a 3D model to X-ray images with two gestures. The gestures are processed via the skeletal and depth data. A cursor is positioned directly via hand movements. Three users tested the system. The positioning error was 8.3 ± 5 mm in 140 ± 70 s positioning time.

Refer to Table 5 for a brief overview.

OR control

Sometimes the physician does want to control not only a specific navigation support or instrument touchlessly, but also other arbitrary software in the OR. To facilitate this, Graetzel et al. [18] developed a system based on a stereo camera to control any OR software contactlessly with hand gestures. The gestures are performed in a 50 \(\times \) 50 \(\times \) 50 cm workspace and tracked and processed at 25 Hz. With this system, the user can move the mouse cursor and trigger mouse click events. The authors conducted a user study in the laboratory with 16 participants and a mock-up user interface. Click gestures were robust, and rapid gestures were not reliably detected. The pointer often jittered and hand-eye coordination problems occured. Another test during an intervention showed that the physicians preferred the hold-and-click gesture over pushing the hand forward. Grange et al. [19] extended this solution. The physician is permanently monitored by the stereo and an RGB camera to infer context information. This information can be used to automate the adaption of equipment settings. The whole process is workflow-step-aware. The authors did not evaluate their extension.

Table 6 Touchless OR control and logging, journal publications are marked with *

In many crisis situations the clinical staff in the OR have to use both hands to ascertain the patient’s health. Therefore, Alapetite [2] developed a voice recognition-based anesthesia record system for the intraoperative use that allows anesthetists to control liquid flow and log events with spoken commands from a fixed dictionary. This was generated from medications that have been used in the hospital in the previous 2 years. However, the language is natural. The command recognition starts after an unlocking keyword and automatically stops after a fixed period. Free text input is also possible. Six anesthesia teams participated in the evaluation in two sessions each. Conventional keyboard and touchscreen input was compared to voice recognition interaction. The study provided the findings that mental workload is decreased by natural language interaction. In order to do that, the authors introduced the “average queue of events” metric as a mental workload indicator. Perrakis et al. [56] compared the voice recognition software of two integrated OR environments (Siemens Integrated OR System SIOS and Karl Storz OR1) in a user study with 74 subjects. The evaluation covered the adjustment of OR light, increasing gas pressure, switching on the video controller and the control of the endolight source. As a result, no significant difference in the number of repeated commands due to different accents has shown. However, the SIOS voice recognition performed significantly better than that of the OR1. The authors conclude that the SIOS is more reliable, but all actions are performed faster manually than with voice control.

Meng et al. [38] not only developed another gesture to mouse event mapper, but also present a system to connect the multi-user interaction to all software on the different displays in the OR. The user wears a structured-light sensor on the head and points with one finger to the direction where the mouse cursor shall be moved to. The RGBD sensor detects the finger and the corresponding display, which is calibrated with 2D markers. The sensor data are processed on a wearable computer-connected wireless to different mobile devices. Those are plugged into the different computers and control the cursor on each display. A user study with 7 participants was conducted in a simulated OR environment and revealed good usability. However, the system’s response to the users’ movements must be improved.

The aforementioned publications are summarized in Table 6.

Commercial state of the art

Some of the approaches that have been described before have already arrived in commercial products. The company TherapixelFootnote 3 developed a medical image viewer software called Fluid which can be controlled by a depth sensor that recognizes hand gestures. The GUI of the client software is particularly designed for touchless interaction. Different clients are connected to a central server which caches DICOM data, builds 3D reconstructions and serves as a PACS gateway. Another commercial solution is distributed by Gestsure.Footnote 4 The product emerged from the system presented by Strickland et al. [68]. A Kinect 1 is used to map arm gestures onto mouse actions, i.e., a USB mouse is emulated and can be used for every (OR) software. A different approach was followed by TedCas.Footnote 5 The company developed a connectivity box for medical applications called TedCube which takes a number of different gestural control sensors as input independent of the operating system and maps movements of hands, arms or eyes to keyboard or mouse commands. They also provide a LMC-based interface to control the mobile DICOM viewer TedSIGN. NZTechFootnote 6 introduced an projection-based augmented reality interface to interact touchlessly with medical image data. The system consists of a ceiling-mounted projector-camera unit and an optional self-developed hand gesture sensor integrated in a table. A user interface with control elements for a medical image viewer software is projected onto an arbitrary surface in the sterile area. By hovering the hand over the projected buttons or by executing circle and pointing gestures above the sensor the physician can trigger the desired actions directly and sterilely. The SCOPIS GmbHFootnote 7 provides touchless hand gesture control of a surgical navigation system based on a LMC. The user can interact with medical images and 3D visualizations via a mouse emulation or gesture interface.

Discussion

This literature review presents the first broad overview of systems providing touchless software interaction for sterile and direct intraoperative or interventional use. The list of publications contains various technical approaches for a diverse set of objectives. While most of the papers deal with touchless control of medical image viewers and laparoscopic devices, there are also publications regarding robotic and telerobotic assistance or general OR control.

Before 2013 there was a great diversity in approaching a natural user interface in the medical domain with RGB cameras, stereo cameras, one time-of-flight camera, body-worn inertial sensors or voice recognition, but only 22 papers were published by this time (see Fig. 2). This is an average of 2.2 publications per year. The release of the consumer-grade console and PC gaming range camera Microsoft Kinect 1 in 2010 and the Kinect SDK in 2012 obviously had a big impact on the human–computer interaction community. With this depth sensor, it was made easy to create touchless interfaces at affordable price. This is reflected by the relatively high number of 21 publications and one commercial product introducing touchless interfaces with the Kinect 1 since 2013. The market introduction of the LMC had a similar effect, leading to 10 publications and two companies adopting the device since 2014. Between 2014 and 2015 10.3 papers were published in average per year. Compared to the years before, a rising trend in the number of publications and thus in interest in the research area can be seen.

Fig. 2
figure 2

Number of publications dealing with touchless interaction in the OR per year, number 2016 as of June

The high number of publications introducing very similar approaches in touchless medical image viewer gesture control shows that in the future the research community does not have to produce yet another gesture interface for medical image interaction in a special application. Instead, it has to be put more effort in improving and evaluating usability and intuitiveness. Especially in the medical domain, it is crucial to design user interfaces that help the physicians to be more effective and thus shorten treatment duration. Nevertheless, only a few touchless interaction systems have been evaluated properly in this respect: 8 groups tested their systems in a real clinical setting; 7 did not provide evaluation details at all. The remaining 40 systems have been examined in the laboratory, but often with too few participants (fewer than ten), which can skew the results. For example, Nathan et al. [46] and Hötker et al. [24] gained better voice recognition rates under laboratory conditions than Alapetite [2] or Perrakis et al. [56] in a real OR environment. Hence, the research community needs to rethink its user interface evaluation methods to eventually enable the integration of touchless interaction capabilities into the OR. Hettig et al. [22], Saalfeld et al. [61], Nouei et al. [50], Opromolla et al. [52], Meng et al. [38] or Wipfli et al. [77] are examples of appropriate usability testing, since they follow a well-defined study concept, use standardized usability questionnaires and provide qualitative and quantitative data as proof. However, those usability studies always have to be reviewed critically, since the tested implementation and study setup might not be the best solution possible. Therefore, it is not possible to draw a conclusion about the applicability of an interaction device based on a single implementation.

Further research also needs to be done in the aspect of hardware specialization and optimization. Most of the systems in the literature are built with commercial products like the Microsoft Kinect 1 or the LMC, which are general purpose hardware designed for games and home use with, in case of the Kinect 1, a large working space but very low resolution of the imprecise depth data. Custom-built, high-class, use-case-dependent devices could bring advantages for the robustness as well as for the usability. Additionally, it is promising to use multimodal interaction to fulfill the requirements of the clinical application. As proposed in Mentis et al. [39], voice commands could be used as trigger functions such as input unlocking or as functionality switch, and hand or body gestures may be used for continuous manipulation of parameters. If the user needs to interact with multiple applications on different displays, new gaze tracking approaches as in Meng et al. [38] can help to switch between them.

As proposed by Grange et al. [19] and Jacob and Wachs [28], clear software restrictions should be implemented to prevent unintended input. For the purpose of better understanding the intention of the interacting person, sensor fusion of all instruments and interaction modalities in the OR and interventional radiology suite will be of interest. Therefore, it must be possible to gather all the information in one place. The OR.NETFootnote 8 project aims at providing a signal bus and software protocol for the safe interconnection of instruments in clinical environments. Together with standardized workflows for medical procedures [26, 27], much of the manual physician–computer interaction can be replaced by automated presentation of relevant information in the respective workflow step. This way physicians can concentrate on the actual intervention by reducing the necessary interaction with software.

Another crucial condition for increasing the usability when using touchless input devices is feedback for the user. Compared to mouse, keyboard and touchscreen interaction there is no haptic feedback (except in Mylonas et al. [45]). Thus, as stated by Silva et al. [65], visual or even auditory feedback [30, 54, 74] is important to prevent confusion but often not provided.

Many of the technical solutions presented in the field of intraoperative gesture interaction are prototypes emulating mouse interaction which are set on top of existing medical software. Although there exist good approaches for integrating touchless interaction better into the existing software via mobile devices [38], one has to get past the 2D WIMP paradigm to enable real natural user interfaces [76]. The manufacturers of medical software will have to redesign their user interfaces, considering menus and action triggers, so that users can benefit from 3D interaction devices.

Although most of the groups present very sophisticated and robust methods to control a medical image viewer, telerobotic systems or other intraoperative assistance (with recognition rates mostly  >90 %), it has to be noted that in the medical domain an almost always correct recognition of input commands is mandatory to decrease the risk for the patient and the workload of the clinical staff. Future research should therefore focus on further increasing robustness and accuracy of touchless control. In the past years, dozens of sophisticated and easy-to-use machine learning frameworks have emerged, e.g., Theano,Footnote 9 TensorflowFootnote 10 or the Stanford CoreNLP Natural Language Processing Toolkit [36]. Those toolkits enable developers to train and classify voice commands or gestures [47, 48], given that training data are available, and to reach much better recognition rates than with conventional methods even in the noisy or cluttered environment [12, 23, 43] of an OR or interventional radiology suite. Natural language processing is constantly improving. Chan et al. (2016) [9] presented a neural network that learns to transcribe speech utterances to characters at a word error rate of 14.1 % without a dictionary or language model. Before, as Cambria and White (2014) [8] pointed out, a large knowledge base was needed to match existing vocabulary and recorded voice. Speech recognition with large vocabulary knowledge works at a word error rate of 8 % [62]. With improved semantic information acquisition [3], and with the promising advantages of neural network integrated circuits [40], it will be possible for computers to understand humans offline without the need for large data centers that are only available through internet connections.

Considering new interaction hardware approaches, a disruptive element in interactive medical software will be the augmented reality HMDs Meta 2 (Meta Company, Portola Valley, CA, USA) or Microsoft Hololens.Footnote 11 They provide lightweight augmented reality glasses, which can be used to display relevant information or images spatially aligned with the patient, and, in case of the Hololens, eye tracking, gesture and state-of-the-art speech recognition capabilities to interact with them. Such devices have the potential to bring together all information in one place, similar to presenting mobile augmented reality solutions [4, 31, 44], and additionally give the opportunity to interact with all relevant data naturally via sophisticated speech recognition (as mentioned before) and touchless gestures. The affordable price will make prototypes pop up in the medical domain, as the Kinect 1 and the LMC did.

With those technologies around the corner and if the research community adapts its efforts in increasing the usability, it will be possible for physicians to interact with medical software context-aware, workflow-step-dependent and finally hands-free.