Introduction

Computer-assisted navigation systems provide surgeons with rich and complex multimodal data, enhancing intraoperative diagnosis, decision making, and surgical maneuvers. Despite the high reliability of such systems, they have not yet been fully integrated into the surgical workflow. The dominant way of conveying information in current navigation systems is based on visual displays, a method that assists the surgeon only via the unisensory perceptual channel of vision. This may be explained because we are biologically trained to localize objects, including their semantic meaning, visually, based on a Cartesian grid in a static form1. However, in a dynamic interaction with a navigation system, occurring over time, objects’ qualities are constantly transforming into new states. This challenges the surgeon’s cognition, creating complications, especially in a high-intensity environment such as an operating room. A challenge related to hand–eye coordination is that the surgeon’s visual attention has to diverge between navigation displays and the actual operation area, including the surgical tools, targeted anatomy, and the surrounding critical structures. Such complications have not been completely resolved even in more recent augmented reality (AR)-based systems, when overloading multiple virtual visual cues on the display may lead to change or inattentional blindness2,3.

In cognitive psychological research, it has been shown that multisensory integration facilitates information processing. Multisensory integration, that is, the combination of multiple independent but causally correlated information sources from different senses, including auditory, visual, and haptic, improves performance on a wide range of tasks4,5. Research in computer-assisted surgery has not yet fully taken advantage of multisensory feedback and there are unanswered questions in this regard. Although, incorporating navigation data into multiple alternative channels will unload a single modality, creating new possibilities for presenting interaction data with computer systems more intuitively. In this article, we have highlighted the importance of alternative perceptual modalities for navigated surgery, investigating potential solutions and discussing the future picture of surgical navigation systems. Human auditory perception, as opposed to visual, is not tied to a spatialized atemporal Cartesian grid. Therefore, sonic qualities such as texture, timbre, and rhythm, which unfold over time, are more efficient and more convenient to embody temporal aspects of objects’ qualities. The idea of using sound as a source of information has been well founded in sonification research, which is often defined as the systematic transformation of data relations into perceived relations in an acoustic signal to facilitate communication or interpretation that is reproducible6,7,8. The auditory channel as an alternative perceptual modality to visual feedback has proven to be beneficial in different domains, such as process monitoring, data exploration, and navigation9,10. It has been shown that sonification is effective in enhancing athletic performance11, rehabilitating stroke patients12, and diagnosing clinically relevant pathology patterns13. The challenge of sonification design for surgical navigation is to incorporate the complex dimensionality of the application scenario into an integrated audio stream, that meets the clinician’s expectations in terms of reliability, usability, and time efficiency.

We hypothesize that a multisensory-based navigation system improves the surgeon’s perception in highly precise interventional tasks. This article, as the first step toward multisensory navigation, introduces a novel standalone four-DOF sonification methodology for the pedicle screw placement task in lumbar spine surgery. To demonstrate the feasibility of the solution, we evaluated the method in a phantom study with 17 orthopedic surgeons in terms of effectiveness, usability, and learnability in comparison with conventional 3D visual navigation as an established method and state of the art with respect to accuracy. Despite the fact that the surgeons have more experience using visual feedback, the study results confirmed the reliability of sonification for surgical navigation tasks and demonstrated the potential behind the core idea of this research.

Clinical motivation

Severe pathological conditions of the spine, including deformity, trauma, degenerative disc disease, and spondylolisthesis, can be treated using the established orthopedic surgical technique called spinal fusion or spondylodesis14,15. Spinal fusion implants, which consist of specialized screws that are driven into the pedicles of the respective vertebrae, are used to achieve a fusion between two or more spine segments, thereby immobilizing the respective region and absorbing biomechanical forces. In modern approaches, the surgeon prepares a guiding hole for the smooth insertion of screws, using a surgical awl or by drilling K-wires. To determine the central position of the guiding hole within the pedicle, the surgeon uses bony landmarks for orientation16,17. Optimal positioning is crucial for avoiding screw perforation, which can cause serious injury to the spinal cord and its surrounding nerves and vessels. Hence, accurate pedicle screw placement is essential for a surgical outcome, and success depends on the experience and anatomical understanding of the surgeon, especially in severe cases such as scoliosis, kyphosis, or congenital anomalies, where the chance of perforation is even greater18.

There are three main techniques for pedicle screw placement: freehand, fluoroscopy guidance, and stereotactic navigation19,20. The misplacement rate, that is, the rate of screws perforating the pedicle cortex to any degree, in the freehand technique ranges from 5 to 41% in the lumbar spine and from 3 to 55% in the thoracic spine19. The high rates of misplaced screws in the freehand approach, various pedicle morphology, and different sizes of the vertebral body motivate computer-assisted systems to improve surgical accuracy19. However, there exists some level of disagreement about the necessity of accuracy in pedicle screw placement21. A careful analysis of related studies16,17,18,19,21,22 shows that accuracy and safety are dependent on several factors, such as the vertebrae level in question, the definitions of thresholds and safety zones, whether the pedicle cortex has been perforated or not, the applied technique, and the availability of the dataset for comparison studies. There have been studies22,23 that considered the freehand technique an accurate and safe technique for pedicle screw placement, and many surgeons believe that even when performed slightly inaccurately, such imprecise pedicle screw placement is asymptomatic. However, even those asymptomatic cases can cause implant instabilities, prevent smooth fusion, or expedite adjacent-level degeneration21,24. Using conventional fluoroscopy has not entirely solved the problem, as the misplacement rate has been reported as 31.9%21 and even higher in more challenging cases22.

Conversely, computer-assisted systems for pedicle screw navigation have been shown to be more accurate, with reduced complications25,26,27,28,29. Intraoperative image-guided navigation has evolved in recent years as established approaches such as 2D and 3D fluoroscopic navigation have increased the rate of successful placements respectively to 84.3 and 95.5%, respectively21. Furthermore, computer-assisted navigation avoids the use of intraoperative imaging, which reduces the dose of radiation required by conventional fluoroscopy30,31,32. However, while the 3D fluoroscopic navigation system demonstrates the most accurate current solution for pedicle screw placement and is accepted as a standard method according to different in vivo studies19,21, the adoption of such technologies in surgical workflow has been slow, requiring further system improvements33,34. In a worldwide survey on the use of navigation in spine surgery, conducted by Härtl et al.33, although 80% of 677 participants acknowledged the use of navigation systems, they concluded current systems do not meet surgeons’ expectations in terms of usability, time efficiency, and integration into the surgical workflow. Participants complained about the complexity of use and the disruption of the surgical workflow as major factors. Additionally, they considered time-consuming training to be a prerequisite factor to support the integration of such systems, and Ryang and colleagues35 supported this in their study. Current navigation systems predominantly provide surgeons with information through visual displays, increasing the surgeons’ cognitive load and complicating hand–eye coordination. Unnaturally, surgeons need to divide their focus of attention between the operation site and navigation displays36, or their field of view becomes cluttered with multiple holographic cues visualized on head-mounted displays. Visual distraction is problematic for surgeons, considering they need to perceive and process complex structures of navigation data at the highest level of precision in the intensive and stressful situation of a surgical environment2,3.

Informed consent

The informed consent for publication of identifying information/images in an online open-access publication has been obtained from the study participants.

Related studies

Among surgical navigation systems, we focus on AR-based solutions, which include sonification as one of its emerging branches. AR has been shown to be beneficial for surgical applications37,38, in particular for orthopedic surgery39. AR technology has the advantage of superimposing preoperative planning with intraoperative anatomy, which, in the case of visual-centric AR, provides surgical navigation information in the surgeon’s field of view. Previous studies have proposed a body of AR-based navigation solutions for pedicle screw placement40,41,42,43. Similar approaches based on tool-mounted mobile devices have been used to provide information in the line of sight of a surgeon44,45,46,47. However, all these approaches have utilized visual feedback as the singular feedback modality. As discussed in “Introduction” and “Clinical motivation” Sections, the inherent limitations, such as change or inattentional blindness2,3 are the motivation behind the research reported here.

Sonification for navigation purposes was initially been designed as a natural application for people with visual impairment48,49,50,51,52,53 and has been expanded to more general applications54,55,56. Sonification of one-dimensional data using primitive sound synthesis methods, such as in heart–lung machines, has already been integrated into surgical procedures. Such basic sonification methods do not extrapolate well to more complex multidimensional scenarios, as they lack consideration of psychoacoustics and sound design in their configuration. To address this problem, sonification methods57,58,59 have been proposed with more focus on usability and clinical integration, using more flexible and creative sound designs; however, these approaches are unsuitable for presenting precise navigation data.

Sonification methodologies for medical applications have mostly focused on image-guided navigation scenarios. Black et al.60, in a review paper, named three primary motivations for sonification of surgical navigation: (1) increasing awareness of structures surrounding the tracked instrument, (2) reducing attention to the screen or increasing attention to the patient or test phantom, (3) helping clinicians correctly interpret (multidimensional) navigation data. Wegner et al.61 recommended different mapping ideas, such as 3D audio spatialization for generalized 3D surgical instrument placement. Sonification in the form of proximity alerts has been proposed for endoscopic cranial base surgery62, temporal bone drilling63, protecting facial nerves during otologic surgery64, guiding cochlear implantation65, and fluorescence-guided resection of gliomas66. More elaborate approaches have been introduced in67,68,69,70 using continuous parameter-mapping sonification for surgical needle guidance in one dimension. Investigation of solutions for one-dimensional distance mapping have been undertaken by Plazak et al.71, who proposed five different mapping strategies, and Roodaki et al.72, who introduced a sonification design based on physical modeling sound synthesis that requires minimum training.

Sonification research in recent years has aimed to expand in terms of data dimensionality and degrees of freedom (DOF). Parseihian et al.54 investigated the efficiency of different sonification strategies in terms of rapidity and precision for a one-dimensional guidance task. Sonification of multidimensional data is challenging72,73, and researchers have investigated the potential of spatial sound to overcome this challenge for 2D74 and 3D space75. Such approaches have been relatively successful when combined with visual guidance. Spatial sonification as an intuitive and natural method with a high learnability rate is suitable for orientation tasks54,76. However, spatial sound does not provide the precise distal and angular resolution required for precise surgical guidance tasks77. The resolution of spatial localization is \(1^{\circ }\,\pm \,3^{\circ }\) along the horizontal axis in front, and becomes less toward the sides. The resolution of estimating distance is decimeters in a short distance area78. Conversely, monaural sonification provides flexibility in design, as its efficiency is justifiable because of our inherent perceptual capability, as we can discriminate pitches in a range of 640–4000 steps78, 120 levels of loudness78, and 250 levels of sharpness78,79. Monaural approaches are efficient regarding dimensionality and resolution; however, they introduce design challenges in terms of intuitiveness and learnability. Sonification methods are proposed for guidance in 2D55,56 and 3D76 spaces, providing information such as distance or orientation. These methods employed monaural sonic characteristics such as pitch, amplitude, and timbre.

A review of the state of the art reveals a lack of research on methodologies for surgical tool guidance in two or more dimensions which would be integrable into highly sensitive application scenarios such as pedicle screw placement. In pedicle screw placement, the surgeon aligns the drill with a predefined target trajectory, which can be mathematically defined by two points, the entry and angular target points. Optimal positioning of the tool on these two points requires tool movement in four DOF. To the best of our knowledge, there is no prior study in four-DOF sonification. There are essential questions to address. For example, the approach’s effectiveness and usability: to what level of precision and accuracy would sonification provide information along with an appropriate level of immediacy, and simultaneously achieve a satisfactory level of usability required for surgical situations? The majority of methods lack clinical-grade integrable sound designs, and Black et al.60 described current sonification approaches as being simple. There are a limited number of studies that have compared the effect of sonification to visual feedback. Also, there is a dearth of comprehensive evaluation studies on clinical evaluations and training.

Computer-assisted auditory navigation system

Our approach to providing auditory navigation assistance to surgeons consists of two main components, the navigation and sonification modules.

The navigation module comprises a workstation and an infrared optical tracking camera. The goal is to intraoperatively provide the positioning of the drill controlled by the surgeon. Prior to the operation, the trajectories of the screws are preplanned on the basis of a preoperative computed tomography (CT) volume of the patient, and, to align the preoperative CT with the intraoperative coordinate system of the camera, a registration method is performed. Intraoperatively, using the camera, markers on the drill sleeve are tracked relative to reference markers on the patient’s bed. The real-time position and orientation of the targets relative to the drill tip are sent to the workstation and used to compute error parameters (described in “Error parameters” Section ). This information is, in turn, transferred to the sonification module, which generates the output sounds accordingly.

Error parameters

We define the pedicle screw placement as a four-DOF alignment task between the tracked drill sleeve’s tip point, \(T_{tool}\), and the preoperative planned trajectory, \(T_{target}\). The first two DOF correspond to the translation of \(T_{tool}\)’s tip point projected on the entry point plane \(P_{entry}\). \(P_{entry}\) is defined by taking the main direction of the planned trajectory \(T_{target}\) as the plane’s normal and the planned entry point to the bone as the center of the plane. Hence, both \(T_{target}\) and \(P_{entry}\) are updated according to each pedicle screw’s planned trajectory. The entry point errors \(e_{x}\) and \(e_{y}\) are defined as the distance between the center point of the \(P_{entry}\) plane and the projection of the drill sleeve’s tip point on \(P_{entry}\). \(e_{x}\) and \(e_{y}\) show the entry point errors in mediolateral and caudocranial directions, respectively.

The remaining two DOF correspond to the orientation mismatch between \(T_{tool}\) and \(T_{target}\). This angular error is decomposed into two values, \(e_{\phi }\) and \(e_{\delta }\), which are Euler angle differences between the projections of the \(T_{tool}\) and \(T_{target}\) on the axial \(XY_a\) and sagittal \(YZ_a\) planes, respectively, in the anatomical coordinate system \(XYZ_a\), as illustrated in Fig. 1. The orientation error on the \(XZ_a\) plane is negligible because of the symmetry of the tool and the preplanned trajectory. The anatomical coordinate system stays constant throughout the execution of all pedicle screw placements.

Figure 1
figure 1

Three cross-sectional views of the CT from the spine phantom model, including the corresponding errors (\(e_{x}, e_{y}, e_{\phi }, e_{\delta }\)). The target and tool are visualized in green and red, respectively. (a) Corresponds to the coronal view visualizing, \(e_x\) and \(e_y\) projected on the \(P_{entry}\); (b) represents the axial view visualizing \(e_{\phi }\); and (c) visualizes the sagittal view including \(e_{\delta }\).

Four DOF sonification model

Interactive alignment model

The interaction model is designed with two interactive phases, namely, entry point phase (EP), and angle phase (AP), each with two DOF. There are also two static phases, the initial phase (IP) where \(T_{tool}\) has not yet entered the entry point working area (\(W_{EP}\)), and the final phase (FP) where \(T_{tool}\) has reached \(T_{target}\). First, the projections of \(T_{tool}\) and \(T_{target}\) on the \(P_{entry}\) plane have to be aligned (EP); then, the tool orientation is aligned with \(T_{target}\) (AP) while the tooltip stays in place. This implies that when the interaction is in the AP, the tooltip has already been aligned to \(T_{target}\). If during the AP the tooltip deviates from \(T_{target}\), the sonification will return to the sound mappings of the EP.

The transitions between these phases and states are carried out using a threshold mechanism, with two control parameters, d and \(\theta\). d is the 2D Euclidean distance between the projections of \(T_{tool}\) and \(T_{target}\) on \(P_{entry}\), and \(\theta\) is the 3D Euler angular distance between \(T_{tool}\) and \(T_{target}\). The user interaction with the sonification model starts when the tooltip enters the \(W_{EP}\), which is a circle on the \(P_{entry}\) plane with radius \(r_{EP}\) around the target entry point. Furthermore, we define the angular working area \(W_{Ang}\), which includes all \(T_{tool}\)s with Euler angular distance less than \(\theta _{Ang}\) from the \(T_{target}\); i.e., \(\theta \,<\,\theta _{Ang}\). The alignment task is accomplished when \(T_{tool}\) is aligned in all four DOF at \(T_{target}\) (Fig. 2).

Figure 2
figure 2

Four DOF alignment model with four phases, initial phase (IP), entry point phase (EP), angle phase (AP), and final phase (FP). EP and AP are two interactive phases with continuous mappings, whereas IP and FP are the static phases with constant mappings.

In interactive phases, EP and AP, we define two thresholds, namely, the target and transition zones. The transitions to a next step, that is, from EP to AP and from AP to FP, are executed only when the tool reaches inside the transition zone. When \(T_{tool}\) exits the target zone, the alignment returns to a previous step, that is, from FP to AP or from AP to EP. In these cases, the user needs to reach the transition zone to be able to proceed to the next step. The threshold mechanism with the space between the target and transition zones enables us to smooth out the interaction with the system, avoiding unwanted transitions due to slight hand tremors of the surgeon or optical tracking jitter (Fig. 3).

Figure 3
figure 3

Illustration of the thresholds for transition between phases. (a) the circles demonstrate thresholds for the transition between IP, EP and AP; (b) the cones represent the thresholds for transition between AP and FP.

Mapping to acoustic features

Sonification mapping is based on a continuous stream of pulse tones generated using the well-known FM synthesis method80. In FM synthesis, a modulating oscillator modulates the frequency of the carrier-oscillator, which is an efficient way of producing complex sounds with multiple harmonics with only two oscillators. The input data to the sonification function are the 4D vector \((e_{x}, e_{y}, e_{\phi }, e_{\delta })\) as its parameters are described in “Error parameters” Section. These components control the fundamental frequency of the synthesizer and the pulse rate of the pulsing stream. Human auditory system can interpret fundamental frequency and pulse rate separately, as they are orthogonal sound features. Perceptual orthogonality means, two simultaneously sonified quantities are perceived separately by the individual. In particular, when FM synthesizer parameters change, it can unambiguously be interpreted in its corresponding sound attribute. As a result of the complex, nonlinear processing of the auditory system, all physical quantities of the sound field are practically capable of directly affecting all perceptual attributes of sound, which makes it extremely challenging to attain perceptual orthogonality76,78. Depending on the alignment phase, the system controls which parameters of the input vector should be used for parameter mapping. In EP, \(e_{x}\) and \(e_{y}\) are used to map to a fundamental frequency and pulse rate, respectively, whereas in AP, \(e_{\phi }\) and \(e_{\delta }\) are used. The mapping of the pulse rate is interpolated linearly; however, exponential interpolation is used for the fundamental frequency, as the human auditory system perceives pitch in an exponential manner.

Because both EP and AP phases use the same implementation of the synthesis function, we apply different ranges for the fundamental frequency of the FM synthesis to create higher contrast between the two alignment phases. An experienced surgeon can generally approximate the target entry point using anatomical landmarks, but finding the target angle in a 3D environment is considerably more challenging. Therefore, we set the frequency range in AP (2 octaves) to be larger than in EP (1 octave) in order to achieve a higher resolution in AP. Furthermore, high-pitch fluctuations of the sound in such micro-temporal interactive tasks are likely to cause fatigue, so a larger range is allocated to lower pitches. To facilitate learning, the range of pulse intervals in both EP and AP are identical. The lower bound is selected because in our design the values smaller than 0.1 sec can not be perceived as discrete pulses, thus the parameter changes are not distinguishable. The higher bound is chosen as a trade-off between delay in the interaction and having a larger mapping range. In the IP and FP, the sonification is limited to musical major and minor chords, respectively, both pulsing at a constant rate with different values, as listed in Table 1. In each interactive phase, when the \(T_{tool}\) reaches the \(T_{target}\) value in only one dimension, an earcon is played to facilitate the process of finding the target in the second dimension. The so-called optimum earcons consist of two sequential notes with a slight difference, depending on which target dimension has been reached. The optimum earcon for \(e_{x}\) and \(e_{\phi }\) is the same, whereas a slightly different earcon is used for \(e_{y}\) and \(e_{\delta }\). To make the transitions clear, two additional earcons were designed, consisting of eight sequential notes in ascending order for the EP to AP transition and in descending order for the AP to EP transition, representing the movement forward and backward in the procedure. Earcons’ parameters, including the IP and FP chords, are selected considering the fact that they can conveniently be distinguished and learned by the user.

Table 1 Parameters for the FM synthesis mapping functions with the input data \(e = (e_{x}, e_{y}, e_{\phi }, e_{\delta }) \in [0, 1]\) for entry point phase (EP), angle phase (AP), initial phase (IP), and final phase (FP).

Comparison study

To compare the sonification and the conventional visual navigation methods, we conducted an experiment with 17 orthopedic surgeons, 4 senior experts, and 13 assistant surgeons. In the study, participants performed the pedicle screw placement procedure on phantoms. We used phantom models of the lower lumbar spine (manufactured by Synbone AG, Zizers, Switzerland) consisting of vertebrae L1–L5. The phantoms incorporate facet joints and discs, which create more realistic, intervertebral movement. To simulate the surrounding anatomical landmarks similar to the real surgical environment, we covered the phantoms with Play-Doh to hide the deeper and medial areas around the drilling surface, as shown in Fig. 4. Each surgeon drilled 20 pedicle screws on two phantoms with an alternating order between auditory and visual navigations. Our primary measures were the entry point distance error and angular error between the executed and preplanned trajectories. For the procedure with conventional 3D visual navigation, participants performed the four-DOF alignment based on three cross-sectional CT slices from three views. The coronal view visualizes \(e_{x}\) and \(e_{y}\) on the \(P_{entry}\) plane, aligned to the 3D anatomical coordinate system. The axial view visualizes \(e_{\phi }\) on the \(XY_{a}\) plane. The sagittal view corresponds to \(e_{\delta }\) on the \(YZ_{a}\) plane (Fig. 1). For the visual model, similar to the sonification model, tracking markers are used to track the drilling sleeve’s position relative to a reference marker fixed on the phantom’s bed. The real-time processing of the tracking data is performed by the workstation and transferred to the visualization module, which renders the image on a visual display. Figure 4 shows the experimental environment.

Figure 4
figure 4

The experiment setup, (a) the phantom covered with Play-Doh, (b) task assisted with visualization, (c) task assisted with sonification.

Starting with a preoperative CT of one of the phantoms, a senior spine surgeon planned 10x lumbar pedicle screws on L1–L5. The preplanned trajectories were aligned to each phantom before starting the trials using a landmark registration method. For the landmark registration, eight points were collected on the most lateral section of each transverse process on L1–L4. L5 was excluded because we observed slight variations among L5 levels in different phantoms; therefore, a higher error for L5 evaluation would be expected.

We used the fusionTrack 500 real-time optical tracking system (Atracsys) and passive infrared markers for tool tracking. The tracking targets on the drill sleeve (3.2 mm, No. 03.614.010, Synapse System) and the phantom’s bed were designed with four passive spheres on each. Pivot calibration81 was performed on the drill sleeve target to transform the tracking coordinates to the center of the drill sleeve’s tip. The real-time processing of tracking data (at 50 Hz) was implemented using ImFusionSuite [ImFusion GmbH, Munich, Germany—https://www.imfusion.com] software. The generation of sounds was implemented with the SuperCollider3 [https://supercollider.github.io/] software platform for audio synthesis. The communication between the ImFusionSuite and SuperCollider modules was established using the OSC networking protocol82. Finally, the generated audio signal was sent to a pair of two-way bass reflex studio monitors to be played for the surgeons.

The working area’s radius, \(r_{EP}\), was set to 20 mm, and the working area’s angle, \(\theta _{ANG}\) was set to \(30^{\circ }\). The target zone’s thresholds for both alignment phases (EP and AP) were set to 2 mm and \(1.5^{\circ }\), and the transition zones’ thresholds were set to 0.5 mm and \(0.375^{\circ }\). Choice of these parameters was based on a pilot experiment with an expert spine surgeon, and the optimum values depend on the accuracy of the tracking system, registration, and calibrations.

Each participant was presented with a short introduction about the method (\(\approx\) 5 min). The trials consisted of two phases, a training and an execution phase. In the training phase, the participants were asked to conduct 10 alignment tasks with the aid of sonification, on L1–L5 on both sides of the phantom. In the execution phase, they were asked to conduct the alignment and drilling on two phantoms, resulting in 20 executions on the same vertebrae levels. The executions were divided into four sequences, and each sequence was assisted with either visualization (V) or sonification (S). We randomized the order of the sequences between subjects as VSVS or SVSV. Each subject started from either the left or right side of the first phantom and the opposite side of the second phantom, again in a uniformly randomized order.

The secondary outcome measures were the alignment time and the participants’ cognitive load. The alignment time is considered the duration between two events, namely, the alignment start and the drilling starting points. This was performed by the trial examiner, pushing a button for each event to record their timestamps. The cognitive load was assessed by asking the participants to respond to a questionnaire, including the NASA Task Load Index (NASA-TLX), a subjective workload assessment measure, and four additional questions provided by the authors. The additional questions were as follows: Q1: Which method helped you better to find the target entry point? Q2: Which method helped you better to find the target angle? Q3: How do you evaluate the overall usability of both systems? Q4: Which navigation feedback method would you like to use in the future?

Ethics statements

This study does not fall within the scope of the Human Research Act (HRA). According to the clarification of responsibility approved by the Institutional Ethics Committee of Canton of Zurich, Switzerland (BASEC-Nr. Req-2021-00820), authorization from the ethics committee was not required.

Evaluation and results

Evaluation

We compared the preoperative planned trajectories with the postoperative CT of the drilled phantoms. To detect the exact drilled path, cylindrical graphite sticks with the same diameter as the drill (\(3\,\mathrm{mm}\)) were inserted into the phantoms before taking the postoperative CT. The average length of the graphite sticks was \(5\,\mathrm{cm}\). The centers of the first and last disks of each graphite were manually labeled for every drilled screw. We also marked the actual point where the drill had entered into the bone phantom. The actual entry point might be slightly deeper in the bone phantom compared with the planned trajectory because part of the bone surface was removed by surgeons in order to create a flat surface on the pedicle to stabilize the drill sleeve and prevent sliding; a similar procedure is performed during real surgery.

The preoperative and postoperative CT volumes were registered using an image-based registration method. To minimize the movement and possible deformation between vertebrae, we did not remove the Play-Doh before taking the postoperative CT, which caused different appearances in between CT volumes. To resolve this issue, we masked the image-based registration within a 3-mm area around the segmented vertebrae surface. The registration algorithm was manually initialized within its capture range, and a nonlinear optimizer with the \(\hbox {LC}^{2}\)83 similarity metric was used to register the volumes.

Results

The results of the post-CT analysis revealed a total mean error of \(1.82\,\mathrm{mm} \pm 0.89\,\mathrm{mm}\) for the entry point and \(1.75^{\circ }\,\pm \,1.01^{\circ }\) for the angle as deviation from the planned trajectories (CT error, \(n\,=\,336\)). Conversely, the system-generated data, which were used to generate both visual and auditory feedback modalities, resulted in a mean error of \(0.82\,\mathrm{mm}\,\pm \,0.46\,\mathrm{mm}\) and \(0.88^{\circ }\,\pm \,0.47^{\circ }\) (feedback error, \(n\,=\,323\)). We estimated our system error (tracker, registration, and calibration) by subtracting pair-wise samples of the errors (CT and feedback, \(n\,=\,314\)) to a mean of \(0.98\,\mathrm{mm}\,\pm \,0.77\,\mathrm{mm}\) and \(0.82^{\circ }\,\pm \,0.92^{\circ }\). The mean CT error for visualization (\(n\,=\,167\)) was \(1.67\,\mathrm{mm}\,\pm \,0.87\,\mathrm{mm}\) and \(1.78^{\circ }\, \pm \,1.04^{\circ }\) and for sonification (\(n\,=\,167\)) \(1.96\,\mathrm{mm}\,\pm \,0.88\,\mathrm{mm}\) and \(1.69^{\circ }\,\pm \,0.96^{\circ }\). The details of the error over the expertise groups and the spinal levels are presented in Figs. 5 and 6 and Table 2.

Figure 5
figure 5

The mean (CT) error of angle (Top), entry point (Bottom) over expertise groups. The white dots in the middle of box plots represent the mean.

Figure 6
figure 6

The mean (CT) error of angle on the left and entry point on the right, per spinal levels. The white dots in the middle of box plots represent the mean.

Table 2 The mean error of angle (ANG) and entry point (EP) over the vertebrae levels L1–L5.

We expect \(< 3\)-s error margin in the recording process for the completion time of each alignment. The mean alignment time for visualization was \(33.5\,\mathrm{s}\,\pm \,16.1\,\mathrm{s}\) and for sonification \(44.1\,\mathrm{s}\,\pm \,21.6\,\mathrm{s}\). The details of the alignment time of the first and second executions on the same level are shown in Fig. 7.

Figure 7
figure 7

Alignment time of the 1st. and 2nd. executions over the spinal levels.

Fourteen individuals (3 experts and 11 assistant surgeons) returned the questionnaires. The small sample size \((n=14)\) made it necessary to determine the distribution of the variables Visualization (V) and Sonification (S) before selecting an appropriate statistical method. A Shapiro-Wilk test was performed and did not reveal any evidence of non-normality \((V: W=0.97, p\;value=0.88; S: W=0.91, p\;value=0.17)\). On the basis of this outcome, we analyzed the NASA-TLX data, using a t-test for independent samples with unequal variances. The results indicated a P value of 0.59. Therefore, we failed to reject the null hypothesis of having equal means in the samples. Accordingly, we found no significant differences regarding the cognitive load between sonification and visualization. The responses to Q1 and Q2 (which method better helped to find the target entry point and the target angle, respectively) had the same proportions; i.e., 10 individuals voted for visualization, three for sonification, and one believed there was no difference between both methods. In response to Q3 (overall usability), 6 individuals responded that visualization was more usable than sonification, 5 were of the opinion that both methods were equally usable, three believed the sonification method was better than visualization in terms of usability, and no one chose the option “none of the methods are acceptable”. Finally, in response to Q4 (which method would you like to use in the future), the majority of respondents (12 individuals) preferred a system that combines both methods, one voted for each of visualization and sonification, and no one chose for the option“none of them”.

Discussion

We proposed four-DOF sonification as a novel method for pedicle screw placement, investigating an alternative method toward multisensory assistive technology in the surgical context. The challenge was to design a clinically compatible and accurate system that simultaneously fulfills usability requirements and is competitive with the more conventional visual peer. The results of the comparison study against the state of the art, as demonstrated in “Evaluation and results” Section, offer clear support of the idea behind this research.

Accuracy

Many clinical and anatomic studies have considered the accuracy of pedicle screw placement as the rate of successful screw placements. A successful screw placement has often been referred as the one fully contained in the pedicle cortex without any degree of perforation. The violating degrees of misplacement have been defined as: \(<2\,\mathrm{mm}\) (Grade A), 2–4 mm (Grade B), and \(>4\,\mathrm{mm}\) (Grade C)19,21,85. Considering this definition, the accuracy of pedicle screw placement using 2D and 3D visual navigation has been reported at 84.3% and 95.5% respectively21. To determine whether a particular system will enable the safe performance of the task, we need to specify the safety requirements, as well. The clinical safety requirements are dependent on the type of procedure and the patient’s anatomy. The margin of error for a given pedicle is dependent on different factors, such as the size of the screw and the critical dimensions of the pedicle, such as isthmus. Rampersaud et al. 84 proposed a mathematical analysis method for calculating safe margins for the pedicle screw placement task: the maximum entry point and angular error tolerances for L1–L5, given 6.5-mm pedicle screws, are 0.65–3.8 mm and \(2.1-12^{\circ }\), respectively. The entry point error was defined as the distance between the actual screw insertion point and the ideal starting point for the screw (at the central axis of the pedicle) and the angular error as the angular deviation between the screw trajectory and the ideal trajectory (parallel to the central pedicle axis). For the same pedicle, the error tolerances increase when using a smaller-diameter screw. Similar to this approach, we calculated the error based on the deviations of the actual drilling trajectories from the preplanned targets (the ideal trajectories).

The overall accuracy of our navigation setup needs to be sufficiently appropriate for the pedicle screw placement task such that we can conduct a valid comparison between the sonification and visualization methods. The accuracies for L4 and L5 in both modalities satisfy the accuracy requirements suggested by Rampersaud et al.84, as highlighted in Table 2. Conversely, the results for L1–L3 do not fully meet these requirements (assuming a 6.5-mm-diameter screw). However, during the actual procedure, the surgeon first drills a guiding hole, with 3-mm diameter in our case, and then inserts a wider screw, with 6.5-mm diameter, which enables the surgeon to manually refine the trajectory based on haptic feedback and the mechanical constraints of the pedicle wall. Therefore, the practical safety thresholds would provide slightly higher tolerance than the suggested thresholds of Rampersaud’s. Moreover, as the state-of-the-art navigation method for pedicle screw placement has not yet provided a 100% success rate, we conclude that the navigation setup has provided an acceptable range of accuracy to compare the sonification and visualization methods. The evaluation of the sonification condition’s error (\(1.96\,\mathrm{mm},\,1.69^{\circ }\)) indicated a similar accuracy to the visualization condition (\(1.67\,\mathrm{mm},\,1.78^{\circ }\)), both demonstrating a better result in comparison with those in40 (\(3.35\,\mathrm{mm},\,2.74^{\circ }\)) and43 (\(2.77\,\mathrm{mm},\,3.38^{\circ }\)). Considering the estimated system error (\(0.98\,\mathrm{mm},\,0.73^{\circ }\)), which includes registration errors, calibration errors, and tracking data noise, we assume the lower error boundary of \(0.65\,\mathrm{mm}\,\text {and}\,0.84^{\circ }\) for visualization and \(0.96\,\mathrm{mm}\,\text {and}\,0.79^{\circ }\) for sonification.

The first step in assessing accuracy was to conduct t-tests for independent samples, with the null hypothesis of equal means. Since the system feedback supports users in finding the target, we do not assume a non-normal distribution for alignment error. According to the Shapiro-Wilk test with the CI \((P < 0.05)\), the experts’ samples for entry point in both V and S \((n=38)\) indicated no evidence of non-normal distribution. In contrast, the Shapiro-Wilk test showed a deviation from normality among novices for entry point and angle errors in V and for angle error in S. One possible explanation is that novices exhibit relatively unstable performance, which results in more extreme values. Furthermore, the sample size of novices \((n=117)\) is large enough that makes the Shapiro-Wilk detect the smallest deviation from normality, with a p value of \(< 0.05\), regardless of whether the variable is generally expected to be normally distributed. Considering these two facts, we conducted a Lilliefors test (a normality test based on the Kolmogorov-Smirnov test), which did not demonstrate evidence of non-normality for novices. The error over all participants does not follow a normal distribution because it includes the samples from both expertise groups with different average of accuracy. However, because the levene test failed to reject any variance differences, which means all other t-test assumptions are satisfied in this case and due to the advantages of parametric tests, we conducted t-tests over all participants as well. The corresponding results of the t-tests for the entry point over experts and angle over both expertise groups failed to reject the null hypothesis. It means, the results of the t-tests did not demonstrate any significant difference between both methods in these cases, which is consistent with the research’s hypothesis.

Investigating to what extent both methods have a similar effect, we applied the equivalence test for two independent samples, which is the two one-sided t-test (TOST). TOST works on an equivalence interval (EI) with lower and upper limits \((-\Delta _{L}, \Delta _{U})\), and two composite null hypotheses H0-1: \(\Delta \,\le -\Delta _{L}\) and H0-2: \(\Delta \,\ge \,\Delta _{U}\). If both hypothesis tests can statistically be rejected, we can conclude that the difference between sonification and visualization samples, \(\Delta\), falls within the EI – \(-\Delta _{L}\,<\,\Delta \,<\Delta _{U}\), which is considered equivalent. The results indicate statistical significant equivalence of both methods within the EI \(\pm \,0.46\,\mathrm{mm}\), \(\pm \,0.23^{\circ }\) (\(P\,<\,0.05,\,n\,=\,155\)). Adding the upper limit of the resultant EI to the sonification error, we can estimate errors of \(2.42\,\mathrm{mm}\,\text {and}\,1.92^{\circ }\), which is still comparable with the state-of-the-art visual navigation40,41. Details of the EI for different expertise groups for actual and feedback errors are presented in Table 3. In general, we observed a larger EI for the entry point compared with the angle, which is because the thresholds for entry point \(2\,\mathrm{mm}\) were set larger, compared with the angle \(1.5^{\circ }\). Considering that the threshold mechanism in sonification does not allow the user to obtain any feedback after the threshold level, discussion about accuracy after this level would not be relevant, and the sonification outcome could present a random effect. The thresholds were empirically set during the pilot study as a compromise between accuracy and user’s convenience. Having a system design with more accurate tracking, the thresholds can also be reduced, leading to a more accurate result for sonification.

According to the resultant equivalent intervals and the safety requirements presented in the literature84, we conclude that the sonification method is significantly as accurate as the visualization method for the pedicle screw placement task, which supports the research’s primary hypothesis.

Table 3 The least EI of the TOST to reach the CI \(P < 0.05\), and \(P < 0.025\) (the adjusted CI using the Bonferroni correction method due to the multiple comparisons problem), over the expertise groups, including both CT and feedback errors for entry point (EP) and angle (ANG).

Learning curve

To determine the training effect in both expertise groups, we compared completion time and errors, as functions of performance, on two consecutive executions on the same vertebrae level. To confirm the learnability, we have to determine whether the mean duration of the second execution decreased compared with that of the first execution, without significant decrease in accuracy. Hence, we conducted Wilcoxon signed-rank tests on the alignment time of both executions. The P values for each expertise group are shown in Table 4. Moreover, to determine any decrease in accuracy, we performed Wilcoxon signed-rank tests (\(\alpha \,<\,0.05\)) on both entry point and angle errors, which failed to reject the null hypothesis in terms of equal means for both samples. The mean differences between both executions are shown in Table 5.

The mean difference between both errors (Table 5) and the results of the Wilcoxon signed-rank tests indicate that the accuracy remained consistent during both executions for all expertise groups. As presented in Tables 4 and 5, the experts demonstrated a significant decrease in time for visualization and sonification. We can observe this pattern in the less experienced group only for visualization (\(P\,<\,0.001\) in Tables 4 and 5). Considering the error consistency and the observed time patterns, we can conclude high learnability for the both modalities for the experts; however, this was demonstrated by the assistants only for the visualization condition.

Our interpretation is that the experience level of the expert group enabled them to focus more on learning the untrained auditory navigation method. Conversely, the assistant surgeons required more of their cognitive processing capacity for executing the screw drilling task, and therefore, they had a negative learning rate in the sonification condition. Because the visual navigation was more familiar for both groups, they could improve their speed on this modality. However, further research is required to accurately evaluate the effect of training for the sonification method and for developing a full picture of its learning curve.

Table 4 The P values of the Wilcoxon signed-rank tests on the alignment time of the 1st. and 2nd. executions.
Table 5 The mean of differences for entry point (CT) error, angle (CT) error, and alignment time between the 1st. and 2nd. executions for each expertise group.

Multisensory processing and research outlook

Although the plurality of the questionnaire respondents preferred visualization (Q1–Q3), the absolute majority (12 out of 14) imagined that a desirable future system would combine the advantages of both modalities. Such responses from the field’s experts are absolutely in accordance with the equivalent results of the accuracy, NASA-TLX assessments, and principles of multisensory perception. Multisensory solutions result in increased performance and recall, in particular, in intense and complex sensory scenarios86,87,88. Research questions that could be asked in future studies include, e.g., To what extent can each modality convey complex information accurately? When is it better, or preferable, for the perceptual modalities to be presented in a complementary fashion, and in which situations do they have to provide redundant contextual information? How do our decisional resources respond to each perceptual cue? Future research in computer-assisted surgery can focus on investigating the possible answers to such fundamental questions in the application field of surgical navigation, as the foundation is well established in cognitive science89,90.

Further, previous sonification research50,51,52,53,54,55,56,76 has investigated different sonification strategies for navigation, providing the preliminary basis for further research. Future studies should take the cons and pros of sonification paradigms into account. For instance, spatial sonification as an intuitive and natural method with a high learnability rate is suitable for orientation tasks54,76; however, we cannot not disregard its limitations with respect to resolution78. Additionally, spatial sonification may cause localization anomalies such as front–back confusion, a vague distance and elevation perception, and orientation errors91,92. On the other hand, monaural sonification as a candidate approach provides efficiency78,79 and flexibility in design. Nonetheless, it requires a more prolonged learning phase, which can also depend on the design concept and parameterization.

Four-DOF sonification is a useful tool for scenarios with complex dimensionality and accuracy challenges, as demanded by surgical applications. We divided multiple dimensions into subsets and controlled switching between them using a threshold mechanism. This idea is expandable to contexts with higher dimensionality. Nonetheless, issues such as intuitiveness and the learning process have to be considered. Finally, we should consider that monaural sonification is a rather new approach, evolving in terms of dimensionality and interaction design. Even though the presented study exhibited promising results, future research should investigate the effect of enhanced learning phases on performance.

Conclusions

In this article, we have outlined the problem of complex data perception in high-intensity environments, such as the operating room, and highlighted the importance of multisensory processing and development of creative solutions to overcome the information overload issue. To investigate the effects of multisensory processing, we conducted a study with 17 medical professionals in a lab environment using a spinal bone phantom and compared two different techniques for surgical navigation assistance. We proposed the four-DOF sonification method – as a stand-alone audio-based solution – for navigating pedicle screw placement in spinal fusion surgery and compared the method with state-of-the-art visual navigation. Four-DOF sonification did not demonstrate any statistically significant differences in performance compared with the visual navigation. Considering the resultant equivalence intervals and the safety requirements, we can conclude that the proposed method can be reliably used for the pedicle screw placement task. However, the results of the secondary metrics such as cognitive load, usability, and learnability, despite not reaching statistical significance, provided evidences which have led to valuable discussions that could open new paths for further research for interdisciplinary teams of biomedical engineers, cognitive psychologists, sonification designers, and medical experts. The novel design concept of the method supports the idea for accurate sonification of high-dimensional data within a complex interactive task scenario. This study is the first step toward enhancing our understanding of perceptual multisensory processing in the surgical context (“Supplemantary infromation”).