Introduction

Image-guided surgical navigation system (SNS) has become an incrementally effected clinical assistance tool for mini-invasive surgery. Since the concept was proposed, this technology has been rapidly developed to apply in various fields, including orthopedics, neurosurgery, otorhinolaryngology and so on. Generally, before the operation, imaging diagnosis with preoperative computed tomography (CT) or magnetic resonance imaging (MRI) was performed to analyze surrounding anatomical tissues and design surgical trajectories by using computer-assisted preoperative planning software. And at the time of surgery, under the guidance of a tracking system, the relative positions among the surgical tools, anatomy structures and planning trajectories could be visualized on a computer screen, guaranteeing the operation accuracy and reliability [2, 3].

However, in accordance with the strict operation requirement, all subjects of the surgeon contact must be sterile. However, it is extremely troublesome and time-consuming to disinfect the hardware of the surgical navigation system [4]. Therefore, using standard devices such as keyboard and mouse as HCI is a latent vector of an infectious medium, causing risks to patients and surgeons. Fortunately, three-dimensional hand gesture recognition based on depth camera as an efficient method of touch-free interface has attracted increasing research interests [5, 6]. In general, non-contact hand gesture recognition approaches can be divided into two categories: (1) static hand gesture recognition, which mainly relies on the judgment of difference static hand postures [7, 8]. Unfortunately, this category is infeasible in clinical application as the existence of potential interference of complex surgical postures. (2) Dynamic gesture recognition. Currently, both single poses and continuous multi-label gestures can be distinguished by detecting the begin–end of special gesture from an infinite motion trajectory [9,10,11]. However, almost all dynamic pose recognition approaches required to abstract the beginning and ending point of special gesture which is a complex task itself.

Therefore, combined with the depth camera, a gesture recognition algorithm was proposed on the basis of an optimized structure of long short-term memory, i.e., multi-LSTM, which allows multiple isolate inputs and takes into account the relationships between inputs layers. The multi-LSTM network was then attached to an in-house oral and maxillofacial surgical navigation system to work as a non-contact user interface. Then a phantom study was involved to evaluate its clinical feasibility and reliability.

Methodology

The architecture of multi-LSTM

LSTM, an optimized network structure of recurrent neural network, can exploit long range relationships in data on the basis of internal purpose-designed memory cells [12, 13]. Figure 1 presents a single-LSTM memory cell, and its data flow can be formulated as:

$$ I_{t} = \delta \left( {u_{xi} * x_{t} + w_{hi} * h_{t - 1} + b_{i} } \right) $$
(1)
$$ F_{t} = \delta \left( { u_{xf} * x_{t} + w_{hf} * h_{t - 1} + b_{f} } \right) $$
(2)
$$ O_{t} = \delta \left( { u_{xo} * x_{t} + w_{ho} * h_{t - 1} + b_{o} } \right) $$
(3)
$$ C_{t} = F_{t} * C_{t - 1} + I_{t} * \tanh \left( {u_{xc} * x_{t} + w_{hc} * h_{t - 1} + b_{t} } \right) $$
(4)
$$ h_{t} = O_{t} * \tanh (C_{t} ) $$
(5)

where subscripts \( t \) and \( t - 1 \) represent the current and last moment, respectively; \( \delta \) means sigmoid function; \( I_{t} \), \( F_{t} \) and \( O_{t} \) are the node value of input gate, forget gate and output gate, respectively; \( u_{xi} \), \( u_{xf} \) and \( u_{xo} \) are the different weights of input components of different gates; \( w_{hi} \), \( w_{hf} \) and \( w_{ho} \) correspond with the weights of output in last moment \( h_{t - 1} \); \( C_{t} \) represents the status of current memory cell; and \( h_{t} \) means the cell output value.

Fig. 1
figure 1

LSTM cell

Figure 2 illustrates the architecture of multi-LSTM, which mainly is comprised of two rows of associated serial LSTM cells. The calculation formulas of \( I1_{t} \), \( F1_{t} \), \( O1_{t} \) and \( C1_{t} \) are same as Eqs. (1)–(4). For the upper row, the calculation function of each node can be presented as follows:

Fig. 2
figure 2

Architecture of multi-LSTM

$$ h1_{t1} = O1_{t} * \tanh (C1_{t} ) $$
(6)
$$ I2_{t} = \delta \left( {q_{hi} * h1_{t} + u2_{xi} * x2_{t} + w2_{hi} * h2_{t - 1} + b2_{i} } \right) $$
(7)
$$ F2_{t} = \delta \left( {q_{hf} * h1_{t} + u2_{xf} * x2_{t} + w2_{hf} * h2_{t - 1} + b2_{f} } \right) $$
(8)
$$ O2_{t} = \delta (q_{ho} * h1_{t} + u2_{xo} * x_{t} + w2_{ho} * h2_{t - 1} + b2_{o} ) $$
(9)
$$ C2_{t} = \,F2_{t} * C2_{t - 1} + I2_{t} * \tanh \left( {q_{hc} * h1_{t1} + u2_{xc} * x_{t} + w2_{hc} * h2_{t - 1} + b_{t} } \right) $$
(10)
$$ h2_{t} = O2_{t} * \tanh (C2_{t} ) $$
(11)
$$ y = \delta \left( {s1 * h1 + s2 * h2 + b} \right) $$
(12)

Compared with Eq. (1)–(5), an additional item \( h1_{t1} \) is added in \( C2_{t} \) and the three gates \( I2_{t} \), \( F2_{t} \), \( F2_{t} \). Meanwhile, both outputs of two row \( h1 \) and \( h2 \) are contributed to finial prediction.

Training the multi-LSTM

Data acquisition

In order to combine the gesture recognition with the in-house surgical navigation system of BeiDou-SNS (School of Mechanical Engineering, Shanghai Jiao Tong University) [1], the motion of the right hand is used to control the cursor movement, and gestures of the left hand are designed to manipulate the mouse operation. To investigate the reliability of the gesture recognition algorithm, the trajectory data of both wrist and elbow from 10 participants were collected by a Kinect RGB-depth camera V 1.0 for Windows (Microsoft Inc., USA) to train the multi-LSTM network. During acquisition, the upward direction of the camera was required to align with the operator’s vertical orientation, and the imaging plane was adjusted to parallel to the operator’s coronal plane. Then, as shown in Fig. 3, the operator cooperated according to the following instructions:

Fig. 3
figure 3

Gesture schematic. a Waving upward, b waving downward, c waving leftward, d waving rightward

  1. (1)

    wave upward: firstly, make a wrist–elbow line perpendicular to coronal plane and keep elbow stationary; then, wave hand upward until wrist–elbow line perpendicular to transverse plane; finally, return to the original position;

  2. (2)

    wave downward: this process is the same as 1) except for waving downward in the second portion;

  3. (3)

    wave leftward: this process is similar as 1) except for moving leftward until wrist–elbow line perpendicular with sagittal plane in the second portion;

  4. (4)

    wave rightward: this process is the same as 3) except for moving rightward in the second portion;

  5. (5)

    other moving or stationary states: arbitrary movement as long as it is different from the above four categories.

Each participant signed each aforementioned five gestures 50 times, producing 500 specimens for each pose. As little training samples may cause poor performance, massive data are required in our method to train the model appropriately. In order to alleviate this limitation, we applied a data argument method to generate a lot of train data on the basis of our collected data.

Gesture recognition training

As shown in Fig. 4, the red lines are the motion trajectories and the green pots are the coordinate positions of the wrist in different moments. As the velocity variations of hand movement contribute to the maldistribution of the positions of elbow and wrist, a cubic spline interpolation was introduced to preprocess these points before entering the network.

Fig. 4
figure 4

Process of gesture recognition

We instantiated the multi-LSTM of network size = 128 to learn the trajectory-to-gesture mapping from interpolated data. The input size was 60 which included 30 wrist input units in \( x_{1} \) and 30 elbow input units in \( x_{2} \). As shown in Fig. 4, the output \( O = 5 \) is the gesture classification which was represented in binary. \( 0 0 0 0 1 \) to \( 1 0 0 0 0 \) represent wave upward, downward, leftward and rightward, respectively. All the weights were initialized with sparse connections, and the bias vectors were initialized to 0. In addition, L2 regularization and early stopping were adapted to avoid over fitting. The main parameters are listed in Table 1.

Table 1 Training parameters used in multi-LSTM network

Integration with the surgical navigation system

In order to combine the gesture recognition with the surgical navigation system, aforementioned signs along four directions were designed to correspond to a left button click, right button click, middle wheel forward and middle wheel backward of a mouse event, respectively. Furthermore, the motion of right hand was used to control the movement of the cursor. Theoretically, according to the following equations, the moving vector of the cursor can be obtained by linear mapping from the motion vector.

$$ \frac{{\partial \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {d} }}{\partial t} = \frac{{p_{\text{h}} \left( t \right) - p_{\text{h}} \left( {t - 1} \right)}}{\Delta t} $$
(13)
$$ P_{\text{c}} \left( t \right) = P_{\text{c}} \left( {t - 1} \right) + w * \frac{{\partial \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {d} }}{\partial t} $$
(14)

where \( p_{\text{h}} \left( t \right) \) represents hand’ position of current moment and \( p_{\text{h}} \left( {t - 1} \right) \) for the last moment; therefore, \( \partial \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {d} /\partial t \) means the hand’ motion vector. \( w \) is the mapping factor which is initialized according to the resolution of screen; \( p_{\text{c}} \left( t \right) \) and \( p_{\text{c}} \left( {t - 1} \right) \) represent cursor’ position of current moment and the last moment, respectively.

However, due to the location draft of depth camera itself and the synergetic effect of limbs, the direct linear mapping results are barely satisfactory. Therefore, a mapping factor which can dynamically be adjusted according to tanh function was adopted.

$$ x = \left\| \frac{{\partial \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {d} }}{\partial t} \right\| $$
(15)
$$ f\left( x \right) = \tanh \left( x \right) = \frac{{{\text{e}}^{x} - {\text{e}}^{ - x} }}{{{\text{e}}^{x} + {\text{e}}^{ - x} }} $$
(16)
$$ P_{\text{c}} \left( t \right) = P_{\text{c}} \left( {t - 1} \right) + w * f\left( x \right) * \frac{{\partial \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {d} }}{\partial t} $$
(17)

where \( x \) is equivalent to the module of \( \partial \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {d} /\partial t \), that is, the hand’ motion velocity.

According to the correspondence between signs and mouse operations, the gesture recognition network was attached to BeiDou-SNS as a sub-thread. And a sliding input model was employed to the recognition approach. As shown in Fig. 5, the latest N sets of data collected from the camera were inputted into the network for each sign judgment. The advantages of this input model are twofold: Firstly, the model effectively eliminates the trouble of query starting point of each gesture. Both isolated gestures and continuous gestures can be recognized. Secondly, the length of input data can be adjustable according to the speed of user movement.

Fig. 5
figure 5

Sliding inputs of multi-LSTM

Phantom experiment validation

A phantom study for zygomatic implants (ZIs) placement was conducted to validate the reliability of the HCI of surgical navigation system based on the proposed multi-LSTM. ZI surgery was proposed by Brånemark [14, 15] in 1989 to assist massive grafting surgery or rehabilitate patients who had gone through maxillectomy. As long trajectory is requested in implant embedment, tiny angle deviation or entry point error could lead to intolerable terminal point error [16].

Serving as fiducial makers, eight bone anchored titanium mini-screw (length of 11.0 mm, square cavity of 1.0 mm, diameter of 1.6 mm, CIBEI®, Shanghai, China) were inserted into a resin craniomaxillofacial model. The principle of quantity and distribution of markers was based on the registration criteria [17]. After that, a cone beam computed tomography (CBCT) (Planmeca, Helsinki, Finland) scanning (resolution of 0.33 mm/pixel, slice thickness of 0.4 mm) was performed. Then the CBCT DICOM data were transferred into an in-house oral and maxillofacial planning software [18] and four ZI paths. The resin model and pre-surgical planning are shown in Fig. 6a, b, respectively.

Fig. 6
figure 6

Procedure and results of the phantom experiment. a Preoperative resin model with eight titanium fiducial makers inserted into the maxilla bilaterally; b preoperative virtual model and the four zygoma implant paths; c postoperative resin model with four implants; d intraoperative screen snapshot; e real-time skeleton of operator. Different signs were recognized according to the trajectories of left wrist and elbow, and the position of cursor followed the motion of right hand; f intraoperative non-contact manipulation of the surgical navigation system via the Kinect RGB-depth camera

First of all, the Kinect RGB-depth camera was activated to control the in-house BeiDou-SNS, as shown in Fig. 6f. Then, under the assistance of Kinect RGB-depth camera and the guideline of NDI Polaris (Accuracy of 0.25 mm, Northern Digital Inc., Canada), the phantom experiment of zygomatic implant placement was conducted on a PC with Intel Core I7-7700TM with a 3.60 GHz CPU, 8 GB memory, a 64-bit Windows 10 operating system and a 3 GB NVIDIA GeForce GTX 1060. And the operations were as follows:

  1. (1)

    Moved right hand and waved left hand upward to open files, including pre-surgical DICOM images, the configuration files of the tracking system and the planning paths;

  2. (2)

    Waved left hand leftward or rightward to scan DICOM images by switching the image slice and zooming current image size;

  3. (3)

    Waved left hand downward to activate the tracking system;

  4. (4)

    Moved right hand and waved left hand upward to calibrate the surgical instruments and to register the image coordinate space and world coordinate space by starting up corresponding functions.

Results

Recognition accuracy evaluation

As no public dataset meets our train requirement, we recorded 3D hand trajectories of elbow and wrist from 10 participants. Each participant signed each aforementioned five gestures 50 times, providing 500 instances in number for each pose. Three-quarter of the recorded data served as training data and the rest as testing data. Meanwhile, both rotating coordinate and adding noise were used to augment the abundance of train data. To investigate the reliability of the proposed gesture recognition algorithm, tenfold cross-validation was performed at judging signs, and the mean accuracy is 96% ± 3%.

The results of the phantom experiment

In the phantom experiment, several gestures have been redone because they were not recognized or incorrectly judged sometimes. From statistical data, gestures toward up, down and right three directions could be distinguished with an accuracy of 92%, and the recognition precision of leftward waves was around 80%. And during the whole experiment, there is no human–computer interaction except the control from Kinect. Along the planned trajectories, four zygomatic implants were successfully placed, as shown in Fig. 6c. After the four implants have been inserted, the 3D model was CBCT scanned again to obtain the postoperative images, which were then fused with the preoperative ones, and three parameters including entry point deviation, exit point deviation and angular deviation were used to evaluate the accuracy of zygomatic implant placement. As shown in Fig. 7, three deviations of four zygomatic implants were measured, and the average deviations of planned–placed implants were 1.22 mm and 1.70 mm for the entry and end points, respectively, while the angular deviation ranged from 0.4° to 2.9°, which can meet clinical requirements. The details are shown in Table 2.

Fig. 7
figure 7

Fusion of preoperative image and post-operative image and the illustration of planned–placed deviations of implants

Table 2 Planned–placed deviation of four implants

Normally, using a mouse as the human–computer interaction is quite reliable. The default frequency of mouse click for Windows XP is in the range of 1–5 Hz, which depends on the response rate of user-defined double-click events. By comparison, the gesture recognition rate of our system depends on the data update frequency of the RGB-depth camera and the system data acquisition frequency. For different users, the time to complete a wave will be recorded to initialize the user-specific data collection frequency before using the system, ensuring the integrity of the intercepted gesture trajectories. By default, the joint position acquisition frequency of our gesture recognition framework is configured to 60 Hz, and the input of the network requires 30 wrist position units and 30 elbow position units. Therefore, the default recognition rate is 2 Hz, which is slightly slower than using mouse. The maximum recognition rate can be improved by using other RGB-depth cameras with a higher frame rate or using higher frequency of data acquisition.

Discussion and conclusion

As the 3D continuous position of targets can be captured in real time by RGB-D cameras, various gesture recognition approaches have been proposed for HCI, disease detection, robotics and so on [19, 20]. However, as a requisite step in those methods, the algorithm complexity is increased with the distinctions of the beginning and ending points of gestures.

In this study, we proposed an optimized sign judgement structure named multi-LSTM on the basis of traditional LSTM as a method of HCI. To meet clinical requirements, the gesture recognition algorithm was integrated with an in-house surgical navigation system to control the user interface. A phantom study of zygomatic implant placement was conducted to validate its feasibility. As a result, it showed that the non-contact interface based on multi-LSTM could be used as a promising tool to eliminate the disinfection problem for both patients and surgeons.

Although it seems that the results are satisfactory in this study, there are twofold limitations. Firstly, compared to other algorithms, it requires longer time to train the model to suit different users. Nevertheless, it can achieve high speed of gesture recognition in our online test. Secondly, when more than one person is in its detection range, Kinect depth camera will trace several skeletons simultaneously, causing confusion of recognition target. So, user-specific gesture recognition algorithm is expected for further development.