1 Introduction

Endovascular surgery (ES) is widely employed to treat thrombus, tumor, embolism, and so on, due to its advantages compared with traditional thoracotomy and craniotomy, such as less bleeding, fewer complications, small trauma, and quick recovery [1]. An ES robot is considered to be a promising technology to further improve the operation accuracy, efficiency, and safety [2]. The current studies on the ES robot mainly focus on the master control device with haptic feedback [3,4,5,6,7,8], the mechanism and control of the slave manipulator [9,10,11], the active catheter [12,13,14,15], the sensing system for surgical state acquirement [9, 16,17,18,19], and the forewarning [20,21,22]. Unlike an autonomous robot system for surgical tasks on static rigid tissue [23] or soft tissue [24] with rigid instruments, current ES robot systems always adopt a master-slave paradigm [25, 26], as shown in Fig. 1. As a result, execution of surgical tasks with such ES robot is entirely controlled by the surgeon and limited to an individual surgeon’s skills.

Fig. 1
figure 1

Schematic diagram of the robotic-assisted ES

Surgeons acquire their skills through a long period of study in the form of explicit learning and training in the form of implicit learning [27]. During ES, the surgeon estimates the spatial relationship between the guide wire and vessel wall in a digital subtraction angiography (DSA) image. Meanwhile, the surgeon recognizes the operating force condition of the guide wire according to haptic sense. Then, the surgeon makes decision of the operating action with their surgical experience. In this paper, the action decision-making procedure is defined as the surgeons’ skills. The surgical state information shown in the DSA image is dynamic and non-structured. The inference procedure from the surgical state to the operating action is complex and highly non-linear. In addition, given the large variability of patients’ tissues and the difference between surgeons’ operating methods, explicit modeling of the surgeons’ skills is practically limited or even impossible.

Machine learning (ML) methods have been employed for surgeons’ skill learning, as well as autonomous surgical robot [28], such as recurrent neural networks (RNN) [29], deep neural network (DNN) [30], Gaussian mixture model (GMM) [31], and hidden Markov model (HMM) [32,32,34]. However, these methods have limitations due to the fact that their inputs are the coordinate position of the gripper [29], the base position, the rotation vector, the tool position [30], and guide wire tip motion data obtained by an electromagnetic position sensor [31,31,33]. These methods do not have the capability of understanding the unstructured surgical state image. A method combining non-rigid registration and GMM was proposed by Yang et al. [35], which is the state of the art for an autonomous ES robot. Despite the fact that it has the capability to adjust new vascular anatomy, only the vessel centerline is mapped to the actions but not the vessel contour morphology. Moreover, this method is limited by generating a fixed action sequence from a fixed starting point.

A deep convolutional neural network (CNN) provides a promising way to deal with the challenges of unstructured surgical state perception and surgeons’ skill learning. Deep CNN has demonstrated its outstanding capability of understanding a raw pixel image in image classification [36], diagnosing a tumor, playing games, even robot grasping, and so on. It successfully overcame the challenge of an enormous search space and the difficulty in evaluating board positions and moves in the game of Go [37]. A deep CNN-based model successfully learned the control policies for seven Atari 2600 games from image input and surpassed a human expert on three of them [38]. A novel method was proposed by employing three-dimensional CNNs for automated pulmonary nodule detection from volumetric computed tomography (CT) data [39]. Further, deep CNNs have been successfully used to map the visual input to control commands for robot grasping [40,40,42]. A deep Q-Network (DQN)-based system was proposed with the capability of autonomously learning robot controllers from image input without any prior knowledge of configuration [43]. However, to our knowledge, there is no reported study on an autonomous ES robot directly with image input based on deep CNN methods.

In this paper, a CNN-based framework is proposed for unstructured surgical state perception and surgeons’ skill learning for an ES robot. A 2-D deep CNN is developed for unstructured surgical state perception with image input, which is defined as the action probability estimator. A multi-input one-dimensional (1-D) deep CNN is proposed to recognize the operating force mode of the guide wire, which is defined as the operating force mode recognizer. Then, an eye-hand collaborative servoing algorithm is proposed to combine those two networks for action execution under closed-loop control. Demonstrations of two kinds of representative surgical tasks, including vessel branch passing and obstacle passing, are collected under laboratory settings for network training.

2 Methodology

2.1 Overview of the CNN-based framework

During the clinical ES procedure, the surgeon applies pushing, pulling, and rotating actions on the guide wire end outside the sheath to deliver the guide wire tip to the target area along the vessel lumen. They make the action decision by estimating the surgical state that is presented mainly by two kinds of information: the visual information in the DSA image and operating force on their hands. They estimate the spatial relationship between the guide wire and vessel contours according to the visual information, while ensuring the surgery safety by adjusting the operating force. The spatial information between the guide wire and vessel contours is unstructured and dynamic because of their flexible and deformable physical property. Although non-rigid registration and GMM methods were proposed [35], only the vessel centerline is mapped to the actions. The important information including vessel contour morphology and operating force was not taken into account. For operating force mode recognition, a threshold method was adopted in Jian et al.’s research [21]. But, during the procedure that the guide wire is continuously inserted into the vessel lumen, the interaction force between the guide wire body and vessel wall is complicated and changeable. So, it is hard to classify the operating force mode by simply setting a safety threshold of the operating force that is always detected at the guide wire end.

A CNN-based framework is proposed in this paper as shown in Fig. 2. It mainly consists of two phases: phase I is data collection and phase II is automatic operation. In phase I, the master-slave robot is used for data collection, which is used to train the network for automatic operation. A human surgeon controls the slave robot through the master robot to manipulate the guide wire for specific surgical tasks, while the actions detected by the master robot and operating force detected by the slave robot, as well as images detected by the camera under laboratory conditions or by the DSA device under clinical conditions, are recorded as demonstration data. In phase II for automatic operation, the slave robot is controlled by the trained models (i.e., the action probability estimator and operating force mode recognizer). In each close control loop of automatic operation, the current surgical state image is taken as the input of the trained action probability estimator and the guide wire operating force is taken as the input of the trained operating force mode recognizer. The eye-hand collaborative mechanism infers the optimal action according to the output of these two learning models. Then, the inferred optimal action is executed by the slave robot to operate the guide wire for surgical tasks.

Fig. 2
figure 2

Diagram of the proposed CNN-based framework

Four preconditions are set for this work. (1) The vessel contour is visible in the medical image. In clinical condition, the vessel contour can be obtained from the DSA image. (2) Camera images are used as a substitute for the DSA image in experimental condition for training data collection and evaluation experiments. The vessel contour and catheter contour are extracted using a Canny operator in the preprocess procedure. For the DSA image in clinical condition, the textures could be extracted with specific extracting and tracking methods for a medical image [44,44,45,46,47,48,49] in the preprocess procedure. (3) The target point of the catheter tip in certain tasks is set by human. (4) The guide wire operating force detected by the force sensor inside the slave manipulator is used to represent the surgeons’ operating force on the guide wire.

2.2 Action probability estimator

An action probability estimator is proposed to predict the probability distribution of the candidate actions among given action spaces based on the current surgical state image. In order to make accurate predictions, the action probability estimator should have the capability not only to extract the features of spatial relationship between the guide wire and vessel contour but also to deduce the extracted features to the action probability. The action probability estimator is developed based on CNN, as shown in Fig. 3. The input of the network is the preprocessed current surgical state image, and the output is the estimated action probability distribution.

Fig. 3
figure 3

Architecture of the action probability estimator

Firstly, the contour of the moving guide wire and the vessel wall in the surgical state image is extracted using a Canny operator. The surgical tasks could be represented by a marking target area and the vessel centerline from the starting area to the target area in the image.

Then, an 8-layer CNN is used to extract the features in the preprocessed images as shown in Fig. 3. It was proven that using more kernels with smaller size could achieve better performance than using less kernels with larger size [48]. At each convolutional layer, the kernels are used to sweep over the input (i.e., the output of the previous layer) step by step to extract a stack of higher-level features. A kernel with trained parameters detects a certain kind of meaningful feature contained in the local information. Then, a bias term is added and a non-linear activation function is employed. The 2-D convolutional layer can be formulated as (1):

$$ {h}_i^l\left(x,y\right)=\sigma \left({b}_i^l+\sum \limits_k\sum \limits_{u,v}{h}_k^{l-1}\left(x-u,y-v\right){W}_{ki}^l\left(u,v\right)\right) $$
(1)

where \( {h}_i^l \) and \( {h}_k^{l-1} \) are respectively the ith 2-D feature map in the lth layer and the kth 2-D feature map in the previous layer, \( {W}_{ki}^l\in {R}^2 \) is the kth 2-D convolutional kernel mapping \( {h}_k^{l-1} \) to\( {h}_i^l \); \( {h}_i^l\left(x,y\right) \), \( {h}_k^{l-1}\left(x-u,y-v\right) \), and \( {W}_{ki}^l\left(u,v\right) \) represent the element values with (x, y) being the coordinates of \( {h}_i^l \) and (u, v) being the coordinates of \( {W}_{ki}^l \), \( {b}_i^l \) is the bias term, σ(⋅) is the non-linear activation function, and the rectified linear units (ReLU) (σ(a) = max(0, a)) is used in this paper.

After the first, second, and fifth convolutional layers, max pooling layers are respectively adopted to further reduce the feature dimension. Max pooling is a form of non-linear downsampling operation. It is beneficial for avoiding overfitting during training the network by reducing the scale of parameters. Max pooling is performed over the whole feature map, so that the max pooled feature represents the global feature [30]. Surgeons’ skills are mainly represented by the change procedure of the spatial relationship between the guide wire and vessel wall contained in the texture feature of the medical image sequence. So, the features reflecting surgeons’ skills remained after every max pooling layer.

After the third max pooling layer, the feature maps are flattened and then connected with two fully connected layers. The fully connected layers are used to fit the non-linear relationship between the low-dimensional feature maps and the action probability distribution. The fully connected layer can be expressed as (2):

$$ {h}^f=\sigma \left({b}^f+{W}^f{h}^{f-1}\right) $$
(2)

where hf − 1 is the input feature vector of the fth fully connected layer, hf is the output feature vector, Wf is the weight matrix, bf is the bias term, and σ(⋅) is the activation function ReLU.

By denoting the neuron vector in the output layer by h0, the number of the neurons of h0 equals to the number of actions in the given candidate action space At[a0, a1, a2, a3, a4] (i.e., pushing forward, pushing forward synchronously with rotating, rotating, pulling back synchronously with rotating, and pulling back; the displacements of the pushing and pulling action are set as 0.6 mm, and the angular displacement of the rotating action is set as 6°). Then, a Softmax layer is used to calculate the action probability of each action by the Softmax regression \( {p}_a\left({h}^o\right)=\exp \left({h}_a^o\right)/{\sum}_{a=0}^{A-1}\exp \left({h}_a^o\right) \), where \( {h}_a^o \) is the ath output value.

2.3 Operating force mode recognizer

Recognition of the operating force condition can be defined as a two-class classification problem. If the guide wire tip encounters an obstacle, the operating force condition is defined as an abnormal force mode. Otherwise, the operating force condition is defined as a normal force mode. As shown in Fig. 4, a multi-input 1-D CNN is proposed as the operating force mode recognizer. It takes both the current operating force values and several last operating force values as the multiple input. In this way, the variation tendency of the operating force is also taken into account to recognize the operating force mode.

Fig. 4
figure 4

Architecture of the operating force mode recognizer

The operating force value at current time is defined as Fti. The operating force sequence from Ftito M last operating force values is defined as Ft (M is set as 50 in this work). The sampling frequency of the force sensor is 90 Hz. Fti is taken as the input of the first convolutional layer, which is followed by the second convolutional layer, two fully connected layers, and a Softmax layer as the output layer. The 1-D convolutional layer can be expressed as (3).

$$ {\eta}_i^l(x)=\sigma \left({b}_i^l+\sum \limits_k\sum \limits_u{\eta}_k^{l-1}(x){W}_{ki}^l(u)\right) $$
(3)

where \( {\eta}_i^l \) is the ith output feature vector of the lth 1-D convolutional layer, \( {\eta}_k^{l-1} \) is the kth input feature vector of the lth 1-D convolutional layer, \( {W}_{ki}^l\in R \) is the kth kernel of the lth 1-D convolutional layers, \( {b}_i^l \) is the bias term, and σ(⋅) is the activation function ReLU.

To avoid data submergence of the current operating force value Fi during convolution, Fi is directly taken as a part of the input feature of the first fully connected layer. The first and second fully connected layers can be expressed respectively as (4) and (5):

$$ {\eta}^{f_1}=\sigma \left({b}^{f_1}+{W}^{f_1}\left({F}_t,{F}_{ti}\right)\right) $$
(4)
$$ {\eta}^{f_2}=\sigma \left({b}^{f_2}+{W}^{f_2}{\eta}^{f_1}\right) $$
(5)

where (Ft, Fti) and \( {\eta}^{f_1} \) are respectively the input feature vector and output feature vector of the first fully connected layer, \( {\eta}^{f_1} \) and \( {\eta}^{f_2} \) are respectively the input feature vector and output feature vector of the second fully connected layer, \( {W}^{f_1} \) and \( {W}^{f_2} \) are the weight matrixes of the two fully connected layers, \( {b}^{f_1} \) and bf2 are the bias terms, and σ(⋅) is the activation function ReLU.

2.4 Eye-hand collaborative servoing algorithm

In this section, an eye-hand collaborative servoing algorithm is proposed to realize autonomous control of the slave robot under closed-loop architecture. It combines the output of the action probability estimator and operating force mode recognizer. If the recognition result of the operating force mode recognizer is a normal force mode, the action with the maximum probability will be executed. If the recognition result of the operating force mode recognizer is an abnormal force mode, the servoing algorithm will control the slave manipulator to avoid the obstacle by the given avoiding actions (i.e., pulling back the guide wire by a certain distance and then rotating it by a certain angle). The eye-hand collaborative servoing algorithm is illustrated below:

figure m

2.5 Hardware setup and data collection

In order to collect data for network training, an ES robot [26] is used to develop the hardware setup, as shown in Fig. 5a. It mainly consists of a Geomagic Touch X device as the master controller, a slave manipulator, a vessel model, and a grayscale monocular camera. During data collection, surgeons control the slave manipulator via the master controller to operate the guide wire inside the vessel model as shown in Fig. 5b. The master controller records the operating actions. The force sensor inside the slave manipulator detects the operating force, as shown in Fig. 5c. Meanwhile, the camera acquires the surgical state image. Two kinds of vessel models are used for data collection and evaluating experiments, as shown in Fig. 6. One branch in the medical model and three designed vessel models are used for data collection, while another branch in the medical model and the fourth designed vessel model are used for evaluating experiments. The proposed CNN-based method is developed on the basis of TensorFlow and a graphic workstation (with a GPU of NVIDIA Quadro K4000).

Fig. 5
figure 5

Experimental setup. a Overview of the ES robot system and the vessel model. b The guide wire inserted into the vessel by the slave manipulator. c The force sensor inside the slave manipulator

Fig. 6
figure 6

Vessel models. a One branch in the medical vessel model for data collection and another for test. b Three designed vessel models for data collection and one for test

Three kinds of experiments in different cases are designed. The starting area and target area are given by an operator.

Case I

The medical vessel model is used. Case I(a): the guide wire tip is moved from the starting area and is orientated towards the objective branch. The guide wire tip can be pushed directly into the objective branch. Case I(b): the guide wire tip is moved from the starting area and is orientated against the objective branch. The guide wire tip needs to be rotated firstly and then pushed into the objective branch. One hundred fifty demonstrations are conducted by 5 surgeons, and about 5500 samples are collected for each case.

Case II

The designed vessel models are used. Case II(a): the guide wire tip is moved from the starting area and is orientated towards the objective branch. The guide wire tip can be pushed directly into the objective branch. Case II(b): the guide wire tip is moved from the starting area and is orientated against the objective branch. The guide wire tip needs to be rotated firstly and then pushed into the objective branch. Case II(c): the guide wire tip has been moved into the wrong branch. The guide wire tip needs firstly to be pulled back to the entrance of the branch, then be rotated and pushed into the objective branch. Two hundred demonstrations are conducted by 5 surgeons, and about 11,000 samples are collected for each case.

Case III

The medical vessel model is used. A simulated plaque is pasted on the vessel wall. The guide wire is pulled from the starting area and had an encounter with the plaque. Case III(a): the guide wire tip is not blocked by the obstacle, and the guide wire needs to be pushed forward directly and passes the obstacle. Case III(b): the guide wire tip is blocked by the obstacle, and the guide wire will impress the obstacle and bend gradually. Fifty times of demonstrations are conducted by 5 surgeons for each case.

2.6 Training the network

To avoid overfitting, the collected data are augmented by 15 times, adopting augmenting methods of flipping, rotating, and adding Gaussian noise. Finally, about 704,000 of samples are obtained. The lost function used to train the network can be obtained based on a cross-entropy method. By defining all the trainable parameters by θ, the lost function can be expressed as (6):

$$ L\left(\theta \right)=-\frac{1}{N}\sum \limits_{j=1}^N\sum \limits_{a=0}^{A-1}1\left\{{y}_A^{(j)}=c\right\}\ln P\left({\overset{\wedge }{y}}_A^{(j)}=c|{I}^{(j)};\theta \right) $$
(6)

where I(j) is the input image batch, \( {y}_A^{(j)} \) is the corresponding label sets, \( {\overset{\wedge }{y}}_A^{(j)} \) is the predicted action probability, and P(⋅) is the probability of classifying I(j) as c class, which can be expressed as (7):

$$ P\left({\overset{\wedge }{y}}_A^{(j)}=c|{I}^{(j)};\theta \right)=\frac{e^{\theta_c^T{I}^{(j)}}}{\sum_{l=1}^k{e}^{\theta_l^T{I}^{(j)}}} $$
(7)

As for training the operating force mode recognizer, the loss function is also based on a cross-entropy method.

An Adam (adaptive moment estimation) optimizer is used to train the networks. The batch size is 32. The learning rate is initially set as 0.005, and it is reduced with a decay of 0.909 at each training step until it reaches 0.0001.

3 Evaluation experiments and results

In this section, the goal of the evaluation experiments is to answer the following questions: (1) how well does the proposed CNN-based method compared with the state of the art? (2) How well does the proposed eye-hand collaborative operation perform?

3.1 Comparison with the state of the art

Vessel branch passing tests are conducted for comparison between the proposed CNN-based method and Yang’s non-rigid registration and GMM method [35]. Because the operating force is not taken into account into Yang’s method, the evaluation tests do not consist of an obstacle passing task. For detailed comparison, evaluation tests in different cases with different training data are conducted, as shown in Table 1. For convenience, the data collected in each case are defined according to themselves. For instance, the data collected in case I is defined as case I data. In addition, the data randomly mixing the case I data and case II data is defined as multi-data. For quantitative comparison, evaluation metrics are defined as follows:

Table 1 Comparison between our method and Yang’s method

(1) Success rate (SR): for a given surgical task, if the guide wire tip achieves nearby the target area in a limit of 100 s, the test is counted as success; otherwise, it is counted as failure. Fifty times of tests are conducted for each case.

(2) Average operating time (AOT): the efficiency of task execution was measured by average operating time. Larger operating time indicates large X-ray radiation dose, which leads to more damage to the surgeon and patient.

Although Yang’s method achieves relative higher SR and shorter AOT, our method shows its capability of adapting to different situations. As shown in Table 1, for the tests on the medical model in case I(a) and case I(b), Yang’s method trained respectively with case I(a) data and case I(b) data achieves SR of 100% and 94%, respectively, which are relatively higher than those with our method. In addition, the AOT with Yang’s method are shorter than those with our method, with the largest AOT difference of 28.5% in case I(a). However, Yang’s method trained with case I(a) data cannot complete the task in case I(b). In fact, the guide wire tip directly moves into the wrong branch in all the tests in case I(a) by Yang’s method trained with case I(b) data. Similar results are also observed from the tests in case II with Yang’s method trained with non-corresponding data. The reason is that the GMM in Yang’s method could only be trained with the data from a relatively fixed starting point and generates a fixed action command sequence. In this way, despite changes of the starting point, the robot is still controlled by the generated command sequence. As a result, the guide wire operated by the robot cannot reach the target area. In contrast, our method trained with case I data can perform the task in both case I(a) and case I(b). It shows the same performance in the tests in case II. This is because our method could learn the non-linear features in mixed data. Also, based on the closed-loop control architecture, our method estimates the current surgical state and then generates the optimal action command at each time step, rather than preplanning and executing a fixed action command sequence.

AOT of our method trained by the multi-data are reduced respectively by 22.0% and 25.2% in case I(a) and case I(b), compared with those trained by case I data. The reason for the time reduction is the decrease in the wrong actions. Despite data augmentation, the original vessel shape is constant in the case I data. But there are more vessel shapes in the case II data. So, the sample diversity is increased by combining case I data with case II data. Then, the vessel shape tends more to be considered a kind of feature by the network. It is beneficial to improve the recognizing accuracy of the network. AOT with case II data and multi-data are at the same level in case II tests. The reason might be that only one kind of training data with wide difference cannot obviously improve the recognizing capability of the network. In some particular situation, reciprocating motion of the guide wire is observed, which prolongs the task completing time. In this situation, the spatial relationship between the guide wire and vessel wall is different with most of the training samples; as a result, the recognizing accuracy of the action probability estimator is low. The estimator might output wrong action in this situation. When the number of wrong actions is large enough, it could lead to a failure test due to moving into a wrong vessel branch or exceeding the time limitation. It also indicates the importance of sample diversity for recognizing capability of the action probability estimator.

In 61 of total 250 tests of our method trained with multi-data in case I and case II, the guide wire tip moves into the wrong vessel branch. And in 53 of these 61 tests, the guide wire tip is successfully pulled back, rotated to the suitable direction, and then pushed towards the target area. It indicates the capability of recovering from failure situation of our method (Fig. 7), which is important for the application in a dynamic surgical state. In contrast, Yang’s method does not show this capability. Further, it should be pointed out that the demonstrations of recovery from a wrong vessel branch are not consisted in the case I data, but consisted in the case II(c) data. It indicates that the network trained with multi-data learns the skills of recovering from failure situation and shows the skills in the tests in case I. It also indicates that the network learns successfully to recognize the spatial relationship between the guide wire and vessel contours.

Fig. 7
figure 7

A test demonstrates the capability of recovering from failure situation

3.2 Performance evaluation of eye-hand collaborative operation

For evaluating the eye-hand collaborative servoing algorithm, evaluation tests of the task combining obstacle passing and vessel branch passing are conducted in the testing vessel branch of the medical vessel model. The guide wire could pass the obstacle only if with suitable orientation. In most of the evaluation tests, both of the robot and surgeon would try several times of attempts before the guide wire tip passes the obstacle. If the tasks are completed within 100 s, the test is counted as success; otherwise, it is counted as failure. Two hundred times of tests are conducted and finally achieve a success rate of 90.5%. It can be seen from Fig. 8 that the operating times of the evaluation tests distribute mainly from 40 to 80 s. The AOT of eye-hand collaborative operation is 55.4 s, which is close to the surgeon’s tests of 44.1 s. Although the maximum operating time is 139.2 s, which is larger than the surgeon’s maximum operating time of 51.7 s.

Fig. 8
figure 8

Distribution of operating times

Figure 9a, k shows respectively the guide wire tip trajectories of a representative robotic test and a manual test. It can be seen from the trajectories that the robot controlled by the eye-hand collaborative servoing algorithm performs the tasks through a similar procedure compared with the manual test. Owing to obstacle passing attempts, both of these two trajectories are relatively tanglesome near the simulated plaque. After passing the simulated plaque, the trajectory in the robotic test is as smooth as that in the manual test.

Fig. 9
figure 9

Comparison of guide wire tip trajectory and detailed operating procedure. aj Robotic test. kt Manual test

The detailed moving procedures of the guide wire in these two tests are respectively shown in panels b–j and l–t of Fig. 9. Three obstacle passing attempts are observed in these two tests. In all the robotic tests and manual tests, the times of obstacle passing attempts range from 0 to 6. Points B, D, F, L, N, and P in Fig. 10 are respectively corresponding to Fig. 9b, d, j, l, n, t. It can be seen that the operating force mode recognizer correctly recognizes the abnormal force mode at points B, D, and F, and then the obstacle avoiding actions are correctly executed. After that, the action probability estimator correctly understands the surgical state and rotating actions are executed near the junction to adjust the guide wire tip towards the objective branch.

Fig. 10
figure 10

Comparison of operating force. Points B, D, F, L, N, and P in this figure are respectively corresponding to panels b–j and l–t of Fig. 9

Further, the operating forces at points B, D, F, L, N, and P are not of the same value as shown in Fig. 10. It indicates that the operating force mode recognizer recognizes the force mode not simply according to a threshold force value but according to both the variation tendency and instantaneous value of the operating force. It is similar to the surgeon’s manner of operating force mode recognition. In addition, the maximum operating forces during obstacle avoiding attempts in all the evaluation tests range from 0.24 to 0.79 N, which are at the same level as those in manual tests (ranging from 0.29 to 0.83 N).

The results of operating force comparison indicate that the operating force mode recognizer learns the skills of the surgeon and shows a similar performance to manual tests. It is beneficial to assure the surgery safety. The overall results demonstrate that the task combining obstacle passing and vessel branch passing could be autonomously performed by the robot with the eye-hand collaborative servoing algorithm.

4 Discussion

Despite the unstructured surgical state and the surgeon’s implicit skills, surgical tasks in ES requiring human cognition and experience can be executed autonomously without manually preprogramming for different situations. We demonstrate the feasibility and potential of the proposed CNN-based method to unstructured surgical state perception and the surgeon’s skill learning in ES.

The experimental results show the CNN-based method’s capability of adjusting to different situations. With the high non-linear mapping capacity, the CNN could be trained with the sample data collected at different conditions, compared with non-rigid registration and GMM methods [35]. As a result, it has the capability of adapting to different situations. Owing to its closed-loop control architecture and ability to perceive an unstructured surgical state, our method shows the capability of recovering from failure situation. Recovery from failure situation is a critical issue for application of the deep learning method in real-world robotics [50]. Our method achieves the capability of recovering from failure situation by combining samples of case II(c) with the training data, which consists of the samples of recovery from a wrong branch. It is interesting that the network trained with multi-data learns the skills of recovering from failure situation and shows the skills in the tests of case I that does not consist of samples of recovery from a wrong branch. It demonstrates the ability of the CNN-based method that it could transfer the learned skills to different situations. It is inspirational for the issue of experience sharing among multiple ES robots trained by the demonstration data collected by different surgeons with different skills.

Under a closed-loop control architecture, the eye-hand collaborative servoing algorithm effectively integrates the capabilities of the action probability estimator and operating force mode recognizer. The action probability estimator helps to choose the optimal action based on unstructured surgical state perception. The operating force mode recognizer maintains a similar operating force level to manual operation, which is beneficial to surgical safety assurance. These capabilities are obtained through learning from demonstrations rather than manually preprogramming. Further, the inputs to the network are an image and force signal, which are not related to the kinematics and dynamics of the surgical robot. It means that calibration of the camera and robot coordinates is not needed, which is necessary for many current methods [24].

The main reason for the failure cases is that diversity of the samples is not high enough. Because there are always some patterns for human demonstrations, some particular situations during the robotic test are not similar to those in the samples. In these situations, wrong actions might be chosen and executed. SR could be improved by enhancing generalization of the network in the future work, using more training data from various high-fidelity surgery scenarios.

The intent of the proposed CNN-based framework is not to replace human surgeons but to enhance the autonomy of the ES robot, since improving the autonomy of the surgical robot could expand human capacity and capability in human-robot collaborative surgery [51]. In human-robot collaborative surgery, the slave robot would be controlled by both the human surgeon (through a master robot) and trained network within specific collaborative architecture. For instance [52], repetitive and low-risk surgical tasks could be performed autonomously by a robot under a human surgeon’s supervision, while the surgeon focuses on high-risk tasks. In this way, the burden on the surgeon could be reduced, and more attention of the surgeon could be turned to complex hand high-risk tasks.

5 Conclusion

In this paper, we report a CNN-based method of unstructured surgical state perception and the surgeon’s skill learning for an ES robot. The results demonstrate the feasibility and potential of the CNN-based method for enhancing the autonomy of the ES robot. In addition, because the kinematic and dynamic models of the robot are unnecessary, the CNN-based method has the potential to be easily extended to many other kinds of surgical robots.

Future studies will include the following: enhancement of the network generalization through collecting more various demonstration data, 3-D perception of the surgical state by taking images from different views as input to the network, and a specific architecture for human-robot collaborative operation.