Keywords

1 Introduction

Laparoscopic surgery is a minimally invasive procedure. This technique reproduces the principles of conventional surgery with minimal physical trauma. Compared to open surgery, this approach is more beneficial to the patient but significantly increases the complexity of the surgical gestures. The constraints for surgeons are mostly ergonomic with the manipulation of surgical instruments (reduction of instrument mobility due to fixed insertion points on the abdominal cavity, loss of tactile sense) and the visualization of the surgical scene (limited field of view, indirect view of the surgical scene, endoscope manipulation). The realization of a laparoscopy requires a large adaptability from surgeons and requires a long learning curve.

Automatic localization of instruments can be helpful to respond to several limitations of laparoscopy and to assist surgeons during an intervention. For instance, [1] propose to localize instruments in space in a surgical trainer, based on a projective model and gradient image processing. In [2], a similar approach is proposed (also in a surgical trainer), with the addition of an extended Kalman filter to extract the edges of instruments.

In [3], the authors use the instrument insertion point as a constraint and a probabilistic algorithm to find instruments with the aim of controlling a robotic endoscope holder to assist surgeons during surgery.

All these methods use a gradient approach to extract instrument edges in the image. However, such approaches are sensitive to noise, illumination and shadows that can lead to insufficient segmentation for robust localization of instruments in the image [4]. To overcome this problem, we propose to use a 2D Frangi filter [5] to obtain a robust instruments edge detection. We present an algorithm to localize and track surgical instruments in endoscopic images in real-time. Our algorithm also permits to estimate the 3D position and orientation of the instruments using 2D information in the images, knowing the camera and instrument models.

2 Instrument Localization and Tracking Framework

The principle of our instrument detection algorithm consists in:

  • roughly identifying all regions corresponding to the location of an instrument in each laparoscopic image Sect. 2.1,

  • refining the instruments detection within the identified regions Sect. 2.2,

  • estimating the 3D pose of the instrument Sect. 2.3.

After an initial detection, the segmentation is constrained by the localization in the previous images to track the instrument.

2.1 Rough Extraction of Instruments Regions

First, the laparoscopic color image (Fig. 2a) is converted from the RGB color space to the CIELab color space. The L channel, corresponding to the luminance is removed to free ourselves from variations of light inherent to laparoscopic surgery. We thus obtain a grayscale image composed of the a and b channels (Fig. 2b) corresponding to the chromaticity \(C_{ab}=\sqrt{a^2+b^2}\). Using this color space is more robust for challenging images than color spaces commonly used such as HSV [7] or RGB, see Fig. 1. We then binarize this grayscale image using an automatic Otsu thresholding approach [8]. Since the laparoscopic instruments have a color very distinct from the background (laparoscopic tools are usually black, metallic, or blue/green), instrument pixels will appear as white pixels whereas background pixels will appear as black (Fig. 2c). Of course, this pre-processing step is noisy, with background pixels appearing as white and tool pixels appearing as black (Fig. 2c). We disconnect the regions by skeletonizing the image using a simple distance transform [9] and refine the separation by performing a simple erosion step on a cross-shaped kernel (Fig. 2d). Finally, we use a contour detection algorithm [10] to extract the extreme outer contour of each region as an oriented bounding box (see Fig. 3b). Based on the observation that laparoscopic instruments have a long and thin cylindrical shape, we eliminate bounding boxes with a width/length ratio inferior to 2 (red boxes in Fig. 3c).

Fig. 1.
figure 1

Typical images obtained by color to grayscale conversion. (a) Original image (b) Saturation modified channels in HSV space [7] (c) Chromaticity \(C_{ab}\) of CIELab space (Color figure online)

Fig. 2.
figure 2

Segmentation of a surgical instrument in 2D images. (a) Original image (b) Chromaticity \(C_{ab}\) of CIELab space (c) Segmentation using Otsu’s thresholding (d) Conversion of the binary image using the distance transform (e) Disconnection of regions in the binary image using distance transform (f) Binarization of the distance transform image (Color figure online)

Fig. 3.
figure 3

Edge detection of a surgical instrument in 2D images. (a) Original image (b) Edge detection (c) Potential instrument bounding boxes obtained from image b (green) and incompatible bounding boxes (red) (Color figure online)

Fig. 4.
figure 4

Extraction of instrument edges. (a) and (b) Two images extracted from the same surgery at different time intervals (c) Image extracted from another surgery (d), (e) and (f) Edge detection in the images (a), (b) and (c) by the Canny filter with the thresholds \(T_L=30\) and \(T_H=90\) (h), (i) and (j) Edge detection by the Frangi filter in the images (a), (b) and (c) with the parameters \(\sigma =2\), \(\beta =0.5\) and \(c=0.5\text {max}(S)\)

2.2 Fine Extraction of Instrument Edges

Now that we have potential bounding boxes for the instruments, we search for instrument edges within each bounding box. To do so, we use a Frangi filter [5], which is the major contribution of this paper. We compared the Frangi filter to the classical Canny filter [6] to search instrument edges (see Fig. 4). The Canny filter is the most classical gradient approach based on the Sobel filter. This filter uses a hysteresis thresholding that requires to find two optimal thresholds for accurate extraction of the edges of an instrument. However, as shown in Fig. 4, the conditions of the surgical scene evolves during an intervention, thresholds initially determined may no longer be optimal and cause of false detections. The advantage of the approach based on the Frangi filter is that it can be applied to different surgery conditions without adjusting the filter parameters. This filter is classically used in vessel detection in medical images. It is based on the computation of the eigenvalues of the image’s Hessian matrix \(\lambda _1, \lambda _2\) such that \(|\lambda _1|\leqslant |\lambda _2|\). The Hessian matrix is obtained by convolving the image with derivatives of a Gaussian kernel with standard deviation \(\sigma \).

The Frangi filter function can be defined as:

$$\begin{aligned} \left\{ \begin{matrix} {0} &{} {\text {if}~ \lambda _2>0,}\\ {V_0=\exp (-\frac{R_B^2}{2\beta ^2})(1-\exp (-\frac{s^2}{2c^2}))} &{} \end{matrix} \right. \end{aligned}$$
(1)

where, \(R_B = \frac{\lambda _1}{\lambda _2}\) is the blobness measure, \(s=\sqrt{\lambda _1^2+\lambda _2^2}\) is the structureness measure and \(c,~\beta \) are parameters to adjust the filter sensitivity. After applying the Frangi filter, each pixel value \(V_0\) of the image indicates the pixel’s probability of belonging to a tubular structure. Here, we do not use the Frangi filter to extract the whole cylindrical shape of the instrument. Indeed, the instrument’s diameter in the image varies depending on its relative orientation with respect to the endoscope (i.e. we cannot fix the standard deviation \(\sigma \)). Instead, we apply the filter with a very low \(\sigma \), in order to highlight the instrument edges (Fig. 5b). Finally, we identify the two borders of an instrument: the bounding box is extended and separated into two areas to search the top and bottom borders of the instrument separately using Hough transform [11] with a very low threshold, as illustrated in Fig. 5b. At this step, we can eliminate lines that are incompatible with a surgical instrument based on the relative orientation and position of the detected lines (as illustrated by Fig. 5c).

Fig. 5.
figure 5

Estimation of instruments poses in the image and in space. (a) Orignal image (b) Expansion and separation of a compatible bounding box in image filtered by the Frangi filter (c) Instruments’ borders refinement process (d) Detection of the instruments borders (e) Instruments tips detection in the Frangi image (f) Instruments’ pose in the image (g) Geometric representation of an instrument in space (h) Illustration of the compute instrument’s position in space

2.3 Estimation of 3D Pose of the Instruments

The two borders of an instrument define two tangent planes \(\varvec{\small {\sum }}_\mathbf{i}\) of normal \(\mathbf{n_i}\) passing through the optical center of the camera \(\mathbf{C}\) in space (see Fig. 5g). The camera calibration can be obtained with a classical chessboard calibration procedure such as [12]. The intersection of these two planes is a line \(\mathbf{D:(C,e_1)}\) parallel to the central axis of the instrument passing through the optical center of the camera with a direction vector \(\mathbf{e_1}\). This line defines the instrument’s central axis direction in space. In order to fully describe the tool’s orientation in space, we need to find a point \(\mathbf{P}\) on the instrument’s axis. To do so, we follow the approach proposed in [3]: the instrument is modeled as a finite cylinder of radius \({\rho }\) (see Fig. 5g). Such a point \(\mathbf{P}\) can be easily computed on the plane perpendicular to the instrument’s axis (Fig. 5h). Indeed, \(\mathbf{P}\) must respect the condition:

$$\begin{aligned} \lambda \mathbf{m_1}-\rho \mathbf{n_1}=\lambda \mathbf{m_2}+\rho \mathbf{n_2} \end{aligned}$$
(2)

where \(\mathbf{m_i=e_1 }\otimes \mathbf{n_i}\), \(\lambda \) is the distance from the optical center to tangent points \(\mathbf{S_i}\) and \(\mathbf{n_i}\) the normal to the plane \(\mathbf i\). Using Eq. 2, we can compute \(\lambda \) and obtain:

$$\begin{aligned} \overrightarrow{\mathbf{CP}}=\lambda \mathbf{m_1}-\rho \mathbf{n_1}=\rho \frac{\left\| \mathbf{n_1}+\mathbf{n_2} \right\| ^2}{(\mathbf{m_1}-\mathbf{m_2}).(\mathbf{n_1}+\mathbf{n_2})}{} \mathbf{m_1}-\rho \mathbf{n_1} \end{aligned}$$
(3)

Then, we search the position of the instrument’s tip \(\mathbf{t_{im}}\), in the Frangi image along the projection of the instrument’s axis \(\mathbf{(P,e_1)}\) in the image (see Fig. 5e). The pixel along the line with maximum grey level in the Frangi image is considered as the tip. Finally, we find the 3D position of the instrument’s tip \(\mathbf{T}\) as the intersection of \(\mathbf{(P,e_1)}\) and the projection line of the tool’s tip \(\mathbf{(C,t_{im})}\).

2.4 Tracking of Surgical Instruments

For our instrument tracking algorithm, we assume that between two successive images, an instrument does not undergo large displacements. In the initial step (first image), we find the instrument as described in Sects. 2.1 and 2.2. In the following images, we find the candidate bounding boxes, but we refine the instrument search only inside the bounding box best compatible with the position/orientation of the instrument in the previous image. If the instrument is not found in several images, we re-initialize the algorithm. In the case of several instruments, it is possible to track all the visible instruments or a particular one. Since only one instrument can be inserted at once through an insertion point \(\mathbf{I}\) on the abdominal wall, we can identify an instrument thanks to its insertion point, which can be easily computed using a pivot algorithm on \(\mathbf{(P,e_1)}\).

Fig. 6.
figure 6

Results of the tracking on the laparoscopic image sequences. (a) Sequence 1: Monopolar hook instrument (b) Sequence 2: Monopolar hook instrument (c) Sequence 3: Needle holder instrument (d) Example of a bad tip detection

Table 1. Laparoscopic images statistics
Fig. 7.
figure 7

Experimental test bench to evaluate the 3D pose estimation accuracy with a printout of a surgical scene as background.

Fig. 8.
figure 8

Estimation of the instrument’s pose in space. (a) Calibration step to find the rigid transformation \(\mathbf{T}\) (b) Evaluation of the 3D pose estimation accuracy with in black, the reference pose obtained with the robot, in green, the pose computed with our method (Color figure online)

3 Experiments and Results

Our algorithm is implemented in C\(^{++}\) using OpenCV and OpenMP libraries. For the computations, we used an Intel Xeon PC 2.67 GHz, 3.48 GB RAM. The 2D evaluation was performed on real laparoscopic images (720\(\,\times \,\)556). The 3D evaluation was performed on a laparoscopy test bench using an OLYMPUS OTV600 CCD and an IC Imaging Source grabber (\(720\,\times \,480\), 25 fps). To achieve a fast processing time the image resolution is divided by 2 for the region extraction and by 4 for the Frangi filter. We evaluated 2D tracking of our algorithm on three in-vivo video sequences of laparoscopic rectopexies obtained through the Digestive Departement of Grenoble Hospital with challenging situations (see Fig. 6).

In these images, the tip position and orientation of the instrument were compared to manual annotation. The results obtained for each sequence are presented in Table 1 with a mean error of 16.10 pixels (std. dev. of 28.98) for the tip position, a mean error of 0.90\(^{\circ }\) (std. dev. 0.88\(^{\circ }\)) for the 2D orientation and a frequency of 30 Hz. Videos of this evaluation are included in supplementary material.

To evaluate the accuracy of the 3D pose estimation, we performed experiments on a testbench (see Fig. 7) consisting of a surgery trainer box on which a commercial robotic instrument holder is directly positioned, and a printout of a surgical scene as background. We compared the 3D tip position of the instrument found by our algorithm to the 3D tip position given by the robot expressed in the camera referential. This required calibrating the system to find the rigid transformation \(\mathbf{T}\) between the robot and camera frame such that: \(\mathbf{p_{cam}^{frangi}=Tp_{cam}^{robot}}\). \(\mathbf{T}\) is obtained by pointing 12 points of a chessboard, for 6 chessboard positions, with the instrument carried by the robot (see Fig. 8). These 12 points can be expressed in the camera frame thanks to a standard extrinsic camera calibration procedure [12] and are also measured in the robot frame. We resolve a classical least squares system to find the rigid transformation between the two sets of 3D points coupled with a RANSAC to eliminate outliers. We obtain a camera calibration Root Mean Square (RMS) error of 0.25 pixels and \(\mathbf{T}\) with a RMS error of 1.2 mm. Figure 9 shows an example of the robot trajectory and of our tracking method for a series of instrument movements. The results for 380 measurements are presented in Table 2. In all results presented, we fixed the Frangi filter parameters as \(\sigma =2\), \(\beta =0.5\) and \(c=0.5\text {max}(s)\), according to recommandations from the literature.

Fig. 9.
figure 9

Robot trajectory (red) and our tracking method trajectory (green) for 380 instrument positions. (a) 3D trajectories (b), (c) and (d) X, Y, Z trajectories with respect to the camera frame (of normal Z) (Color figure online).

Table 2. Error of the 3D pose estimation with our method compared to the position obtained with a robotic instrument holder

4 Conclusion

We presented a surgical instrument tracking algorithm based on image processing. It permits to estimate the 2D/3D instruments pose in real-time without artificial fiducials. An extensive 2D evaluation on real surgical videos shows that our 2D pose estimation is accurate and robust on wide range of realistic cases. In difficult situations as a suture gesture, we can lose accuracy in the instrument’s tip position but the orientation is still correct. A machine learning approach as [13], applied in the neighbourhood of our estimated tip position could increase the accuracy of the tip detection. Our approach for 3D pose estimation was validated on a testbench using a printout of a surgery background. Although this might lack realism we estimated that the robustness of the proposed method on realistic images was already shown extensively on the 2D case. This 3D evaluation provides us with the precision range we can expect when the 2D detection works well. The greatest errors are found in the depth estimation along the z axis. This error could be reduced by using a stereoscopic endoscope.

Our 2D localization approach is robust and accurate enough to control a robotic endoscope holder. Even if the Frangi filter might not be the most obvious approach for edge detection, we showed that it works better than classical approaches. Other more sophisticated edge detection approaches could easily be compared on our image database. The 3D pose estimation could be useful for surgical gesture recognition or for co-manipulation, if we are able to increase the depth precision. Another application could be the online calibration of no rigidly-linked robotic endoscope and instrument holders, which could lead to less bulky surgical systems. Our next step will be to evaluate the 3D pose estimation more extensively in conditions closer to the clinical reality (cadaver experiments).