Introduction

In the context of minimally invasive surgery (MIS), surgical cameras such as laparoscopes, arthroscopes, and pass-through head-mounted displays (HMD) are often used as a surrogate for direct vision. They provide a superficial view of the anatomy and are incapable of visualizing internal structures beneath the organ surface [14]. One way to enhance surgical video is to overlay preoperative medical images (such as CT and MRI) or intraoperative ultrasound directly onto the video in an augmented reality (AR) environment. This necessitates a highly accurate hand-eye calibration [11, 24] between the optical axis of the camera and the spatial measuring device. Inaccurate calibration would result in the misalignment between virtual and real objects in the image overlay, creating additional mental burden for the surgeons and potential errors for instrument placement. Once properly calibrated, advanced visualization techniques can be used to facilitate in surgical planning [1] and surgical guidance [16, 18].

Fig. 1
figure 1

Tracked cameras used for this study: a commercial webcam (C920, Logitech, USA) commonly used for medical training/teaching purposes, b head-mounted-display (Oculus Rift, Oculus VR. LLC., USA) with pass-through stereo cameras (Ovrvision Pro, Ovrvision Inc., Japan), potentially used in augmented reality surgical guidance. The HMD has built-in optical tracking system (the tracking IREDs are shown as cyan dots) which is not utilized in this study, and c clinical laparoscope (Olympus). A passive DRF is rigidly attached to each of the three cameras

Hand-eye calibration is an active research topic in robotics [22], where the camera view (i.e. the eye) must be linked with the kinematics of the robotic systems (i.e. the hand). Most of the approaches rely on imaging salient features of a stationary object from different poses and then solving for rotation and translation either separately [24], jointly [5], or iteratively [6, 10]. Hand-eye calibration in a surgical environment is non-trivial [23], since issues such as sterilization, sensor attachment, and real-time computation requirements remain challenging. In tracked MIS, surgical instruments and patient anatomy are augmented with a dynamic reference frame (DRF), allowing their poses to be determined in a common coordinate system. Thus, an additional tracked calibration device may be used to aid the process of hand-eye calibration. Perhaps the simplest method is the Procrustean approach, where the hand-eye calibration is reduced to paired-point registration [12, 21]. Both Voruganti and Bartz [25] and Chen et al. [4] used a calibrated, tracked planar chessboard pattern for hand-eye calibration, where the 3D position of the chessboard corners is determined in both the tracker’s coordinate system (via tracking) and the camera’s optical axis (by solving some forms of Perspective-n-Point problem). Calibration of a chessboard to its DRF is a possible contributor to the fiducial localization error (FLE), and while simple and effective, the reported accuracy for these approaches remains sub-optimal [4].

Contribution

We present an accurate hand-eye calibration framework linking a surgical camera and an external spatial measurement device. This framework requires minimal user interaction and compares favourably with other algorithms found in the current literature. We demonstrate the applicability of this framework to three different types of camera commonly used in medical training, surgical planning and image-guided surgery. We propose to use a ball-tip stylus as the calibration device and formulate the hand-eye calibration as a Procrustean point-to-line registration. The numerical algorithm has a very compact formulation (“Appendix”), requiring minimal measurements (typically 12–15 tracked images) to converge to a stable and accurate calibration.

Methods

Without loss of generality, we assume the optical characteristics of the camera lens have been determined accurately by other means (such as Zhang [27]) and the images are un-distorted (i.e. both the radial and tangential distortions are removed). All lenses are assumed to have a fixed focal point. A passive optical spatial measurement device (Spectra, NDI, Canada) was used for this study, although other forms of tracking can be easily incorporated.

Apparatus and image processing

Three cameras were used in this study (Fig. 1): a commercial webcam (C920, Logitech, USA), a HMD with pass-through stereo camera (Ovrvision Pro, Ovrvision Inc., Japan), and a stereo laparoscope (surgical laparoscope, Olympus). Camera specifications are listed in Table 1. For the purpose of evaluating the efficacy of our framework, the DRFs are rigidly attached as close to the camera lenses as possible. While such spatial arrangement may not be clinical plausible (in the case of the rigid laparoscope), such arrangement has the advantage of minimizing tracking error due to lever arm effect of tracking uncertainty.

Table 1 The acquisition specification for cameras used in this study
Fig. 2
figure 2

Two examples of a ball-tip stylus as a calibrator for hand-eye calibration: a a commercial stylus augmented with a 10.0 mm (in radius) ball tip, and b a custom tracked stylus with a 30.0 mm (in radius) ball tip

Fig. 3
figure 3

a Image captured by the Ovrvision Pro device with the optically tracked stylus in view (ball-tip radius of 30.0 mm). b Image following colour thresholding, red objects are displayed as white and all other colours are black. c Detected circle is drawn on a black image. d Detected circle is outlined and the centre is found with sub-pixel accuracy. Similarly, eh showed images captured with the surgical laparoscope (Olympus) with shallow field view (ball-tip radius of 10.0 mm)

We formulate the hand-eye calibration as a Perspective-n-Point problem that can be solved efficiently using a Procrustean point-line registration [3]. A calibration tool in the form of a ball-tip stylus was designed, where the size of the ball tip accommodates the viewing depth of a particular camera (Fig. 2). The ball tip was painted red to facilitate automatic segmentation. The centre of the ball tip can be accurately calibrated by pivoting it against a hemispherical divot of matching radius, a hollow inverted-cone divot, or a hollow tube divot.

When imaged by a camera, the red ball tip is projected as a circular pattern, which can be segmented automatically. First, the acquired colour image (Fig. 3a) is transformed into HSV colour space of which a colour thresholding technique is applied to locate red objects (Fig. 3b). The colour-thresholded image is smoothed by a Gaussian and a median filter to reduce noise and artefacts. A Hough transform is then applied to detect a circular pattern within the smoothed image, which provides a segmentation of the centroid of the projected ball tip with sub-pixel accuracy (Fig. 3c). The detected circle is drawn back to the segmented image for visual validation (Fig. 3d). The detected sub-pixel location of the ball tip is recorded in conjunction with the tracker pose. This process is fully automatic and eliminates possible sources of error due to manual interaction and segmentation.

Hand-eye calibration as Perspective-n-Point

Given the camera matrix, a 3D coordinate system can be defined that is centred at the principal point of the camera lens, pointing in the direction of the focal point. Using the ideal pin hole camera model, the intrinsic parameters of a camera can be represented as:

$$\begin{aligned} M = \begin{bmatrix} f_x&0&c_x \\ 0&f_y&c_y \\ 0&0&1 \end{bmatrix} \end{aligned}$$
(1)

where \((f_x,f_y)\) and \((c_x,c_y)\) are the focal point and principal point, respectively. A point \(Q=(X,Y,Z)^T\) in 3D space can be projected onto the image by:

$$\begin{aligned} q = M Q \end{aligned}$$
(2)

where \(q=(x,y,w)^T\). Given a pixel location, however, only the corresponding ray can be computed by:

$$\begin{aligned} r = M^{-1} q \end{aligned}$$
(3)

Using the camera model in a canonical form, a pixel can be represented in a homogeneous coordinate system of \(q=[x,y,1]^T\).

Fig. 4
figure 4

a Conceptualization of point-line registration: segmented ball-tip centroid and the camera origin forms a line by which, after registration, must pass through the 3D location (in tracker space) of the tracked stylus, and b projection error is defined by the Euclidean distance between the location of the segmented ball tip (on image) to the projected location (shown in black circle)

Using the calibrated ball-tip stylus as a calibration device, for each measurement we record the 3D location of the centre of the ball tip (\(Q_i = [X_i,Y_i,Z_i]^\mathrm{T}\) in tracker space), as well as its projection onto the image (\(q_i=[x_i,y_i,1]^\mathrm{T}\)). A line r emanating from the centre of the camera through the centroid of the projected ball can be formulated using Eq. 3 which, after calibration, must pass through the centre of the tracked ball tip (Fig. 4a). This scenario is identical to the Perspective-n-Point (PnP) problem in computer vision, for which we previously presented an efficient solution [3]. Our algorithm requires only simple matrix operations, and the computational requirement is minimal. Refer to “Appendix” for a MATLAB implementation.

Since there is a one-to-one correspondence between the points and lines, we refer to our algorithm as the Procrustean point-line calibration. In our prior work [3], this algorithm was compared against six other well-known PnP algorithms including Efficient PnP and Efficient PnP with Gauss–Newton refinement [15], Procrustean PnP [9], generalized Procrustean PnP [8], generalized Fiore algorithm [8], and the Orthogonal Iteration algorithm [17]. We concluded that our point-line solution, despite having a very compact formulation, performed favourably against other algorithms. We also demonstrated that our algorithm requires a minimum of three paired point-line measurements to establish a stable solution (assuming arbitrary line orientation), with the accuracy of the registration (as assessed by the target registration error, or TRE) increasing as a number of measurements increase. The sharpest drop in TRE occurs after 6–7 measurements, reaching a plateau after 12–15 measurements are acquired [2, 3].

Validation

Several well-known hand-eye calibration algorithms with the open- source implementation are available in the current literature: Tsai [24] presented an algorithm that utilizes a series of images rotated around a calibration board. The Navy algorithm [19] formulated a technique by solving AX = XB on the Euclidean Group. The Inria [11] calibration technique utilizes the calibration frame instead of the camera frame to reduce error. Dual [5] is a calibration algorithm that writes the line transformation with the dual quaternion product for relative position and orientation. Lastly, Branch and Bound [10] by Heller et al. minimizes an objective function based on the epipolar constraint and was shown to be globally optimal with respect to the \(L_{\infty }\) norm. These calibration techniques provide a basis for comparison for our proposed solution.

The ball-tip stylus is used as a validation tool, providing an automatic assessment framework with minimal user interaction. Each camera was calibrated using six hand-eye calibration algorithms. An independent set of images capturing the ball-tip stylus at varying depths was acquired. The projection errors, defined as the difference (in pixel space) between the 3D location of the ball tip (in tracker space) projected onto the image and the centre of the segmented circular centre (in video), are reported for each camera/hand-eye calibration algorithm combination (Fig. 4b). We choose not to report the back-projection error, which associates the error with a physical unit, as both projection and back-projection represent the same measure of quality (which is angular error) but expressed in different coordinate systems.

Fig. 5
figure 5

Accuracy assessment for various hand-eye calibration algorithms at varying poses: (red) our proposed Procrustean point-line calibration, (green) Tsai, (yellow) Inria, dual, and Navy, (white) Branch and Bound, and (grey) ground truth (automatic segmentation). For this particular camera, we were not able to achieve accurate hand-eye calibrating using Tsai [24] algorithm

Data collection

The intrinsic parameters for all three cameras were determined using the method described by Zhang [27] as is implemented in OpenCVFootnote 1 using a square-checkerboard pattern. Each camera was calibrated multiple times to ensure accuracy and consistency. Per-camera intrinsic parameters and validation data were held constant when testing different hand-eye calibration methods, effectively isolating the detected errors to algorithm behaviour. The ball-tip stylus was calibrated multiple times using pivot calibration [26], achieved using an inverted cone as a reciprocal surface for ideal pivoting. A typical root-mean-square pivot calibration error for the two styluses shown in Fig. 2 was less than 0.5 mm, which can be achieved consistently.

For all acquired tracked images, the ball-tip stylus was stabilized using a passive arm and moved throughout the viewing frustum of the camera. Static images were acquired and associated with the tracking data. Each measurement, including moving the passive arm, typically took less than 10 s to acquire. To achieve accurate hand-eye calibration, the ball-tip stylus was move throughout the viewing frustum to maximize the spread of the fiducial placement [2]. Based on our prior work [3], we collected 12–15 measurements for each camera to derive the hand-eye calibration using our algorithm, as more measurement may not necessary improve the fitness of the calibration in any meaningful way.

An independent set of tracked images were acquired for validation purposes, using the ball-tip styluses, for each of the three cameras. All hand-eye calibration algorithms were evaluated using the same set of validation data, effectively isolating the difference in the projection error to algorithm behaviour. All validation images were acquired with varying stylus poses (particularly the orientation), so that any bias or inaccuracy in pivot calibration would be apparent in our validation data analysis (“Results” section).

Results

Accuracy of the hand-eye calibration as assessed by projection error can be visualized directly on the tracked images, as shown in Fig. 5. Using the ball-tip stylus, the centroid of the circular projection is automatically segmented (“Apparatus and image processing” section). The segmented centre is drawn on the original colour image, serving as the ground truth (shown as grey marker). The 3D location of the tracked stylus is then projected through the hand-eye calibration (as well as camera intrinsics) and displayed on the image as marker of different colours. The Euclidean distance between these colour marks compared to the ground truth is noted as the projection error (in pixels).

Fig. 6
figure 6

Projection errors for hand-eye calibration algorithms applied to the Logitech webcam, evaluated over varying depths

The accuracy of the commercial webcam as assessed by the projection error is shown in Fig. 6, with the distance between the ball tip to the centre of the camera ranges from 150.0 mm 700.0 mm. For this particular camera, we were unable to achieve a repeatable hand-eye calibration using Tsai [24] algorithm after several failed attempts; thus, its result is omitted from Fig. 6. The results for Inria, Navy, and Dual algorithms produced similar results and therefore overlap on the graph. The Tsai algorithm produced largest projection error due to unstable calibration, whereas our proposed Procrustean point-line algorithm produced the best result. The Branch and Bound algorithm produces consistent results, outperforming the Inria, Navy, and Dual algorithms. Our result suggests that there is no correlation between the projection error and the distance from the camera. A summary of the accuracy for the webcam is listed in Table 2.

Table 2 Summary of the accuracy assessment for the Logitech C920 webcam in terms of projection error

Using the result of these hand-eye calibrations, an augmented reality (AR) visualization system was implemented which may be used for the purposes of training and instruction. A patient-specific lumbar vertebra (L2) phantom was manufactured based on a patient CT and registered to an optical DRF. As shown in Fig. 7, the overlay of the virtual representation of the spine coincides precisely with the image.

Fig. 7
figure 7

Projection of a 3D model onto an image following point-line calibration of the Logitech camera using the proposed Procrustean point-line calibration

Accuracy results for the pass-through HMD, as assessed by projection error, are shown in Fig. 8 and summarized in Table 3. They demonstrate a similar trend; our proposed Procrustean point-line solution performed the best, followed closely by the Branch and Bound algorithm. The projection error does not correlate with the distance between the tracked stylus and the camera.

Fig. 8
figure 8

Projection errors for hand-eye calibration algorithms applied to the Ovrvision Pro camera, evaluated over varying depths

When properly calibrated, the popular Tsai [24] algorithm produced accurate result, although slightly worse than the Branch and Bound [10] algorithm. For both the webcam and the see-through HMD, both Navy [19] and Inria [11] performed almost identically. Both the webcam and HMD pass-through cameras were calibrated using the custom tool, with a ball-tip size of 30.00 mm in radius (Fig. 2b).

Table 3 Summary of the accuracy assessment for the Ovrvision Pro in terms of projection error

The surgical laparoscope by Olympus, originally used in the daVinci Surgical System, has a limited field of view (roughly 250.00 mm in depth). We were only able to calibrate this laparoscope using our proposed algorithm and the Branch and Bound algorithm, possibly due to the optics and the data selection requirement [20]. The projection error as a function of the distance from camera is shown in Fig. 9 and summarized in Table 4. The stylus with a ball-tip size of 10.00 mm (Fig. 2a) was used to accommodate the limited field of view.

Fig. 9
figure 9

Projection errors for hand-eye calibration algorithms applied to the Olympus surgical laparoscope, evaluated over varying depths

Fig. 10
figure 10

Visualization of the projection error for the Olympus surgical laparoscope. The centre of the ball tip is automatically segmented and shown as a grey marker. The blue marker shows the projection based on our proposed algorithm, and the white marker shows the projection based on the Branch and Bound algorithm

For the surgical laparoscope with limited field of view, our proposed algorithm outperformed the Branch and Bound [10] algorithm in terms of maximum and mean projection errors by a factor of 3. The surgical laparoscope was difficult to calibrate using the standard techniques, possibly due to the requirement to collect measurement in a dense and controlled manner [20]. For all three cameras, our proposed algorithm consistently performed better than existing techniques. Visualizations of a typical projection errors projected onto the image for the Olympus surgical laparoscope are depicted in Fig. 10.

Table 4 Summary of the accuracy assessment for the Olympus surgical laparoscope in terms of projection error

Discussion

The results presented in this paper demonstrate that our proposed Procrustean point-line hand-eye calibration is well suited for navigated minimally invasive surgery. It is scalable for cameras with varying field of view, ranging from a commercial webcam to surgical camera. The versatility, ease of implementation as well as ease of data collection make our proposed algorithm a suitable candidate for either preoperative or intraoperative hand-eye calibration. In our laboratory setup, a highly accurate hand-eye calibration can be achieved using 12–15 images, which can be acquired in less than 3 min.

For all the cameras tested, our algorithm delivered the best performance when compared against five other well-understood algorithms. In particular, the Branch and Bound algorithm is considered as the state of the art, as it was shown to be globally optimal with respect to the \(L_{\infty }\)-norm. We note that all five publicly available algorithms rely on imaging salient features of an object from varying poses; thus, they are inherently sensitive to range of poses between measurements as well as the quality of the lens/images acquired. In particular, one needs to be careful to optimize the tracked image acquisition for both the Tsai [24] and the Branch and Bound algorithm [10], as when the pose space are not sampled densely enough these algorithms tend to fail. Schmidt et al. [20] addressed the issue of data selection in detail.

As we present our algorithm as a Procrustean registration, its performance can be understood in terms of fiducial localization error (FLE) and target registration error (TRE). It is well understood that the TRE is proportional to FLE and is influenced by the fiducial configuration [7]. The ball-tip stylus, as a calibrator, can be calibrated accurately through pivot calibration. The spherical tip is projected onto the image as a circle, which can be segmented accurately and robustly. Both of these factors minimize the contribution to FLE. In addition, since this calibrator can be placed anywhere within the viewing frustum of the camera, the fiducial configuration can be maximized in terms of the spatial relationship to any target inside the frustum. The ease of optimizing data collection in terms of fiducial configuration is a possible explanation to the superior performance of our proposed algorithm. One possible future direction is to optimize fiducial placement based on a TRE prediction model [2]. A simple heuristic such as placing 16 fiducials evenly across a \(4\times 4\) grid on the image will almost guarantee a reasonable calibration.

Other Procrustean approaches such as Voruganti and Bartz [25] employ a planar chessboard, which requires its own calibration. In our experience, calibration of the tracked chessboard, which often is aided with a calibrated stylus, also contributes to FLE [4]. The minimization of FLE contributed by three tracked objects requires a considerable amount of engineering effort. In our approach, since we are using a tracked stylus as a calibrator directly, the need to calibrate an intermediate calibrator is eliminated.

We acknowledge that using a stylus as both the hand-eye calibration and validation tool may potentially introduce a systematic bias in favour of our algorithm. The use of the ball-tip stylus, nonetheless, provided a common evaluation framework for all other hand-eye calibration techniques. Employing a separate validation apparatus may introduce additional error (such as its own calibration); thus, it would be ambiguous if experimental error originated from the calibration tool or validation tool. Furthermore, if there is a more accurate validation tool suitable for clinical deployment, this tool should be used for calibration instead. The same paradox has been recognized in the context of ultrasound calibration for quite some time [13]. In our validation framework, any rotation and translational error in the image plane can be detected quantitatively using the projected error. Translation error in the viewing direction of the camera would manifest itself as a scaling error, which would be apparent in image overlay. Figure 7 provides anecdotal evidence of minimal scaling error using our algorithm.

Our proposed point-line calibration is an iterative algorithm, sensitive to the line configuration. Suppose all measurements were made where lines are parallel, our point-line registration will never reach an unique solution. In the case of hand-eye calibration of a camera, where lines form a bundle of rays at the camera origin, we have shown that our algorithm always converges from an uninitialized state as long as more than six measurements are acquired [3]. Using a good initial estimate would reduce the number of the iterations required by our algorithm; the efficient PnP [15] or the weak-perspective camera model estimate used by orthogonal iteration [17] would serve as suitable initial estimates for our algorithm.

Building on the TRE prediction model for point-line registration that was recently introduced [2], one possible future direction is to assess the fitness of the hand-eye calibration as soon as a new measurement is acquired. Both our experimental results and theoretical prediction [3] suggest that accurate calibration can be achieved using 12 to 15 measurements, with diminishing improvement when more measurements are available. The ability to establish an accurate calibration using minimal data acquisition may be advantageous in a clinical setting.

The pass-through HMD, Oculus Rift, has a built-in spatial measuring device based on IRED and inertia sensors. As shown in Fig. 1b, these IRED emitters are visible by the optical spatial measuring device. Under the standard setting, these IREDs flash in a binary pattern for 2 s in every 20 to compensate for the temporal drift in its inertia sensors. We are exploring the possibility to track these IREDs directly using the optical spatial measuring device to improve tracking accuracy as well as to improve ergonomics.

Conclusion

We present a hand-eye calibration for surgical cameras, using a ball-tip stylus as a calibrator. The calibration is formulated as a Procrustean point-line registration, requiring only a small number of measurements and yet achieves high accuracy. We demonstrated that the proposed algorithm performed equally as well or better than the most common existing methods. The Procrustean formulation allows a target registration error prediction model to be used as a real-time assessment for the fitness of the calibration.