Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Three dimensional (3D) image overlay based augmented reality (AR) surgical navigation has been introduced many times in the literature [35, 8]. The 3D image with both horizontal and vertical parallax is displayed by a lens array monitor (3D display) which can be observed without wearing glasses. In our recent work [12], we proposed an integrated 3D image overlay based surgical visualization solution including 3D image rendering, distortion correction, and spatial projection. Given an anatomical mesh model derived from CT data, the 3D image of the model, which has the same geometric dimensions as the original organ, can be rendered in real time with the help of a graphics processing unit (GPU). To superimpose the 3D image on patient’s body with correct overlay, the pose of the anatomical model (i.e., the digitalized patient organ in the image space) with respect to the 3D display has to be determined intraoperatively. 3D measurement systems (e.g., Polaris Tracking System) are usually employed for this purpose. The pose determination is composed of two steps. One is to calculate the transformation from the 3D display to the 3D measurement system, which is called display-camera calibration. The other is to calculate the transformation from the 3D measurement system to the patient (i.e., the pose of the anatomical model with respect to the 3D measurement system), which is called image registration. The device calibration is performed offline only once because the 3D display and the 3D measurement system are relatively fixed during surgery. However the image registration suffers from patient’s movement, which requires the image registration process to be performed automatically in real time.

In the previous work [35, 8], an optical tracking system was employed for image registration using a manual marker-based registration method. The involvement of markers will hamper common surgical workflow, and the attachment of markers is either invasive or infeasible in many cases. Automatic markerless image registration is preferable in surgical navigation. In our previous work [10, 13], a low-priced stereo camera was employed replacing the Polaris tracking system for the image registration task in oral and maxillofacial surgery. A teeth contour tracking method was proposed to calculate the pose of patient’s teeth with respect to the stereo camera without manual intervention. However, this method is a 3D-3D matching method requiring that the organ to be registered have sharp 3D contour features; and these features should be easily reconstructed by the stereo camera three-dimensionally. Such conditions are quite strict, hence limit the applicable scope of the 3D contour tracking method. Furthermore, in the previous method, only silhouette features in the captured image pair are used while other useful visual clues such as image gradients on non-silhouette edges are ignored. The incorporation of these information could improve the accuracy and robustness of image registration.

In this study, we further simplify the hardware for image registration by using a single camera. We present a 3D surgical overlay system with automatic markerless image registration. The image registration is achieved by matching the 2D shape of teeth’s mesh model with the intraoperative 2D image captured by the camera. The display-camera calibration is performed by solving a perspective-n-point (PnP) problem (i.e., the estimation of camera’s extrinsic parameters given its intrinsic parameters and n-point 3D-2D correspondences). The proposed system enables real-time correct 3D image overlay on patient’s head and neck area for surgical visualization in oral and maxillofacial surgery.

2 System Overview

The proposed system consists of a 3D display, a translucent mirror (AR window), a monochrome camera and a workstation for information processing, as shown in Fig. 1(a). The 3D display is composed of a high pixel per inch (ppi) liquid crystal display (LCD) and a hexagonal lens array which is placed in front of the LCD. The 3D image of a mesh model (e.g., in the form of an STL file) is created using computer generated integral imaging (CGII) techniques and can be projected to a specified location and orientation with respect to the 3D display [12]. The 3D image is further overlaid on the patient by the translucent mirror. Surgeons will see a superimposed 3D image through the AR window to acquire visualized anatomical information. In this study, the goal of the system is to realize intraoperative augmented reality visualization in oral and maxillofacial surgery.

Figure 1(b) shows the involved coordinate systems in the overlay system. We denote by \(T_C\), \(T_D\), \(T_M\) the camera, 3D display, and model coordinate systems, respectively. \(T_D\) is the world coordinate system in which the 3D image is rendered as described in our CGII rendering algorithm [12]. \(T_M\) actually represents the image space where the patient is digitalized (e.g., by CT scanning). To overlay the 3D image correctly on the patient, the transformation from \(T_D\) to \(T_M\) denoted by \(\varvec{T}_D^M\) should be determined. Because we have \(\varvec{T}_D^M = \varvec{T}_D^C\varvec{T}_C^M\), this raises two problems: display-camera calibration and image registration.

Fig. 1.
figure 1

(a) System overview. (b) Coordinate systems.

3 Display-Camera Calibration

Display-camera calibration is to determine the transformation \(\varvec{T}_D^C\), which can be formulated as a camera extrinsic calibration problem. The 3D image of a known geometry in \(T_D\) is projected by the 3D display, and the 2D image of the 3D image is captured by the camera through the AR window. In the captured image, the 2D-3D correspondences are established. The pose \((\varvec{R}_C^D, \varvec{t}_C^D)\) of the 3D displayFootnote 1 with respect to the camera can be estimated by solving

$$\begin{aligned} \min _{\langle \varvec{R}_C^D,\varvec{t}_C^D\rangle }\sum _{i=1}^N||\varvec{K}\left( \varvec{R}_C^D,\varvec{t}_C^D\right) \varvec{X}_i-\varvec{x}_i||^2 \end{aligned}$$
(1)

where \(\varvec{X}_i\leftrightarrow \varvec{x}_i, i = 1\cdots N\) are 3D-2D correspondences in the form of homogeneous coordinates; \(\varvec{K}\) is the intrinsic matrix of the camera; \(||\varvec{X}_i - \varvec{x}_i||\) denotes the underlying 2D distance between \(\varvec{X}_i\) and \(\varvec{x}_i\).

Figure 2(a) shows the calibration model which is a \(5\times 5\) planar ball array. The captured 2D image through the AR window is shown in Fig. 2(b). The centers of the projected balls are automatically detected using a simple threshold followed by an ellipse fitting. Figure 2(c) shows an example of ball center detection. Equation (1) is well known as a PnP problem which can be easily solved using a nonlinear least squares technique [1]. \(\varvec{T}_D^C\) is the inverse of the pose \((\varvec{R}_C^D, \varvec{t}_C^D)\).

Fig. 2.
figure 2

(a) Calibration model. (b) Captured image of (a). (c) Automatic ball center detection.

4 Image Registration

Image registration is to determine the transformation \(\varvec{T}_C^M\). Unlike the fixed spatial relationship during surgery in the device calibration, \(\varvec{T}_C^M\) suffers from patient movement and varies when patient’s pose is changed. This requires the image registration to be performed automatically in real time. We propose a 3D-2D registration method by matching patient’s 3D teeth model (created from preoperative CT data) with the intraoperative 2D camera image based on Ulrich’s method [9]. Because teeth are rigid and can be easily exposed to an optical camera, they can serve as “natural markers” for registering anatomical information in the head and neck area.

4.1 Problem Formulation

Our problem is to find the best pose \((\varvec{R},\varvec{t})\) so that the 2D projection of the 3D model using the projection matrix \(\varvec{K}(\varvec{R},\varvec{t})\) is most consistent with the 2D image. 2D projection views of the 3D model can be rendered using computer graphics API, such as OpenGL. To measure the consistency between the projected model shape \(E_\mathrm {2D}\) and the image I(xy), we use the following similarity proposed by Steger [7]

$$\begin{aligned} s(E_\mathrm {2D},I) = \frac{1}{N}\sum _{i = 1}^{N}\frac{|\langle \nabla I(x_i,y_i), \varvec{d}_i\rangle |}{\Vert \nabla I(x_i,y_i) \Vert \cdot \Vert \varvec{d}_i\Vert } \end{aligned}$$
(2)

where \(E_\mathrm {2D}\overset{\mathrm {def}}{=}\{x_i,y_i, \varvec{d}_i\}_{i = 1}^{N}\) is a set of edge points \((x_i,y_i)\) with associated direction vectors \(\varvec{d}_i\) representing the normal direction; \(\nabla I(x_i,y_i)\) is the image gradient at \((x_i,y_i)\); \(\langle ,\rangle \) denotes dot product. The absolute value \(|\cdot |\) on the numerator will ignore local contrast polarity change. \(s(E_\mathrm {2D},I)\) ranges from [0, 1].

Given \((\varvec{R},\varvec{t})\), the projected 2D shape \(E_\mathrm {2D}\) of the 3D model is extracted as follows. First, the intrinsic matrix \(\varvec{K}\) is used to set the view frustum of the virtual camera so that the virtual camera has the same projection geometry as the real camera. Then, the 3D model is rendered into a 3-channel RGB image whose RBG values represent the normal vector on the corresponding surface of the 3D model. Next, the image tensor (\(2\times 2\) matrix) at each pixel of the RGB image is calculated, whose largest eigen value represents the edge strength of the pixel [6]. The edge strength corresponds to the face angle of the corresponding 3D edge of the model. Subsequently, a threshold is applied to the edge strength to suppress the pixel whose corresponding face angle is below a certain value (e.g., 30\(^\circ \)). Finally, non-maximum suppression is performed for edge thinning and the remaining edge pixels with their gradient vectors constitute the 2D shape \(E_\mathrm {2D}\). Figure 3 shows the extracted 2D shape of a left molar model with the suppression angle of 35\(^\circ \). A straightforward idea is to find optimal \((\varvec{R},\varvec{t})\) so that (2) is maximized.

Fig. 3.
figure 3

(a) Rendered RGB image of a molar model. (b) Projected 2D shape of (a). (c) Associated direction vectors.

4.2 Aspect Graph Based Matching

It is impossible to directly optimize (2) unless the start point is quite near to the true pose. However, we do not have the prior knowledge about the pose of the model. We instead adopt a view-based approach. An aspect graph-based matching method proposed by Ulrich [9] is used for fast 3D-2D pose estimation.

Offline Aspect Graph Building. Views are generated by specifying viewpoints (virtual camera positions) in a spherical coordinate system (SCS) whose origin is set to be the center of 3D model’s bounding box. The viewpoint range is specified by \([r_{\min }, r_{\max }]\), \([\varphi _{\min }, \varphi _{\max }]\), and \([\theta _{\min }, \theta _{\max }]\), which is a spherical quadrilateral. r, \(\varphi \), and \(\theta \) represent altitude, longitude, and latitude, respectively. The generated views are clustered into aspects according to their mutual similarities calculated using (2) on the overlapped pixels. The aspect in this context means a cluster of views (can be one view) whose mutual similarities are higher than a threshold \(t_\mathrm {c}\). A complete-linkage clustering method is used for view clustering. After the clustering is finished, the aspect is downsampled to the next higher image pyramid level and the clustering process is repeated. As a result, we can obtain a hierarchical aspect graph spanning different image levels.

Online 2D Matching. After the hierarchical aspect graph has been built, it is ready to use in the online search phase for pose estimation. We assume the proposed method is used for real-time image registration, in which case the input data is a video stream. A tracking-matching-optimization strategy is proposed for robust and fast pose estimation. The output of the tracking is used to confine the search space at the top image level to a tight bounding box encompassing the object. The tracking-learning-detection (TLD) framework [2] is incorporated for tracking the 2D appearance of the 3D model over a video stream at a higher image level whose resolution is close to \(512\times 512\). Let \(B_\mathrm {target}^n\) be the tracked bounding box at the top image level n, denoted by \(I^n(x,y)\). All aspects at the top level of the hierarchical aspect graph are examined within \(B_\mathrm {target}^n\) in \(I^n(x,y)\). An aspect is represented by its shape features \(E_\mathrm {2D} = \{x_i,y_i, \varvec{d}_i\}_{i=1}^N\). To search for a match of an aspect, \(E_\mathrm {2D}\) is scaled, rotated, and translated by discrete steps as follows

$$\begin{aligned} x_i' = x_i\sigma \cos \gamma -y_i\sigma \sin \gamma + t_x \end{aligned}$$
(3)
$$\begin{aligned} \nonumber \\ y_i' = x_i\sigma \sin \gamma + y_i\sigma \cos \gamma + t_y \end{aligned}$$
(4)

where \(\sigma \) is scaling factor; \(\gamma \) is rotation angle; \((t_x, t_y)\) is translation. The similarity between the transformed \(E_\mathrm {2D}\) and the image \(I^n(x,y)\) is calculated using (2). Those 2D poses \((\sigma ,\gamma ,t_x,t_y)\) with resulting similarity exceeding a threshold \(t_s\) are stored in a candidate list. These candidates are either refined in the child aspects by searching close neighboring poses, or discarded due to lower similarity than \(t_s\). All candidates are tracked down along the hierarchical level until reaching the bottom image level. The best candidate is considered as the candidate with the highest similarity score. The 3D pose can be recovered from the 2D pose of the best match at the bottom level and the pose of its associated aspect.

4.3 3D Pose Refinement

The accuracy of the obtained 3D pose from matching is usually insufficient due to the discrete step widths when searching for matches. Iterative optimization using an iterative closest point (ICP) algorithm is performed to refine the 3D pose by alternately identifying feature correspondences and estimating the pose. The corresponding 3D point \((X_i,Y_i,Z_i)\) of the shape feature \((x_i,y_i)\) can be recovered using the z buffer value of OpenGL. The sub-pixel edge point \((x'_i,y'_i)\) near \((x_i,y_i)\) is localized along the direction of \(\varvec{d}_i\) using the method proposed in [11]. The refined pose can be calculated by solving the PnP problem \((X_i,Y_i,Z_i)\leftrightarrow (x'_i,y'_i)\). The above procedure is repeated until convergence. Usually, several iterations will lead to satisfactory convergence.

5 Experiments and Results

5.1 Experimental Setting

Figure 4(a) shows the experimental scene. The LCD (6.4 inch) of the 3D display has a resolution of \(1024\times 768\) pixels with a pixel pitch of 0.13 mm (200 ppi). The micro lens array has lens pitches of 0.89 mm and 1.02 mm in the vertical and horizontal directions, respectively.

A mandibular phantom was created using a 3D printer from real patient’s CT data as shown in Fig. 4(b). Fiducial points were made in the front teeth area and molar area of the phantom with known positions in the image (model) space, for accuracy evaluation. The front teeth model and the molar model shown in Fig. 4(c) are used for image registration. Which model should be used depends on the exposed area (front teeth area or molar area).

Fig. 4.
figure 4

(a) Experimental scene. (b) Mandibular phantom with fiducial points. (c) Teeth models for image registration.

The camera (UI-3370CP-M-GL, IDS Imaging Development Systems GmbH, Germany) employed in the experiments has a resolution of \(2048\times 2048\) pixels with maximum frame rate of 80 frames per second (fps). Camera calibration was performed in advance to obtain the intrinsic matrix and remove lens distortion. The computer used in the experiments has an Intel\(^\circledR \) Core™ i7-4820K CPU (3.7GHz) and a NVIDIA\(^\circledR \) GeForce GTX TITAN Black GPU. The GPU is used to accelerate the aspect graph building process and the online matching process by parallel computing.

5.2 Display-Camera Calibration

Figure 5 shows the automatic processing flow of the display-camera calibration. The original image was smoothed using a Gaussian filter. White spots were segmented out using a threshold followed by classifications according to their roundness and areas. The contours of the extracted spots were approximated by ellipses whose centers were used for PnP estimation. The PnP estimation yielded a geometric estimation error of 1.9 pixels.

Fig. 5.
figure 5

Processing flow of display-camera calibration.

Fig. 6.
figure 6

Image registration results using (a) front teeth model (b) left molar model. (c) Target registration evaluation.

5.3 Image Registration Evaluation

Image registration was performed by matching the model with the video stream captured through the AR window. The distance between the camera and the phantom was approximately 710 mm. The registration process took approximately 0.2 s yielding an update frame rate of 5 fps. Figure 6(a) and (b) show the image registration results using the front teeth model and the left molar model (see Fig. 4(c)), respectively, with critical structures (tooth roots and nerve channels) and fiducial points (in red) overlaid on camera’s view.

Target registration errors (TREs) on the fiducial points (see Fig. 4(b)) were calculated to evaluate the registration accuracy as follows. A surgical tool (dental driller) was used to approach individual fiducial points on the phantom under the guidance of the virtually overlaid fiducial points on camera’s view. The physical distance between the indicated position and the real position of a fiducial point on the phantom was measured as the error on that fiducial point. The average error distance in an evaluation area was calculated as the TRE in that area. For TRE calculation in the front teeth (left molar) area, the front teeth (left molar) model was used for the image registration. Figure 6(c) shows the accuracy evaluation process. The accuracy evaluation results yielded TREs of 0.8 mm in the front teeth area (15 points) and 1.1 mm in the molar area (18 points).

5.4 3D Surgical Overlay

After display-camera calibration and image registration, the necessary spatial information for 3D display has become available. Figure 7 shows the 3D overlay of the tooth roots and nerve channels, observed through the AR window. The visualized information could be used to guide surgical operation.

Fig. 7.
figure 7

3D surgical overlay by (a) molar matching and (b) front teeth matching.

6 Conclusion

This paper presents a 3D surgical overlay system with automatic markerless image registration using a single camera. Teeth are rigid and easy to be exposed to a camera, making it possible to match a teeth model with an intraoperative camera image for the image registration task. The registered 3D images representing anatomical structures are superimposed on the patient via a translucent mirror for augmented reality surgical visualization. In this study, the application in dental surgery was demonstrated using our proposed system. Given the fact that the maxillary teeth are fixed with the skull, the proposed method may also be used for surgical navigation in the craniofacial region. In that case, the error compensation in the area far from the registration features can be a challenging work.