Keywords

1 Introduction

Since the step-by-step procedure for endoscopic sinus surgery (ESS) was first developed [14] in the early 1980s, interventions through the nasal cavities have become predominantly minimally invasive due to faster recovery times and reduced facial scarring. However, ESS comes with its own challenges. For instance, the 3D operating field is transformed into a 2D video display and the field of view (FOV) of the surgeon is limited to that of the endoscope. These can cause difficulty in estimating nearby anatomy that is not in the FOV of the endoscope. Knowledge of nearby anatomy is especially critical during surgeries through the nasal cavity since nasal cavities are small and complex with thin boundaries separating them from critical structures like the brain, eyes, carotid arteries, optic nerves, etc. Therefore, minimally invasive ESS through the nasal cavity requires a preoperative patient computed tomography (CT) scan, which is used by surgical navigation systems to orient surgeons with respect to critical anatomy.

Several navigation systems have been introduced that register endoscopic views to preoperative patient CT image. Systems that use electromagnetic or optical trackers require markers to be attached to the endoscope, which can interfere with surgical workflow. Vision-based navigation systems, however, do not add any additional hardware to the surgical space. Many vision-based navigation systems compute rigid registrations between features from endoscopic video and CT image. The iterative closest point (ICP) algorithm [2] is a standard two-step registration algorithm that iterates between finding correspondences between feature sets and finding the transformation that best aligns the matched points until convergence. Several ICP variants have also been explored [21]. In addition to position, orientation [6, 20], contours [4], and noise models [11, 22] have been used to improve matches. However, patient anatomy undergoes change between CT image acquisition and surgery [12]. Patients are also administered decongestants before surgery which further modifies anatomy. Rigid registration methods have shown deterioration in performance in the presence of tissues that undergo change due to decongestants [15, 16]. However, prior work has shown that principal component analysis (PCA) modes can capture the physiological changes that occur in the nasal cavity, i.e., the expanding and contracting of erectile tissues on the nasal turbinates [26]. Therefore, deformable variants of ICP that use PCA-based statistical shape models (SSMs) [9] to additionally solve for shape parameters have also been explored [13, 27]. However, these methods compute registrations to the mean shape, deforming the mean shape to estimate patient, and do not take prior patient information into consideration. Further, they do not provide confidence measures on the estimated shape.

Our registration simultaneously computes video-CT registration and deforms patient shape using SSMs. We also estimate how errors in deformed patient shape estimation increase as distance from video features increases and provide confidence measures. We evaluate our method on simulated and in vivo data.

2 Method

The method in [27] is formulated according to the following likelihood function:

$$\begin{aligned} f_{\mathrm {match\_def}}(\mathbf {x},\mathbf {y};\theta , \mathbf {s}, \bar{\mathbf {V}}, \mathbf {W}) = f_{\mathrm {match}}(\mathbf {x};\mathrm {T_{ssm}}(\mathbf {y}, \mathbf {s}), \theta ) \cdot f_{\mathrm {shape}}(\mathrm {T_{ssm}}(\mathbf {y}, \mathbf {s});\mathbf {s}), \end{aligned}$$

where \(f_{\mathrm {match}}\) finds the oriented point \(\mathbf {y}= (\mathbf {y}_\mathbf {p}, \hat{\mathbf {y}}_\mathbf {n})\) on the current shape, \(\psi \), that maximizes the likelihood of a match with oriented sample point \(\mathbf {x}= (\mathbf {x}_\mathbf {p}, \hat{\mathbf {x}}_\mathbf {n})\) from video, and \(f_{\mathrm {shape}}\) is the likelihood of shape deformation. \(\theta \) represents parameters of the non-deformable registration (i.e., rotation, R, translation, t, and scale, a), \(\mathbf {s}= \{s\}\) represents the shape parameters, \(\bar{\mathbf {V}}\) is the mean shape, \(\mathbf {W}\) the weighted modes of variation, and \(\mathrm {T_{ssm}}\) is the deformable transformation applied to \(\bar{\mathbf {V}}\). \(f_{\mathrm {match}}\) can be any likelihood-based registration objective, such as those presented in [3, 5, 6]. In this paper, we use the \(f_{\mathrm {match}}\) defined in [6, 27], which incorporates an anisotropic Gaussian noise model and an anisotropic Kent noise model to account for errors in position and orientation, respectively [6]. Assuming both position and orientation errors are zero-mean, i.i.d, \(f_{\mathrm {match}}\) for each \(\mathbf {x}\) transformed by a current similarity transform, \([a,\mathbf {R},\mathbf {t}]\), is defined as [27]:

$$\begin{aligned} {\begin{matrix} &{}f_{\mathrm {match}}(\mathbf {x}; \mathbf {y},\mathbf {\Sigma _{\mathrm {x}}}, \mathbf {\Sigma _{\mathrm {y}}}, \kappa , \beta , \hat{\mathbf {\gamma }}_1, \hat{\mathbf {\gamma }}_2, a, \mathbf {R}, \mathbf {t}) = \frac{1}{\sqrt{(2\pi )^3 |\mathbf {\Sigma }|}\cdot c(\kappa ,\beta )}\\ &{}\cdot e^{-\frac{1}{2}(\mathbf {y}_\mathbf {p}-a\mathbf {R}\mathbf {x}_\mathbf {p}-\mathbf {t})\mathbf {^T}\mathbf {\Sigma }^{-1}(\mathbf {y}_\mathbf {p}-a\mathbf {R}\mathbf {x}_\mathbf {p}-\mathbf {t})-\kappa \hat{\mathbf {y}}_\mathbf {n}\mathbf {^T}\mathbf {R}\hat{\mathbf {x}}_\mathbf {n}+ \beta \left( \left( \hat{\mathbf {\gamma }}_1\mathbf {^T}\mathbf {R}\hat{\mathbf {x}}_\mathbf {n}\right) ^2-\left( \hat{\mathbf {\gamma }}_2\mathbf {^T}\mathbf {R}\hat{\mathbf {x}}_\mathbf {n}\right) ^2 \right) }, \end{matrix}} \end{aligned}$$

where \(\mathbf {\Sigma }= \mathbf {R}\mathbf {\Sigma _{\mathrm {x}}}\mathbf {R}\mathbf {^T}+ \mathbf {\Sigma _{\mathrm {y}}}\), \(\mathbf {\Sigma _{\mathrm {x}}}\) and \(\mathbf {\Sigma _{\mathrm {y}}}\) are the covariance matrices representing noise in \(\mathbf {x}\) and \(\mathbf {y}\), \(\kappa = \frac{1}{\sigma ^2}\) is the concentration parameter of orientation noise, \(\sigma \) is the standard deviation, and \(\beta = e\frac{\kappa }{2}\) (\(e \in [0, 1]\) is the eccentricity) controls the anisotropy of orientation noise along with \(\hat{\mathbf {\gamma }}_{1}\) and \(\hat{\mathbf {\gamma }}_{2}\), which are the major and minor axes of the elliptical level sets of the Kent distribution on the unit sphere [6, 18]. \(\hat{\mathbf {y}}_\mathbf {n}\), \(\hat{\mathbf {\gamma }}_{1}\), \(\hat{\mathbf {\gamma }}_{2}\) are orthogonal. Similarly, assuming each vertex, \(\mathbf {v}\in \mathbf {V}\), on a shape deforms independently and deformations are Gaussian distributed [23]:

$$\begin{aligned} f_{\mathrm {shape}}(\mathbf {V}; \mathbf {s}) = \prod _{i=1}^{n_\mathbf {v}}f_{\mathrm {vertex}}(\mathbf {v}_i; \mathbf {s}), \ \text {where}\ f_{\mathrm {vertex}}(\mathbf {v}; \mathbf {s}) = \prod _{i=1}^{n_\mathbf {m}}\frac{1}{(2\pi )^{3/2}} . e^{\frac{\Vert s_i\Vert _2^2}{2}}, \end{aligned}$$
(1)

where \(n_\mathbf {v}\) is the number of vertices in the shape and \(n_\mathbf {m}\) is the number of PCA modes used to estimate deformation.

This formulation forces the mean shape to be the most likely shape [27] and cannot accommodate prior patient information. That is, in a generalized formulation for the exponent in Eq. 1, \(e^{\frac{\Vert s_i- \mu _i\Vert _2^2}{2}}\), \(\mu _i\), which are the mode weights corresponding to the most likely shape, are simply set to \(0,\,\forall i\), since the mean shape produces \(\mathbf {0}\) mode weights. This is a good assumption when patient CT is unavailable [25, 27]. However, if patient CT is available, then patient shape should be assumed to be the most likely shape. If patient shape, \(\mathbf {V}^*\), is segmented such that it has correspondences with the mean shape [26, 28], \(\mu \) can be computed by projecting the mean subtracted patient shape onto the SSM modes,

$$\begin{aligned} \mu _i = \mathbf {m}_i^\mathrm {T}(\mathbf {V}^* - \bar{\mathbf {V}})/\sqrt{\lambda _i}, \end{aligned}$$

where \(\mathbf {m}\) and \(\lambda \) are the modes and mode weights of variation. \(\mathbf {m}\) and \(\lambda \) can be obtained by performing PCA on a set of shapes with corresponding vertices [9]:

$$\begin{aligned} \mathbf {\Sigma _{\mathrm {SSM}}}= \frac{1}{n_\mathbf {s}}\sum _{j=1}^{n_\mathbf {s}} (\mathbf {V}_j - \bar{\mathbf {V}})(\mathbf {V}_j-\bar{\mathbf {V}})\mathbf {^T}= [\mathbf {m}_1 \ldots \mathbf {m}_{n_\mathbf {s}}] \begin{bmatrix} \lambda _1 &{} &{} \\ &{} \ddots &{} \\ &{} &{} \lambda _{n_\mathbf {s}} \end{bmatrix} [\mathbf {m}_1 \ldots \mathbf {m}_{n_\mathbf {s}}]\mathbf {^T}. \end{aligned}$$

Finally, the deformation applied to \(\bar{\mathbf {V}}\) based on the current \(\mathbf {s}\) is defined as

$$\begin{aligned} \mathrm {T_{ssm}}(\mathbf {y}_{\mathbf {p}_i}) = \sum _{j=1}^3 \eta _i^{(j)}\mathrm {T_{ssm}}(\mathbf {v}_i^{(j)}), \ \ \text {where}\ \ \mathrm {T_{ssm}}(\mathbf {v}_i) = \bar{\mathbf {v}}_i + \sum _{j=1}^{n_\mathbf {m}} s_j\mathbf {w}_j^{(i)}, \end{aligned}$$

\(\eta _i^{(j)}\) are the 3 barycentric coordinates that define the position of \(\mathbf {y}_i\) on a triangle on \(\psi \), and \(\mathbf {w}_i = \sqrt{\lambda _i}\mathbf {m}_i\) are the weighted modes of variation [27].

Finally, we associate confidence with regions of the estimated shape. We expect errors to be lower where sampled points are matched to the shape and higher as distance from these points increases since these areas are unobserved. To verify this, we associate per vertex errors with distance from the centroid of inlying matched points,

$$\begin{aligned} \mathbf {d}_i = \Vert \mathbf {v}_i - \mathbf {c}\Vert _2, \ \ \text {where} \ \ \mathbf {c}= \frac{1}{n_{\mathrm {inliers}}} \sum _{i=1}^{n_{\mathrm {inliers}}} \mathbf {y}_i, \end{aligned}$$

and model our confidence based on observations in simulation. \(n_{\mathrm {inliers}}\) is the number of inlying matched points.

3 Experiments and Analysis

Since our method initializes the shape estimation problem closer to the optimal solution, we expect our registrations to converge faster and produce lower mean errors since it is less likely for our optimization to converge to a non-optimal solution. To verify these expectations, we evaluate our method on simulated and in vivo data. All experiments are run on a 3.4 GHz Intel Core i7 CPU, 16 GB RAM.

3.1 Simulated Data

We perform a leave-one-out experiment using right nasal cavity meshes from a 53 CT dataset [1, 7, 8, 10]. The left-out shape is perturbed to simulate a patient with modified anatomy. Points are sampled from regions of the perturbed left-out shape that would be visible to an endoscope (Fig. 3A). Position noise with \(\sigma = 1\times 1\) mm\(^2\) in plane and 2 mm out of plane (i.e., \(1\times 1\times 2\) mm\(^3\)) and orientation noise with \(\sigma =10^{\circ }\) and \(e=0.5\) are added to the sampled points. Offsets in intervals [0, 10] mm and \([0, 10]^{\circ }\) are applied to the sample positions and orientations, respectively. The perturbed left-out shape is estimated using our method, i.e., with \(\mu \) set to weights from the original left-out shape, and using prior work, i.e., with \(\mu =\mathbf {0}\). Estimation of \(\mathbf {s}\) is constrained to \([-3, 3]\) standard deviations. Both registrations are run with two noise assumptions: first, assuming the noise in the samples is known, and second, assuming the noise is unknown and initializing the noise estimates to \(2\times 2\times 4\) mm\(^3\) and \(20^{\circ }\ (e=0.5)\) for position and orientation, respectively. To evaluate our results, we first compute the total registration error (tRE) by computing the Hausdorff distance (HD) between the deformed left-out shape and the estimated shape transformed to the coordinate frame of the registered sample points. Next, we compute the total shape error (tSE) by computing the HD between the two shapes in the same coordinate frame.

Fig. 1.
figure 1

Simulation experiment with known noise shows statistically significant (indicated by *) decrease in (A) tSE, (B) tRE, (C) number of iterations, and (D) runtime when registration is initialized to patient shape (green) rather than mean shape (red). (Color figure online)

In both cases, we observed statistically significantFootnote 1 decrease in errors when registration is initialized to patient weights, with errors lower when noise is known (Fig. 1A and B) compared to when noise is unknown (Fig. 2A and B). We also observe that there is a statistically significant decrease in the number of iterations required for convergence, which also leads to decrease in runtime. Although there is a bigger decrease in number of iterations and runtime for convergence with known noise (Fig. 1C and D), results with unknown noise (Fig. 2C and D) show that for similar number of iterations and runtime, we achieve lower errors when registration is initialized to patient shape. Our current CPU implementation is embarrassingly parallelizable and can be further optimized to improve runtime.

Fig. 2.
figure 2

Simulation experiment with unknown noise shows statistically significant (indicated by *) decrease in (A) tSE, (B) tRE, (C) number of iterations, and (D) runtime when registration is initialized to patient shape (green) rather than mean shape (red). (Color figure online)

We also observe, as expected, that shape estimation errors are lower where correspondences to sample points are found (Fig. 3B). Per vertex tSE shows little change near the centroid of the matched points, but quickly increases away from it (Fig. 4A). Therefore, we model our confidence in shape estimation as an exponential decay as distance from matched points increases. Figure 4B shows an estimated left-out shape from the experiment with unknown noise with ground truth errors, while Fig. 4C shows our estimated confidence in shape estimation.

3.2 In Vivo Data

Our clinical evaluation was performed on anonymized endoscopic videos of the nasal cavity collected from 4 consenting patients under an IRB approved protocol. These videos were used to train a self-supervised depth-estimation network that leverages established multi-view stereo methods like structure from motion (SfM) for learning [17]. This method produces dense and accurate point clouds from single frames of monocular endoscopic videos. Reconstructions from nearby frames were aligned using relative camera motion to produce dense reconstructions covering large areas in the nasal cavities. 3000 randomly sampled points from 14 such reconstructions were deformably registered both to the mean shape and the patient shape assuming noise with \(\sigma = 1\times 1\times 2\) mm\(^3\) and \(30^\circ \ (e=0.5)\) in position and orientation, respectively. The SSM used is pre-built using our 53 CT dataset and does not include CTs from any of the patients scoped for clinical evaluation. Rigid registrations to the respective patient shapes with the same parameters were also computed for comparison [6]. All registrations were manually initialized.

Fig. 3.
figure 3

(A) Right nasal cavity mesh with an example of points (yellow) sampled from the inferior turbinate, lateral and septal walls, and cavity floor - regions visible to an endoscope when entering the cavity. (B) Mean tSE over registrations (from L-R) initialized at patient shape and mean shape, with known and unknown noise each, and run with 50 modes plotted on the mean shape. Shape estimation errors are higher away from matched points as well as when mean shape is used for initialization (arrow). (Color figure online)

Fig. 4.
figure 4

(A) Trend in shape estimation errors as distance from matched points increases. (B) Ground truth shape estimation errors on an estimated left-out shape and (C) our estimated confidence. Blue indicates high confidence while red indicates low confidence. (Color figure online)

Since in vivo data lacks ground truth, we evaluate our registration using residual errors between matches computed by the registration. However, residual error can be misleading since it does not take into consideration the orientations of matched points. Therefore, we also report registration confidence based on the agreement in the orientations of corresponding points. [27] showed that registration confidence decreased with increasing chi-square CDF values, p, computed using orientation agreement scores. For simplicity, we show increasing confidence with increasing \(q = 1-p\).

Fig. 5.
figure 5

(A) Confidence based on orientation agreement plotted for registrations computed with reconstructions from different video sequences. Initializations at patient shape (green) produce higher confidence in registration, only showing no confidence for sequences 01 and 04 from Patient 4. Several registrations initialized at mean shape (red) show no confidence. (B) Deformed patient 2 with confidence estimates when aligned with features extracted from (C) video sequence 02. (D) Points (red with blue normals) overlayed on the deformed shape (gray). (Color figure online)

Both deformable registration methods produced statistically significant improvements over rigid registration to patient shape, which produced a mean residual error of 0.7 (±0.26) mm. Residual errors between deformable registration initialized at mean shape and initialized at patient shape produced the same residual error of 0.44 (±0.14) mm. However, confidence in registration using orientation agreement was higher when initialized to patient shape (Fig. 5A), implying that our method produced better alignment. We are also able to visualize confidence in estimated shapes. Confidence in estimated Patient 2 is shown in Fig. 5B along with the alignment that produced the estimated shape (Fig. 5C and D).

4 Discussion and Future Work

In this work, we show that we can deformably register endoscopic videos to patient CT using PCA modes to deform patient shape. This method produces statistically significant improvements in registration errors as well as iterations and runtime to convergence compared to prior PCA-based deformable registration methods [24]. We believe that these results bring us a step closer to providing accurate patient-specific navigation during endoscopic sinus procedures without using any external tools like electromagnetic or optical trackers.

We did not compare our method to other prior registration methods like coherent point drift (CPD) due to memory limitations [19]. CPD computes a \(n_\mathbf {v}\times n_\mathbf {v}\) matrix which is stored in memory, resulting in large memory overhead even for medium sized meshes. Our method does not suffer from such limitations. Further, we expect our method to perform as well [24] or better since CPD only deforms the parts of meshes where sample points are matched which can result in unnatural deformations in the mesh. Since our method is driven by statistics learned from population, it is much less likely for our method to produce unnatural deformations. Our method also computes confidence falloff in estimated shapes as distance from registered feature points increases. Providing this information during surgical navigation is critical since it allows surgeons to modulate their trust in the navigation system.

In the future, we will analytically evaluate the uncertainty in different regions of the estimated shape and improve the runtime of our algorithm. We are also working on building SSMs from a larger population in order to better capture the extent of variations in the nasal cavity. Another goal we hope to work towards is automating registration initialization so our method can be seamlessly integrated into the surgical navigation workflow. Finally, we would like to emphasize that our method can enable highly accurate navigation and reduce risk of damage to critical structures towards the start of a procedure. We hope that future work that can account for non-physiological changes that occur during surgery can be integrated with our method to enable accurate navigation throughout endoscopic sinus procedures.