Introduction

Minimally invasive surgeries (MIS) are appreciated for their many advantages over traditional open surgery including decreased risk of infection due to minimal incisions, and faster recovery times for patients [15]. The shortcomings of MIS are mainly associated with the loss of direct 3D view of the surgical scene as well as cumbersome and non-intuitive tool manipulation. With the advent of robotic MIS systems, many of these shortcomings have been alleviated. Greater precision, improved dexterity and enhanced 3D immersive visualization for surgeons are three important advantages of robotic over traditional MIS.

One of the application areas in which robotic MIS is being expanded is partial nephrectomy. Approximately 208,000 new cases of kidney cancer are diagnosed in the world each year [10]. The goal in partial nephrectomy, an effective treatment for localized renal cancers, is to excise only the cancerous regions and spare as much healthy tissue as possible to preserve kidney function. According to Gill et al. [14], a partial nephrectomy procedure is organized into five main stages: (1) bowel mobilization; (2) Hilar dissection and control; (3) identification and demarcation of tumour margins; (4) resection of tumour; and (5) reconstruction of the kidney (renorrhaphy). Among these, hilar dissection and tumour demarcation stand out as two critical and daunting stages requiring significant expertise.

Hilar dissection allows the surgeon access to the renal hilum, where the flow of blood into and out of the kidney is controlled by clamping the vessels. Improper vessel control due to overlooked accessory renal vessels can cause significant bleeding during resection [28]. The identification and demarcation stage is also critical as, during this stage, the surgeon localizes the tumour mass and identifies the resection margins. Proper demarcation speeds up the succeeding stage of tumour resection, improves the preservation of kidney function by sparing as much healthy tissue as possible, and avoids tumour recurrence by correctly delineating all of the cancerous tissue from the healthy ones.

Both of these two stages are complex tasks. In the hilar dissection and vessel clamping stage, some vessels might be accidentally missed due to substantial variability in patients vasculature and the amount of fat which surrounds the kidney and hides the vessels [31]. Tumour identification and demarcation is difficult due to noise associated with clutter such as bleeding or smoke from cutting, poor endoscopic image colour/texture contrast between different structures, occluding surgical tools, and limited 3D visibility of the structures of interest where only surfaces are observable from the camera feed. If the surgeon’s level of experience is limited, the incidence of missed vessels or incorrect demarcation may significantly jeopardize the surgical outcome.

To minimize mistakes during the operation, surgeons typically rely on previously viewed preoperative 3D scans, e.g. computed tomography (CT), and mentally reconstruct locations of various structures during surgery. Transferring this mental abstraction from 3D to 2D data is an error-prone procedure especially if the surgeon’s level of experience is limited. To facilitate this, many efforts have been made towards augmenting the endoscopic views. These methods vary from directly segmenting the endoscopic scene (e.g. using level sets) to registering the preoperative data onto the intraoperative endoscopic scene.

Table 1 Categorization and comparison between state-of-the-art methods for intraoperative guidance and our proposed method

Related works

Reducing the difficulty of intraoperative navigation has been attempted by various approaches that rely on multi-modal registration to align the preoperative data onto the intraoperative view by, for example, fusing a 3D preoperative image volume with 2D intraoperative ultrasound [11], intraoperative MR [16], or 2D X-ray [25, 33]. In this work, we focus on augmenting the endoscopic video as it remains the staple modality in MIS. Some recent works ignored preoperative data and segmented the endoscopic view using active contour-based methods [12, 13] or parameter-sensitive morphological operations and thresholding techniques [21]. Since these approaches rely only on colour/intensity information, they often fail due to noise and clutter from bleeding and smoke. In addition, these methods focused on segmenting only one object in an endoscopic scene.

Other techniques proposed to register 3D preoperative data on 2D intraoperative view(s) manually, e.g. by using invasive fiducials and performing tracking [17, 18, 26, 27, 29]. While the registration in these methods is performed manually, the methods proposed in Yim et al. [32] and Merritt et al. [20] are able to automatically find the 3D pose. Yet, none of the aforementioned methods can handle the free-form deformation of tissues that usually happens due to respiratory motion and/or surgical intervention. Furthermore, these techniques are unable to segment critical occluded structures, e.g. vessels covered by fat.

To account for non-rigid tissue deformation, Agudo et al. [1] exploited the well-known FEM model analysis and proposed an online framework to model the non-rigid deformations of shapes in a video sequence. However, their method has been designed for a single camera setting. In our previous work [23], we incorporated generative statistical shape models of tissues in their 3D pose tracking and endoscopic video segmentation framework. Although the statistical model used is capable of encoding non-rigid deformations, these deformations are based on a heterogeneous training population and are not patient-specific. In addition, our previous method required manual 3D–2D registration for the first frame of the video and is incapable of localizing hidden vessels.

To localize vessels in the endoscopic view, different methods have been proposed. These methods include both hardware solutions, such as near infrared fluorescence imaging [30], and algorithmic methods based on colour/intensity information from the endoscope that highlight vasculature based on perfusion models [8]. However, due to sensitivity to the dose administered and the cost involved, hardware solutions are not widely used. Moreover, none of these methods [8, 30] are able to identify occluded vessels that are hidden under thick layers of fat. A detailed survey of such vessel localization techniques was presented in our recent paper [2]. In our recent works [2, 3] we proposed a purely image-based method for labelling occluded vasculature by detecting minute pulsatile vessel motion. Feedback from our clinical collaborators revealed that the raw visualization of our fuzzy labels alone are not ideal for guidance as (1) they are susceptible to noise, (2) they cannot identify parts of the vasculature that are deeply hidden under other tissue, and (3) the visualizations are not easy to interpret compared to methods based on pre- to intra-operative alignment. This provided the motivation for us to integrate the complimentary advantages of our previously proposed techniques in [2] and [23] into a single framework.

Fig. 1
figure 1

Our surgical image guidance system leverages vessel pulsation and visual appearance cues to guide the 3D preoperative to endoscopic intraoperative alignment

Two important aspects have to be considered in segmenting an endoscopic video: noise/clutter due to light reflection from uneven tissue surfaces and smoke from cutting tissue, and structure occlusion either by tools or other tissues (e.g. vessels occluded by fat, or endophytic tumours occluded by healthy kidney tissue). In this paper, we address these two problems by proposing a variational technique to augment the endoscopic surgical view. Our framework estimates the 3D pose and deformations of multiple structures in the preoperative data and uses that to aid the segmentation of multiple visible and occluded structures in an endoscopic scene. Our key contribution in this work is how we leverage vessel pulsation and endoscopic visual cues to guide 3D preoperative to endoscopic intraoperative alignment. Our surgical image guidance system combines signal processing and machine learning cues. The first performs spatio-temporal colour waveform analysis to detect vessel pulsation, and the second uses random decision forests (RF) to learn colour and textural visual patterns of tissue types in endoscopy. The above contributions are used to guide a multi-structure, preoperative to intraoperative alignment process that allow for both rigid and heterogeneous physically based, patient-specific non-rigid deformations. Table 1 presents a detailed summary of the capabilities of state-of-the-art methods compared to our proposed technique. To demonstrate the potential utility of our method to surgeons, we evaluated our method with a retrospective study of fifteen challenging robot-assisted partial nephrectomy (RAPN) clinical cases.

Method

Problem statement and notation

Given 3D preoperative data, let \(\mathcal {S}_\mathrm{pre}=\{S_1,\ldots , S_N\}\) be the set of N segmented structures of interest in the preoperative spatial domain \({\varOmega }_\mathrm{pre} \subset \mathbb {R}^3\), where \(S_i\) represents the surface of the ith structure. Having M camera views of the surgical scene, let \(I_m:{\varOmega }^m_{2D}\subset \mathbb {R}^2\rightarrow \mathbb {R}^3\) represent the 3-channel RGB image of mth camera view. We adopt the 3D geometry defined by first camera (\(m = 1\)) as the reference domain of the surgical scene, \({\varOmega }_{srg}\subset \mathbb {R}^3\). Our objective is to augment the surgical intraoperative view by localizing and visualizing the N structures of interest in the 2D endoscopic images (\(I_1,\ldots ,I_M\)). To do so, we transform and deform the objects of interest in 3D such that their projections onto the 2D camera views (e.g. their silhouettes) align with the corresponding structures in \(I_1,\ldots ,I_M\). Figure 1 illustrates the overview of our pipeline.

We implicitly represent the boundary of each structure in \(I_m\) by a level set function \(\phi ^i_m:{\varOmega }_\mathrm{2D}^m\rightarrow \mathbb {R}\) such that

$$\begin{aligned} {\left\{ \begin{array}{ll} \phi ^i_m(\varvec{x})>0,\quad ~\varvec{x} \text { is inside the} ~i\mathrm{th}\,\, \text {structure in } I_m \\ \phi ^i_m(\varvec{x})=0, \quad ~\varvec{x} \text { is on the boundary of the } i{\text {th}}\,\,\text { structure}\\ \phi ^i_m(\varvec{x})<0, \quad ~\varvec{x} \text { is outside the} ~i\mathrm{th}\,\, \text {structure in } I_m. \end{array}\right. } \end{aligned}$$
(1)

Let \(\varvec{P}^{\mathrm{pre},i}_{\ell }=(X^{\mathrm{pre},i}_{\ell },Y^{\mathrm{pre},i}_{\ell },Z^{\mathrm{pre},i}_{\ell })\in S_i\) be the coordinates of the \(\ell \)th point on \(S_i\). The level set \(\phi ^i_m\) is calculated as:

$$\begin{aligned} \phi _m^i(&\varvec{x})={\text {SDM}}\bigg (\partial \Big (\mathcal {P}_m\big (\varvec{T}(\varvec{P}^{\mathrm{pre},i});\pi _m\big )\Big )\bigg ), \end{aligned}$$
(2)

where \(\varvec{T}\) is a spatial transformation consisting of two parts: a non-rigid component and a rigid component that maps the 3D models (\(\varvec{P}^\mathrm{pre}\)) from \({\varOmega }_\mathrm{pre}\) to the reference surgical domain \({\varOmega }_\mathrm{srg}\). \(\mathcal {P}_m:{\varOmega }_\mathrm{srg}\rightarrow {\varOmega }_\mathrm{2D}^m\) is the projection from the surgical scene frame of reference to \({\varOmega }_\mathrm{2D}^m\) given the corresponding camera parameters \(\varvec{\pi }=\{\pi _1,\ldots ,\pi _M\}\), \(\partial \) is the spatial 2D derivative of the projected model (i.e. resulting in the boundary of the structures), and SDM(.) is the signed distance map. In less technical terms, we position and deform the segmented 3D model (using T), virtually image (or project) the 3D model using the camera parameters (\(\varvec{\pi }\)), detect the boundary of the projected image (using \(\partial \)), and finally represent the boundary using level sets. We define \(\varvec{T}\) as:

$$\begin{aligned} \varvec{T} (\varvec{P}^{\mathrm{pre},i})=\varvec{R}\big (\varvec{P}^{\mathrm{pre},i}+\varvec{{\varPsi }}^i \varvec{u}^i\big )_{[3\times n_i]} +\varvec{t}, \end{aligned}$$
(3)

where the columns of \(\varvec{{\varPsi }}^i\) are the ith tissue-specific modes of vibration and \(\varvec{u}^i\) are the corresponding weights that are used to non-rigidly deform the segmented preoperative models \(P^{\mathrm{pre},i}\) (see “Patient-specific heterogeneous deformation model” section for more details). \(\varvec{R}\) and \(\varvec{t}\) are the rotation matrix and translation vector in 3D space, respectively, giving the rigid component of transformation T. Note that the number of elements in \(\varvec{P}^{\mathrm{pre},i}\) is \(3\times n_i\) (\(n_i\) is the number of points in \(\mathcal {S}_i\)); however, they are stacked as a long \(3n_i\times 1\) column in (3). The subscript \((\ldots )_{[3\times n_i]}\) is the reshaped version of \((\varvec{P}^{\mathrm{pre},i}+\varvec{{\varPsi }}^i \varvec{u}^i)\) from \(3n_i\times 1\) to \(3\times n_i\). Also, note that \(\varvec{{\varPsi }}^i \varvec{u}^i\) deforms the whole ith tissue (or object \(\mathcal {S}_i\)) and not just a single point on \(\mathcal {S}_i\). We emphasize that \(\varvec{R}\), \(\varvec{t}\), and \(\varvec{u}^i\) in (3) are the same for the 3D models regardless of which viewpoint of the M cameras is considered. Having N structures appear in M camera images, we define and minimize the following energy functional E to simultaneously align the preoperative models with the intraoperative images and segment them:

$$\begin{aligned}&E(\varvec{T},\varvec{U},\varvec{{\varPhi }};\varvec{P}^\mathrm{pre},I_1,\ldots I_M,\varvec{\pi })\nonumber \\&\quad =\sum _{n=1}^N\sum _{m=1}^M\int _{{\varOmega }_\mathrm{2D}^{m}} (2\rho ^n_m-1)(\varvec{x})\;H(\phi _m^n(\varvec{x})){\text {d}}\varvec{x}, \end{aligned}$$
(4)

where H(.) is the Heaviside function, \(\rho ^n_m(\varvec{x})\) are the regional terms that provide the cues towards which the level set is moved during the optimization and measure the agreement of the image pixels \(\varvec{x}\) with the learnt statistical models (see “Data terms: appearance of structures in endoscopic images” section) of the nth structure in \(I_m\), \( \varvec{{\varPhi }}=\{\phi _1^1,\ldots ,\phi _1^N,\ldots ,\phi _M^1\) \(,\ldots ,\phi _M^N\}\), and \(\varvec{U}=\{\varvec{u}^1,\ldots ,\varvec{u}^N\}\). The above energy function is similar to the Chan–Vese model [7] with the main difference lying in the optimization; where in the Chan–Vese method, the energy function is minimized with respect to the level set function \(\phi \) whereas in our case, we optimize E with respect to the pose and shape parameters. To find the optimum pose and deformations of structures, we optimize E with respect to \(\varvec{T}\) and \(\varvec{U}\) (with \({\varPhi }\) is updated accordingly).

The success of the proposed optimization depends on how accurate we model the 3D non-rigid deformation and how well we identify candidate pixels belonging to different structures in the endoscopic scenes. Some structures are hidden (e.g. occluded vessels), while others have complicated visual appearance. In the following sections, we explain how we choose our data terms [\(\rho ^n_m\) in (4)] and how we generate the structures’ deformation and the final segmentation.

Data terms: appearance of structures in endoscopic images

In RAPN, the focus is on three organs: the vasculature (v), kidney (k), and tumour (t). In this context, we set the number of objects \(N=3\) in (4). The regional term of vessels, kidney, and tumour are represented with \(\rho ^1=\rho ^v\), \(\rho ^2=\rho ^k\), and \(\rho ^3=\rho ^t\), respectively.

Vascular pulsatile motion

Computing the regional term \(\rho ^v\) by appearance alone is difficult as blood vessels are typically hidden under a layer of fat. These regions may, however, be identified by their characteristic pulsatile motion, which is detectable but invisible to the naked eye. This regional term is computed by first extracting pulsatile motion features from local phase information using our method proposed in [3]. In this section we briefly describe the framework for extracting these features and the corresponding data term used to drive our segmentation.

A video frame captured from camera m denoted as \(I_m(\varvec{x},t)\) that maps the given pixel \(\varvec{x}\in {\varOmega }_\mathrm{2D}^m\) at time t to an intensity value can be represented as function of local displacements \(d(\varvec{x},t)\) with respect to the first frame of the video \(I_m(\varvec{x},0) = f(\varvec{x})\) such that \(I_m(\varvec{x},t) = f(\varvec{x} + d(\varvec{x},t))\). Our regional term \(\rho ^v\) is extracted from the local motions specified by \(d(\varvec{x},t)\) and to approximate these motions from the video, we perform a wavelet decomposition of the video into sub-bands such that

$$\begin{aligned} I_m(\varvec{x},t) = f(\varvec{x} + d(\varvec{x},t)) \approx \sum ^{\infty }_{\omega = -\infty } A_{\omega }e^{i\omega (\varvec{x} + \hat{d}(\varvec{x},t))}, \end{aligned}$$
(5)

with each sub-band representing a complex sinusoid \( S_{\omega }(\varvec{x},t) = A_{\omega }e^{i\omega (\varvec{x} + \hat{d}(\varvec{x},t))}\) at spatial frequency \(\omega \) and \(\hat{d}(\varvec{x},t)\) is our approximation to the local motion. The local phase of each sub-band is defined as \({\varPhi }_\omega (\varvec{x},t) = arg(S_\omega ) = \omega (\varvec{x} + \hat{d}(\varvec{x},t))\). Since the local \(\omega \hat{d}(\varvec{x},t)\) is the only motion related component of the phase that varies with time, we can isolate it from the zero-frequency component \(\omega \varvec{x}\) by applying a mean-free temporal bandpass filter

$$\begin{aligned} \mathcal {H}(\varvec{x},t) = 2 B_H \mathrm {sinc}(2 B_H t) - 2 B_L \mathrm {sinc}(2 B_L t), \end{aligned}$$
(6)

where \(B_L\) and \(B_H\) are the low and high frequency cut-off, respectively. The response of the temporal bandpass filter is denoted by \(B_{\omega }(\varvec{x},t) = {\varPhi }_{\omega } *\mathcal {H} = \omega \hat{d}_{\mathcal {H}}(\varvec{x},t)\), where \(\hat{d}_{\mathcal {H}}\) are components of the motion that are in the passband of the filter. The passband of the filter is tuned to the typical heart rate of a patient so that we can isolate components of the local motion that are synchronous with the heart rate and hence to vascular pulsation.

The bandpassed phases \(B_{\omega }\) are then denoised and combined across scales and orientations using the methods outlined in [2] to obtain fuzzy labels

$$\begin{aligned} L=\frac{1}{C}\sum _{\forall \omega } \frac{|\tilde{Q}_{\omega }|}{2\pi \omega }, \end{aligned}$$
(7)

where C is a normalizing factor and \(\tilde{Q}\) is the denoised \(B_{\omega }\). To attenuate these false positives and obtain better visual cues of the hidden vessels, we process L in a temporal window (i.e. 2 s video) by summing L in time and denoising the results with a spatial median filter. We denote the resulting denoised fuzzy labels by \(L^\mathrm{avg}\). Then, we calculate the probability of a pixel belonging to a pulsating vessel (artery and vein) as:

$$\begin{aligned} P^v(\varvec{x}|L^\mathrm{avg})=\frac{1}{2}+ \frac{1}{\pi } \arctan \left( \frac{L^\mathrm{avg}(\varvec{x})-0.3}{\epsilon } \right) , \end{aligned}$$
(8)

where we chose \(\epsilon \) to be 0.1. The above equation maps \(L^\mathrm{avg}\) to the probabilistic range of [0, 1]. The value 0.3 in (8), chosen empirically, intensifies any value more than 0.3 and discards small values in \(L^\mathrm{avg}\). The data term corresponding to vessels is defined as \(\rho ^v=-\log (P^v(\varvec{x}|L^\mathrm{avg}))\).

Fig. 2
figure 2

Patch selection and initialization. a Patch selection (blue background/tools, yellow kidney, green tumour). b Probability map of background, kidney and tumour. c Initial pose (before convergence). d Recovered pose and shape (after convergence)

Learning of tissue appearance

To calculate the data term of a visible structure, we extract a variety of image features (\(\mathcal {A}_m\)) from each image m and train a RF classifier to distinguish between different structures. Particularly, we capture the textural patterns and colour variations via local colour histograms of the normalized RGB and YCbCr channels. The probability of pixel \(\varvec{x}\) belonging to kidney (\(P^k(\varvec{x})\)) and tumour (\(P^t(\varvec{x})\)) are estimated by training a RF consisting of \(N_t\) binary decision trees. To train the RF, we select few \(20\times 20\) patches in \(I_m,~i=1,\ldots ,M\) from different structures. In practice, surgeons may (virtually) select these patches with the help of surgical tools. We emphasize that unlike feature-based methods, e.g. [27], our method does not require any correspondence between 3D CT and the 2D intraoperative data. Figure 2a shows a sample seeding on an example endoscopic scene of real clinical data. After training, for each pixel \(\varvec{x}\), the feature channels, \(\mathcal {A}_m(\varvec{x})\), are propagated through each RF tree resulting in the probability \(P^{n,j}(\varvec{x}|\mathcal {A}_m(\varvec{x}))\), for the jth tree and nth structure. These probabilities are combined into a forest’s joint probability \(P^n(\varvec{x}|I_m(\varvec{x}))=\frac{1}{N_t}\sum _{j=1}^{N_t}p^{n,j}(\varvec{x}|\mathcal {A}_m(\varvec{x}))\) to determine the probability of \(\varvec{x}\) belonging to nth structure. Figure 2b illustrates examples of regions probability for the frame shown in Fig. 2a. The regional terms of kidney and tumour are then calculated as: \(\rho ^k_m=- \log P^k(\varvec{x}|I_m(\varvec{x}))\) and \(\rho ^t_m=- \log P^t(\varvec{x}|I_m(\varvec{x}))\), respectively.

Patient-specific heterogeneous deformation model

Due to the noisy endoscopic images, the regional terms alone are not able to provide robust cues for guiding the 3D–2D alignment. To obtain a reliable result, we constrain the space of possible transformations to patient-specific shape models obtained from the preoperative data (\(\varvec{P}^\mathrm{pre}\)). To account for non-rigid deformation of structures, we include the tissue-specific modes of vibration (\(\varvec{{\varPsi }}^i\)) in our framework as described in “Problem statement and notation” section.

The modes of vibration \(\varvec{{\varPsi }}^i\) are obtained by solving the generalized eigendecomposition problem: \(\varvec{K}^i\varvec{{\varPsi }}^i=\varvec{M}^i\varvec{{\varPsi }}^i\varvec{{\varLambda }}^i\), where \(\varvec{K}\) and \(\varvec{M}\) are the stiffness and mass matrices, respectively, and \(\varvec{{\varLambda }}^i\) is a diagonal matrix of eigenvalues associated with the eigenvectors (\(\varvec{{\varPsi }}^i\)) where higher eigenvalues corresponds to higher frequency. In contrast to traditional variational methods that update the level set function representing the segmentation to adapt it to image data, the use of \(\varvec{{\varPsi }}\) favours shape deformations that are biomechanically plausible (as dictated by \(\varvec{K}\) and \(\varvec{M}\)).

As we increase the number of modes, more deformations are allowed, which is useful for accurately modelling the deformations. Given all possible modes of shape variability, we are able to produce/recover all possible deformations of that shape. However, as we include more deformation modes, the computational complexity increases and recovering the proper deformation parameters becomes more difficult. In addition, too many modes of variations may end up with extreme and unrealistic deformations. We empirically found that choosing more than 6 modes results in an unjustified increase in complexity, as deformations may become unrealistic and no noticeable improvement in accuracy is observed. According to Pentland et al. [24], since the lowest frequencies correspond to rigid body motions and global deformations, and we already encoded global deformations, translation and rotation, through \(\varvec{R}\) and \(\varvec{t}\) (Eq. (3)), we selected eigenvectors corresponding to larger eigenvalues to focus on finer deformation details.

We use the average Hounsfield unit (HU) associated with each structure in the preoperative CT to approximate the stiffness of each tissue. This way the stiffness of each structure is proportional to its tissue density. We used a simple linear elastic model to deform the objects of interest to lower the computational complexity; however, exploring more advanced elastic models might improve the results but with the cost of complexity. In the next section, we show how we optimize E in (4).

Optimization of the multi-structure pose and deformation

We optimize (4) with respect to \(\varvec{T}\), i.e. with respect to \(\varvec{R}=\{\alpha ,\beta ,\gamma \}\), \(\varvec{t}=\{t_x,t_y,t_z\}\), and \(\varvec{u}=\{u_1,\ldots ,u_q \}\), where \(\alpha \), \(\beta \), and \(\gamma \) are the rotation angles around the x, y, and z axes, respectively and \(t_x\), \(t_y\), and \(t_z\) are the translations in x, y, and z directions. The derivative of E with respect to each of the unknown parameters is:

$$\begin{aligned} \frac{\partial E}{\partial \xi _\ell }=\sum _{n\in \{v,k,t\}}\sum _{m=1}^M\int _{{\varOmega }_\mathrm{2D}^m}\rho ^n_m(\varvec{x})\frac{\partial H(\phi _m^n(\varvec{x}))}{\partial \xi _\ell }, \end{aligned}$$
(9)

where \(\xi _{\ell }=\{\alpha ,\beta ,\gamma ,t_x,t_y,t_z,u_1,\ldots ,u_q\}\) and

$$\begin{aligned} \frac{\partial H(\phi _m^n(\varvec{x}))}{\partial \xi _\ell }&=\frac{\partial H(\phi _m^n(\varvec{x}))}{\partial \phi _m^n}\bigg (\frac{\partial \phi _m^n}{\partial x}\frac{\partial x}{\partial \xi _\ell }+\frac{\partial \phi _m^n}{\partial y}\frac{\partial y}{\partial \xi _\ell }\bigg )\nonumber \\&=\delta (\phi _m^n) \begin{bmatrix} \frac{\partial \phi _m^n}{\partial x}\; \frac{\partial \phi _m^n}{\partial y} \end{bmatrix} \begin{bmatrix} \frac{\partial x}{\partial \xi _\ell }\\ \frac{\partial y}{\partial \xi _\ell } \end{bmatrix}. \end{aligned}$$
(10)

In (10), \(\delta (.)\) is the Dirac delta function. We use the centred finite difference to calculate \(\frac{\partial \phi }{\partial x}\) and \(\frac{\partial \phi }{\partial y}\). Every 2D point \(\varvec{x}=(x,y)\) in \(I_m\) has at least one corresponding 3D point \(\varvec{P}^\mathrm{srg}=(X^\mathrm{srg},Y^\mathrm{srg},Z^\mathrm{srg})\) in the surgical domain \({\varOmega }_\mathrm{srg}\). We calibrated the stereo camera and obtained the intrinsic camera parameters. For \(m=1\), \(\varvec{x}\) and \(\varvec{P}^\mathrm{srg}\) are related by:

$$\begin{aligned} \begin{bmatrix} x\\ y \\ 1 \end{bmatrix} = \begin{bmatrix} f_{1x}&\quad 0&\quad c_{1x} \\ 0&\quad f_{1y}&\quad c_{1y} \\ 0&\quad 0&\quad 1 \end{bmatrix} \begin{bmatrix} X^\mathrm{srg} \\ Y^\mathrm{srg}\\ Z^\mathrm{srg} \end{bmatrix}, \end{aligned}$$
(11)

where \(\pi _1=(f_{1x},f_{1y},c_{1x},c_{1y})\) are the first camera parameters. Hence, we have

$$\begin{aligned} \frac{\partial x}{\partial \xi _\ell }=\frac{f_{1x}}{{Z^\mathrm{srg}}^2}\left( Z^\mathrm{srg}\frac{\partial X^\mathrm{srg}}{\partial \xi _\ell }-X^\mathrm{srg}\frac{\partial Z^\mathrm{srg}}{\partial \xi _\ell }\right) . \end{aligned}$$
(12)

We similarly calculate \(\frac{\partial y}{\partial \xi _\ell }\). Also, each 3D point \(\varvec{P}^\mathrm{srg}\) is related to \(\varvec{P}^\mathrm{pre}\) by \(\varvec{P}^\mathrm{srg}=\varvec{R}\varvec{P}^\mathrm{pre}+\varvec{t}\). Therefore, for \(\xi _\ell \in \{\alpha ,\beta ,\gamma ,t_x,t_y,t_z\}\), \(\frac{\partial X^\mathrm{srg}}{\partial \xi _\ell }\), \(\frac{\partial Y^\mathrm{srg}}{\partial \xi _\ell }\) and \(\frac{\partial Z^\mathrm{srg}}{\partial \xi _\ell }\) in (12) are easily calculated upon the choice of transformation function. To update the non-rigid deformation of structures, we need to calculate the derivatives of E with respect to the shape parameters \(u^i\). These shape parameters \(u^i\) and the energy functional are related through the 2D coordinate x in E (Eq.(4)) which are derived from the corresponding 3D point in the surgical domain \(\varvec{P}^\mathrm{srg}=(X^\mathrm{srg},Y^\mathrm{srg},Z^\mathrm{srg})\) and the preoperative 3D point \(\varvec{P}^\mathrm{pre}\). According to (3), \(\varvec{P}^\mathrm{pre}\) depends on the weights of the modes of vibration, \(u^i\). Therefore, for the shape parameters \(\xi _\ell \in \{u_1,\ldots ,u_k\}\), the derivative of a 3D point \(\varvec{P}^\mathrm{srg}\) in the surgical domain with respect to \(\xi _\ell \) is \(\frac{\partial \varvec{P}^\mathrm{srg}}{\partial \xi _\ell }=\varvec{R}\cdot \psi _\ell \), where \(\psi _\ell \) is the \(\ell \)th mode of vibration in \(\varvec{{\varPsi }}\). To avoid any irrational shape deformation, we limit the shape parameters to vary not more than three times the standard deviation (\(\varvec{{\varLambda }}^i\)). For multiple camera views (\(m>1\)), the extrinsic parameters (\(\varvec{R}^\mathrm{ext}_m\), \(\varvec{t}^\mathrm{ext}_m\)) have to be considered in calculating the derivatives, i.e. \(\varvec{R}\) in the above equations is replaced by \(\varvec{R}^\mathrm{ext}_m \varvec{R}\). The boundary of the segmented structures in \(I_m\) are the zero level set of their corresponding level set functions \(\phi ^n_m\) that are obtained by (2) after finding the optimal \(\varvec{T}\).

Fig. 3
figure 3

First eight cases of the qualitative comparison of our proposed method with state-of-the-art method [3]. a Original endoscopic image. b The ground truth of venous (cyan), arterial (red), kidney (brown) and tumour (green) structures provided in [3]. c Segmentation results of vessels using [3]. d Our results. Kidney and tumour are shown in yellow and green, respectively

Fig. 4
figure 4

Final seven cases of the qualitative comparison of our proposed method with state-of-the-art method [3]. a Original endoscopic image. b The ground truth of venous (cyan), arterial (red), kidney (brown) and tumour (green) structures provided in [3]. c Segmentation results of vessels using [3]. d Our results. Kidney and tumour are shown in yellow and green, respectively

Materials and experiments

For validation, we applied our framework to fifteen different clinical cases of robot-assisted partial nephrectomy. All endoscopic videos were acquired by a da Vinci\(^\circledR \) Si surgical system (Intuitive Surgical, California, USA), and each frame was resized to \(480\times 270\) pixels for efficiency. The default parameters suggested in our previous works [2, 3] were used to detect the vascular motion cues. We used \(N_t=70\) trees to train the RF for learning the appearance of kidney and tumour. Higher values of \(N_t\) did not improve accuracy but increased complexity.

Table 2 Quantitative comparison for kidney (K) and vessel (V) segmentation: our method versus [2]

We used the patient-specific 3D segmented kidney, tumour, and vasculature models and to set the tissue-specific modes of vibration, for simplicity, we assumed that the structures can be modelled as a set of unit masses mutually interconnected, i.e. \(\varvec{M}\) is the identity matrix and can be removed from the eigendecomposition equation in our case. We set the stiffness of each tissue to be proportional to its corresponding HU (higher HU means higher density and hence higher stiffness). HU is calculated from the preoperative DICOM meta-data as \({\text {HU}}={\text {CT}}~{\text {pixel}}~{\text {value}}\times \mu _s+\mu _i\), where \(\mu _s\) and \(\mu _i\) are the rescale slope and rescale intercept values that are stored in the CT meta-data. We manually initialized \(\varvec{T}\) such that the projection of 3D models intersect the organs. This initialization does not need to be close to the solution. Figure 2c shows the initial pose, which despite being not well placed, results in a reasonable pose as shown in Fig. 2d. However, we emphasize that an irrational initialization will result in a wrong pose estimation due to our local optimization framework. We should also mention that if most of the surface of the objects is occluded in the 2D scene, our method cannot find the correct pose. For our experiments, we asked surgeons to stop moving the tools for \(\sim \)10 s so our method can compute the regional term.

We compared the segmentations obtained through our guidance system with the same ground truth presented in [2, 3] (Figs. 3, 4). Note how the noisy segmentations in (c) are improved in (d) by incorporating the preoperative prior information. We also quantitatively compared our proposed method with [2] in Table 2.

The average runtime of our unoptimized MATLAB code to process the vessel pulsation in a 4-s clip (120 frames) was 65 s. The runtime for pose estimation and segmenting the structures depends on the initial pose of the organs. The average runtime to find the pose and segment the structures for an initialization similar to Fig. 2c is \(\sim \)16 s on a standard 3.40 GHz CPU.

Discussion and conclusions

We proposed a new technique for localizing both visible and occluded structures in an endoscopic view by estimating the 3D pose and deformations of structures of interest in the 3D surgical space. Our framework leverages both preoperative data, as a source of patient-specific prior knowledge, as well as vasculature pulsation (by analysing the local phase information) and endoscopic visual cues (by training a random decision forest) in order to accurately segment the highly noisy and cluttered environment of an endoscopic video. To handle the non-rigid deformation of different structures, we incorporated a tissue-specific physically based deformation model. To make the non-rigid deformation of each structure closer to reality, we used the HU value of each structure in the preoperative CT and assigned a specific stiffness to each deformable model. Our results on in vivo clinical cases of partial nephrectomy illustrate the potential of the proposed framework for augmented reality applications in MIS.

There are several directions to extend this work. Our variational framework is highly parallelizable, and we do not foresee any obstacles towards a GPU implementation for real-time pose estimation and endoscopic video segmentation. In addition, we believe that leveraging stereo views as well as encoding depth information into the proposed energy functional can improve the performance.

We attribute some of the observable differences between the ground truth, and our results to both the local optimization framework we used and also to the error in the alignment of the ground truth. As mentioned in Amir-Khalili et al. [2], due to the fact that the preoperative model was rigidly aligned to the endoscopic video, an alignment error of 4–7 mm exist in cases where the organs have been significantly retracted by the surgical instruments or mobilization of other organs. We believe that despite the visible differences between the two, our current solution is one step closer to an ideal solution compared to the ground truth as our current method allows for non-rigid modes of vibration. Generating a ground truth that accounts for the non-rigid deformations due to mobilization and retraction requires volumetric intraoperative imaging such as cone beam CT or possibly implanting fiducials. The use of such imaging techniques is not feasible as it exposes the patient and clinicians to ionizing radiation and implanting fiducials is intrusive and invasive and hence not recommended.

Also, in our future work, we will explore the use of an additional shape variation component that is orthogonal to the restricted shape model, as descried by Andrews and Hamarneh [4], since this allows for exploring larger shape variability without noticeable increase in complexity. Although we limited the modes of vibration to vary not more than three times the corresponding eigenvalue (to avoid any irrational shape deformation), we still might get similar projection from two different 3D deformations. This is due to the fact that we lose information during the 3D to 2D transformation. We believe that this is another interesting future direction that worth investigation.

Given that in this proposed method we used a local optimization technique, leveraging our own group and others that have worked on convexification techniques [46, 9, 19, 22] can make the method less sensitive (or insensitive) to initialization.

Finally, improved estimates of elasticity parameters (e.g. using elastography imaging) will likely more accurately constrain the space of non-rigid deformations.