Transition State Clustering: Unsupervised Surgical Trajectory Segmentation for Robot Learning

Krishnan, Sanjay; Garg, Animesh; Patil, Sachin; Lea, Colin; Hager, Gregory; Abbeel, Pieter; Goldberg, Ken

doi:10.1007/978-3-319-60916-4_6

Sanjay Krishnan⁵,
Animesh Garg⁶,
Sachin Patil⁵,
Colin Lea⁷,
Gregory Hager⁷,
Pieter Abbeel⁵ &
…
Ken Goldberg^5,6

Part of the book series: Springer Proceedings in Advanced Robotics ((SPAR,volume 3))

Abstract

A large and growing corpus of synchronized kinematic and video recordings of robot-assisted surgery has the potential to facilitate training and subtask automation. One of the challenges in segmenting such multi-modal trajectories is that demonstrations vary spatially, temporally, and contain random noise and loops (repetition until achieving the desired result). Segments of task trajectories are often less complex, less variable, and allow for easier detection of outliers. As manual segmentation can be tedious and error-prone, we propose a new segmentation method that combines hybrid dynamical systems theory and Bayesian non-parametric statistics to automatically segment demonstrations. Transition State Clustering (TSC) models demonstrations as noisy realizations of a switched linear dynamical system, and learns spatially and temporally consistent transition events across demonstrations. TSC uses a hierarchical Dirichlet Process Gaussian Mixture Model to avoid having to select the number of segments a priori. After a series of merging and pruning steps, the algorithm adaptively optimizes the number of segments. In a synthetic case study with two linear dynamical regimes, where demonstrations are corrupted with noise and temporal variations, TSC finds up to a 20% more accurate segmentation than GMM-based alternatives. On 67 recordings of surgical needle passing and suturing tasks from the JIGSAWS surgical training dataset [7], supplemented with manually annotated visual features, TSC finds 83% of needle passing segments and 73% of the suturing segments found by human experts. Qualitatively, TSC also identifies transitions overlooked by human annotators.

Access provided by CONRICYT-eBooks. Download chapter PDF

Temporal clustering of surgical activities in robot-assisted surgery

Article Open access 05 May 2017

Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery

Article 24 March 2021

Surgical Gesture Segmentation and Recognition

1 Introduction

Recorded demonstrations of robot-assisted minimally invasive surgery (RMIS) have been used for surgical skill assessment [7], development of finite state machines for automation [13, 25], learning from demonstration (LfD) [29], and calibration [22]. Intuitive Surgical’s da Vinci robot facilitated over 570, 000 procedures in 2014 [11]. There are proposals to record all of Intuitive’s RMIS procedures similar to flight data recorders (“black boxes”) in airplanes [12], which could lead to a proliferation of data. While these large datasets have the potential to facilitate learning and autonomy; the length and variability of surgical trajectories pose a unique challenge. Each surgical trajectory may represent minutes of multi-modal observations, may contain loops (failures and repetitions until achieving the desired result), and even identical procedures can vary due to differences in the environment. In this setting, typical techniques for establishing spatial and temporal correspondence that employ continuous deformations can be unreliable (e.g., Dynamic Time Warping [14] and spline-based registration [31]).

Segmentation of a task into sub-tasks can be valuable since individual segments are less complex, less variable, and allow for easier detection and rejection of outliers. Trajectory segmentation in robotics is an extensively studied problem [4, 5, 16, 20, 21, 26, 30]. However, prior work in robotic surgery focuses on the supervised problem setting, either requiring manual segmentation of example trajectories or using a set of pre-defined primitive motions called “surgemes” [21, 30, 36]. Manual labelling requires specifying consistent segmentation criteria and applying these criteria to across demonstrations, which can be time-consuming and unreliable. Similarly, it can be challenging to manually construct a dictionary of primitives at the correct level of abstraction.

Outside of surgery, there have been several proposals for unsupervised segmentation [5, 16, 20, 26], where the criteria are learned from data without a pre-defined dictionary. The salient feature of these approaches is a clustering or local regression model to identify locally similar states. Inherently, the success of unsupervised approaches is dependent on how well the demonstrations match the assumptions of the model (i.e., the definition of “similar”). In surgery, the tissue and environment may vary greatly between demonstrations making it difficult to directly compare different trajectories. Our insight is that while the trajectories may be very different, there can be a common latent structure in the demonstrations that can be learned from the data. Segmentation can be performed with respect to these latent parameters leading to robust segmentation criteria.

Transition State Clustering (TSC) combines hybrid dynamical system theory with Bayesian statistics to learn such a structure. We model demonstrations as repeated realizations of an unknown noisy switched linear dynamical system [8]. TSC identifies changes in local linearity in each demonstration, and leans a model to infer regions of the state-space at which switching events occur. These regions are generated from a hierarchical nonparametric Bayesian model, where the number of regions are determined by a Dirichlet Process and the shape of the regions are determined by a mixture of multivariate Gaussian random variables. A series of merging and pruning steps (controlled by user-specified parameters $\delta $ and $\rho $ respectively) remove outlier transition states.

We also explore how to use the video data that accompanies kinematic data in surgical demonstration recordings. In this work, we explore improving segmentation through hand-engineered visual features. We manually label the video stream with two features: a binary variable identifying object grasp events and a scalar variable indicating surface penetration depth. We evaluate results with and without these visual features (Sect. 5.4). In future work, we will explore automated methods to construct featurized representations of the video data.

2 Related Work and Background

Motion Primitives and Skill Learning: Motion primitives are segments that discretize the action-space of a robot, and can facilitate faster convergence in LfD [10, 23, 27]. On the other hand, TSC discretizes the state-space, which can be interpreted as segmenting a task and not a trajectory. Much of the initial work in motion primitives considered manually identified segments, but recently, Niekum et al. [26] proposed learning the set of primitives from demonstrations using the Beta-Process Autoregressive Hidden Markov Model (BP-AR-HMM). Calinon et al. [2] also build on a large corpus of literature of unsupervised skill segmentation including the task-parameterized movement model [6], and GMMs for segmentation [5].

The ideas in Niekum et al. inspire the results presented in this work, namely, the use of Bayesian non-parametric models for segmentation and switched linear models. Unlike Niekum et al. and our work, Calinon et al. do not employ Bayesian non-parametrics or multimodal data. In Niekum et al. transition events are only dependent on the current dynamical regime, and in TSC they also depend on the current state (as illustrated in Fig. 1 with a dashed line). In this paper, we extend this line of work with non-parametric clustering on a GMM based model, and account for specific challenges such as looping and inconsistency in surgical demonstrations.

Handling Temporal Inconsistency: The most common model for handling demonstrations that have varying temporal characteristics is Dynamic Time Warping (DTW). However, DTW is a greedy dynamic programming approach which assumes that trajectories are largely the same up-to some smooth temporal deformations. When there are significant variations due to looping or spurious states, this model can give unreliable results [14], as shown by our results.

Another common model for modeling temporal inconsistencies is the Finite State Markov Chain model with Gaussian Mixture Emissions (GMM$+$HMM) [1, 3, 15, 34]. These models, impose a probabilistic grammar on the segment transitions and can be learned with an EM algorithm. However, they can be sensitive to hyper-parameters such as the number of segments and the amount of data [32]. The problem of robustness in GMM$+$HMM (or closely related variants) has been addressed using down-weighting transient states [17] and sparsification [9]. In TSC, we explore whether it is sufficient to know transition states without having to fully parametrize a Markov Chain for accurate segmentation. In Fig. 1, we compare the graphical models of GMM$+$HMM, and TSC. The TSC model applies Dirichlet Process priors to automatically set the number of hidden states (regimes).

The TSC algorithm finds spatially and temporally similar transition states across demonstrations, and it does not have to model correlations between switching events–in essence, using the current state as a sufficient statistic for switching behavior. On the other hand, the typical GMM$+$HMM model learns a full $k\times k$ transition matrix. Consequently, we empirically find that the TSC model is robust to noise and temporal variation, especially for a small number of demonstrations.

Surgical Task Recognition: Surgical robotics has largely studied the problem of supervised segmentation using either segmented examples or a pre-defined dictionary of motions (similar to motion primitives). For example, given manually segmented videos, Zappella et al. [36] use features from both the videos and kinematic data to classify surgical motions. Simiarly, Quellec et al. [28] use manually segmented examples as training for segmentation and recognition of surgical tasks based on archived cataract surgery videos. The dictionary-based approaches are done with a domain-specific set of motion primitives for surgery called “surgemes”. A number of works (e.g., [19, 21, 33, 35]), use the surgemes to bootstrap learning segmentation.

3 Problem Setup and Model

The TSC model is summarized by the hierarchical graphical model in the previous section (Fig. 1). Here, we formalize each of the levels of the hierarchy and describe the assumptions in this work.

Dynamical System Model: Let $\mathscr {D}=\{d_i\}$ be the set of demonstrations where each $d_i$ is a trajectory $\mathbf {x}(t)$ of fully observed robot states and each state is a vector in $\mathbb {R}^d$. We model each demonstration as a switched linear dynamical system. There is a finite set of $d \times d$ matrices $\{A_1,\ldots ,A_k\}$, and an i.i.d zero-mean additive Gaussian Markovian noise process W(t) which accounts for noise in the dynamical model:

$$ \mathbf {x}(t+1) = A_{i}\mathbf {x}(t) + W(t) \text { : } A_i \in \{A_1,\ldots ,A_k\} $$

Transitions between regimes are instantaneous where each time t is associated with exactly one dynamical system matrix $1,\ldots ,k$.

Transition States and Times: Transition states are defined as the last states before a dynamical regime transition in each demonstration. Each demonstration $d_i$ follows a switched linear dynamical system model, therefore there is a time series of regimes A(t) associated with each demonstration. Consequently, there will be times t at which $A(t) \ne A(t+1)$.

We model the occurrence of these events as a stochastic process conditioned on the current state. Switching events are governed by a latent function of the current state $S:\mathscr {X}\mapsto \{0,1\}$, and we have noisy observations of switching events $\widehat{S}(x(t)) = S(x(t)+Q(t))$, where Q(t) is a i.i.d noise process. Thus, across all demonstrations, the observed switching events induce a probability density f(x) over the state space $\mathscr {X}$. In this paper, we focus on the problem where f(x) is a Mixture of Gaussian densities.

Transition State Clusters: Across all demonstrations, we are interested in aggregating nearby (spatially and temporally) transition states together. The goal of transition state clustering is to find a mixture model for f that approximately recovers the true latent function S. Consequently, a transition state cluster is defined as a clustering of the set of transition states across all demonstrations; partitioning these transition states into m non-overlapping similar groups:

$$ \mathscr {C} = \{C_1, C_2,\ldots ,C_m\} $$

Every $U_i$ can be represented as a sequence of integers indicating that transition states assignment to one of the transition state clusters $U_i=[1,2,4,2]$.

Consistency: We assume, demonstrations are consistent, meaning there exists a non-empty sequence of transition states $\mathscr {U}^*$ such that the partial order defined by the elements in the sequence (i.e., $s_1$ happens before $s_2$ and $s_3$) is satisfied by every $U_i$. For example,

$$\begin{aligned} U_1 = [1,3,4]\text {, }U_2 = [1,1,2,4]\text {, }\mathscr {U}^*=[1,4] \end{aligned}$$

A counter example,

$$\begin{aligned} U_1 = [1,3,4]\text {, }U_2 = [2,5]\text {, }\mathscr {U}^*\text { no solution} \end{aligned}$$

Intuitively, this condition states that there have to be a consistent ordering of actions over all demonstrations up to some additional regimes (e.g., spurious actions).

Loops: Loops are common in surgical demonstrations. For example, a surgeon may attempt to insert a needle 2–3 times. When demonstrations have varying amounts of retrials it is challenging. In this work, we assume that these loops are modeled as repeated transitions between transition state clusters, which is justified in our experimental datasets, for example,

$$\begin{aligned} U_1 = [1,3,4]\text {, }U_2 = [1,3,1,3,1,3,4]\text {, }\mathscr {U}^*=[1,3,4] \end{aligned}$$

Our algorithm will compact these loops together into a single transition.

Minimal Solution: Given a consistent set of demonstrations, that have additional regimes and loops, the goal of the algorithm is to find a minimal solution, $\mathscr {U}^*$ that is loop-free and respects the partial order of transitions in all demonstrations.

Given a set of demonstrations $\mathscr {D}$, the Transition State Clustering problem is to find a set of transition state clusters $\mathscr {C}$ such that they represent a minimal parametrization of the demonstrations.

Multi-modal TSC: This model can similarly be extended to states derived from sensing. Suppose at every time t, there is a feature vector z(t). Then the augmented state of both the robot spatial state and the features denoted is:

$$ \mathbf {x}(t) = \left( {\begin{array}{c}x(t)\\ z(t)\end{array}}\right) $$

In our experiments, we worked the da Vinci surgical robot with two 7-DOF arms, each with 2 finger grippers. Consider the following feature representation which we used in our experiments:

1.
Gripper grasp. Indicator that is 1 if there is an object between the gripper, 0 otherwise.
2.
Surface Penetration. In surgical tasks, we often have a tissue phantom. This feature describes whether the robot (or something the robot is holding like a needle) has penetrated the surface. We use an estimate of the truncated penetration depth to encode this feature. If there is no penetration, the value is 0, otherwise the value of penetration is the robot’s kinematic position in the direction orthogonal to the tissue phantom.

4 Transition State Clustering

In this section, we describe the hierarchical clustering process of TSC. This algorithm is a greedy approach to learning the parameters in the graphical model in Fig. 1. We decompose the hierarchical model into stages and fit parameters to the generative model at each stage. The full algorithm is described in Algorithm 1.

4.1 Background: Bayesian Statistics

One challenge with mixture models is hyper-parameter selection, such as the number of clusters. Recent results in Bayesian statistics can mitigate some of these problems. The basic recipe is to define a generative model, and then use Expectation Maximization to fit the parameters of the model to observed data. The generative model that we will use is called a mixture model, which defines a probability distribution that is a composite of multiple distributions.

One flexible class of mixture models are Gaussian Mixture Models (GMM), which are described generatively as follows. We first sample some c from a categorical distribution, one that takes on values from (1...K), with probabilities $\phi $, where $\phi $ is a K dimensional simplex:

$$ c \sim cat(K,\phi ) $$

Then, given the event $\{ c = i \}$, we specify a multivariate Gaussian distribution:

$$ x_i \sim N(\mu _i, \Sigma _i) $$

The insight is that a stochastic process called the Dirichlet Process (DP) defines a distribution over discrete distributions, and thus instead we can draw samples of $cat(K,\phi )$ to find the most likely choice of K via EM. The result is the following model:

$$\begin{aligned} (K,\phi ) \sim DP(H,\alpha ) \qquad c \sim cat(K,\phi ) \qquad X \sim N(\mu _i, \Sigma _i) \end{aligned}$$

(1)

After fitting the model, every observed sample of $x \sim X$ will have a probability of being generated from a mixture component $P(x \mid c = i)$. Every observation x will have a most likely generating component. It is worth noting that each cluster defines an ellipsoidal region in the feature space of x, because of the Gaussian noise model $N(\mu _i, \Sigma _i)$.

We denote this entire clustering method in the remainder of this work as DP-GMM. We use the same model at multiple levels of the hierarchical clustering and we will describe the feature space at each level. We use a MATLAB software package to solve this problem using a variational EM algorithm [18].

4.2 Transition States Identification

The first step is to identify a set of transition states for each demonstration in $\mathscr {D}$. To do this, we have to fit a switched dynamic system model to the trajectories. Suppose there was only one regime, then this would be a linear regression problem:

$$ \arg \min _A \Vert A X_t - X_{t+1}\Vert $$

where $X_t$ and $X_{t+1}$ are matrices where each column vector is corresponding x(t) and $x(t+1)$. Moldovan et al. [24] showed that fitting a jointly Gaussian model to $n(t) = \left( {\begin{array}{c}\mathbf {x}(t+1)\\ \mathbf {x}(t)\end{array}}\right) $ is equivalent to Bayesian Linear Regression.

Therefore, to fit a switched linear dynamical system model, we can fit a Mixture of Gaussians (GMM) model to n(t) via DP-GMM. Each cluster learned signifies a different regime, and co-linear states are in the same cluster. To find transition states, we move along a trajectory from $t=1,\ldots ,t_f$, and find states at which n(t) is in a different cluster than $n(t+1)$. These points mark a transition between clusters (i.e., transition regimes).

4.3 Transition State Pruning

We consider the problem of outlier transitions, ones that appear only in a few demonstrations. Each of these regimes will have constituent vectors where each n(t) belongs to a demonstration $d_i$. Transition states that mark transitions to or from regimes whose constituent vectors come from fewer than a fraction $\rho $ demonstrations are pruned. $\rho $ should be set based on the expected rarity of outliers. In our experiments, we set the parameter $\rho $ to 80% and show the results with and without this step.

4.4 Transition State Compaction

Once we have transition states for each demonstration, and have applied pruning, the next step is to remove transition states that correspond to looping actions, which are prevalent in surgical demonstrations. We model this behavior as consecutive linear regimes repeating, i.e., transition from i to j and then a repeated i to j. We apply this step after pruning to take advantage of the removal of outlier regimes during the looping process. These repeated transitions can be compacted together to make a single transition.

The key question is how to differentiate between repetitions that are part of the demonstration and ones that correspond to looping actions–the sequence might contain repetitions not due to looping. To differentiate this, as a heuristic, we threshold the L2 distance between consecutive segments with repeated transitions. If the L2 distance is low, we know that the consecutive segments are happening in a similar location as well. In our datasets, this is a good indication of looping behavior. If the L2 distance is larger, then repetition between dynamical regimes might be happening but the location is changing.

For each demonstration, we define a segment $\mathbf {s}^{(j)}[t]$ of states between each transition states. The challenge is that $\mathbf {s}^{(j)}[t]$ and $\mathbf {s}^{(j+1)}[t]$ may have a different number of observations and may be at different time scales. To address this challenge, we apply Dynamic Time Warping (DTW). Since segments are locally similar up-to small time variations, DTW can find a most-likely time alignment of the two segments.

Let $\mathbf {s}^{(j+1)}[t^*]$ be a time aligned (w.r.t to $\mathbf {s}^{(j)}$) version of $\mathbf {s}^{(j+1)}$. Then, after alignment, we define the $L_2$ metric between the two segments:

$$ d(j,j+1) = \frac{1}{T}\sum _{t=0}^T (\mathbf {s}^{(j)}[i] - \mathbf {s}^{(j+1)}[i^*])^2 $$

when $d\le \delta $, we compact two consecutive segments. $\delta $ is chosen empirically and a larger $\delta $ leads to a sparser distribution of transition states, and smaller $\delta $ leads to more transition states. For our needle passing and suturing experiments, we set $\delta $ to correspond to the distance between two suture/needle insertion points–thus, differentiating between repetitions at the same point versus at others.

However, since we are removing points from a time-series this requires us to adjust the time scale. Thus, from every following observation, we shift the time stamp back by the length of the compacted segments.

4.5 State-Space Clustering

After compaction, there are numerous transition states at different locations in the state-space. If we model the states at transition states as drawn from a GMM model:

$$ {x}(t) \sim N(\mu _i, \Sigma _i) $$

Then, we can apply the DP-GMM again to cluster the state vectors at the transition states. Each cluster defines an ellipsoidal region of the state-space space.

4.6 Time Clustering

Without temporal localization, the transitions may be ambiguous. For example, in circle cutting, the robot may pass over a point twice in the same task. The challenge is that we cannot naively use time as another feature, since it is unclear what metric to use to compare distance between $\left( {\begin{array}{c}\mathbf {x}(t)\\ t\end{array}}\right) $. However a second level of clustering by time within each state-space cluster can overcome this issue. Within a state cluster, if we model the times which change points occur as drawn from a GMM $t \sim N(\mu _i, \sigma _i)$, and then we can apply DP-GMM to the set of times. We cluster time second because we observe that the surgical demonstrations are more consistent spatially than temporally. This groups together events that happen at similar times during the demonstrations. The result is clusters of states and times. Thus, a transition states $m_k$ is defined as tuple of an ellipsoidal region of the state-space and a time interval.

5 Results

5.1 Experiment 1. Synthetic Example of 2-Segment Trajectory

In our first experiment, we segment noisy observations from a two regime linear dynamical system. Figure 2 illustrates examples from this system under the different types of corruption. Since there is a known a ground truth of two segments, we measure the precision (average fraction of observations in each segment that are from the same regime) and recall (average fraction of observations from each regime segmented together) in recovering these two segments. We can jointly consider precision and recall with the F1 Score which is the harmonic mean of the two. We compare three techniques against TSC: K-Means (only spatial), GMM$+$T (using time as a feature in a GMM), GMM$+$HMM (using an HMM to model the grammar). For the GMM techniques, we have to select the number of segments, and we experiment with $k=1,2,3$ (i.e., a slightly sub-optimal parameter choice compared to $k=2$). In this example, for TSC, we set the two user-specified parameters to $\delta =0$ (merge all repeated transitions), and $\rho =80\%$ (prune all regimes representing less than $80\%$ of the demonstrations).

First, we generate 100 noisy observations (additive zero mean Gaussian noise) from the system without loops or spurious states–effectively only measuring the DP-GMM versus the alternatives. Figure 3a shows the F1-score as a function of the noise in the observations. Initially, for an appropriate parameter choice $k=2$ both of the GMM-based methods perform well and at low noise levels the DP-GMM used by our work mirrors this performance. However, if the parameter is set to be $k=3$, we see that the performance significantly degrades. $k=1$ corresponds to a single segment which has a F1 score of 0.4 on all figures. The DP-GMM mitigates this sensitivity to the choice of parameter by automatically setting the value. Furthermore, as the noise increases, the $80\%$ pruning of DP-GMM mitigates the effect of outliers leading to improved accuracy.

In Fig. 3b, we look at the accuracy of each technique as a function of the number of demonstrations. GMM$+$HMM has more parameters to learn and therefore requires more data. GMM$+$T converges the fastest, TSC requires slightly more data, and the GMM$+$HMM requires the most. In Fig. 3c, we corrupt the observations with spurious dynamical regimes. These are random transition matrices which replace one of the two dynamical regimes. We vary the rate at which we randomly corrupt the data, and measure the performance of the different segmentation techniques as a function of this rate. Due to the pruning, TSC gives the most accurate segmentation. The Dirichlet process groups the random transitions in different clusters and the small clusters are pruned out. On the other hand, the pure GMM techniques are less accurate since they are looking for exactly two regimes.

In Fig. 3d, introduce corruption due to loops and compare the different techniques. A loop is a step that returns to the start of the regime randomly, and we vary this random rate. For an accurately chosen parameter $k=2$, for the GMM−HMM, it gives the most accurate segmentation. However, when this parameter is set poorly $k=3$, the accuracy is significantly reduced. On the other hand, using time as a GMM feature (GMM$+$T) does not work since it does not know how to group loops into the same regime.

5.2 Surgical Experiments: Evaluation Tasks

We describe the three tasks used in our evaluation, and show manually segmented versions in Fig. 4. This will serve as ground truth when qualitatively evaluating our segmentation on real data.

Circle Cutting: In this task, we have a 5 cm diameter circle drawn on a piece of gauze. The first step is to cut a notch into the circle. The second step is to cut clockwise. Next, the robot transitions to the other side cutting counter clockwise. Finally, the robot finishes the cut at the meeting point of the two incisions. As the left arm’s only action is maintain the gauze in tension, we exclude it from the analysis. In Fig. 4a, we mark 6 manually identified transitions points for this task from [25]: (1) start, (2) notch, (3) finish 1st cut, (4) cross-over, (5) finish 2nd cut, and (6) connect the two cuts. For the circle cutting task, we collected 10 demonstrations by non-experts familiar with operating the da Vinci Research Kit (dVRK).

We apply our method to the JIGSAWS dataset [7] consisting of surgical activity for human motion modeling. The dataset was captured using the da Vinci Surgical System from eight surgeons with different levels of skill performing five repetitions each of Needle Passing and Suturing.

Needle Passing: We applied our framework to 28 demonstrations of the needle passing task. The robot passes a needle through a loop using its right arm, then its left arm to pull the needle through the loop. Then, the robot hands the needle off from the left arm to the right arm. This is repeated four times as illustrated with a manual segmentation in Fig. 4b.

Suturing: Next, we explored 39 examples of a 4 throw suturing task (Fig. 4c). Using the right arm, the first step is to penetrate one of the points on right side. The next step is to force the needle through the phantom to the other side. Using the left arm, the robot pulls the needle out of the phantom, and then hands it off to the right arm for the next point.

5.3 Experiment 2. Pruning and Compaction

In Fig. 5, we highlight the benefit of pruning and compaction using the Suturing task as exemplar. First, we show the transition states without applying the compaction step to remove looping transition states (Fig. 5a). We find that there are many more transition states at the “insert” step of the task. Compaction removes the segments that correspond to a loop of the insertions. Next, we show the all of the clusters found by DP-GMM. The centroids of these clusters are marked in Fig. 5b. Many of these clusters are small containing only a few transition states. This is why we created the heuristic to prune clusters that do not have transition states from at least 80% of the demonstrations. In all, 11 clusters are pruned by this rule.

5.4 Experiment 3. Can Vision Help?

In the next experiment, we evaluate TSC in a featurized state space that incorporates states derived from vision (Described in Sect. 5.1). We illustrate the transition states in Fig. 6 with and without visual features on the circle cutting task. At each point where the model transitions, we mark the end-effector (x, y, z) location. In particular, we show a region (red box) to highlight the benefits of these features. During the cross-over phase of the task, the robot has to re-enter the notch point and adjust to cut the other half of the circle. When only using the end-effector position, the locations where this transition happens is unreliable as operators may approach the entry from slightly different angles. On the other hand, the use of a gripper contact binary feature clusters the transition states around the point at which the gripper is in position and ready to begin cutting again. In the subsequent experiments, we use the same two visual features.

5.5 Experiment 4. TSC Evaluation

Circle Cutting: Figure 7a shows the transition states obtained from our algorithm. And Fig. 7b shows the TSC clusters learned (numbered by time interval midpoint). The algorithm found 8 clusters, one of which was pruned out using our $\rho =80\%$ threshold rule.

The remaining 7 clusters correspond well to the manually identified transition points. It is worth noting that there is one extra cluster (marked $2'$), that does not correspond to a transition in the manual segmentation. At $2'$, the operator finishes a notch and begins to cut. While at a logical level notching and cutting are both penetration actions, they correspond to two different linear transition regimes due to the positioning of the end-effector. Thus, TSC separates them into different clusters even though a human annotator may not do so.

Needle Passing: In Fig. 8a, we plot the transition states in (x, y, z) end-effector space for both arms. We find that these transition states correspond well to the logical segments of the task (Fig. 4b). These demonstrations are noisier than the circle cutting demonstrations and there are more outliers. The subsequent clustering finds 9 (2 pruned). Next, Fig. 8b–c illustrate the TSC clusters. We find that again TSC learns a small parametrization for the task structure with the clusters corresponding well to the manual segments. However, in this case, the noise does lead to a spurious cluster (4 marked in green). One possible explanation is that the two middle loops are in close proximity and demonstrations contain many adjustments to avoid colliding with the loop and the other arm while passing the needle through leading to numerous transition states in that location.

Suturing: In Fig. 9, we show the transition states and clusters for the suturing task. As before, we mark the left arm in orange and the right arm in blue. This task was far more challenging than the previous tasks as the demonstrations were inconsistent. These inconsistencies were in the way the suture is pulled after insertion (some pull to the left, some to the right, etc.), leading to transition states all over the state space. Furthermore, there were numerous demonstrations with looping behaviors for the left arm. In fact, the DP-GMM method gives us 23 clusters, 11 of which represent less than 80% of the demonstrations and thus are pruned (we illustrate the effect of the pruning in the next section). In the early stages of the task, the clusters clearly correspond to the manually segmented transitions. As the task progresses, we see that some of the later clusters do not.

5.6 Experiment 5. Comparison to “Surgemes”

Surgical demonstrations have an established set of primitives called surgemes, and we evaluate if segments discovered by our approach correspond to surgemes. In Table 1, we compare the number of TSC segments for needle passing and suturing to the number of annotated surgeme segments. A key difference between our segmentation and number of annotated surgemes is our compaction and pruning steps. To account for this, we first select a set of surgemes that are expressed in most demonstrations (i.e., simulating pruning), and we also apply a compaction step to the surgeme segments. In case of consecutive appearances of these surgemes, we only keep the 1 instance of each for compaction. We explore two metrics: TSC-Surgeme the fraction of TSC clusters with only one surgeme switch (averaged over all demonstrations), and Surgeme-TSC the fraction of surgeme switches that fall inside exactly one TSC clusters.

Table 1 83 and $73\%$ of transition clusters for needle passing and suturing respectively contained exactly one surgeme transition

Full size table

6 Conclusion and Future Work

We presented Transition State Clustering (TSC), which leverages hybrid dynamical system theory and Bayesian statistics to robustly learn segmentation criteria. To learn these clusters, TSC uses a hierarchical Dirichlet Process Gaussian Mixture Model (DP-GMM) with a series of merging and pruning steps. Our results on a synthetic example suggest that the hierarchical clusters are more robust to looping and noise, which are prevalent in surgical data. We further applied our algorithm to three surgical datasets and found that the transition state clusters correspond well to hand annotations and transitions w.r.t motions from a pre-defined surgical motion vocabulary called surgemes.

There are a number of important open-questions for future work. First, we believe that the growing maturity of Convolutional Neural Networks can facilitate transition state clustering directly from raw data (e.g., pixels), as opposed to the features studied in this work, and is a promising avenue for future work. Next, we are also particularly interested in closing-the-loop and using segmentation to facilitate optimal control or reinforcement learning. Finally, we are also interested in relaxing the consistency and normality assumptions in our parameter inference algorithm.

References

Asfour, T., Gyarfas, F., Azad, P., Dillmann, R.: Imitation learning of dual-arm manipulation tasks in humanoid robots. In: 2006 6th IEEE-RAS International Conference on Humanoid Robots, pp. 40–47 (2006)
Google Scholar
Calinon, S.: Skills learning in robots by interaction with users and environment. In: 2014 11th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), pp. 161–162. IEEE (2014)
Google Scholar
Calinon, S., Billard, A.: Stochastic gesture production and recognition model for a humanoid robot. In: Proceedings of the 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems 2004, (IROS 2004), vol. 3, pp. 2769–2774 (2004)
Google Scholar
Calinon, S., Halluin, F.D., Caldwell, D.G., Billard, A.G.: Handling of multiple constraints and motion alternatives in a robot programming by demonstration framework. In: 9th IEEE-RAS International Conference on Humanoid Robots, 2009, Humanoids 2009, pp. 582–588. IEEE (2009)
Google Scholar
Calinon, S., D’halluin, F., Sauser, E.L., Caldwell, D.G., Billard, A.G.: Learning and reproduction of gestures by imitation. IEEE Robot. Autom. Mag. 17(2), 44–54 (2010)
Article Google Scholar
Calinon, S., Bruno, D., Caldwell, D.G.: A task-parameterized probabilistic model with minimal intervention control. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 3339–3344 (2014)
Google Scholar
Gao, Y., Vedula, S., Reiley, C., Ahmidi, N., Varadarajan, B., Lin, H., Tao, L., Zappella, L., Bejar, B., Yuh, D., Chen, C., Vidal, R., Khudanpur, S., Hager, G.: The jhu-isi gesture and skill assessment dataset (jigsaws): a surgical activity working set for human motion modeling. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2014)
Google Scholar
Goebel, R., Sanfelice, R.G., Teel, A.: Hybrid dynamical systems. IEEE Control Syst. 29(2), 28–93 (2009)
Article MathSciNet Google Scholar
Grollman, D.H., Jenkins, O.C.: Incremental learning of subtasks from unsegmented demonstration. In: 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 261–266. IEEE (2010)
Google Scholar
Ijspeert, A., Nakanishi, J., Schaal, S.: Learning attractor landscapes for learning motor primitives. In: Neural Information Processing Systems (NIPS), pp. 1523–1530 (2002)
Google Scholar
Intuitive Surgical: Annual report (2014). http://investor.intuitivesurgical.com/phoenix.zhtml?c=122359&p=irol-IRHome
Johns Hopkins: Surgical robot precision. http://eng.jhu.edu/wse/magazine-winter-14/print/surgical-precision
Kehoe, B., Kahn, G., Mahler, J., Kim, J., Lee, A., Lee, A., Nakagawa, K., Patil, S., Boyd, W., Abbeel, P., Goldberg, K.: Autonomous multilateral debridement with the raven surgical robot. In: International Conference on Robotics and Automation (ICRA) (2014)
Google Scholar
Keogh, E.J., Pazzani, M.J.: Derivative dynamic time warping. SIAM
Google Scholar
Kruger, V., Herzog, D., Baby, S., Ude, A., Kragic, D.: Learning actions from observations. IEEE Robot. Autom. Mag. 17(2), 30–43 (2010)
Article Google Scholar
Krüger, V., Tikhanoff, V., Natale, L., Sandini, G.: Imitation learning of non-linear point-to-point robot motions using dirichlet processes. In: 2012 IEEE International Conference on Robotics and Automation (ICRA), pp. 2029–2034. IEEE (2012)
Google Scholar
Kulić, D., Nakamura, Y.: Scaffolding on-line segmentation of full body human motion patterns. In: 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2008, pp. 2860–2866. IEEE (2008)
Google Scholar
Kurihara, K., Welling, M., Vlassis, N.A.: Accelerated variational dirichlet process mixtures. In: Advances in Neural Information Processing Systems, pp. 761–768 (2006)
Google Scholar
Lea, C., Hager, G.D., Vidal, R.: An improved model for segmentation and recognition of fine-grained activities with application to surgical training tasks. In: WACV (2015)
Google Scholar
Lee, S.H., Suh, I.H., Calinon, S., Johansson, R.: Autonomous framework for segmenting robot trajectories of manipulation task. Auton. Robots 38(2), 107–141 (2014)
Article Google Scholar
Lin, H., Shafran, I., Murphy, T., Okamura, A., Yuh, D., Hager, G.: Automatic detection and segmentation of robot-assisted surgical motions. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 802–810. Springer (2005)
Google Scholar
Mahler, J., Krishnan, S., Laskey, M., Sen, S., Murali, A., Kehoe, B., Patil, S., Wang, J., Franklin, M., Abbeel, P.K.G.: Learning accurate kinematic control of cable-driven surgical robots using data cleaning and gaussian process regression. In: International Conference on Automated Sciences and Engineering (CASE), pp. 532–539 (2014)
Google Scholar
Manschitz, S., Kober, J., Gienger, M., Peters, J.: Learning movement primitive attractor goals and sequential skills from kinesthetic demonstrations. Robot. Auton. Syst. 74(5), 97–107 (2015)
Article Google Scholar
Moldovan, T., Levine, S., Jordan, M., Abbeel, P.: Optimism-driven exploration for nonlinear systems. In: International Conference on Robotics and Automation (ICRA) (2015)
Google Scholar
Murali, A., Sen, S., Kehoe, B., Garg, A., McFarland, S., Patil, S., Boyd, W., Lim, S., Abbeel, P., Goldberg, K.: Learning by observation for surgical subtasks: multilateral cutting of 3d viscoelastic and 2d orthotropic tissue phantoms. In: International Conference on Robotics and Automation (ICRA) (2015)
Google Scholar
Niekum, S., Osentoski, S., Konidaris, G., Barto, A.: Learning and generalization of complex tasks from unstructured demonstrations. In: International Conference on Intelligent Robots and Systems (IROS), pp. 5239–5246. IEEE (2012)
Google Scholar
Pastor, P., Hoffmann, H., Asfour, T., Schaal, S.: Learning and generalization of motor skills by learning from demonstration. In: International Conference on Robotics and Automation (ICRA), pp. 763–768. IEEE (2009)
Google Scholar
Quellec, G., Lamard, M., Cochener, B., Cazuguel, G.: Real-time segmentation and recognition of surgical tasks in cataract surgery videos. IEEE Trans. Med. Imag. 33(12), 2352–2360 (2014)
Article Google Scholar
Reiley, C.E., Plaku, E., Hager, G.D.: Motion generation of robotic surgical tasks: learning from expert demonstrations. In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 967–970. IEEE (2010)
Google Scholar
Rosen, J., Brown, J.D., Chang, L., Sinanan, M.N., Hannaford, B.: Generalized approach for modeling minimally invasive surgery as a stochastic process using a discrete markov model. IEEE Trans. Biomed. Eng. 53(3), 399–413 (2006)
Article Google Scholar
Schulman, J., Ho, J., Lee, C., Abbeel, P.: Learning from demonstrations through the use of non-rigid registration
Google Scholar
Tang, H., Hasegawa-Johnson, M., Huang, T.S.: Toward robust learning of the gaussian mixture state emission densities for hidden markov models. In: 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 5242–5245. IEEE (2010)
Google Scholar
Tao, L., Zappella, L., Hager, G.D., Vidal, R.: Surgical gesture segmentation and recognition. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2013, pp. 339–346. Springer (2013)
Google Scholar
Vakanski, A., Mantegh, I., Irish, A., Janabi-Sharifi, F.: Trajectory learning for robot programming by demonstration using hidden markov model and dynamic time warping. IEEE Trans. Syst. Man Cybern. Part B Cybern. 42(4), 1039–1052 (2012)
Article Google Scholar
Varadarajan, B., Reiley, C., Lin, H., Khudanpur, S., Hager, G.: Data-derived models for segmentation with application to surgical assessment and training. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 426–434. Springer (2009)
Google Scholar
Zappella, L., Bejar, B., Hager, G., Vidal, R.: Surgical gesture classification from video and kinematic data. Med. Image Analysis 17(7), 732–745 (2013)
Article Google Scholar

Download references

Acknowledgements

This research was supported in part by a seed grant from the UC Berkeley Center for Information Technology in the Interest of Society (CITRIS), by the U.S. National Science Foundation under Award IIS-1227536: Multilateral Manipulation by Human-Robot Collaborative Systems. This work has been supported in part by funding from Google and Cisco. We also thank Florian Pokorny, Jeff Mahler, and Michael Laskey.

Author information

Authors and Affiliations

EECS, UC Berkeley, Berkeley, CA, USA
Sanjay Krishnan, Sachin Patil, Pieter Abbeel & Ken Goldberg
IEOR, UC Berkeley, Berkeley, CA, USA
Animesh Garg & Ken Goldberg
Computer Science Department, The Johns Hopkins University, Baltimore, MD, USA
Colin Lea & Gregory Hager

Authors

Sanjay Krishnan
View author publications
You can also search for this author in PubMed Google Scholar
Animesh Garg
View author publications
You can also search for this author in PubMed Google Scholar
Sachin Patil
View author publications
You can also search for this author in PubMed Google Scholar
Colin Lea
View author publications
You can also search for this author in PubMed Google Scholar
Gregory Hager
View author publications
You can also search for this author in PubMed Google Scholar
Pieter Abbeel
View author publications
You can also search for this author in PubMed Google Scholar
Ken Goldberg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Sanjay Krishnan or Animesh Garg .

Editor information

Editors and Affiliations

Istituto Italiano di Tecnologia, Genova, Italy, University of Pisa, Pisa, Italy , Pisa, Italy
Antonio Bicchi
Inst. für Informatik, Albert-Ludwigs-Universität Freiburg Inst. für Informatik, Freiburg, Germany
Wolfram Burgard

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Krishnan, S. et al. (2018). Transition State Clustering: Unsupervised Surgical Trajectory Segmentation for Robot Learning. In: Bicchi, A., Burgard, W. (eds) Robotics Research. Springer Proceedings in Advanced Robotics, vol 3. Springer, Cham. https://doi.org/10.1007/978-3-319-60916-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-60916-4_6
Published: 25 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60915-7
Online ISBN: 978-3-319-60916-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics