1 Introduction

Recorded demonstrations of robot-assisted minimally invasive surgery (RMIS) have been used for surgical skill assessment [7], development of finite state machines for automation [13, 25], learning from demonstration (LfD) [29], and calibration [22]. Intuitive Surgical’s da Vinci robot facilitated over 570, 000 procedures in 2014 [11]. There are proposals to record all of Intuitive’s RMIS procedures similar to flight data recorders (“black boxes”) in airplanes [12], which could lead to a proliferation of data. While these large datasets have the potential to facilitate learning and autonomy; the length and variability of surgical trajectories pose a unique challenge. Each surgical trajectory may represent minutes of multi-modal observations, may contain loops (failures and repetitions until achieving the desired result), and even identical procedures can vary due to differences in the environment. In this setting, typical techniques for establishing spatial and temporal correspondence that employ continuous deformations can be unreliable (e.g., Dynamic Time Warping [14] and spline-based registration [31]).

Segmentation of a task into sub-tasks can be valuable since individual segments are less complex, less variable, and allow for easier detection and rejection of outliers. Trajectory segmentation in robotics is an extensively studied problem [4, 5, 16, 20, 21, 26, 30]. However, prior work in robotic surgery focuses on the supervised problem setting, either requiring manual segmentation of example trajectories or using a set of pre-defined primitive motions called “surgemes” [21, 30, 36]. Manual labelling requires specifying consistent segmentation criteria and applying these criteria to across demonstrations, which can be time-consuming and unreliable. Similarly, it can be challenging to manually construct a dictionary of primitives at the correct level of abstraction.

Outside of surgery, there have been several proposals for unsupervised segmentation [5, 16, 20, 26], where the criteria are learned from data without a pre-defined dictionary. The salient feature of these approaches is a clustering or local regression model to identify locally similar states. Inherently, the success of unsupervised approaches is dependent on how well the demonstrations match the assumptions of the model (i.e., the definition of “similar”). In surgery, the tissue and environment may vary greatly between demonstrations making it difficult to directly compare different trajectories. Our insight is that while the trajectories may be very different, there can be a common latent structure in the demonstrations that can be learned from the data. Segmentation can be performed with respect to these latent parameters leading to robust segmentation criteria.

Transition State Clustering (TSC) combines hybrid dynamical system theory with Bayesian statistics to learn such a structure. We model demonstrations as repeated realizations of an unknown noisy switched linear dynamical system [8]. TSC identifies changes in local linearity in each demonstration, and leans a model to infer regions of the state-space at which switching events occur. These regions are generated from a hierarchical nonparametric Bayesian model, where the number of regions are determined by a Dirichlet Process and the shape of the regions are determined by a mixture of multivariate Gaussian random variables. A series of merging and pruning steps (controlled by user-specified parameters \(\delta \) and \(\rho \) respectively) remove outlier transition states.

We also explore how to use the video data that accompanies kinematic data in surgical demonstration recordings. In this work, we explore improving segmentation through hand-engineered visual features. We manually label the video stream with two features: a binary variable identifying object grasp events and a scalar variable indicating surface penetration depth. We evaluate results with and without these visual features (Sect. 5.4). In future work, we will explore automated methods to construct featurized representations of the video data.

2 Related Work and Background

Motion Primitives and Skill Learning: Motion primitives are segments that discretize the action-space of a robot, and can facilitate faster convergence in LfD [10, 23, 27]. On the other hand, TSC discretizes the state-space, which can be interpreted as segmenting a task and not a trajectory. Much of the initial work in motion primitives considered manually identified segments, but recently, Niekum et al. [26] proposed learning the set of primitives from demonstrations using the Beta-Process Autoregressive Hidden Markov Model (BP-AR-HMM). Calinon et al. [2] also build on a large corpus of literature of unsupervised skill segmentation including the task-parameterized movement model [6], and GMMs for segmentation [5].

The ideas in Niekum et al. inspire the results presented in this work, namely, the use of Bayesian non-parametric models for segmentation and switched linear models. Unlike Niekum et al. and our work, Calinon et al. do not employ Bayesian non-parametrics or multimodal data. In Niekum et al. transition events are only dependent on the current dynamical regime, and in TSC they also depend on the current state (as illustrated in Fig. 1 with a dashed line). In this paper, we extend this line of work with non-parametric clustering on a GMM based model, and account for specific challenges such as looping and inconsistency in surgical demonstrations.

Handling Temporal Inconsistency: The most common model for handling demonstrations that have varying temporal characteristics is Dynamic Time Warping (DTW). However, DTW is a greedy dynamic programming approach which assumes that trajectories are largely the same up-to some smooth temporal deformations. When there are significant variations due to looping or spurious states, this model can give unreliable results [14], as shown by our results.

Another common model for modeling temporal inconsistencies is the Finite State Markov Chain model with Gaussian Mixture Emissions (GMM\(+\)HMM) [1, 3, 15, 34]. These models, impose a probabilistic grammar on the segment transitions and can be learned with an EM algorithm. However, they can be sensitive to hyper-parameters such as the number of segments and the amount of data [32]. The problem of robustness in GMM\(+\)HMM (or closely related variants) has been addressed using down-weighting transient states [17] and sparsification [9]. In TSC, we explore whether it is sufficient to know transition states without having to fully parametrize a Markov Chain for accurate segmentation. In Fig. 1, we compare the graphical models of GMM\(+\)HMM, and TSC. The TSC model applies Dirichlet Process priors to automatically set the number of hidden states (regimes).

The TSC algorithm finds spatially and temporally similar transition states across demonstrations, and it does not have to model correlations between switching events–in essence, using the current state as a sufficient statistic for switching behavior. On the other hand, the typical GMM\(+\)HMM model learns a full \(k\times k\) transition matrix. Consequently, we empirically find that the TSC model is robust to noise and temporal variation, especially for a small number of demonstrations.

Fig. 1
figure 1

a A finite-state Hidden Markov Chain with Gaussian Mixture Emissions (GMM\(\,{+}\,\)HMM), and b TSC model. TSC uses Dirichlet Process Priors and the concept of transition states to learn a robust segmentation

Surgical Task Recognition: Surgical robotics has largely studied the problem of supervised segmentation using either segmented examples or a pre-defined dictionary of motions (similar to motion primitives). For example, given manually segmented videos, Zappella et al. [36] use features from both the videos and kinematic data to classify surgical motions. Simiarly, Quellec et al. [28] use manually segmented examples as training for segmentation and recognition of surgical tasks based on archived cataract surgery videos. The dictionary-based approaches are done with a domain-specific set of motion primitives for surgery called “surgemes”. A number of works (e.g., [19, 21, 33, 35]), use the surgemes to bootstrap learning segmentation.

3 Problem Setup and Model

The TSC model is summarized by the hierarchical graphical model in the previous section (Fig. 1). Here, we formalize each of the levels of the hierarchy and describe the assumptions in this work.

Dynamical System Model: Let \(\mathscr {D}=\{d_i\}\) be the set of demonstrations where each \(d_i\) is a trajectory \(\mathbf {x}(t)\) of fully observed robot states and each state is a vector in \(\mathbb {R}^d\). We model each demonstration as a switched linear dynamical system. There is a finite set of \(d \times d\) matrices \(\{A_1,\ldots ,A_k\}\), and an i.i.d zero-mean additive Gaussian Markovian noise process W(t) which accounts for noise in the dynamical model:

$$ \mathbf {x}(t+1) = A_{i}\mathbf {x}(t) + W(t) \text { : } A_i \in \{A_1,\ldots ,A_k\} $$

Transitions between regimes are instantaneous where each time t is associated with exactly one dynamical system matrix \(1,\ldots ,k\).

Transition States and Times: Transition states are defined as the last states before a dynamical regime transition in each demonstration. Each demonstration \(d_i\) follows a switched linear dynamical system model, therefore there is a time series of regimes A(t) associated with each demonstration. Consequently, there will be times t at which \(A(t) \ne A(t+1)\).

We model the occurrence of these events as a stochastic process conditioned on the current state. Switching events are governed by a latent function of the current state \(S:\mathscr {X}\mapsto \{0,1\}\), and we have noisy observations of switching events \(\widehat{S}(x(t)) = S(x(t)+Q(t))\), where Q(t) is a i.i.d noise process. Thus, across all demonstrations, the observed switching events induce a probability density f(x) over the state space \(\mathscr {X}\). In this paper, we focus on the problem where f(x) is a Mixture of Gaussian densities.

Transition State Clusters: Across all demonstrations, we are interested in aggregating nearby (spatially and temporally) transition states together. The goal of transition state clustering is to find a mixture model for f that approximately recovers the true latent function S. Consequently, a transition state cluster is defined as a clustering of the set of transition states across all demonstrations; partitioning these transition states into m non-overlapping similar groups:

$$ \mathscr {C} = \{C_1, C_2,\ldots ,C_m\} $$

Every \(U_i\) can be represented as a sequence of integers indicating that transition states assignment to one of the transition state clusters \(U_i=[1,2,4,2]\).

Consistency: We assume, demonstrations are consistent, meaning there exists a non-empty sequence of transition states \(\mathscr {U}^*\) such that the partial order defined by the elements in the sequence (i.e., \(s_1\) happens before \(s_2\) and \(s_3\)) is satisfied by every \(U_i\). For example,

$$\begin{aligned} U_1 = [1,3,4]\text {, }U_2 = [1,1,2,4]\text {, }\mathscr {U}^*=[1,4] \end{aligned}$$

A counter example,

$$\begin{aligned} U_1 = [1,3,4]\text {, }U_2 = [2,5]\text {, }\mathscr {U}^*\text { no solution} \end{aligned}$$

Intuitively, this condition states that there have to be a consistent ordering of actions over all demonstrations up to some additional regimes (e.g., spurious actions).

Loops: Loops are common in surgical demonstrations. For example, a surgeon may attempt to insert a needle 2–3 times. When demonstrations have varying amounts of retrials it is challenging. In this work, we assume that these loops are modeled as repeated transitions between transition state clusters, which is justified in our experimental datasets, for example,

$$\begin{aligned} U_1 = [1,3,4]\text {, }U_2 = [1,3,1,3,1,3,4]\text {, }\mathscr {U}^*=[1,3,4] \end{aligned}$$

Our algorithm will compact these loops together into a single transition.

Minimal Solution: Given a consistent set of demonstrations, that have additional regimes and loops, the goal of the algorithm is to find a minimal solution, \(\mathscr {U}^*\) that is loop-free and respects the partial order of transitions in all demonstrations.

Given a set of demonstrations \(\mathscr {D}\), the Transition State Clustering problem is to find a set of transition state clusters \(\mathscr {C}\) such that they represent a minimal parametrization of the demonstrations.

Multi-modal TSC: This model can similarly be extended to states derived from sensing. Suppose at every time t, there is a feature vector z(t). Then the augmented state of both the robot spatial state and the features denoted is:

$$ \mathbf {x}(t) = \left( {\begin{array}{c}x(t)\\ z(t)\end{array}}\right) $$

In our experiments, we worked the da Vinci surgical robot with two 7-DOF arms, each with 2 finger grippers. Consider the following feature representation which we used in our experiments:

  1. 1.

    Gripper grasp. Indicator that is 1 if there is an object between the gripper, 0 otherwise.

  2. 2.

    Surface Penetration. In surgical tasks, we often have a tissue phantom. This feature describes whether the robot (or something the robot is holding like a needle) has penetrated the surface. We use an estimate of the truncated penetration depth to encode this feature. If there is no penetration, the value is 0, otherwise the value of penetration is the robot’s kinematic position in the direction orthogonal to the tissue phantom.

4 Transition State Clustering

In this section, we describe the hierarchical clustering process of TSC. This algorithm is a greedy approach to learning the parameters in the graphical model in Fig. 1. We decompose the hierarchical model into stages and fit parameters to the generative model at each stage. The full algorithm is described in Algorithm 1.

4.1 Background: Bayesian Statistics

One challenge with mixture models is hyper-parameter selection, such as the number of clusters. Recent results in Bayesian statistics can mitigate some of these problems. The basic recipe is to define a generative model, and then use Expectation Maximization to fit the parameters of the model to observed data. The generative model that we will use is called a mixture model, which defines a probability distribution that is a composite of multiple distributions.

One flexible class of mixture models are Gaussian Mixture Models (GMM), which are described generatively as follows. We first sample some c from a categorical distribution, one that takes on values from (1...K), with probabilities \(\phi \), where \(\phi \) is a K dimensional simplex:

$$ c \sim cat(K,\phi ) $$

Then, given the event \(\{ c = i \}\), we specify a multivariate Gaussian distribution:

$$ x_i \sim N(\mu _i, \Sigma _i) $$

The insight is that a stochastic process called the Dirichlet Process (DP) defines a distribution over discrete distributions, and thus instead we can draw samples of \(cat(K,\phi )\) to find the most likely choice of K via EM. The result is the following model:

$$\begin{aligned} (K,\phi ) \sim DP(H,\alpha ) \qquad c \sim cat(K,\phi ) \qquad X \sim N(\mu _i, \Sigma _i) \end{aligned}$$
(1)

After fitting the model, every observed sample of \(x \sim X\) will have a probability of being generated from a mixture component \(P(x \mid c = i)\). Every observation x will have a most likely generating component. It is worth noting that each cluster defines an ellipsoidal region in the feature space of x, because of the Gaussian noise model \(N(\mu _i, \Sigma _i)\).

We denote this entire clustering method in the remainder of this work as DP-GMM. We use the same model at multiple levels of the hierarchical clustering and we will describe the feature space at each level. We use a MATLAB software package to solve this problem using a variational EM algorithm [18].

4.2 Transition States Identification

The first step is to identify a set of transition states for each demonstration in \(\mathscr {D}\). To do this, we have to fit a switched dynamic system model to the trajectories. Suppose there was only one regime, then this would be a linear regression problem:

$$ \arg \min _A \Vert A X_t - X_{t+1}\Vert $$

where \(X_t\) and \(X_{t+1}\) are matrices where each column vector is corresponding x(t) and \(x(t+1)\). Moldovan et al. [24] showed that fitting a jointly Gaussian model to \(n(t) = \left( {\begin{array}{c}\mathbf {x}(t+1)\\ \mathbf {x}(t)\end{array}}\right) \) is equivalent to Bayesian Linear Regression.

Therefore, to fit a switched linear dynamical system model, we can fit a Mixture of Gaussians (GMM) model to n(t) via DP-GMM. Each cluster learned signifies a different regime, and co-linear states are in the same cluster. To find transition states, we move along a trajectory from \(t=1,\ldots ,t_f\), and find states at which n(t) is in a different cluster than \(n(t+1)\). These points mark a transition between clusters (i.e., transition regimes).

4.3 Transition State Pruning

We consider the problem of outlier transitions, ones that appear only in a few demonstrations. Each of these regimes will have constituent vectors where each n(t) belongs to a demonstration \(d_i\). Transition states that mark transitions to or from regimes whose constituent vectors come from fewer than a fraction \(\rho \) demonstrations are pruned. \(\rho \) should be set based on the expected rarity of outliers. In our experiments, we set the parameter \(\rho \) to 80% and show the results with and without this step.

4.4 Transition State Compaction

Once we have transition states for each demonstration, and have applied pruning, the next step is to remove transition states that correspond to looping actions, which are prevalent in surgical demonstrations. We model this behavior as consecutive linear regimes repeating, i.e., transition from i to j and then a repeated i to j. We apply this step after pruning to take advantage of the removal of outlier regimes during the looping process. These repeated transitions can be compacted together to make a single transition.

figure a

The key question is how to differentiate between repetitions that are part of the demonstration and ones that correspond to looping actions–the sequence might contain repetitions not due to looping. To differentiate this, as a heuristic, we threshold the L2 distance between consecutive segments with repeated transitions. If the L2 distance is low, we know that the consecutive segments are happening in a similar location as well. In our datasets, this is a good indication of looping behavior. If the L2 distance is larger, then repetition between dynamical regimes might be happening but the location is changing.

For each demonstration, we define a segment \(\mathbf {s}^{(j)}[t]\) of states between each transition states. The challenge is that \(\mathbf {s}^{(j)}[t]\) and \(\mathbf {s}^{(j+1)}[t]\) may have a different number of observations and may be at different time scales. To address this challenge, we apply Dynamic Time Warping (DTW). Since segments are locally similar up-to small time variations, DTW can find a most-likely time alignment of the two segments.

Let \(\mathbf {s}^{(j+1)}[t^*]\) be a time aligned (w.r.t to \(\mathbf {s}^{(j)}\)) version of \(\mathbf {s}^{(j+1)}\). Then, after alignment, we define the \(L_2\) metric between the two segments:

$$ d(j,j+1) = \frac{1}{T}\sum _{t=0}^T (\mathbf {s}^{(j)}[i] - \mathbf {s}^{(j+1)}[i^*])^2 $$

when \(d\le \delta \), we compact two consecutive segments. \(\delta \) is chosen empirically and a larger \(\delta \) leads to a sparser distribution of transition states, and smaller \(\delta \) leads to more transition states. For our needle passing and suturing experiments, we set \(\delta \) to correspond to the distance between two suture/needle insertion points–thus, differentiating between repetitions at the same point versus at others.

However, since we are removing points from a time-series this requires us to adjust the time scale. Thus, from every following observation, we shift the time stamp back by the length of the compacted segments.

4.5 State-Space Clustering

After compaction, there are numerous transition states at different locations in the state-space. If we model the states at transition states as drawn from a GMM model:

$$ {x}(t) \sim N(\mu _i, \Sigma _i) $$

Then, we can apply the DP-GMM again to cluster the state vectors at the transition states. Each cluster defines an ellipsoidal region of the state-space space.

4.6 Time Clustering

Without temporal localization, the transitions may be ambiguous. For example, in circle cutting, the robot may pass over a point twice in the same task. The challenge is that we cannot naively use time as another feature, since it is unclear what metric to use to compare distance between \(\left( {\begin{array}{c}\mathbf {x}(t)\\ t\end{array}}\right) \). However a second level of clustering by time within each state-space cluster can overcome this issue. Within a state cluster, if we model the times which change points occur as drawn from a GMM \(t \sim N(\mu _i, \sigma _i)\), and then we can apply DP-GMM to the set of times. We cluster time second because we observe that the surgical demonstrations are more consistent spatially than temporally. This groups together events that happen at similar times during the demonstrations. The result is clusters of states and times. Thus, a transition states \(m_k\) is defined as tuple of an ellipsoidal region of the state-space and a time interval.

Fig. 2
figure 2

a Observations from a dynamical system with two regimes, b Observations corrupted with Gaussian Noise, c Observations corrupted with a spurious inserted regime (red), and d Observations corrupted with an inserted loop(green)

5 Results

5.1 Experiment 1. Synthetic Example of 2-Segment Trajectory

In our first experiment, we segment noisy observations from a two regime linear dynamical system. Figure 2 illustrates examples from this system under the different types of corruption. Since there is a known a ground truth of two segments, we measure the precision (average fraction of observations in each segment that are from the same regime) and recall (average fraction of observations from each regime segmented together) in recovering these two segments. We can jointly consider precision and recall with the F1 Score which is the harmonic mean of the two. We compare three techniques against TSC: K-Means (only spatial), GMM\(+\)T (using time as a feature in a GMM), GMM\(+\)HMM (using an HMM to model the grammar). For the GMM techniques, we have to select the number of segments, and we experiment with \(k=1,2,3\) (i.e., a slightly sub-optimal parameter choice compared to \(k=2\)). In this example, for TSC, we set the two user-specified parameters to \(\delta =0\) (merge all repeated transitions), and \(\rho =80\%\) (prune all regimes representing less than \(80\%\) of the demonstrations).

First, we generate 100 noisy observations (additive zero mean Gaussian noise) from the system without loops or spurious states–effectively only measuring the DP-GMM versus the alternatives. Figure 3a shows the F1-score as a function of the noise in the observations. Initially, for an appropriate parameter choice \(k=2\) both of the GMM-based methods perform well and at low noise levels the DP-GMM used by our work mirrors this performance. However, if the parameter is set to be \(k=3\), we see that the performance significantly degrades. \(k=1\) corresponds to a single segment which has a F1 score of 0.4 on all figures. The DP-GMM mitigates this sensitivity to the choice of parameter by automatically setting the value. Furthermore, as the noise increases, the \(80\%\) pruning of DP-GMM mitigates the effect of outliers leading to improved accuracy.

Fig. 3
figure 3

Accuracy as a function of noise: TSC, K-Means, GMM\(+\)T (GMM with time), GMM\(+\)HMM (GMM with HMM). a The Dirichlet Process used in TSC reduces sensitivity to parameter choice and is comparable to GMM techniques using the optimal parameter choice, b HMM based methods need more training data as they have to learn transitions, c TSC clusters are robust to spurious regimes, and d TSC clusters are robust to loops–without having to know the regimes in advance

In Fig. 3b, we look at the accuracy of each technique as a function of the number of demonstrations. GMM\(+\)HMM has more parameters to learn and therefore requires more data. GMM\(+\)T converges the fastest, TSC requires slightly more data, and the GMM\(+\)HMM requires the most. In Fig. 3c, we corrupt the observations with spurious dynamical regimes. These are random transition matrices which replace one of the two dynamical regimes. We vary the rate at which we randomly corrupt the data, and measure the performance of the different segmentation techniques as a function of this rate. Due to the pruning, TSC gives the most accurate segmentation. The Dirichlet process groups the random transitions in different clusters and the small clusters are pruned out. On the other hand, the pure GMM techniques are less accurate since they are looking for exactly two regimes.

In Fig. 3d, introduce corruption due to loops and compare the different techniques. A loop is a step that returns to the start of the regime randomly, and we vary this random rate. For an accurately chosen parameter \(k=2\), for the GMM−HMM, it gives the most accurate segmentation. However, when this parameter is set poorly \(k=3\), the accuracy is significantly reduced. On the other hand, using time as a GMM feature (GMM\(+\)T) does not work since it does not know how to group loops into the same regime.

5.2 Surgical Experiments: Evaluation Tasks

We describe the three tasks used in our evaluation, and show manually segmented versions in Fig. 4. This will serve as ground truth when qualitatively evaluating our segmentation on real data.

Circle Cutting: In this task, we have a 5 cm diameter circle drawn on a piece of gauze. The first step is to cut a notch into the circle. The second step is to cut clockwise. Next, the robot transitions to the other side cutting counter clockwise. Finally, the robot finishes the cut at the meeting point of the two incisions. As the left arm’s only action is maintain the gauze in tension, we exclude it from the analysis. In Fig. 4a, we mark 6 manually identified transitions points for this task from [25]: (1) start, (2) notch, (3) finish 1st cut, (4) cross-over, (5) finish 2nd cut, and (6) connect the two cuts. For the circle cutting task, we collected 10 demonstrations by non-experts familiar with operating the da Vinci Research Kit (dVRK).

We apply our method to the JIGSAWS dataset [7] consisting of surgical activity for human motion modeling. The dataset was captured using the da Vinci Surgical System from eight surgeons with different levels of skill performing five repetitions each of Needle Passing and Suturing.

Needle Passing: We applied our framework to 28 demonstrations of the needle passing task. The robot passes a needle through a loop using its right arm, then its left arm to pull the needle through the loop. Then, the robot hands the needle off from the left arm to the right arm. This is repeated four times as illustrated with a manual segmentation in Fig. 4b.

Suturing: Next, we explored 39 examples of a 4 throw suturing task (Fig. 4c). Using the right arm, the first step is to penetrate one of the points on right side. The next step is to force the needle through the phantom to the other side. Using the left arm, the robot pulls the needle out of the phantom, and then hands it off to the right arm for the next point.

Fig. 4
figure 4

Hand annotations of the three tasks: a circle cutting, b needle passing, and c suturing. Right arm actions are listed in dark blue and left arm actions are listed in yellow

5.3 Experiment 2. Pruning and Compaction

In Fig. 5, we highlight the benefit of pruning and compaction using the Suturing task as exemplar. First, we show the transition states without applying the compaction step to remove looping transition states (Fig. 5a). We find that there are many more transition states at the “insert” step of the task. Compaction removes the segments that correspond to a loop of the insertions. Next, we show the all of the clusters found by DP-GMM. The centroids of these clusters are marked in Fig. 5b. Many of these clusters are small containing only a few transition states. This is why we created the heuristic to prune clusters that do not have transition states from at least 80% of the demonstrations. In all, 11 clusters are pruned by this rule.

Fig. 5
figure 5

We first show the transition states without compaction (in black and green), and then show the clusters without pruning (in red). Compaction sparsifies the transition states and pruning significantly reduces the number of clusters

Fig. 6
figure 6

a We show the transition states without visual features, b and with visual features. Marked in the red box is a set of transitions that cannot always be detected from kinematics alone

5.4 Experiment 3. Can Vision Help?

In the next experiment, we evaluate TSC in a featurized state space that incorporates states derived from vision (Described in Sect. 5.1). We illustrate the transition states in Fig. 6 with and without visual features on the circle cutting task. At each point where the model transitions, we mark the end-effector (xyz) location. In particular, we show a region (red box) to highlight the benefits of these features. During the cross-over phase of the task, the robot has to re-enter the notch point and adjust to cut the other half of the circle. When only using the end-effector position, the locations where this transition happens is unreliable as operators may approach the entry from slightly different angles. On the other hand, the use of a gripper contact binary feature clusters the transition states around the point at which the gripper is in position and ready to begin cutting again. In the subsequent experiments, we use the same two visual features.

5.5 Experiment 4. TSC Evaluation

Circle Cutting: Figure 7a shows the transition states obtained from our algorithm. And Fig. 7b shows the TSC clusters learned (numbered by time interval midpoint). The algorithm found 8 clusters, one of which was pruned out using our \(\rho =80\%\) threshold rule.

The remaining 7 clusters correspond well to the manually identified transition points. It is worth noting that there is one extra cluster (marked \(2'\)), that does not correspond to a transition in the manual segmentation. At \(2'\), the operator finishes a notch and begins to cut. While at a logical level notching and cutting are both penetration actions, they correspond to two different linear transition regimes due to the positioning of the end-effector. Thus, TSC separates them into different clusters even though a human annotator may not do so.

Fig. 7
figure 7

a The transition states for the circle cutting task are marked in black. b The TSC clusters, which are clusters of the transition states, are illustrated with their 75% confidence ellipsoid

Needle Passing: In Fig. 8a, we plot the transition states in (xyz) end-effector space for both arms. We find that these transition states correspond well to the logical segments of the task (Fig. 4b). These demonstrations are noisier than the circle cutting demonstrations and there are more outliers. The subsequent clustering finds 9 (2 pruned). Next, Fig. 8b–c illustrate the TSC clusters. We find that again TSC learns a small parametrization for the task structure with the clusters corresponding well to the manual segments. However, in this case, the noise does lead to a spurious cluster (4 marked in green). One possible explanation is that the two middle loops are in close proximity and demonstrations contain many adjustments to avoid colliding with the loop and the other arm while passing the needle through leading to numerous transition states in that location.

Fig. 8
figure 8

a The transition states for the task are marked in orange (left arm) and blue (right arm). bc The TSC clusters, which are clusters of the transition states, are illustrated with their 75% confidence ellipsoid for both arms

Suturing: In Fig. 9, we show the transition states and clusters for the suturing task. As before, we mark the left arm in orange and the right arm in blue. This task was far more challenging than the previous tasks as the demonstrations were inconsistent. These inconsistencies were in the way the suture is pulled after insertion (some pull to the left, some to the right, etc.), leading to transition states all over the state space. Furthermore, there were numerous demonstrations with looping behaviors for the left arm. In fact, the DP-GMM method gives us 23 clusters, 11 of which represent less than 80% of the demonstrations and thus are pruned (we illustrate the effect of the pruning in the next section). In the early stages of the task, the clusters clearly correspond to the manually segmented transitions. As the task progresses, we see that some of the later clusters do not.

Fig. 9
figure 9

a The transition states for the task are marked in orange (left arm) and blue (right arm). bc The clusters, which are clusters of the transition states, are illustrated with their 75% confidence ellipsoid for both arms

5.6 Experiment 5. Comparison to “Surgemes”

Surgical demonstrations have an established set of primitives called surgemes, and we evaluate if segments discovered by our approach correspond to surgemes. In Table 1, we compare the number of TSC segments for needle passing and suturing to the number of annotated surgeme segments. A key difference between our segmentation and number of annotated surgemes is our compaction and pruning steps. To account for this, we first select a set of surgemes that are expressed in most demonstrations (i.e., simulating pruning), and we also apply a compaction step to the surgeme segments. In case of consecutive appearances of these surgemes, we only keep the 1 instance of each for compaction. We explore two metrics: TSC-Surgeme the fraction of TSC clusters with only one surgeme switch (averaged over all demonstrations), and Surgeme-TSC the fraction of surgeme switches that fall inside exactly one TSC clusters.

Table 1 83 and \(73\%\) of transition clusters for needle passing and suturing respectively contained exactly one surgeme transition

6 Conclusion and Future Work

We presented Transition State Clustering (TSC), which leverages hybrid dynamical system theory and Bayesian statistics to robustly learn segmentation criteria. To learn these clusters, TSC uses a hierarchical Dirichlet Process Gaussian Mixture Model (DP-GMM) with a series of merging and pruning steps. Our results on a synthetic example suggest that the hierarchical clusters are more robust to looping and noise, which are prevalent in surgical data. We further applied our algorithm to three surgical datasets and found that the transition state clusters correspond well to hand annotations and transitions w.r.t motions from a pre-defined surgical motion vocabulary called surgemes.

There are a number of important open-questions for future work. First, we believe that the growing maturity of Convolutional Neural Networks can facilitate transition state clustering directly from raw data (e.g., pixels), as opposed to the features studied in this work, and is a promising avenue for future work. Next, we are also particularly interested in closing-the-loop and using segmentation to facilitate optimal control or reinforcement learning. Finally, we are also interested in relaxing the consistency and normality assumptions in our parameter inference algorithm.