Keywords

1 Introduction

In recent years, analysis of surgical motion has received a growing interest following the development of devices enabling automated capture of surgeon motions such as tracking, robotic and training systems. Surgical training programs now often include surgical simulators which are equipped with sensors for automatic surgical motions recording [1, 2]. The ability to collect surgical motion data brings unprecedented opportunities for automated objective analysis and assessment of surgical trainees progression. The main goal of this effort is to support surgeons in technical skills acquisition, as these are shown to correlate with a reduction of patient complications [3]. Hence, automated evaluation of surgical skill level is an important step in surgical patient care improvement.

This article tackles the issue of identifying discriminative and interpretable patterns of surgical practice from recordings of surgical motions. We define a pattern a series of actions or events in the kinematic data that together are distinctive of a specific gesture or a skill level. We show, that by using these patterns, we can reach beyond the simple classification of observed surgeons into categories (e.g., expert, novice) by providing a quantitative evidence-supported feedback to the trainee as per where he or she can improve. The proposed approach, based on SAX-VSM algorithm [4], considers surgical motion as continuous multi-dimensional time-series and starts by discretizing them into sequence of letters (i.e., strings) using Symbolic Aggregate approXimation (SAX) [5]. In turn, SAX sequences are decomposed into subsequences of few consecutive letters via sliding window. The relative frequencies of these subsequences, i.e., the number of times they appear in a given sequence or in a set of sequences, are then used to identify discriminative patterns that characterize specific surgical motion. To discover the patterns, we rely on the Vector Space Model (VSM) [6] which has been originally proposed as an algebraic model for representing collection of text documents. The identified discriminative patterns are then used to perform classification by identifying them in to-be-classified recordings. Furthermore, by highlighting discriminative patterns in the visualization of original motion data, we are able to provide an intuitive visual explanation about why a specific skill assessment is provided. We evaluated our method on the kinematic data from the JHU-ISI Gesture and Skill Assessment Dataset (JIGSAWS) [7] that is currently the largest publicly accessible database for surgical gesture analysis. Our experiments have shown that the proposed method accurately classifies gestures and skill levels. The main contributions of this paper are:

  • A framework for identifying discriminative and interpretable patterns in surgical activity motion based on SAX [5] and VSM [4].

  • Experimental evaluation highlighting the relevance of the proposed method for gestures classification and skill assessment.

  • A visualization technique enabling self-assessment of trainee skills.

2 Background

Surgical motion analysis is mainly based on kinematic data recorded by surgical robot [8, 9] and video data [10,11,12]. Kinematic data usually include multiple attributes such as the position of robot’s tools, rotations, and velocities. From such data, significant amount of work has been devoted to the segmentation of surgical tasks into more detailed gestures [13,14,15]. Segmenting surgical motion into gestures makes it possible to obtain a finer description of surgical task leading to more detailed feedback on skill assessment [16, 17]. Previous work concerned with gesture segmentation using kinematic and video data uses Hidden Markov Models [18, 19], Conditional Random Fields [20] and Linear Dynamical Systems [11]. Main drawback of these approaches is the difficulty for the trainee to understand the output and to use it as a feedback to improve performance. In contrast, our approach seeks not only to identify that a surgical motion has been performed by a novice surgeon, but also to explain why it has been classified as such. This step is critical in justifying the reasons why the trainee is still considered as a novice and to help him or her to focus on the specific steps that require improvement.

3 Method

3.1 Symbolic Aggregate ApproXimation (SAX)

We propose to use Symbolic Aggregate approXimation (SAX) [5] to discretize the input time series [21]. For time series T of length n, SAX obtains a lower-dimensional representation by first performing a z-normalization then dividing the time series into s equal-sized segments. Next, for each segment, SAX computes a mean value and maps it to a symbol according to a pre-defined set of breakpoints dividing the data space into \(\alpha \) equiprobable regions, where \(\alpha \) is the user specified alphabet size. While dimensionality reduction is a desirable feature for exploring global patterns, the high compression ratio (n / s) significantly affects performance in cases where localized phenomena are of interest. Thus, for the local pattern discovery, SAX is typically applied to a set of subsequences that represent local features – a technique called subsequence discretization [22] which is implemented via a sliding window. Note that other time-series discretization approaches could have been used at this step [23].

3.2 Bag of Words Representation of Kinematic Data

Following the approach proposed in [4], a sliding window technique is used to convert a time series T of length n into the set of m SAX words, where \(m = (n-l_s) + 1\) and \(l_s\) the sliding window length. A sliding window of length \(l_s\) is applied across the time series T and the overlapping extracted subsequences are converted into SAX words and then put in a collection. This collection is a bag of words representation of the original time series T.

In the case of kinematic data, this process is performed independently for each dimension of the data (e.g., x coordinate, y coordinate, etc.). All features are normalized on a per-trial per-feature basis. Each word extracted in each dimension of the data is postfixed with the name of the dimension (e.g. x, y, etc.). We assume that depending of the gesture or the skill level to classify, different kinematic features can be relevant. Note, that this methodology can be used regardless of the available kinematic data (e.g. number of features, etc.). Figure 1 illustrates the conversion of kinematic data for one trial into a bag of words using SAX.

Fig. 1.
figure 1

Conversion of kinematic data for one trial into a bag of words using SAX [5] and a sliding window of size 4 (in red). (Color figure online)

3.3 Vector Space Model (VSM)

We rely on the original definition of vector space model as it is known in Information Retrieval (IR) [4, 6]. The tf*idf weight for a term t is defined as a product of two factors: term frequency (tf) and inverse document frequency (idf). The first factor corresponds to logarithmically scaled term frequency [24].

$$\begin{aligned} \text {tf}_{t,d}= \left\{ \begin{array}{rl} \text {log}(1+\text {f}_{t,d}), &{} \text {if}~\text {f}_{f,d} > 0\\ 0, &{} \text {otherwise} \end{array} \right. \end{aligned}$$
(1)

where t is the term, d is a bag of words (a document in IR terms), and \(\text {f}_{t,d}\) is the frequency of t in d. The inverse document frequency [24] is defined as

$$\begin{aligned} \text {idf}_{t,D}=\text {log}\frac{|D|}{|d \in D: t \in d |} = \text {log}\frac{N}{\text {df}_t} \end{aligned}$$
(2)

where N is the cardinality of a corpus D (the total number of classes) and the denominator \(\text {df}_{t}\) is the number of bags where the term t appears. Then, tf*idf weight value for a term t in the bag d of a corpus D is defined as

$$\begin{aligned} \text {tf*idf}(t,d,D) = \text {tf}_{t,d} \times \text {idf}_{t,D} = \text {log}(1+ \text {f}_{t,d}) \times \text {log}\frac{N}{\text {df}_t} \end{aligned}$$
(3)

for all cases where \(\text {f}_{t,d} > 0\) and \(\text {df}_t > 0\), or zero otherwise.

Once all frequencies are computed, the term frequency matrix becomes the term weight matrix, whose columns are used as class term weight vectors to perform classification using Cosine similarity. For two vectors a and b, the Cosine similarity is based on their inner product and defined as

$$\begin{aligned} \text {similarity}(\mathbf {a},\mathbf {b}) = cos(\theta ) = \frac{\mathbf {a}\cdot \mathbf {b}}{||a||\cdot ||b||} \end{aligned}$$
(4)

3.4 Training and Classifying Kinematic Data

The training step starts by transforming the kinematic data into SAX representation using two parameters: the size of the sliding window \(l_s\), and the size of the alphabet \(\alpha \). Then, the algorithm builds a corpus of N bags corresponding to the subsequences extracted from the N classes of kinematic data, i.e. same skill level or same gesture depending on the application. The tf*idf weighting is then applied to create N real-valued weight vectors of equal length, representing the different class of kinematic data.

In order to classify an unlabeled kinematic data, the method transforms it into a terms frequency vector using exactly the same sliding window and SAX parameters used for the training part. It computes the cosine similarity measure (Eq. 4) between this term frequency vector and the N tf*idf weight vectors representing the training classes. The unlabeled kinematic data is assigned to the class whose vector yields the maximal cosine similarity value.

Fig. 2.
figure 2

Snapshots of the three surgical tasks in the JIGSAWS dataset (from left to right): suturing, knot-tying, needle-passing [7].

4 Experimental Evaluation

The JIGSAWS dataset [7] includes 8 subjects with 3 different skill levels (novice, intermediate and expert) performing 3–5 trials of three tasks (suturing, knot tying, and needle passing). The Fig. 2 illustrates the three tasks. Each trial lasts about 2 min and is represented by the kinematic data of both master and slave manipulators of the da Vinci robotic surgical system recorded at a constant rate of 30 Hz. Kinematic data consists of 76 motion variables including positions and velocities of both master and slave manipulators. All trials in the JIGSAWS dataset were manually segmented into 15 surgical gestures. Video of the trials are also available and are synchronized with the kinematic data. A detailed description of the dataset is available in [25].

Our training step first transforms the kinematic data time series into SAX representation configured by two parameters: the sliding window length (\(l_s\)) and SAX alphabet size (\(\alpha \)). The number of segments per window was kept equal to the length of the window which means that every point of the time series was transformed into a letter. This choice was made to allow us to map back the patterns on the original time series. Parameters \(l_s\) and \(\alpha \) were optimized using cross-validation on the training data. As they can differ for each specific classification problem, their values are provided along with the experimental results.

Table 1. Gesture classification performance, assuming known boundaries and using kinematic data only.

4.1 Gesture Classification

We considered the gesture boundaries to be known and we used the kinematic data alone. We present results for two cross-validation configurations provided with the JIGSAWS data [7]. In the first configuration – leave one supertrial out (LOSO) – for each iteration of cross-validation (five in total), one trial of each subject was left out for the test and the remaining trials were used for training. In the second configuration – leave one user out (LOUO) – for each iteration of the cross-validation (eight in total), all the trials belonging to a particular subject were left out for the test. These are the standard benchmark configurations provided in [7]. We report micro (average of total correct predictions across all classes) and macro (average of true positive rates for each class) performance results as defined in [25].

Table 1 presents the results for gesture classification assuming known boundaries and using kinematic data only. For comparison purposes, we also report state-of-the-art results for Linear Dynamical Systems (LDS) and Hidden Markov Models (HMM) from [25]. The proposed method outperforms both LDS and HMM methods in terms of micro and macro performances for the three tasks and the two cross-validation configurations. These results show that our method accurately identifies patterns that are specific to a gesture motion. One of the interesting features of the proposed method is the ability to use different kinematic data depending of the gesture. As our method computes the frequencies for each component of the kinematic data for each gesture independently, the most discriminative attributes of a given gesture naturally stand out. Furthermore, the tf*idf regularization discards the motion patterns that are common to every gesture (i.e., irrelevant for classification as not distinctive of any class).

The LOUO configuration is known to be particularly challenging, because we attempt to classify gestures of a subject without having any of his or her other attempts. The good performance of our approach can be explained by its ability to identify highly discriminative patterns that are the most distinctive of each gesture. These results also indicate that our method generalizes well, as shown by the fact that it can accurately classify gestures from unobserved trainees.

Table 2. Skill classification performance per trial using kinematic data only.

4.2 Skills Classification

For skill classification, we performed experiments to identify the skill level (novice, intermediate or expert) at the trial level. In this experiment, we used the leave one trial out (LOSO) cross-validation configuration. Table 2 presents the results for the three tasks and reports micro and macro performances. The results are better for Suturing and Needle Passing tasks than for Knot Tying task. The poor performance on the Knot Tying task can be explained by the minor difference between the Expert and Intermediate subjects for this task (mean GRS is 17.7 and 17.1 for expert and intermediate respectively). We also report the state-of-the-art results from [19] for the Suturing task. The SHMM approach gives better results for the per trial classification configuration as it uses global temporal information, whereas our method is focusing on the local patterns regardless of their location within larger time series. Furthermore, the SHMM approach [19] uses gestures boundaries to learn the temporal model while our method is not using this information.

Fig. 3.
figure 3

Example of interpretable feedback using a heat-map visualization of subsequence importance to a class identification. The value corresponds to the combination of the tf-idf weights of all patterns which cover the point.

4.3 Interpretable Patterns Visualization

Our approach outputs a set of discriminative patterns weighted by the class specificity for each of the input class. These lists of ranked patterns can be studied to better understand what makes each class distinctive. As the use of tf*idf (Eq. (3)) discards patterns that are common to all classes, only patterns having discriminative power remain.

The list of weighted discriminative patterns can be used to visualize, on a given trial, where are the areas that are specific to the current skill level of the trial. We propose to use a heat map-like visualization technique that provides immediate insight into the layout of the “important” class-characteristic patterns (as described in [4]). Figure 3 shows, for the Suturing task, the two individual 5th trials of subjects B (Novice) and E (Expert), using (xyz) coordinates for the right hand. In this figure, we used respectively the tf*idf weights vectors of the 5th fold for the Novice on subject B and for the Expert on subject E. The red areas correspond to specific motions that are correlated with a skill level. For Subject B (Fig. 3a), these areas correspond to motions that were only observed among the novices. By contrast, green areas correspond to motions that are common to all subjects regardless of their skill. This visualization provides a rich information about what makes a specific skill level distinctive and can also be used to provide individual and personalized feedback. As the videos of the trials are also available, this result has to be displayed side-by-side with the videos in order to show to the trainee the movements that are specific. Note that a more detailed analysis could be performed by observing which kinematic data features are specific in these areas or by performing the analysis on a per gesture basis. Visualization (like Fig. 3) for all the subject trials for the Suturing task are available on the companion webpageFootnote 1.

Note that as the tf*idf weight vectors are computed prior to the classification step, it is possible to display this heat-map visualization in real-time during the trial. We provide a video on the companion webpage that shows the real-time computation of this visualization while a trainee performs a suturing task. We believe that this tool is an interesting addition to existing learning tools for surgery as it provides a way to obtain a feedback on which parts of an exercise have been used to classify the attempt.

5 Conclusion

In this paper, we presented a new method for discovery of discriminative and interpretable patterns in surgical activity motion. Our method uses SAX to discretize the kinematic data into sequence of letters. A sliding window is then used to build bag of words. Finally, tf*idf framework is applied to identify motion class-characteristic patterns. Experiments performed on the JIGSAWS dataset has shown that our method successfully classifies gestures and skill levels. The strong advantage of the proposed technique is the ability to provide a precise quantitative feedback for the classification results. Of course, the evaluation of our visualization approach needs to be performed within curriculum.