Keywords

1 Introduction

Action segmentation is the task of predicting the action label for each frame in the input video. Action segmentation is usually studied in the context of activities performed by a single person, where temporal smoothness of actions are assumed. Fully supervised approaches for action segmentation [1, 21, 26, 34] already achieve good performance on this task. Most approaches for fully supervised action segmentation make frame-wise predictions [1, 21, 26] while trying to model the temporal relationship between the action labels. These approaches usually suffer from over-segmentation. Recent works [13, 34] try to overcome the over-segmentation problem by finding the action boundaries and temporally smoothing the predictions inside each action segment. But these post-processing approaches still can not guarantee temporal smoothness.

Action segmentation inference is the problem of making segment-wise smooth predictions from frame-wise probabilities given a known grammar of the actions and their average lengths [30]. The typical inference in action segmentation involves solving an expensive Viterbi-like dynamic programming problem that finds the best action sequence and its corresponding lengths. In the literature, weakly supervised action segmentation approaches [18, 25, 29, 30, 32] usually use inference at test time. Despite being very useful for action segmentation, the inference problem remains the main computational bottleneck in the action segmentation pipeline [32].

In this paper, we propose FIFA, a fast anytime approximate inference procedure that achieves comparable performance with respect to the dynamic programming based Viterbi decoding inference at a fraction of the computational time. Instead of relying on dynamic programming, we formulate the energy function as an approximate differentiable function of segment lengths parameters and use gradient-descent-based methods to search for a configuration that minimizes the approximate energy function. Given a transcript of actions and the corresponding initial lengths configuration, we define the energy function as a sum over segment level energies. The segment level energy consists of two terms: a length energy term that penalizes the deviations from a global length model and an observation energy term that measures the compatibility between the current configuration and the predicted frame-wise probabilities. A naive approach to model the observation energy would be to sum up the negative log probabilities of the action labels that are defined based on the length configuration. Nevertheless, such an approach is not differentiable with respect to the segment lengths. In order to optimize the energy using gradient descent-based methods, the observation energy has to be differentiable with respect to the segment lengths. To this end, we construct a plateau-shaped mask for each segment which temporally locates the segment within the video. This mask is parameterized by the segment lengths, the position in the video, and a sharpness parameter. The observation energy is then defined as a product of a segment mask and the predicted frame-wise negative log probabilities, followed by a sum pooling operation. Finally, a gradient descent-based method is used to find a configuration for the segment lengths that minimizes the total energy.

FIFA is a general inference approach and can be applied at test time on top of different action segmentation approaches for fast inference. We evaluate our approach on top of the state-of-the-art methods for weakly supervised temporal action segmentation, weakly supervised action alignment, and fully supervised action segmentation. Results on the Breakfast [16] and Hollywood extended [4] datasets show that FIFA achieves state-of-the-art results on most metrics. Compared to the exact inference using the Viterbi decoding, FIFA is at least 5 times faster. Furthermore, FIFA is an anytime algorithm which can be stopped after each step of the gradient-based optimization, therefore it provides a better speed vs. accuracy trade-off compared to exact inference.

2 Related Work

In this section we highlight relevant works addressing fully and weakly supervised action segmentation that have been recently proposed.

Fully Supervised Action Segmentation. In fully supervised action segmentation, frame-level labels are used for training. Initial attempts for action segmentation applied action classifiers on a sliding window over the video frames [15, 31]. However, these approaches did not capture the dependencies between the action segments. With the objective of capturing the context over long video sequences, context free grammars [28, 33] or hidden Markov models (HMMs) [17, 19, 22] are typically combined with frame-wise classifiers. Recently, temporal convolutional networks showed good performance for the temporal action segmentation task using encoder-decoder architectures [21, 24] or even multi-stage architectures [1, 26]. Many approaches further improve the multi-stage architectures by applying post-processing based on boundary-aware pooling operation [13, 34] or graph-based reasoning [12]. Without any inference most of the fully-supervised approaches therefore suffer from oversegmentation at test time.

Weakly Supervised Action Segmentation. To reduce the annotation cost, many approaches that rely on a weaker form of supervision have been proposed. Earlier approaches apply discriminative clustering to align video frames to movie scripts [7]. Bojanowski et al.  [4] proposed to use as supervision the transcripts in the form of ordered lists of actions. Indeed, many approaches rely on this form of supervision to train a segmentation model using connectionist temporal classification [11], dynamic time warping [5] or energy-based learning [25]. In [6], an iterative training procedure is used to refine the transcript. A soft labeling mechanism is further applied at the boundaries between action segments. Kuehne et al.  [18] applied a speech recognition system based on a HMM and Gaussian mixture model (GMM) to align video frames to transcripts. The approach generates pseudo ground truth labels for the training videos and iteratively refines them. A similar idea has been recently used in [19, 29]. Richard et al.  [30] combined the frame-wise loss function with the Viterbi algorithm to generate the target labels. At inference time, these approaches iterate over the training transcripts and select the one that matches the test video best. By contrast, Souri et al.  [32] predict the transcript besides the frame-wise scores at inference time. State-of-the-art weakly supervised action segmentation approaches require time consuming dynamic programming based inference at test time.

Energy-Based Inference. In energy-based inference methods, gradient descent is used at inference time as described in [23]. The goal is to minimize an energy function that measures the compatibility between the input variables and the predicted variables. This idea has been exploited for many structured prediction tasks such as image generation [8, 14], machine translation [10] and structured prediction energy networks [3]. Belanger and McCallum [2] relaxed the discrete output space for multi-label classification tasks to a continuous space and used gradient descent to approximate the solution. Gradient-based methods have also been used for other applications such as generating adversarial examples [9] and learning text embeddings [20].

3 Background

The following sections introduce all the concepts and notations required to understand the proposed FIFA methodology.

3.1 Action Segmentation

In action segmentation, we want to temporally localize all the action segments occurring in a video. In this paper, we consider the case where the actions are from a predefined set of M classes (a background class is used to cover uninteresting parts of a video). The input video of length T is usually represented as a set of d dimensional features vectors \(x_{1:T} = (x_1, \dots , x_T)\). These features are extracted offline and are assumed to be the input to the action segmentation model. The output of action segmentation can be represented in two ways:

  • Frame-wise representation \(y_{1:T} = (y_1, \dots , y_T)\) where \(y_t\) represents the action label at time t.

  • Segment-wise representation \(s_{1:N} = (s_1, \dots , s_N)\) where segment \(s_n\) is represented by both the action label of the segment \(c_n\) and its corresponding length \(\ell _n\), i.e., \(s_n = (c_n, \ell _n)\). The ordered list of actions \(c_{1:N}\) is usually referred to as the transcript.

These two representations are equal and redundant, i.e., it is possible to compute one from the other. In order to transfer from the segment-wise to the frame-wise representation, we introduce a mapping \(\alpha (t; c_{1:N}, \ell _{1:N})\) which outputs the action label at frame t given the segment-wise labeling.

The target labels to train a segmentation model, depend on the level of supervision. In fully supervised action segmentation [1, 26, 34], the target label for each frame is provided. However, in weakly supervised approaches [25, 30, 32] only the ordered list of action labels are provided during training while their lengths are unknown.

Recent fully supervised approaches for action segmentation like MSTCN [1] and its variants directly predict the frame-wise representation \(y_{1:T}\) by choosing the action label with the highest probability for each frame independently. This results in predictions that are sometimes oversegmented.

Conversely, recent weakly supervised action segmentation approaches like NNV [30] and follow-up work include an inference stage during testing where they explicitly predict the segment-wise representation. This inference stage involves a dynamic programming algorithm for solving an optimization problem which is a computational bottleneck for these approaches.

3.2 Inference in Action Segmentation

During testing, the inference stage involves an optimization problem to find the most likely segmentation for the input video, i.e.,

$$\begin{aligned} c_{1:N}, \ell _{1:N} = \underset{\hat{c}_{1:N}, \hat{\ell }_{1:N}}{\mathrm {argmax}} \Big \{ p(\hat{c}_{1:N}, \hat{\ell }_{1:N} | x_{1:T}) \Big \}. \end{aligned}$$
(1)

Given the transcript \(c_{1:N}\), the inference stage boils down to finding the segment lengths \(\ell _{1:N}\) by aligning the transcript to the input video, i.e.,

$$\begin{aligned} \ell _{1:N} = \underset{\hat{\ell }_{1:N}}{\mathrm {argmax}} \Big \{ p(\hat{\ell }_{1:N} | x_{1:T}, c_{1:N}) \Big \}. \end{aligned}$$
(2)

In approaches like NNV [30] and CDFL [25], the transcript is found by iterating over the transcripts seen during training and selecting the transcript that achieves the most likely alignment by optimizing (2). In MuCon [32], the transcript is predicted by a sequence to sequence network.

The probability defined in (2) is broken down by making independences assumption between frames

$$\begin{aligned} \begin{aligned} p(\hat{\ell }_{1:N} | x_{1:T}, c_{1:N}) = \prod _{t=1}^{T} p\big ( \alpha (t; c_{1:N}, \hat{\ell }_{1:N}) | x_t \big ) \cdot \prod _{n=1}^{N} p\big ( \hat{\ell }_n | c_n \big ) \end{aligned} \end{aligned}$$
(3)

where \(p\big ( \alpha (t)| x_t \big )\) is referred to as the observation model and \(p\big ( \ell _n | c_n \big )\) as the length model. Here, \(\alpha (t)\) is the mapping from time t to the action label given the segment-wise labeling. The observation model estimates the frame-wise action probabilities and is implemented using a neural network. The length model is used to constrain the inference defined in (2) with the assumption that the lengths of segments for the same action follow a particular probability distribution. The segment length is usually modelled by a Poisson distribution with a class dependent mean parameter \(\lambda _{c_n}\), i.e.,

$$\begin{aligned} p\big ( \ell _n | c_n \big ) = \frac{\lambda _{c_n}^{\ell _n} \mathrm {exp}(- \lambda _{c_n}) }{\ell _n!}. \end{aligned}$$
(4)

This optimization is solved using an expensive dynamic programming based Viterbi decoding [30]. For details on how to solve this optimization problem using Viterbi decoding please refer to the supplementary material.

4 FIFA: Fast Inference Approximation

Our goal is to introduce a fast inference algorithm for action segmentation. We want the fast inference to be applicable in both weakly supervised and fully supervised action segmentation. We also want the fast inference to be flexible enough to work with different action segmentation methods. To this end, we introduce FIFA, a novel approach for fast inference for action segmentation.

Fig. 1.
figure 1

Overview of the FIFA optimization process. At each step in the optimization, a set of masks are generated using the current length estimates. Using the generated masks and the frame-wise negative log probabilities, the observation energy is calculated in an approximate but differentiable manner. The length energy is calculated from the current length estimate and added to the observation energy to calculate the total energy value. Taking the gradient of the total energy with respect to the length estimates we can update it using a gradient step.

In the following, for brevity, we write the mapping \(\alpha (t; c_{1:N}, \ell _{1:N})\) simply as \(\alpha (t)\). Maximizing probability (2) can be rewritten as minimizing the negative logarithm of that probability

$$\begin{aligned} \begin{aligned} \mathrm {argmax} \bigg \{ p(\hat{\ell }_{1:N} | x_{1:T}, c_{1:N}) \bigg \} = \mathrm {argmin} \bigg \{- \log \big ( p(\hat{\ell }_{1:N} | x_{1:T}, c_{1:N}) \big ) \bigg \} \end{aligned} \end{aligned}$$
(5)

which we refer to as the energy \(E(\ell _{1:N})\). Using (3) the energy is rewritten as

$$\begin{aligned} \begin{aligned} E(\ell _{1:N}) =&- \log \bigg (p(\ell _{1:N} | x_{1:T}, c_{1:N}) \bigg )\\ =&- \log \bigg (\prod _{t=1}^{T} p\big ( \alpha (t) | x_t \big ) \cdot \prod _{n=1}^{N} p\big ( \ell _n | c_n \big ) \bigg ) \\ =&\underbrace{\sum _{t=1}^{T} - \log p\big ( \alpha (t) | x_t \big )}_{\textstyle {E_{o}}} + \underbrace{\sum _{n=1}^{N} - \log p\big ( \ell _n | c_n \big )}_{\textstyle {E_{\ell }}}. \end{aligned} \end{aligned}$$
(6)

The first term in (6), \(E_o\) is referred to as the observation energy. This term calculates the cost of assigning the labels for each frame and is calculated from the frame-wise probability estimates. The second term \(E_\ell \) is referred to as the length energy. This term is the cost of each segment having a length given that we assume an average length for actions of a specific class.

We propose to optimize the energy defined in (6) using gradient based optimization in order to avoid the need for time-consuming dynamic programming. We start with an initial estimate of the lengths (obtained from the length model of each approach or calculated from training data when available) and update our estimate to minimize the energy function.

As the energy function \(E(\ell _{1:N})\) is not differentiable with respect to the lengths, we have to calculate a relaxed and approximate energy function \(E^*(\ell _{1:N})\) that is differentiable.

4.1 Approximate Differentiable Energy \(E^*\)

The energy function E as defined in (6) is not differentiable in two parts. First the observation energy term \(E_o\) is not differentiable because of the \(\alpha (t)\) function. Second, the length energy term \(E_\ell \) is not differentiable because it expects natural numbers as input and cannot be computed on real values which are dealt with in gradient-based optimization. Below we describe how we approximate and make each of the terms differentiable.

Approximate Differentiable Observation Energy. Consider a \(N \times T\) matrix P containing negative log probabilities, i.e.,

$$\begin{aligned} P[n, t] = -\log p(c_n|x_t). \end{aligned}$$
(7)

Furthermore, we define a mask matrix M with the same size \(N \times T\) where

$$\begin{aligned} M[n, t] = {\left\{ \begin{array}{ll} 0 &{} if ~ \alpha (t) \ne c_n \\ 1 &{} if ~ \alpha (t) = c_n \end{array}\right. }. \end{aligned}$$
(8)

Using the mask matrix we can rewrite the observation energy term as

$$\begin{aligned} E_o = \sum _{t=1}^{T} \sum _{n=1}^{N} M[n, t] \cdot P[n, t]. \end{aligned}$$
(9)

In order to make the observation energy term differentiable with respect to the length, we propose to construct an approximate differentiable mask matrix \(M^*\). We use the following smooth and parametric plateau function

$$\begin{aligned} f(t|\lambda ^c, \lambda ^w, \lambda ^s) = \frac{1}{(e^{\lambda ^s(t-\lambda ^c-\lambda ^w)} + 1)(e^{\lambda ^s(-t+\lambda ^c-\lambda ^w)} + 1)} \end{aligned}$$
(10)

from [27]. This plateau function has three parameters and it is differentiable with respect to them: \(\lambda ^c\) controls the center of the plateau, \(\lambda ^w\) is the width and \(\lambda ^s\) is the sharpness of the plateau function.

While the sharpness of the plateau functions \(\lambda ^s\) is fixed as a hyper-parameter of our approach, the center \(\lambda ^c\) and the width \(\lambda ^w\) are computed from the lengths \(\ell _{1:N}\). First we calculate the starting position of each plateau function \(b_n\) as

$$\begin{aligned} b_1 = 0, b_n = \sum _{n' = 1}^{n-1} \ell _{n'}. \end{aligned}$$
(11)

We can then define both the center and the width parameters of each plateau function as

$$\begin{aligned} \begin{aligned} \lambda _n^c&= b_n + \ell _n / 2,\\ \lambda _n^w&= \ell _n / 2 \end{aligned} \end{aligned}$$
(12)

and define each row of the approximate mask as

$$\begin{aligned} M^*[n, t] = f(t| \lambda _n^c, \lambda _n^w, \lambda ^s). \end{aligned}$$
(13)

Now we can calculate a differentiable approximate observation energy similar to (9) as

$$\begin{aligned} E^*_o = \sum _{t=1}^{T} \sum _{n=1}^{N} M^*[n, t] \cdot P[n, t]. \end{aligned}$$
(14)

Approximate Differentiable Length Energy. For the gradient-based optimization, we must relax the length values to be positive real values instead of natural numbers. As the Poisson distribution (4) is only defined on natural numbers, we propose to use a substitute distribution defined on real numbers. As a replacement, we experiment with a Laplace distribution and a Gaussian distribution. In both cases, the scale or the width parameter of the distribution is assumed to be fixed.

We can rewrite the length energy \(E_\ell \) as the approximate length energy

$$\begin{aligned} \begin{aligned} E^*_\ell (\ell _{1:N}) = \sum _{n=1}^{N} - \log p(\ell _n|\lambda ^{\ell }_{c_n}), \end{aligned} \end{aligned}$$
(15)

where \(\lambda ^{\ell }_{c_n}\) is the expected value for the length of a segment from the action \(c_n\). In case of the Laplace distribution this length energy will be

$$\begin{aligned} \begin{aligned} E^*_\ell (\ell _{1:N}) = \frac{1}{Z} \sum _{n=1}^{N} | \ell _n - \lambda ^{\ell }_{c_n}|, \end{aligned} \end{aligned}$$
(16)

where Z is the constant normalization factor. This means that the length energy will penalize any deviation from the expected average length linearly. Similarly, for the Gaussian distribution, the length energy will be

$$\begin{aligned} \begin{aligned} E^*_\ell (\ell _{1:N}) = \frac{1}{Z} \sum _{n=1}^{N} | \ell _n - \lambda ^{\ell }_{c_n}|^2, \end{aligned} \end{aligned}$$
(17)

which means that the Gaussian length energy will penalize any deviation from the expected average length quadratically.

Fig. 2.
figure 2

Speed vs. accuracy trade-off of different inference approaches applied to the MuCon method. Using FIFA we can achieve a better speed vs. accuracy trade-off compared to frame sampling or hypothesis pruning in exact inference.

Fig. 3.
figure 3

Effect of the length energy multiplier for Laplace and Gaussian length energy. Accuracy is calculated on the Breakfast dataset using FIFA applied to the MuCon approach trained in the weakly supervised action segmentation setting.

With the objective to maintain a positive value for the length during the optimization process, we estimate the length in log space and convert it to absolute space only in order to compute both the approximate mask matrix \(M^*\) and the approximate length energy \(E^*_\ell \).

Approximate Energy Optimization. The total approximate energy function is defined as a weighted sum of both the approximate observation and the approximate length energy functions

$$\begin{aligned} E^*(\ell _{1:N}) = E^*_o(\ell _{1:N}, Y) + \beta E^*_\ell (\ell _{1:N}) \end{aligned}$$
(18)

where \(\beta \) is the multiplier for the length energy.

Given an initial length estimate \(\ell ^0_{1:N}\), we iteratively update this estimate to minimize the total energy. Figure 1 illustrates the optimization step for our approach. During each optimization step, we first calculate the energy \(E^*\) and then calculate the gradients of the energy with respect to the length values. Using the calculated gradients, we update the length estimate using a gradient descent update rule such as SGD or Adam. After a certain number of gradient steps (50 steps in our experiments) we will finally predict the segment length.

If the transcript for a test video is provided then it is used, e.g., using the MuCon [32] approach or in a weakly supervised action alignment setting. However, if the latter is not known, e.g., in a fully supervised approach or CDFL [25] for weakly supervised action segmentation, we perform the optimization for each of the transcripts seen during training and select the most likely one based on the final energy value at the end of the optimization.

The initial length estimates are calculated from the length model of each approach in case of a weakly supervised setting whereas in a fully supervised setting the average length of each action class is calculated from the training data and used as the initial length estimates. The initial length estimates are also used as the expected length parameters for the length energy calculations.

The hyper-parameters like the choice of the optimizer, number of steps, learning rate, and the mask sharpness, remain as the hyper-parameters of our approach.

5 Experiments

5.1 Evaluation Protocols and Datasets

We evaluate FIFA on 3 different tasks: weakly supervised action segmentation, fully supervised action segmentation, and weakly supervised action alignment. Results for action alignment are included in the supplementary material. We obtain the source code for the state-of-the-art approaches on each of these tasks and train a model using the standard training configuration of each model. Then we apply FIFA as a replacement for an existing inference stage or as an additional inference stage.

We evaluate our model using the Breakfast [16] and Hollywood extended [4] datasets on the 3 different tasks. Details of the datasets are included in the supplementary material.

5.2 Results and Discussions

In this section, we study the speed-accuracy trade-off and the impact of the length model. Additional ablation experiments are included in the supplementary material. Speed vs. Accuracy Trade-off. One of the major benefits of FIFA is that it is any-time. It provides the flexibility of choosing the number of optimization steps. The number of steps of the optimization can be a tool to trade-off speed vs. accuracy. In exact inference, we can use frame-sampling, i.e., lowering the resolution of the input features, or hypothesis pruning, i.e., beam search for speed vs. accuracy trade-off.

Figure 2 plots the speed vs. accuracy trade-off of exact inference compared to FIFA. We observe that FIFA provides a much better speed-accuracy trade-off as compared to frame-sampling for exact inference. The best performance is achieved after 50 steps with \(5.9\%\) improvement on the MoF accuracy compared to not performing any optimization (0 steps).

Impact of the Length Energy Multiplier. For the length energy, we assume that the segment lengths follow a Laplace distribution. Figure 3 shows the impact of the length energy multiplier on the weakly supervised action segmentation performance on the Breakfast dataset. The choice of this parameter depends on the dataset. While the best accuracy is achieved with a multiplier of 0.05, our approach is robust to the choice of these hyper-parameters on this dataset. We further experimented with a Gaussian length energy. However, as shown in the figure, the performance is much worse compared to the Laplace energy. This is due to the quadratic penalty that dominates the total energy, which makes the optimization biased towards the initial estimate and ignores the observation energy.

Impact of the Length Initialization. Since FIFA starts with an initial estimate for the lengths, the choice of initialization might have an impact on the performance. Table 1 shows the effect of initializing the lengths with equal values compared to using the length model of MuCon [32] for the weakly supervised action segmentation on the Breakfast dataset. As shown in the table, FIFA is more robust to initialization compared to the exact inference as the drop in performance is approximately half of the exact inference.

Table 1. Impact of the length initialization for MuCon using exact inference and FIFA for weakly supervised action segmentation on the Breakfast dataset.
Table 2. Results for weakly supervised action segmentation on the Breakfast dataset. \(^*\) indicates results obtained by running the code on our machine.
Table 3. Results for fully supervised action segmentation setup on the Breakfast dataset. \(^*\) indicates results obtained by running the code on our machine.

5.3 Comparison to State of the Art

In this section, we compare FIFA to other state-of-the-art approaches.

Weakly Supervised Action Segmentation. We apply FIFA on top of two state-of-the-art approaches for weakly supervised action segmentation namely MuCon [32] and CDFL [25] on the Breakfast dataset [16] and report the results in Table 2. FIFA applied on CDFL achieves a 12 times faster inference speed while obtaining results comparable to exact inference. FIFA applied to MuCon achieves a 5 times faster inference speed and obtains a new state-of-the-art performance on the Breakfast dataset on most of the metrics.

We also reported the inference speed of ISBA [6] and NNV [30] (since the source code of D3TW [5] is not available we could not measure its inference speed) and reported them for the sake of completeness in Table 2. ISBA has the fastest inference time as during testing it does not perform any optimization. ISBA makes framewise predictions which results in over-segmentation and low performance across all metrics.

Similarly for the Hollywood extended dataset [4], we apply FIFA to MuCon [32] and report the results in Table 4. FIFA applied on MuCon achieves a 4 times faster inference speed while obtaining results comparable to MuCon with exact inference.

Fully Supervised Action Segmentation. In the fully supervised action segmentation setting, we apply FIFA on top of MS-TCN [1] and its variant MS-TCN++ [26] on the Breakfast dataset [16] and report the results in Table 3. MS-TCN and MS-TCN++ do not perform any inference at test time. This usually results in over-segmentation and low F1 and Edit scores. Applying FIFA on top of these approaches improves the F1 and Edit scores significantly. FIFA applied on top of MS-TCN achieves state-of-the-art performance on most metrics.

For the Hollywood extended dataset [4], we train MS-TCN [1] and report results comparing exact inference (EI) compared to FIFA in Table 5. We observe that MS-TCN using an inference algorithm achieves new state-of-the-art results on this dataset. FIFA is comparable or better than exact inference on this dataset.

Table 4. Results for weakly supervised action segmentation on the Hollywood extended dataset. Time is reported in seconds. \(^*\) indicates results obtained by running the code on our machine.
Table 5. Results for fully supervised action segmentation on the Hollywood extended dataset. \(^*\) indicates results obtained by running the code on our machine. EI stands for Exact Inference.

5.4 Qualitative Example

A qualitative example of the FIFA optimization process is depicted in Fig. 4. For further qualitative examples, failure cases, and details please refer to the supplementary material.

Fig. 4.
figure 4

Visualization of the FIFA optimization process. On the right the values of the total approximate energy is plotted. On the left, negative log probability values, ground truth segmentation, optimization initialization, the masks and the segmentation after inference are plotted.

6 Conclusion

In this paper, we proposed FIFA a fast approximate inference procedure for action segmentation and alignment. Unlike previous methods, our approach does not rely on dynamic programming-based Viterbi decoding for inference. Instead, FIFA optimizes a differentiable energy function that can be minimized using gradient-descent which allows for a fast and accurate inference during testing. We evaluated FIFA on top of fully and weakly supervised methods trained on the Breakfast and Hollywood extended datasets. The results show that FIFA is able to achieve comparable or better performance, while being at least 5 times faster than exact inference.