FIFA: Fast Inference Approximation for Action Segmentation

Souri, Yaser; Farha, Yazan Abu; Despinoy, Fabien; Francesca, Gianpiero; Gall, Juergen

doi:10.1007/978-3-030-92659-5_18

Yaser Souri¹¹,
Yazan Abu Farha¹¹,
Fabien Despinoy¹²,
Gianpiero Francesca¹² &
…
Juergen Gall¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13024))

Included in the following conference series:

DAGM German Conference on Pattern Recognition

1692 Accesses
12 Citations
3 Altmetric

Abstract

We introduce FIFA, a fast approximate inference method for action segmentation and alignment. Unlike previous approaches, FIFA does not rely on expensive dynamic programming for inference. Instead, it uses an approximate differentiable energy function that can be minimized using gradient-descent. FIFA is a general approach that can replace exact inference, improving its speed by more than 5 times while maintaining its performance. FIFA is an anytime inference algorithm that provides a better speed vs. accuracy trade-off compared to exact inference. We apply FIFA on top of state-of-the-art approaches for weakly supervised action segmentation and alignment as well as fully supervised action segmentation. FIFA achieves state-of-the-art results for most metrics on two action segmentation datasets.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation

GRUNTS: Graph Representation for UNsupervised Temporal Segmentation

Leveraging Action Affinity and Continuity for Semi-supervised Temporal Action Segmentation

Keywords

1 Introduction

Action segmentation is the task of predicting the action label for each frame in the input video. Action segmentation is usually studied in the context of activities performed by a single person, where temporal smoothness of actions are assumed. Fully supervised approaches for action segmentation [1, 21, 26, 34] already achieve good performance on this task. Most approaches for fully supervised action segmentation make frame-wise predictions [1, 21, 26] while trying to model the temporal relationship between the action labels. These approaches usually suffer from over-segmentation. Recent works [13, 34] try to overcome the over-segmentation problem by finding the action boundaries and temporally smoothing the predictions inside each action segment. But these post-processing approaches still can not guarantee temporal smoothness.

Action segmentation inference is the problem of making segment-wise smooth predictions from frame-wise probabilities given a known grammar of the actions and their average lengths [30]. The typical inference in action segmentation involves solving an expensive Viterbi-like dynamic programming problem that finds the best action sequence and its corresponding lengths. In the literature, weakly supervised action segmentation approaches [18, 25, 29, 30, 32] usually use inference at test time. Despite being very useful for action segmentation, the inference problem remains the main computational bottleneck in the action segmentation pipeline [32].

In this paper, we propose FIFA, a fast anytime approximate inference procedure that achieves comparable performance with respect to the dynamic programming based Viterbi decoding inference at a fraction of the computational time. Instead of relying on dynamic programming, we formulate the energy function as an approximate differentiable function of segment lengths parameters and use gradient-descent-based methods to search for a configuration that minimizes the approximate energy function. Given a transcript of actions and the corresponding initial lengths configuration, we define the energy function as a sum over segment level energies. The segment level energy consists of two terms: a length energy term that penalizes the deviations from a global length model and an observation energy term that measures the compatibility between the current configuration and the predicted frame-wise probabilities. A naive approach to model the observation energy would be to sum up the negative log probabilities of the action labels that are defined based on the length configuration. Nevertheless, such an approach is not differentiable with respect to the segment lengths. In order to optimize the energy using gradient descent-based methods, the observation energy has to be differentiable with respect to the segment lengths. To this end, we construct a plateau-shaped mask for each segment which temporally locates the segment within the video. This mask is parameterized by the segment lengths, the position in the video, and a sharpness parameter. The observation energy is then defined as a product of a segment mask and the predicted frame-wise negative log probabilities, followed by a sum pooling operation. Finally, a gradient descent-based method is used to find a configuration for the segment lengths that minimizes the total energy.

FIFA is a general inference approach and can be applied at test time on top of different action segmentation approaches for fast inference. We evaluate our approach on top of the state-of-the-art methods for weakly supervised temporal action segmentation, weakly supervised action alignment, and fully supervised action segmentation. Results on the Breakfast [16] and Hollywood extended [4] datasets show that FIFA achieves state-of-the-art results on most metrics. Compared to the exact inference using the Viterbi decoding, FIFA is at least 5 times faster. Furthermore, FIFA is an anytime algorithm which can be stopped after each step of the gradient-based optimization, therefore it provides a better speed vs. accuracy trade-off compared to exact inference.

2 Related Work

In this section we highlight relevant works addressing fully and weakly supervised action segmentation that have been recently proposed.

Fully Supervised Action Segmentation. In fully supervised action segmentation, frame-level labels are used for training. Initial attempts for action segmentation applied action classifiers on a sliding window over the video frames [15, 31]. However, these approaches did not capture the dependencies between the action segments. With the objective of capturing the context over long video sequences, context free grammars [28, 33] or hidden Markov models (HMMs) [17, 19, 22] are typically combined with frame-wise classifiers. Recently, temporal convolutional networks showed good performance for the temporal action segmentation task using encoder-decoder architectures [21, 24] or even multi-stage architectures [1, 26]. Many approaches further improve the multi-stage architectures by applying post-processing based on boundary-aware pooling operation [13, 34] or graph-based reasoning [12]. Without any inference most of the fully-supervised approaches therefore suffer from oversegmentation at test time.

Weakly Supervised Action Segmentation. To reduce the annotation cost, many approaches that rely on a weaker form of supervision have been proposed. Earlier approaches apply discriminative clustering to align video frames to movie scripts [7]. Bojanowski et al. [4] proposed to use as supervision the transcripts in the form of ordered lists of actions. Indeed, many approaches rely on this form of supervision to train a segmentation model using connectionist temporal classification [11], dynamic time warping [5] or energy-based learning [25]. In [6], an iterative training procedure is used to refine the transcript. A soft labeling mechanism is further applied at the boundaries between action segments. Kuehne et al. [18] applied a speech recognition system based on a HMM and Gaussian mixture model (GMM) to align video frames to transcripts. The approach generates pseudo ground truth labels for the training videos and iteratively refines them. A similar idea has been recently used in [19, 29]. Richard et al. [30] combined the frame-wise loss function with the Viterbi algorithm to generate the target labels. At inference time, these approaches iterate over the training transcripts and select the one that matches the test video best. By contrast, Souri et al. [32] predict the transcript besides the frame-wise scores at inference time. State-of-the-art weakly supervised action segmentation approaches require time consuming dynamic programming based inference at test time.

Energy-Based Inference. In energy-based inference methods, gradient descent is used at inference time as described in [23]. The goal is to minimize an energy function that measures the compatibility between the input variables and the predicted variables. This idea has been exploited for many structured prediction tasks such as image generation [8, 14], machine translation [10] and structured prediction energy networks [3]. Belanger and McCallum [2] relaxed the discrete output space for multi-label classification tasks to a continuous space and used gradient descent to approximate the solution. Gradient-based methods have also been used for other applications such as generating adversarial examples [9] and learning text embeddings [20].

3 Background

The following sections introduce all the concepts and notations required to understand the proposed FIFA methodology.

3.1 Action Segmentation

In action segmentation, we want to temporally localize all the action segments occurring in a video. In this paper, we consider the case where the actions are from a predefined set of M classes (a background class is used to cover uninteresting parts of a video). The input video of length T is usually represented as a set of d dimensional features vectors $x_{1:T} = (x_1, \dots , x_T)$. These features are extracted offline and are assumed to be the input to the action segmentation model. The output of action segmentation can be represented in two ways:

Frame-wise representation $y_{1:T} = (y_1, \dots , y_T)$ where $y_t$ represents the action label at time t.
Segment-wise representation $s_{1:N} = (s_1, \dots , s_N)$ where segment $s_n$ is represented by both the action label of the segment $c_n$ and its corresponding length $\ell _n$, i.e., $s_n = (c_n, \ell _n)$. The ordered list of actions $c_{1:N}$ is usually referred to as the transcript.

These two representations are equal and redundant, i.e., it is possible to compute one from the other. In order to transfer from the segment-wise to the frame-wise representation, we introduce a mapping $\alpha (t; c_{1:N}, \ell _{1:N})$ which outputs the action label at frame t given the segment-wise labeling.

The target labels to train a segmentation model, depend on the level of supervision. In fully supervised action segmentation [1, 26, 34], the target label for each frame is provided. However, in weakly supervised approaches [25, 30, 32] only the ordered list of action labels are provided during training while their lengths are unknown.

Recent fully supervised approaches for action segmentation like MSTCN [1] and its variants directly predict the frame-wise representation $y_{1:T}$ by choosing the action label with the highest probability for each frame independently. This results in predictions that are sometimes oversegmented.

Conversely, recent weakly supervised action segmentation approaches like NNV [30] and follow-up work include an inference stage during testing where they explicitly predict the segment-wise representation. This inference stage involves a dynamic programming algorithm for solving an optimization problem which is a computational bottleneck for these approaches.

3.2 Inference in Action Segmentation

During testing, the inference stage involves an optimization problem to find the most likely segmentation for the input video, i.e.,

$$\begin{aligned} c_{1:N}, \ell _{1:N} = \underset{\hat{c}_{1:N}, \hat{\ell }_{1:N}}{\mathrm {argmax}} \Big \{ p(\hat{c}_{1:N}, \hat{\ell }_{1:N} | x_{1:T}) \Big \}. \end{aligned}$$

(1)

Given the transcript $c_{1:N}$, the inference stage boils down to finding the segment lengths $\ell _{1:N}$ by aligning the transcript to the input video, i.e.,

$$\begin{aligned} \ell _{1:N} = \underset{\hat{\ell }_{1:N}}{\mathrm {argmax}} \Big \{ p(\hat{\ell }_{1:N} | x_{1:T}, c_{1:N}) \Big \}. \end{aligned}$$

(2)

In approaches like NNV [30] and CDFL [25], the transcript is found by iterating over the transcripts seen during training and selecting the transcript that achieves the most likely alignment by optimizing (2). In MuCon [32], the transcript is predicted by a sequence to sequence network.

The probability defined in (2) is broken down by making independences assumption between frames

$$\begin{aligned} \begin{aligned} p(\hat{\ell }_{1:N} | x_{1:T}, c_{1:N}) = \prod _{t=1}^{T} p\big ( \alpha (t; c_{1:N}, \hat{\ell }_{1:N}) | x_t \big ) \cdot \prod _{n=1}^{N} p\big ( \hat{\ell }_n | c_n \big ) \end{aligned} \end{aligned}$$

(3)

where $p\big ( \alpha (t)| x_t \big )$ is referred to as the observation model and $p\big ( \ell _n | c_n \big )$ as the length model. Here, $\alpha (t)$ is the mapping from time t to the action label given the segment-wise labeling. The observation model estimates the frame-wise action probabilities and is implemented using a neural network. The length model is used to constrain the inference defined in (2) with the assumption that the lengths of segments for the same action follow a particular probability distribution. The segment length is usually modelled by a Poisson distribution with a class dependent mean parameter $\lambda _{c_n}$, i.e.,

$$\begin{aligned} p\big ( \ell _n | c_n \big ) = \frac{\lambda _{c_n}^{\ell _n} \mathrm {exp}(- \lambda _{c_n}) }{\ell _n!}. \end{aligned}$$

(4)

This optimization is solved using an expensive dynamic programming based Viterbi decoding [30]. For details on how to solve this optimization problem using Viterbi decoding please refer to the supplementary material.

4 FIFA: Fast Inference Approximation

Our goal is to introduce a fast inference algorithm for action segmentation. We want the fast inference to be applicable in both weakly supervised and fully supervised action segmentation. We also want the fast inference to be flexible enough to work with different action segmentation methods. To this end, we introduce FIFA, a novel approach for fast inference for action segmentation.

In the following, for brevity, we write the mapping $\alpha (t; c_{1:N}, \ell _{1:N})$ simply as $\alpha (t)$. Maximizing probability (2) can be rewritten as minimizing the negative logarithm of that probability

$$\begin{aligned} \begin{aligned} \mathrm {argmax} \bigg \{ p(\hat{\ell }_{1:N} | x_{1:T}, c_{1:N}) \bigg \} = \mathrm {argmin} \bigg \{- \log \big ( p(\hat{\ell }_{1:N} | x_{1:T}, c_{1:N}) \big ) \bigg \} \end{aligned} \end{aligned}$$

(5)

which we refer to as the energy $E(\ell _{1:N})$. Using (3) the energy is rewritten as

$$\begin{aligned} \begin{aligned} E(\ell _{1:N}) =&- \log \bigg (p(\ell _{1:N} | x_{1:T}, c_{1:N}) \bigg )\\ =&- \log \bigg (\prod _{t=1}^{T} p\big ( \alpha (t) | x_t \big ) \cdot \prod _{n=1}^{N} p\big ( \ell _n | c_n \big ) \bigg ) \\ =&\underbrace{\sum _{t=1}^{T} - \log p\big ( \alpha (t) | x_t \big )}_{\textstyle {E_{o}}} + \underbrace{\sum _{n=1}^{N} - \log p\big ( \ell _n | c_n \big )}_{\textstyle {E_{\ell }}}. \end{aligned} \end{aligned}$$

(6)

The first term in (6), $E_o$ is referred to as the observation energy. This term calculates the cost of assigning the labels for each frame and is calculated from the frame-wise probability estimates. The second term $E_\ell $ is referred to as the length energy. This term is the cost of each segment having a length given that we assume an average length for actions of a specific class.

We propose to optimize the energy defined in (6) using gradient based optimization in order to avoid the need for time-consuming dynamic programming. We start with an initial estimate of the lengths (obtained from the length model of each approach or calculated from training data when available) and update our estimate to minimize the energy function.

As the energy function $E(\ell _{1:N})$ is not differentiable with respect to the lengths, we have to calculate a relaxed and approximate energy function $E^*(\ell _{1:N})$ that is differentiable.

4.1 Approximate Differentiable Energy $E^*$

The energy function E as defined in (6) is not differentiable in two parts. First the observation energy term $E_o$ is not differentiable because of the $\alpha (t)$ function. Second, the length energy term $E_\ell $ is not differentiable because it expects natural numbers as input and cannot be computed on real values which are dealt with in gradient-based optimization. Below we describe how we approximate and make each of the terms differentiable.

Approximate Differentiable Observation Energy. Consider a $N \times T$ matrix P containing negative log probabilities, i.e.,

$$\begin{aligned} P[n, t] = -\log p(c_n|x_t). \end{aligned}$$

(7)

Furthermore, we define a mask matrix M with the same size $N \times T$ where

$$\begin{aligned} M[n, t] = {\left\{ \begin{array}{ll} 0 &{} if ~ \alpha (t) \ne c_n \\ 1 &{} if ~ \alpha (t) = c_n \end{array}\right. }. \end{aligned}$$

(8)

Using the mask matrix we can rewrite the observation energy term as

$$\begin{aligned} E_o = \sum _{t=1}^{T} \sum _{n=1}^{N} M[n, t] \cdot P[n, t]. \end{aligned}$$

(9)

In order to make the observation energy term differentiable with respect to the length, we propose to construct an approximate differentiable mask matrix $M^*$. We use the following smooth and parametric plateau function

$$\begin{aligned} f(t|\lambda ^c, \lambda ^w, \lambda ^s) = \frac{1}{(e^{\lambda ^s(t-\lambda ^c-\lambda ^w)} + 1)(e^{\lambda ^s(-t+\lambda ^c-\lambda ^w)} + 1)} \end{aligned}$$

(10)

from [27]. This plateau function has three parameters and it is differentiable with respect to them: $\lambda ^c$ controls the center of the plateau, $\lambda ^w$ is the width and $\lambda ^s$ is the sharpness of the plateau function.

While the sharpness of the plateau functions $\lambda ^s$ is fixed as a hyper-parameter of our approach, the center $\lambda ^c$ and the width $\lambda ^w$ are computed from the lengths $\ell _{1:N}$. First we calculate the starting position of each plateau function $b_n$ as

$$\begin{aligned} b_1 = 0, b_n = \sum _{n' = 1}^{n-1} \ell _{n'}. \end{aligned}$$

(11)

We can then define both the center and the width parameters of each plateau function as

$$\begin{aligned} \begin{aligned} \lambda _n^c&= b_n + \ell _n / 2,\\ \lambda _n^w&= \ell _n / 2 \end{aligned} \end{aligned}$$

(12)

and define each row of the approximate mask as

$$\begin{aligned} M^*[n, t] = f(t| \lambda _n^c, \lambda _n^w, \lambda ^s). \end{aligned}$$

(13)

Now we can calculate a differentiable approximate observation energy similar to (9) as

$$\begin{aligned} E^*_o = \sum _{t=1}^{T} \sum _{n=1}^{N} M^*[n, t] \cdot P[n, t]. \end{aligned}$$

(14)

Approximate Differentiable Length Energy. For the gradient-based optimization, we must relax the length values to be positive real values instead of natural numbers. As the Poisson distribution (4) is only defined on natural numbers, we propose to use a substitute distribution defined on real numbers. As a replacement, we experiment with a Laplace distribution and a Gaussian distribution. In both cases, the scale or the width parameter of the distribution is assumed to be fixed.

We can rewrite the length energy $E_\ell $ as the approximate length energy

$$\begin{aligned} \begin{aligned} E^*_\ell (\ell _{1:N}) = \sum _{n=1}^{N} - \log p(\ell _n|\lambda ^{\ell }_{c_n}), \end{aligned} \end{aligned}$$

(15)

where $\lambda ^{\ell }_{c_n}$ is the expected value for the length of a segment from the action $c_n$. In case of the Laplace distribution this length energy will be

$$\begin{aligned} \begin{aligned} E^*_\ell (\ell _{1:N}) = \frac{1}{Z} \sum _{n=1}^{N} | \ell _n - \lambda ^{\ell }_{c_n}|, \end{aligned} \end{aligned}$$

(16)

where Z is the constant normalization factor. This means that the length energy will penalize any deviation from the expected average length linearly. Similarly, for the Gaussian distribution, the length energy will be

$$\begin{aligned} \begin{aligned} E^*_\ell (\ell _{1:N}) = \frac{1}{Z} \sum _{n=1}^{N} | \ell _n - \lambda ^{\ell }_{c_n}|^2, \end{aligned} \end{aligned}$$

(17)

which means that the Gaussian length energy will penalize any deviation from the expected average length quadratically.

With the objective to maintain a positive value for the length during the optimization process, we estimate the length in log space and convert it to absolute space only in order to compute both the approximate mask matrix $M^*$ and the approximate length energy $E^*_\ell $.

Approximate Energy Optimization. The total approximate energy function is defined as a weighted sum of both the approximate observation and the approximate length energy functions

$$\begin{aligned} E^*(\ell _{1:N}) = E^*_o(\ell _{1:N}, Y) + \beta E^*_\ell (\ell _{1:N}) \end{aligned}$$

(18)

where $\beta $ is the multiplier for the length energy.

Given an initial length estimate $\ell ^0_{1:N}$, we iteratively update this estimate to minimize the total energy. Figure 1 illustrates the optimization step for our approach. During each optimization step, we first calculate the energy $E^*$ and then calculate the gradients of the energy with respect to the length values. Using the calculated gradients, we update the length estimate using a gradient descent update rule such as SGD or Adam. After a certain number of gradient steps (50 steps in our experiments) we will finally predict the segment length.

If the transcript for a test video is provided then it is used, e.g., using the MuCon [32] approach or in a weakly supervised action alignment setting. However, if the latter is not known, e.g., in a fully supervised approach or CDFL [25] for weakly supervised action segmentation, we perform the optimization for each of the transcripts seen during training and select the most likely one based on the final energy value at the end of the optimization.

The initial length estimates are calculated from the length model of each approach in case of a weakly supervised setting whereas in a fully supervised setting the average length of each action class is calculated from the training data and used as the initial length estimates. The initial length estimates are also used as the expected length parameters for the length energy calculations.

The hyper-parameters like the choice of the optimizer, number of steps, learning rate, and the mask sharpness, remain as the hyper-parameters of our approach.

5 Experiments

5.1 Evaluation Protocols and Datasets

We evaluate FIFA on 3 different tasks: weakly supervised action segmentation, fully supervised action segmentation, and weakly supervised action alignment. Results for action alignment are included in the supplementary material. We obtain the source code for the state-of-the-art approaches on each of these tasks and train a model using the standard training configuration of each model. Then we apply FIFA as a replacement for an existing inference stage or as an additional inference stage.

We evaluate our model using the Breakfast [16] and Hollywood extended [4] datasets on the 3 different tasks. Details of the datasets are included in the supplementary material.

5.2 Results and Discussions

In this section, we study the speed-accuracy trade-off and the impact of the length model. Additional ablation experiments are included in the supplementary material. Speed vs. Accuracy Trade-off. One of the major benefits of FIFA is that it is any-time. It provides the flexibility of choosing the number of optimization steps. The number of steps of the optimization can be a tool to trade-off speed vs. accuracy. In exact inference, we can use frame-sampling, i.e., lowering the resolution of the input features, or hypothesis pruning, i.e., beam search for speed vs. accuracy trade-off.

Figure 2 plots the speed vs. accuracy trade-off of exact inference compared to FIFA. We observe that FIFA provides a much better speed-accuracy trade-off as compared to frame-sampling for exact inference. The best performance is achieved after 50 steps with $5.9\%$ improvement on the MoF accuracy compared to not performing any optimization (0 steps).

Impact of the Length Energy Multiplier. For the length energy, we assume that the segment lengths follow a Laplace distribution. Figure 3 shows the impact of the length energy multiplier on the weakly supervised action segmentation performance on the Breakfast dataset. The choice of this parameter depends on the dataset. While the best accuracy is achieved with a multiplier of 0.05, our approach is robust to the choice of these hyper-parameters on this dataset. We further experimented with a Gaussian length energy. However, as shown in the figure, the performance is much worse compared to the Laplace energy. This is due to the quadratic penalty that dominates the total energy, which makes the optimization biased towards the initial estimate and ignores the observation energy.

Impact of the Length Initialization. Since FIFA starts with an initial estimate for the lengths, the choice of initialization might have an impact on the performance. Table 1 shows the effect of initializing the lengths with equal values compared to using the length model of MuCon [32] for the weakly supervised action segmentation on the Breakfast dataset. As shown in the table, FIFA is more robust to initialization compared to the exact inference as the drop in performance is approximately half of the exact inference.

Table 1. Impact of the length initialization for MuCon using exact inference and FIFA for weakly supervised action segmentation on the Breakfast dataset.

Full size table

Table 2. Results for weakly supervised action segmentation on the Breakfast dataset. $^*$ indicates results obtained by running the code on our machine.

Full size table

Table 3. Results for fully supervised action segmentation setup on the Breakfast dataset. $^*$ indicates results obtained by running the code on our machine.

Full size table

5.3 Comparison to State of the Art

In this section, we compare FIFA to other state-of-the-art approaches.

Weakly Supervised Action Segmentation. We apply FIFA on top of two state-of-the-art approaches for weakly supervised action segmentation namely MuCon [32] and CDFL [25] on the Breakfast dataset [16] and report the results in Table 2. FIFA applied on CDFL achieves a 12 times faster inference speed while obtaining results comparable to exact inference. FIFA applied to MuCon achieves a 5 times faster inference speed and obtains a new state-of-the-art performance on the Breakfast dataset on most of the metrics.

We also reported the inference speed of ISBA [6] and NNV [30] (since the source code of D3TW [5] is not available we could not measure its inference speed) and reported them for the sake of completeness in Table 2. ISBA has the fastest inference time as during testing it does not perform any optimization. ISBA makes framewise predictions which results in over-segmentation and low performance across all metrics.

Similarly for the Hollywood extended dataset [4], we apply FIFA to MuCon [32] and report the results in Table 4. FIFA applied on MuCon achieves a 4 times faster inference speed while obtaining results comparable to MuCon with exact inference.

Fully Supervised Action Segmentation. In the fully supervised action segmentation setting, we apply FIFA on top of MS-TCN [1] and its variant MS-TCN++ [26] on the Breakfast dataset [16] and report the results in Table 3. MS-TCN and MS-TCN++ do not perform any inference at test time. This usually results in over-segmentation and low F1 and Edit scores. Applying FIFA on top of these approaches improves the F1 and Edit scores significantly. FIFA applied on top of MS-TCN achieves state-of-the-art performance on most metrics.

For the Hollywood extended dataset [4], we train MS-TCN [1] and report results comparing exact inference (EI) compared to FIFA in Table 5. We observe that MS-TCN using an inference algorithm achieves new state-of-the-art results on this dataset. FIFA is comparable or better than exact inference on this dataset.

Table 4. Results for weakly supervised action segmentation on the Hollywood extended dataset. Time is reported in seconds. $^*$ indicates results obtained by running the code on our machine.

Full size table

Table 5. Results for fully supervised action segmentation on the Hollywood extended dataset. $^*$ indicates results obtained by running the code on our machine. EI stands for Exact Inference.

Full size table

5.4 Qualitative Example

A qualitative example of the FIFA optimization process is depicted in Fig. 4. For further qualitative examples, failure cases, and details please refer to the supplementary material.

6 Conclusion

In this paper, we proposed FIFA a fast approximate inference procedure for action segmentation and alignment. Unlike previous methods, our approach does not rely on dynamic programming-based Viterbi decoding for inference. Instead, FIFA optimizes a differentiable energy function that can be minimized using gradient-descent which allows for a fast and accurate inference during testing. We evaluated FIFA on top of fully and weakly supervised methods trained on the Breakfast and Hollywood extended datasets. The results show that FIFA is able to achieve comparable or better performance, while being at least 5 times faster than exact inference.

References

Abu Farha, Y., Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: CVPR (2019)
Google Scholar
Belanger, D., McCallum, A.: Structured prediction energy networks. In: ICML (2016)
Google Scholar
Belanger, D., Yang, B., McCallum, A.: End-to-end learning for structured prediction energy networks. In: ICML (2017)
Google Scholar
Bojanowski, P., et al.: Weakly supervised action labeling in videos under ordering constraints. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 628–643. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_41
Chapter Google Scholar
Chang, C., Huang, D., Sui, Y., Fei-Fei, L., Niebles, J.C.: D${}^{\text{3}}$TW: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: CVPR (2019)
Google Scholar
Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: CVPR (2018)
Google Scholar
Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In: ICCV (2009)
Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)
Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: ICLR (2015)
Google Scholar
Hoang, C.D.V., Haffari, G., Cohn, T.: Towards decoding as continuous optimization in neural machine translation. In: EMNLP (2017)
Google Scholar
Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_9
Chapter Google Scholar
Huang, Y., Sugano, Y., Sato, Y.: Improving action segmentation via graph-based temporal reasoning. In: CVPR (2020)
Google Scholar
Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: WACV (2021)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Karaman, S., Seidenari, L., Del Bimbo, A.: Fast saliency based pooling of fisher encoded dense trajectories. In: ECCV THUMOS Workshop (2014)
Google Scholar
Kuehne, H., Arslan, A., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: CVPR (2014)
Google Scholar
Kuehne, H., Gall, J., Serre, T.: An end-to-end generative framework for video segmentation and recognition. In: WACV (2016)
Google Scholar
Kuehne, H., Richard, A., Gall, J.: Weakly supervised learning of actions from transcripts. CVIU 163, 78–89 (2017)
Google Scholar
Kuehne, H., Richard, A., Gall, J.: A Hybrid RNN-HMM approach for weakly supervised temporal action segmentation. PAMI 42(04), 765–779 (2020)
Article Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: ICML (2014)
Google Scholar
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: CVPR (2017)
Google Scholar
Lea, C., Reiter, A., Vidal, R., Hager, G.D.: Segmental spatiotemporal CNNs for fine-grained action segmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 36–52. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_3
Chapter Google Scholar
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.: A tutorial on energy-based learning. In: Predicting Structured Data, no. 1 (2006)
Google Scholar
Lei, P., Todorovic, S.: Temporal deformable residual networks for action segmentation in videos. In: CVPR (2018)
Google Scholar
Li, J., Lei, P., Todorovic, S.: Weakly supervised energy-based learning for action segmentation. In: ICCV (2019)
Google Scholar
Li, S., Abu Farha, Y., Liu, Y., Cheng, M.M., Gall, J.: MS-TCN++: multi-stage temporal convolutional network for action segmentation. PAMI (2020)
Google Scholar
Moltisanti, D., Fidler, S., Damen, D.: Action recognition from single timestamp supervision in untrimmed videos. In: CVPR (2019)
Google Scholar
Pirsiavash, H., Ramanan, D.: Parsing videos of actions with segmental grammars. In: CVPR (2014)
Google Scholar
Richard, A., Kuehne, H., Gall, J.: Weakly supervised action learning with RNN based fine-to-coarse modeling. In: CVPR (2017)
Google Scholar
Richard, A., Kuehne, H., Iqbal, A., Gall, J.: Neuralnetwork-viterbi: a framework for weakly supervised video learning. In: CVPR (2018)
Google Scholar
Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: CVPR (2012)
Google Scholar
Souri, Y., Fayyaz, M., Minciullo, L., Francesca, G., Gall, J.: Fast weakly supervised action segmentation using mutual consistency. PAMI (2021)
Google Scholar
Vo, N.N., Bobick, A.F.: From stochastic grammar to Bayes network: probabilistic parsing of complex activity. In: CVPR (2014)
Google Scholar
Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 34–51. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_3
Chapter Google Scholar

Download references

Acknowledgement

The work has been partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - GA 1927/4-2 (FOR 2535 Anticipating Human Behavior).

Author information

Authors and Affiliations

University of Bonn, Bonn, Germany
Yaser Souri, Yazan Abu Farha & Juergen Gall
Toyota Motor Europe, Brussels, Belgium
Fabien Despinoy & Gianpiero Francesca

Authors

Yaser Souri
View author publications
You can also search for this author in PubMed Google Scholar
Yazan Abu Farha
View author publications
You can also search for this author in PubMed Google Scholar
Fabien Despinoy
View author publications
You can also search for this author in PubMed Google Scholar
Gianpiero Francesca
View author publications
You can also search for this author in PubMed Google Scholar
Juergen Gall
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yaser Souri .

Editor information

Editors and Affiliations

Fraunhofer IAIS, Sankt Augustin, Germany
Christian Bauckhage
University of Bonn, Bonn, Germany
Juergen Gall
University of Illinois at Urbana-Champaign, Urbana, IL, USA
Alexander Schwing

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 586 KB)

Supplementary material 2 (mp4 48776 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Souri, Y., Farha, Y.A., Despinoy, F., Francesca, G., Gall, J. (2021). FIFA: Fast Inference Approximation for Action Segmentation. In: Bauckhage, C., Gall, J., Schwing, A. (eds) Pattern Recognition. DAGM GCPR 2021. Lecture Notes in Computer Science(), vol 13024. Springer, Cham. https://doi.org/10.1007/978-3-030-92659-5_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-92659-5_18
Published: 13 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92658-8
Online ISBN: 978-3-030-92659-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

FIFA: Fast Inference Approximation for Action Segmentation

Abstract

Similar content being viewed by others

Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation

GRUNTS: Graph Representation for UNsupervised Temporal Segmentation

Leveraging Action Affinity and Continuity for Semi-supervised Temporal Action Segmentation

Keywords

1 Introduction

2 Related Work