Keywords

1 Introduction

Deep learning has been essential to enhance the great performance for action recognition, particularly for shorter videos containing only a single action label. However, it is still a challenge on long untrimmed videos with many action segments, i.e., it is challenging for the model to correctly predict each frame’s action from a long video with fine-grained class labels. Many researchers pay attention to temporal convolutional networks (TCN) [15] in action segmentation. Nevertheless, it is still hard to train effectively with the limited amount of data for better performance on public datasets due to the high-cost human annotation of each frame from a long video. Therefore, it is crucial in action segmentation to exploit unlabeled data effectively for model training without additional large-scale manual annotated data.

Fig. 1.
figure 1

The overview of Contrastive Temporal Domain Adaptation (CTDA) for video action segmentation with three modules: self-supervised learning (SSL), feature extraction, and action prediction. In the SSL module, domain adaptation and contrastive learning are performed together from the source and target domains. In the feature extraction module, one prediction generation stage and several refinement stages are included for better representation.

Apart from training with limited annotated data, another challenge of action segmentation lies in that significant Spatio-temporal variations exist in training and test data. More specifically, Spatio-temporal variation refers to the same action which is executed by different subjects with different styles. It can be regarded as one of the domain adaptation problems [24]. The source domain is the videos with annotated labels while the target domain is the unlabeled videos. As a result, it is essential for us to minimize the discrepancy and enhance the generalization performance of an action segmentation model.

This paper proposes a new Contrastive Temporal Domain Adaptation (CTDA) framework in action segmentation with limited training/source videos. We are mainly inspired by the latest closely-related method SSTDA [3]. Specifically, we embed the most recent contrastive learning approaches in the following auxiliary domain prediction tasks. Concretely, binary domain prediction refers to the model performing one of the two auxiliary tasks to predict each frame’s domain (source domain or target domain), embedding frame-level features with local temporal dynamics. In contrast, sequential domain prediction is the other auxiliary task that the model predicts the domain of video segments and performs sequential domain prediction using video-level features in which global temporal dynamics from untrimmed videos are embedded. Besides the two auxiliary tasks from self-supervised learning, we also induce contrastive learning into our model to obtain the stronger feature representation. Specifically, for contrastive learning, the negative pair is defined as the two video frames/segments from the different domains, while the opposite is the positive pair. The model constructed by contrastive learning ensures that the videos of the same domain are forced to be closer in the same latent space, while the videos from different domains are pushed far away. Thus, contrastive learning enhances the prediction ability of the two auxiliary tasks without any annotations. Furthermore, we devise a multi-stage architecture to obtain the final results of action segmentation better. There are two improvements: (a) the predication generation stage has a dual dilated layer to combine larger and smaller receptive fields; (b) it is followed by embedding efficient channel attention to increase the information interaction with channels.

In Fig. 1, we illustrate the CTDA framework with three various modules: (1) self-supervised learning (SSL) module, (2) feature extraction module, and (3) action prediction module. To be more specific, the data that has labels is the source domain in the SSL module, while the other data is the target domain. Then, domain adaptation and contrastive learning are performed together. In the feature extraction module, one prediction generation stage and several refinement stages are included. In the action prediction module, it outputs the predicted action for each frame of a long video. So, we design a diverse combination of the latest domain adaptation and contrastive learning approaches to reduce the domain discrepancy (i.e., Spatio-temporal variations) and find that contrastive learning combined with domain adaptation can improve the video action segmentation effect.

Therefore, this paper has made three major contributions:

(1) We are the first to propose a novel contrastive temporal domain adaptation (CTDA) approach to video action segmentation. Different from the closely-related method SSTDA [3], we apply the latest contrastive learning approaches to two auxiliary tasks within the domain adaptation framework.

(2) The proposed CTDA brings boost to the predication generation stage with a dual dilated layer to combine larger and smaller receptive fields in multi-stage architecture, followed by embedding efficient channel attention to increase the information interaction with channels for action segmentation.

(3) Extensive experimental results illustrate that CTDA not only achieves the best results on two public benchmarks but also achieves comparable performance even with less training data.

Fig. 2.
figure 2

Network for proposed Contrastive Temporal Domain Adaptation (CTDA) framework for action segmentation. Two auxiliary tasks (i.e., binary and sequential domain prediction) are performed by domain adaptation and contrastive learning together within the self-supervised learning paradigm. We improve the predication generation stage with a dual dilated layer, followed by embedding efficient channel attention.

2 Related Work

Domain Adaptation. DA is defined as a representative transfer learning approach in that both the source and the target tasks are identical but without the same data distribution. Moreover, the data in the source domain are labeled, whereas the data in the target are not. RTN [20] uses residual structure to learn the difference classifiers. JAN [21] proposes a JMMD to make the joint distribution’s features closer. MADA [25] states that the alignment of semantic information can make the feature space better aligned. Within this work, we investigate various domain adaptation approaches to learn information from unlabeled target videos with the same contrastive learning approaches in the CTDA framework.

Contrastive Learning. The concept of contrastive learning is a new approach to self-supervised learning that has emerged as a result of recent research. SimCLRv2 [4] is the improved version to explore larger ResNet models and increase the non-linear network’s capacity. Momentum Contrast (MoCo) [13] builds a dynamic dictionary, and MoCov2 [5] applies an MLP-based projection head to obtain better representations. SimSiam [6] maximizes the similarity to learn meaningful representations. BYOL [12] proposes two networks to interact and learn from each other without negative pairs. Therefore, we evaluate self-supervised learning with different latest contrastive learning approaches based on the same domain adaptation approaches in the CTDA framework.

Action Segmentation. Long-term dependencies is important to video action segmentation for high-level video understanding. The temporal convolutional network (TCN) [15] is firstly proposed for action segmentation with an encoder-decoder architecture. MS-TCN [7] designs multi-stage TCNs, and MS-TCN++ [19] further introduces a dual dilated layer to decouple the different phases. Very recently, self-supervised learning with domain adaptation (SSTDA) [3] has been introduced for DA problems. This paper combines contrastive temporal domain adaptation in the self-supervised learning to embed the latest contrastive learning approaches in the auxiliary tasks to reduce the Spatio-temporal discrepancy further for action segmentation.

3 Methodology

3.1 Framework Overview

Action segmentation is high-level video understanding, which depends on large amount of labeled videos with high-cost annotation. We redefine the self-supervised learning module for better representation and to distinguish the difference in similar activities. Therefore, we propose a novel Contrastive Temporal Domain Adaptation (CTDA) framework that integrates domain adaptation and contrastive learning together for both of the auxiliary tasks.

In Fig. 2, we show the proposed CTDA framework in detail. We select SSTDA [3] as the backbone, because it integrates domain adaptation within MS-TCN for the first time and achieves better performance. Furthermore, we improve the predication generation stage with a dual dilated layer to combine larger and smaller receptive fields more effectively, followed by embedding efficient channel attention, which can not only avoid the reduction of feature dimension, but also increase the information interaction with channels. Based on the above improvements, the CTDA has obtained better metrics and achieved relatively comparable results with fewer training videos.

3.2 Self-supervised Learning Module

This paper further combines contrastive learning with SSTDA [3] to enhance the SSL module, then embed the latest contrastive learning approaches in binary and sequential domain predictions. Therefore, domain adaptation and contrastive learning are performed together in both auxiliary tasks’ predictions.

In self-supervised learning modules, we split videos into two approaches and carried out the pipeline of two branches simultaneously. The first approach is video frame, and the first auxiliary task in self-supervised learning is binary prediction to determine whether they are source or target. The second is the video segment, and the second auxiliary task for sequential domain prediction is to predict the correct combination of the segments. To sum up, we have embedded contrastive learning in the corresponding positions of the two auxiliary tasks of domain adaptation (local and global) while predicting the two auxiliary tasks jointly completed by domain adaptation approaches and contrastive learning approaches. Therefore, contrastive learning in the CTDA framework improves the performance of the SSL module.

3.3 Feature Extraction Module

We refer to SSTDA-based code [3] and improve the prediction generation stage with a dual dilated layer, followed by efficient channel attention, which can not only avoid the reduction of feature dimension but also increase the information interaction with channels.

Prediction Generation Stage. We follow the architecture of SSTDA [3], and make two improvements: (1) combination with a designed-well dual dilated layer to obtain the various reception views and (2) embedding efficient channel attention (ECA) for feature interaction between channels. The prediction generation stage is described in Fig. 3(a). Specifically, in Net1, there is only one convolution with dilated factor \(2^l\) in SSTDA, and we introduce a dual dilated layer (DDL) for optimization. The dilated factor is set as \(2^l\) and \(2^{L-l}\) at each layer l, and L represents the layer’s quantity. To increase the information interaction between channels, we embed efficient channel attention shown in Fig. 4.

Refinement Stage. The latest research shows that a DDL combines different larger and smaller reception views only in the prediction generation stage [19] to achieve better performance. In Fig. 3(b), we follow the design of SSTDA [3] without changes in the refinement stage, and it is a simplified version of the prediction generation stage, where DDL is replaced by a dilated factor \(2^l\), and other processes remain unchanged.

Fig. 3.
figure 3

An illustration of prediction generation stage and refinement stages.

Fig. 4.
figure 4

An illustration of efficient channel attention (ECA).

3.4 Implementation Details

In this paper, we refer to SSTDA-based code [3] in PyTorch [23] for improvement. Following the official implementation, we use the same four-stage architecture. Please see Fig. 2 for more details.

Table 1. Comparison with the state-of-the-arts for action segmentation on 50Salads and GTEA datasets.

4 Experiments

4.1 Datasets and Evaluation Metrics

Datasets. This paper selects two challenging action segmentation datasets: GTEA [8] and 50Salads [30]. The standard splits of training sets and validation sets with different people to perform the same actions may trigger a series of domain shift-related problems. In this case, we followed the strategy first proposed in [3].

The GTEA dataset covers seven different kinds of behaviors that occur daily in a kitchen (e.g., making cheese sandwiches and tea) with four different human subjects from a dynamic view so that it contains 28 egocentric videos from the camera with the actor’s head in total. The 50Salads dataset consists of 50 videos that fall into 17 different activity categories related to the preparation of salads in the kitchen (e.g., cut tomato), which are performed by 25 human subjects to prepare two different salads.

In this paper, we follow the previous works to pre-extract I3D feature sequences for the videos of both datasets, which are trained on kinetics dataset [1] followed by taking the features to the input of our framework.

Evaluation Metrics. To better evaluate the performance, we select the same three widely evaluation metrics mentioned in [15], there are frame-wise Accuracy, segmental edit distance (Edit) and segmental F1 score, respectively.

4.2 Experimental Results

Comparison to State-of-the-Art. This paper evaluates CTDA framework on challenging action segmentation datasets: GTEA [8] and 50Salads [30]. As seen in Table 1, we summarize the metrics in detail, and the CTDA framework achieves the highest performance on the two datasets.

For 50salads, a relatively small dataset, the CTDA framework achieves significant improvement. Compared with SSTDA [3], the F1 score {10,25,50} is 85.5, 84.2, 76.9, respectively. Overall, it increases by 3 points on average. As for the other two metrics, edit distance and frame-wise accuracy, the former is relatively more important, which can be penalized for over-segmentation errors. The value of edit distance also obviously improves the 3 points. Besides, the frame-wise accuracy raises from 83.2 to 84.5. Therefore, we can clearly see when using the full training data, our CTDA framework improves on average by 3 points in F1 scores and segmental edit distance in SSTDA, 5 points in MS-TCN++, and 10 points in MS-TCN, respectively.

For another small dataset, GTEA, our CTDA framework also achieves a slight improvement, but not as much as 50Salads, probably mainly because of the relatively simple dataset and activities. From Table 1, we can see CTDA framework has achieved a slight increase in F1 score and segmental edit distance compared to other approaches with similar frame-wise accuracy, which indicates that we have further reduced the impact of over-segmentation to better video action segmentation.

Table 2. The results of CTDA embedded various CL methods when using less labeled training data (m%) on 50Salads.
Table 3. Ablation Study on CL.
Table 4. Ablation Study on DA.
Table 5. Ablation Study on dual dilated layer (DDL).
Table 6. Ablation Study on embedded efficient channel attention (ECA).

Performance with Less Training Data. Based on the performance in Table 1, our CTDA framework achieves the improvement with unlabeled target videos, followed by training with fewer labeled frames in this section. The process follows the SSTDA setup [3]. In Table 2, we record in detail to train with fewer instances of labeled training data, which related to the different percentages of labeled training data (m%) with four different latest contrastive learning approaches (MoCov2 [5], SimCLRv2 [4], SimSiam [6] and BYOL [12]). We train with less labeled training data from 95% to 65% at a rate of 10% each time for 50Salads, and the lower limit is set at 60%.

To facilitate the analysis, we choose BYOL embedded in the CTDA framework. Compared to the SSTDA [3] and MS-TCN [7], we only use 60% of the data for training, and the model can still achieve better performance that F1@{10,25,50} is 78.8, 76.4, 67.5, segmental edit distance is 70.5, and Frame-wise accuracy is 81.3, which exceeds the SSTDA performance with 65% training data and the best results of MS-TCN with 100% training data. When we use 75% of the data for training, our performance can exceed MS-TCN++ with 100% training data in all metrics. When we use 85% training data, it can also exceed SSTDA with 100% training data. In general, CTDA framework only needs 85% training data to obtain the best performance on 50Salads. Table 2 records other latest contrastive learning approaches embedded in the CDTA framework.

Fig. 5.
figure 5

Qualitative effects on various temporal action segmentation approaches with color-coding the obtained action segments for the best view. (a) On the activity Make salad by Subject 04 from the 50Salads dataset. (b) On the activity Make coffee from the GTEA dataset.

4.3 Ablation Study

Different Contrastive Learning Approaches. We conduct a extensive experiments on 4 contrastive learning approaches (MoCov2 [5], SimCLRv2 [4], SimSiam [6] and BYOL [12]), and embed them followed by the Binary Classifier to predict a frame to source or target (local mode), and embed them followed by the segment classifier to predict a segment sequence (global mode), and combination both of them (both mode). In Table 3, we record the comparison of different contrastive learning on 50Salads. We can demonstrate that (1) when the local contrastive learning mode is used, the average performance is higher than that of the non-contrastive learning mode, and (2) the average performance with both contrastive learning modes is higher than that of the local contrastive learning mode. (3) As a result, the two auxiliary tasks are performed by domain adaptation and contrastive learning together, which can maximize the best effect of self-supervised learning, not for one auxiliary task to binary domain prediction.

Different Domain Adaptation Approaches. We perform an ablation study on different domain adaption approaches to demonstrate which can learn information from unlabeled target videos better with the same contrastive learning approach in our CTDA framework. We take the same contrastive learning approach BYOL [12] as an example. Please see Table 4 for more detail. We can see that the CTDA framework combines with SSTDA, whose performance is significantly higher than other domain adaptation approaches. The results demonstrate that SSTDA performs better than other DA strategies with contrastive learning because it aligns temporal dynamics better within our CTDA framework.

Different Embedded Components of the Framework. We further conducted two experiments on various embedded components of the CTDA framework to demonstrate the effectiveness of the dual dilated layer (DDL) and efficient channel attention (ECA). Table 5 shows the impact of DDL on the CTDA framework, and Table 6 shows the effects of ECA. Comparing Table 5 and Table 6, we can conclude that both DDL and ECA have had a positive effect on improving outcomes, and DDL is even more critical.

4.4 Quantitative Effects

We represent the qualitative segmentation effect for the best view of two datasets in Fig. 5. (a) is the activity Make salad by Subject 04 on 50Salads, and (b) is the activity Make coffee on GTEA. We choose Fig. 5(b) for further detailed analysis. It is the activity Make coffee, including a series of actions (e.g. take, open, scoop, pour, close, put, stir and background). Each action segmentation approach in Fig. 5 correctly predicts action categories and duration segments, and improves the sequence accuracy. For example, at the beginning of Fig. 5(b), the CTDA framework predicts the action time boundary of scoop and pour relatively more accurately and then predicts the time boundary of take and pour better, which adapts to the target domain more effectively. Therefore, the CTDA framework reduces the prediction error relatively and achieves better action segmentation performance.

5 Conclusions

This paper proposes a novel self-supervised Contrastive Temporal Domain Adaptation (CTDA) framework for efficiently utilizing unlabeled target videos with limited training videos. We are the first to combine contrastive learning and temporal domain adaptation for action segmentation. Furthermore, a multi-stage architecture is also devised to obtain the final results of action segmentation. Extensive experiments on two action segmentation benchmarks demonstrate that CTDA achieves the best results. In our upcoming work, we will further enhance the prediction of action segments to overcome the challenge that estimating time boundaries excessively deviates from the ground truth by embedding an additional module to refine the segment time boundary information.