Keywords

1 Introduction

Difficulty in collecting large-scale data and labels promotes the research on few-shot learning. Built upon the success of few-shot learning for image understanding tasks [10, 16, 26, 33, 36, 37, 53, 54, 62], many begin to focus on the few-shot action recognition task [28]. Once realized, such techniques could greatly alleviate the cost of video labeling [20] and promote real-world applications where the labels in certain scenarios are hard to acquire [25].

Most few-shot action recognition works determine the category of a query video using its similarity to the few labeled support videos. Many works [18, 73, 74] follow ProtoNet [47] to first learn a prototype for each video and compute the video similarity based on the similarity of the prototypes. To better consider the temporal dependencies in videos (e.g., temporal ordering and relation), recent works construct sub-sequence prototypes using different parts of the videos and calculate video similarity by matching the support-query prototypes [5, 44].

While temporal dependencies are considered, limitations still exist in previous methods. Firstly, without considering spatial information, previous methods cannot fully exploit the spatiotemporal relation in the videos for distinguishing actions like “put A on B” and “put A besides B” because they differ only in the relative position of the objects. Secondly, the sub-sequence prototypes come from fixed temporal locations, so that they cannot well handle the actions that happen at different speeds. Thirdly, it is computationally costly to exhaustively compute the similarity between all pairs of sub-sequence prototypes [44].

To address the limitations and achieve more robust few-shot action recognition, we explore how to: (1) better generate prototypes that can robustly encode spatiotemporal relation in the videos, (2) enable the prototypes to flexibly encode the actions done with different lengths and speeds, and (3) match the prototypes between two videos without exhaustively comparing all prototype pairs.

One straightforward way to address the first point is to use object features extracted from object bounding boxes [22]. However, we observed in our preliminary experiments only limited performance gain when previous methods [44, 70] directly use them as additional inputs. In this work, we propose a multi-relation encoder to effectively encode the object features, by considering the spatiotemporal relation among objects across frames, the temporal relation between different frame-wise features, as well as the relation between object features and frame-wise features.

For the second and third points, we propose to generate global prototypes that consider all frames in the input video. This is done by taking advantage of the self-attention mechanism of Transformers [51]. Since it is difficult to represent a wide variety of actions by using a single prototype, we instead use a group of prototypes to represent each action. We match the support-query similarity by fixed 1-to-1 matching, i.e., the i-th prototype of the query video is always matched with the i-th prototype of the support video. Thus, during training, each prototype will try to capture a certain aspect of the video. To avoid all prototypes to be the same, we apply a diversity loss when learning the prototypes, so that they are encouraged to capture different aspects of the action (e.g., one prototype captures the start of the action, and another prototype captures the action evolution).

However, learning prototypes to represent different aspects of the action (e.g., the start/end of the action) is a difficult task even with annotation [9, 40, 41]. Thus, it is not sufficient to compute video similarity only by the global prototypes. To make the similarity measurement more reliable, we generate another group of focused prototypes, where each prototype is encouraged to focus on certain timestamps of the video. Since actions may happen in different parts of the videos at different speeds [34], it is not correct to compare the same timestamp between two videos. We therefore use bipartite matching to match the focused prototypes between the support and query videos, so that comparison of actions with different lengths and speeds can be done.

Our method uses the compound of two groups of prototypes, which we call compound prototypes, for calculating video similarity. On multiple benchmark datasets, our method outperforms previous methods significantly when only one example is available, demonstrating the superiority of using our compound prototypes in similarity measurement.

To summarize, our key contributions include:

  • A novel method for few-shot action recognition based on generating and matching compound prototypes.

  • Our method achieves state-of-the-art performance on multiple benchmark datasets [7, 19, 30], outperforming previous methods by a large margin.

  • A detailed ablation study showing the usefulness of leveraging object information for few-shot action recognition and demonstrating how the two groups of prototypes encode the video from complementary perspectives.

2 Related Works

Few-Shot Image Classification methods can be broadly divided into three categories. The transfer-learning based methods [14, 45, 57] use pre-training and fine-tuning to increase the performance of deep backbone networks on few-shot learning. The second line of works focuses on rapidly learning an optimized classifier using limited training data [1, 2, 17, 21, 46, 58, 67]. The third direction is based on metric learning, whose goal is to learn more generalizable feature embeddings under a certain distance metric [27, 29, 47, 52]. The key to metric-learning based methods is to generate robust representations of data under a certain metric, so that it may generalize to novel categories with few labeled samples. Our work falls into this school of research and focus on the more challenging video setting.

Few-Shot Action Recognition methods [3, 24, 34, 42, 43, 50, 61, 74, 75] mainly fall into the metric-learning framework. Many works follow the scheme of ProtoNet [47] to compute video similarity based on generated prototypes. For learning better prototypes, ProtoGAN [32] synthesizes additional feature, CMN [72, 73] uses a multi-layer memory network, while ARN [69] uses jigsaws for self-supervision and enhances the video-level representation via spatial and temporal attention. There are also methods that perform pretraining with semantic labels [56, 59, 60] or use additional information such as depth [18] to augment video-level prototypes. Our method uses another form of additional information: the object bounding box from pretrained object detectors [22].

The temporal variance of actions form a major challenge in few-shot action recognition task [34]. To better model temporal dependencies, recent researches put more focus on generating and matching sub-video level prototypes. OTAM [5] uses a generalized dynamic time warping technique [8] to monotonously match the prototypes between query and support videos. ITA-Net [70] first implicitly aggregates information for each frame using other frames, and conducts a 1-to-1 matching of all prototypes. However, these two methods use frame-wise prototypes, thus cannot well capture the higher-level temporal relation in multiple frames. Recently, TRX [44] constructs prototypes of different cardinalities for query and support videos, and calculate similarity by matching all prototype pairs. However, since TRX can only match prototypes with the same cardinality (e.g., three-frame prototypes matches with three-frame prototypes), it is hard to align the actions with different evolving speeds (e.g., one takes 2 frames while the other takes 4 frames). Also, the exhaustively matching of all pairs is computationally expensive.

We summarize three main differences compared with previous works: (1) We encode spatiotemporal object information to form more robust prototypes. (2) We generate a compound of global and focused prototypes to represent the actions from diverse perspectives. (3) The two groups of prototypes are efficiently matched to robustly compute video similarity.

Transformers [51] have recently acquired remarkable achievements in computer vision [6, 13, 15, 38, 64, 68]. FEAT [65] represents early work that applies transformer in the few-shot learning task, and TRX [44] first introduces Transformer [15] into the few-shot action recognition task. Different from TRX, we apply a Transformer encoder-decoder to generate compound prototypes, and show that this is more effective in the few-shot action recognition scenario.

3 Method

3.1 Problem Setting

In few-shot action recognition, a model aims to recognize an unlabeled video (query) into one of the target categories each with a limited number of labeled examples (support set) [5, 44]. We follow the common practice [17, 52] to use episodic training, where in each episode a C-way K-shot problem is sampled: the support set \(\mathcal {S}=\{\boldsymbol{X}^{j}_s\}_{j=1}^{C\times K}\) is composed of \(C\times K\) labeled videos from C different classes where each class contains K samples. The query set contains N unlabeled videos \(\mathcal {Q} = \{\boldsymbol{X}_q^i\}_{i=1}^N\). The goal is to classify each video in the query set as one of the C classes.

3.2 Proposed Method

We propose a new method for few-shot action recognition by generating and matching compound prototypes between query and support videos. As shown in Fig. 1, given the video input, a feature embedding network first extracts global (frame-wise) features \(\boldsymbol{F}_g\). To better model the actions involving multiple objects, we obtain object bounding boxes by a pretrained object detector [22], and extract object features \(\boldsymbol{F}_o\) using the same embedding network. Then a multi-relation encoder uses \(\boldsymbol{F}_g\) and \(\boldsymbol{F}_o\) to output multi-relation features \(\boldsymbol{F}_m\) containing spatiotemporal global-object relations. Then a compound prototype decoder generates global prototypes \(\boldsymbol{P}_g\) and focused prototypes \(\boldsymbol{P}_f\) for each video. During similarity calculation, we use fixed 1-to-1 matching on the global prototypes and bipartite matching on the focused prototypes, encouraging the similarity to be computed robustly from diverse perspectives. We introduce the details of each component below.

Fig. 1.
figure 1

Illustration of our proposed method on a 3-way 1-shot problem. First, the videos are processed by an embedding network to acquire global (frame-level) features \(\boldsymbol{F}_g\) and object features \(\boldsymbol{F}_o\). Features \(\boldsymbol{F}_g\) and \(\boldsymbol{F}_o\) are equipped with 1D and 3D positional encoding (PE), respectively, and then used by a multi-relation encoder (Sect. 3.2) to encode global-global, global-object and object-object information into a multi-relation feature \(\boldsymbol{F}_m\). Using \(\boldsymbol{F}_m\), a Transformer-based compound prototype decoder transforms the learnable tokens \(\boldsymbol{T}_g, \boldsymbol{T}_f\) into compound prototypes that represent the input video (Sect. 3.2). The compound prototypes is consisted of several global prototypes \(\boldsymbol{P}_g\) (green squares) and focused prototypes \(\boldsymbol{P}_f\) (blue squares). They are applied with different loss and different matching strategies, so that each global prototype captures a certain aspect of the action summarized from the whole video, and each focused prototype focuses on a specific temporal location of the video. The final similarity score is calculated as the average similarity of all matched prototype pairs between support and query videos. (Color figure online)

Feature Embedding. For each input video \(\boldsymbol{X} \in \mathcal {S} \cup \mathcal {Q}\), we sample T frames following the sampling strategy of TSN [55]. We use an embedding network to acquire a global (frame-level) feature representation for each video \(\boldsymbol{F}_g\in \mathbb {R}^{T\times d}\), where d is the feature dimension. Additionally we extract object features via ROI-Align [22] using the predicted bounding boxes on each frame. We use only B most confident boxes on each frame, forming object features \(\boldsymbol{F}_o\in \mathbb {R}^{BT\times d}\).

Multi-relation Encoder. To better generate prototypes that are discriminative for actions involving multiple objects, we propose to use a multi-relation encoder to encode the spatiotemporal information from \(\boldsymbol{F}_g\) and \(\boldsymbol{F}_o\). We specifically consider the following three relations: global-global (frame-wise) relation, global-object relation, object-object relation, and apply transformer [51] as the base architecture to allow relation modeling across frames. As shown in Fig. 1, the encoder consists of three relation encoding transformers (RETs). The global-global RET (RET\(_{gg}\)) and the object-object RET (RET\(_{oo}\)) are identical except the input. They take as input the global feature \(\boldsymbol{F}_g\) and the object feature \(\boldsymbol{F}_o\) respectively, and use the input to generate the query \(\boldsymbol{Q}\), key \(\boldsymbol{K}\) and value \(\boldsymbol{V}\) vectors for the transformer:

$$\begin{aligned} \boldsymbol{F}_{gg} = RET _{gg}(\boldsymbol{Q}=\boldsymbol{K}=\boldsymbol{V}=\boldsymbol{F}_g) \end{aligned}$$
(1)
$$\begin{aligned} \boldsymbol{F}_{oo} = RET _{oo}(\boldsymbol{Q}=\boldsymbol{K}=\boldsymbol{V}=\boldsymbol{F}_o), \end{aligned}$$
(2)

The global-object RET (RET\(_{go}\)) works differently, where it maps \(\boldsymbol{F}_g\) as query vector, while \(\boldsymbol{F}_o\) as key and value vectors:

$$\begin{aligned} \boldsymbol{F}_{go} = RET _{go}(\boldsymbol{Q}=\boldsymbol{F}_g,\; \boldsymbol{K}=\boldsymbol{V}=\boldsymbol{F}_o). \end{aligned}$$
(3)

The output size of each RET is the same as its input query vector. Thus, each of the T frame would have \(B+2\) features with dimension d. We concatenate the three outputs into a multi-relation feature \(\boldsymbol{F}_m\in \mathbb {R}^{(B+2)T\times d}\).

Positional encoding (PE) is shown to be effective in transformer-based architectures [38, 39, 51]. We also use PE but omit in the equations for simplicity. For \(\boldsymbol{F}_{g}\) we use 1D PE to encode the temporal location of each frame, and for \(\boldsymbol{F}_{o}\) we apply 3D PE, encoding both spatial and temporal location of each object.

Compound Prototype Decoder. The compound prototype decoder also follows the transformer architecture [6, 11, 49, 51], so that the prototypes can be generated by considering feature of all frames via self-attention. As shown in Fig. 1, the input to the prototype decoder are two groups of learnable tokens \(\boldsymbol{T}_g\in \mathbb {R}^{m_{g} \times d}\) and \(\boldsymbol{T}_f\in \mathbb {R}^{m_{f} \times d}\). A multi-head attention layer first encodes the tokens into \(\boldsymbol{\hat{T}}_g\) and \(\boldsymbol{\hat{T}}_f\), then another multi-head attention layer transforms them into two groups of prototypes \(\boldsymbol{P}_g=\{\boldsymbol{p}_{g,k}\}_{k=1}^{m_g}\in \mathbb {R}^{m_{g} \times d}\) and \(\boldsymbol{P}_f=\{\boldsymbol{p}_{f,k}\}_{k=1}^{m_f}\in \mathbb {R}^{m_{f} \times d}\). For simplicity, we omit the subscripts \(_{g,f}\) and all normalization layers, thus the equation can be written as:

$$\begin{aligned} \boldsymbol{Q} = \boldsymbol{\hat{T}}\boldsymbol{W}_Q, \quad \boldsymbol{K} = \boldsymbol{F}_m\boldsymbol{W}_K, \quad \boldsymbol{V} = \boldsymbol{F}_m\boldsymbol{W}_V, \end{aligned}$$
(4)

where \(\boldsymbol{W}_Q, \boldsymbol{W}_K, \boldsymbol{W}_V \in \mathbb {R}^{d \times d}\) are linear projection weights, then we have

$$\begin{aligned} \boldsymbol{A} = softmax(\frac{\boldsymbol{Q}\boldsymbol{K}^T}{\sqrt{d}}), \quad \boldsymbol{P} = FFN(\boldsymbol{A}\boldsymbol{V}), \end{aligned}$$
(5)

where \(\boldsymbol{A} \in \mathbb {R}^{m\times (B+2)T}\) is the self-attention weights, and FFN denotes feed forward network.

To encourage the prototypes to capture different aspects of the action, we apply constraints on the two groups of prototypes individually. For the global prototypes \(\boldsymbol{P}_g\), we add a diversity loss to maximize their diversity:

$$\begin{aligned} L_{div} = \sum _{i\ne j} sim(\boldsymbol{p}_{g,i},\; \boldsymbol{p}_{g,j}), \end{aligned}$$
(6)

where sim denotes the cosine similarity function.

Since learning \(\boldsymbol{P}_g\) to robustly represent each aspect of the action (e.g., the start of the action) is difficult even with annotation [63, 66]. To increase the overall robustness, for the focused prototypes \(\boldsymbol{P}_f\), we instead add regularization on the self-attention weight \(\boldsymbol{A}_f\) so that different \(\boldsymbol{p}_f\) can focus on different temporal locations of the video:

$$\begin{aligned} L_{att} = \sum _{i\ne j} sim(\boldsymbol{\alpha }_{f,i},\; \boldsymbol{\alpha }_{f,j}). \end{aligned}$$
(7)

Here \(\boldsymbol{\alpha }_{f,i} \in \mathbb {R}^{(B+2)T}\) denotes the i-th row in \(\boldsymbol{A}_f\).

Compound Prototype Matching. Cooperating with the compound prototypes, we use different matching strategies to calculate the overall similarity between two videos. As shown in Fig. 1, for the global prototypes \(\boldsymbol{P}_g\), we match the query and support prototypes in a 1-to-1 manner, i.e., the i-th prototype of the query video is always matched with the i-th prototype of the support video. To calculate the global prototypes’ overall similarity score between video a and video b, we average the similarity score of all the \(m_g\) global prototypes:

$$\begin{aligned} s_g^{a,b} = \frac{1}{m_g}\sum _{k=1}^{m_g}sim(\boldsymbol{p}_{g,k}^a,\; \boldsymbol{p}_{g,k}^b). \end{aligned}$$
(8)

Thus, to maximize the similarity score of correct video pairs and minimize the similarity of incorrect video pairs during episodic training, each \(\boldsymbol{p}_g\) will try to encode a certain aspect of the action, e.g., the start of the action. This phenomenon is supported by our experiments in Sect. 4.

For the focused prototypes \(\boldsymbol{P}_f\), we apply a bipartite matching-based similarity measure. Since different actions may happen in different temporal positions in the videos, the bipartite matching enables the temporal alignment of actions, allowing the comparison between actions of different lengths and at different speeds. Formally speaking, for \(\boldsymbol{P}_f^a\) of video a and \(\boldsymbol{P}_f^b\) of video b, we find a bipartite matching between these two sets of prototypes by searching for the best permutation of \(m_f\) elements with the highest cosine similarity using the Hungarian algorithm [31]. Denote \(\sigma \) as the best permutation, the similarity score of \(\boldsymbol{P}_f^a\) and \(\boldsymbol{P}_f^b\) is calculated by:

$$\begin{aligned} s_d^{i,j} = \frac{1}{m_f}\sum _{k=1}^{m_f} sim (\boldsymbol{p}_{f,k}^a,\; \boldsymbol{p}_{f,\sigma (k)}^b). \end{aligned}$$
(9)

Finally, the similarity score is computed by a weighted average of \(s_g\) and \(s_f\): \(s^{a,b} = \lambda _1 s_g^{a,b} + \lambda _2 s_f^{a,b}\). During training, this similarity score is directly regarded as logits for the cross-entropy loss \(L_{ce}\). The total loss function is a weighted sum of three losses: \(L = w_1L_{ce} + w_2L_{div} + w_3L_{att}\). During inference, we assign the label of the query video as the label of the most similar video in the support set.

4 Experiments

We conduct experiments on four public datasets. Kinetics [7] and Something-something V2 (SSv2) [19] are two most frequently used benchmarks for few-shot action recognition. These two datasets are both splitted as 64/12/24 classes for train/val/test. For SSv2, we use both the split from CMN [72] (SSv2\(^\circ \)) and the split from OTAM [5] (SSv2\(^\sharp \)). Recently, Zhang et al. [69] proposed new splits for HMDB [30] and UCF [48] datasets. We also conduct experiments on these two datasets using the split from [69].

Since our method’s performance is competitive in both standard 1-shot 5-way setting and 5-shot 5-way setting, in this section we only demonstrate 1-shot results and place 5 shot results in the supplementary due to page limitation. Following previous works, we report the average result of 10,000 test episodes in the experiments.

Baselines. We compare our method with recent works reporting state-of-the-art performance, including MatchNet [52], CMN [73], OTAM [5], TRN [71], ARN [69], TRX [44], ITA-Net [70]. Following [74], we also compare with the few-shot image classification model FEAT [65] which is also based on transformers. Specifically, since no previous works used object detector in few-shot action recognition, for a fair comparison with previous works, we conduct experiments in two settings: (1) we give baseline methods the same input (both \(\boldsymbol{F}_g\) and \(\boldsymbol{F}_o\)) as our method and compare the performance. We denote the baselines as “Baseline+” in this setting. (2) We discard the object detector in our method and use only \(\boldsymbol{F}_g\) as input and \(RET_{gg}\) as the encoder. We denote our method in this setting as “Ours-”.

To enable previous methods to take object features as input, we choose to compare with methods TRX+, FEAT+, and ITA-Net+ because they also use transformer-based architectures like our method, thus no modification on the model architecture is needed. For completeness, we also compare MatchNet+ and TRN+ without transformer architecture. For these two methods, we reshape the object features \(\boldsymbol{F}_o\) to \(\boldsymbol{F}_o' \in \mathbb {R}^{T\times Bd}\) as input. Since these two works are not designed to input the object features, we train an ensembled network, one with \(\boldsymbol{F}_g\) as input and the other with \(\boldsymbol{F}_o'\) as input. We use the public implementation of TRX and implement ITA-Net by ourselves. More details about baseline implementation can be found in the supplementary material. Recently, several works [4, 56, 74] pretrain the backbone on the meta-training set and found this pretraining to be useful for few-shot action recognition. To compare with the majority of prior works, we do not follow this setting in our experiments.

Implementation Details. We use ResNet-50 [23] pretrained on ImageNet [12] as the backbone of embedding network, and a fixed Mask-RCNN [22] trained on COCO dataset [35] is used as the bounding box extractor. We select \(B=3\) most confident object bounding boxes per frame. The pre-processing steps follow OTAM [5] where we also sample \(T=8\) frames with random crop during training and center crop during inference. The model and the backbone is optimized using SGD with an initial learning rate of 0.001 and decaying every 20 epochs by 0.1. The embedding network except its first BatchNorm layer is fine-tuned with 1/10 learning rate. The training is stopped when the loss on the validation set is greater than the average of the previous 5 epochs. Unless otherwise stated, we report result using \(m_g=m_f=8\), \(\lambda _1=\lambda _2=0.5\), \(w_1=1, w_2=w_3=0.1\).

Table 1. Results of 5-way 1-shot experiments on 5 dataset splits. Methods marked with * indicates results of our implementation with the original reported results shown in parenthesis. The bottom and upper block are results with and without using object features, respectively. For ITA-Net, result on SSv2\(^\circ \) comes from Table 5 of [70].

4.1 Results

Table 1 shows result comparison with baseline methods. In the upper block, our model slightly outperforms previous works even without using the object detector on 4 of the 5 dataset/splits. This suggests that compared with other prototype generation methods, our proposed approach to generate and match the compound prototypes enables better similarity measurement for few-shot action recognition. From the comparison of baseline methods with their “+” variants, we can see that these methods cannot fully exploit the information brought by the object features. When object features are added as input, our full model significantly outperforms previous methods that use the same input. Compared with the performance of our method in the upper block, the object features significantly stimulates the potential of our proposed compound prototype matching scheme, by improving the accuracy on the SSv2 dataset for over 10%, the Kinetics dataset for 7.7% and the HMDB dataset for over 20%. In the ablation study, we show even better performance can be achieved by carefully adjusting the number of prototypes \(m_g\) and \(m_f\).

Table 2. Results comparison of different methods when using/not using object information, and using/not using our proposed encoder.
Table 3. Results comparison of our model using different encoding relations with different number of global prototypes and focused prototypes.

Our method does not achieve state-of-the-art results on the UCF dataset. One reason is that classes in the UCF dataset can be easily distinguished only from appearance. This causes our multi-relation encoder to overfit. If we remove this encoder and directly use concatenation of object and global features as input to the compound prototype decoder, our method can achieve the state-of-the-art accuracy of 87.7. This reveals one limitation of our method, i.e., easy to overfit on simple datasets.

4.2 Ablation Study

Effect of Object Features. While there exist previous studies that leverage additional information for few-shot action recognition [18], no work has investigated the use of object features as in our method. One may argue that the performance improvement of our method only comes from the use of object features. However, in Table 1 we find that performance gain of previous methods is limited if object features are additionally used as input. Here, we show that a boost in performance only happens when object features, our multi-relation encoder, and our decoder are used together.

We test the performance of multiple methods with and without using object information, and also with and without using our multi-relation encoder. In Table 2, the first block uses neither object feature nor our multi-relation encoder, and the second block uses our encoder but with only the global-global relation \(RET_{gg}\). From the comparison between these two blocks, we can see that our multi-relation encoder can improve the performance of all methods, but not significantly. In the third block of Table 2, we concatenate each frame-wise feature with its corresponding object features as input. The comparison between this block and the first block shows the improvement brought by object features: around 2–5% on SSv2 dataset, and 2– 6% on Kinetics dataset. Finally, in the fourth block of Table 2, both object features and our multi-relation encoder are used. Comparing this block with the second and third blocks, all methods get more improvement. This shows that using our multi-relation encoder to consider multiple relations across frames can better leverage the information brought by the object features. Among all methods in the fourth block of Table 2, our method enjoys the most significant performance gain. This indicates that while the object features bring additional information, our method can best leverage this information to help the few-shot action recognition task.

Impact of Multi-relation Feature Encoding. To see how does the multi-relation feature encoding contribute to the performance, we conduct ablation study of our method considering only subsets of \(\{\boldsymbol{F}_{gg}, \boldsymbol{F}_{go}, \boldsymbol{F}_{oo}\}\). We also vary the number of global prototypes \(m_g\) and focused prototypes \(m_f\) to see the influence of feature encoding on each group of prototypes.

Results can be seen in Table 3. From the experiments with \(m_g=m_f=8\), the SSv2\(^\sharp \) dataset gets much improvement from the use of object-object feature \(\boldsymbol{F}_{oo}\), while the Kinetics dataset benefits more from the global-global feature \(\boldsymbol{F}_{gg}\). This is reasonable since SSv2 dataset includes more actions with multiple objects. From the experiments with \(m_g=16\), the global prototypes seem to work equally well using the three encoded features on both datasets. The experiments with \(m_f=16\) suggest that focused prototypes work better with global-global relations. When using all three features (last row of each block), the comparison between different choices of \(m_g\) and \(m_f\) indicates that the two groups of prototypes capture complementary aspects of the action, since the performances got significantly improved when two groups of prototypes both present. A more detailed figure showing the accuracy difference of each action class can be found in the supplementary material.

4.3 Analysis of Compound Prototypes

The core of our proposed method is the generation and matching of compound prototypes. In this section we conduct extensive experiments to get a more comprehensive understanding of the two groups of prototypes.

Fig. 2.
figure 2

Visualization of the self-attention weight of two and two on each timestamp of the input. Attention weights higher than average (0.125) are marked in black. We can see the global prototypes capture a certain aspect of the action in the video, regardless of temporal location: \(\boldsymbol{p}_{g,2}\) - the start of the action; \(\boldsymbol{p}_{g,6}\) - the frames without hand. Meanwhile, the focused prototypes mainly attend on fixed timestamps of the video: \(\boldsymbol{p}_{f,1}\) - the end of the video; \(\boldsymbol{p}_{f,2}\) - the middle part of the video. Example to the left comes from the SSv2\(^\circ \) dataset and the example to the right is from SSv2\(^\sharp \). Video similarity scores s and similarity scores of matched prototypes \(\boldsymbol{p}_*\sim \boldsymbol{p}_*\) are shown at the bottom.

What Aspect of the Action Does Each Prototype Capture? We address this question by investigating the self-attention operation that generates the prototypes. From Eq. 5, the self-attention weight \(\boldsymbol{\alpha }\in \mathbb {R}^{(B+2)\times T}\) on each frame represents from which part of the video does each prototype gather its information. To better understand the prototypes, we visualize this attention in Fig. 2 using two 1-shot 2-way examples. In the visualization we average the \(B+2\) attention weights in each frame, forming \(\boldsymbol{\tilde{\alpha }} \in \mathbb {R}^{T}\), and show this averaged value on each of the \(T=8\) frames. For clarity we only show 2 and 2 in each video. We also show the video similarity scores and the similarity of matched prototypes at the bottom of the figure.

The visualization is shown in Fig. 2. We first analyze the attention of each prototype. In all videos, the global prototype \(\boldsymbol{p}_{g,2}\) have high attention weights on the start of the action (not the start of the video), and \(\boldsymbol{p}_{g,6}\) pays more attention to the frames that contain appearance change compared with other frames (no hand existence). This is expected since \(L_{div}\) forces each global prototype to be different, while the 1-to-1 matching encourages each global prototype to focus on similar aspects so that correct video similarity can be predicted. For the focused prototypes, \(\boldsymbol{p}_{f,1}\) usually gives high attention to the last few frames, and \(\boldsymbol{p}_{f,2}\) pays more attention on the middle frames. This is also expected since \(L_{att}\) refrains the focused prototypes to attend on similar temporal locations, and bipartite matching allows similar actions to be matched even when they are at different temporal locations of the videos. A similar phenomenon exists in the object detection task [6], where each object query focuses on detecting objects in a specific spatial location of the image.

In the left example, both the prototype pairs \(<\boldsymbol{p}_{g,2}^a,\boldsymbol{p}_{g,2}^b>\) and \(<\boldsymbol{p}_{g,6}^a,\boldsymbol{p}_{g,6}^b>\) have high similarity scores (0.53 and 0.49 shown at the bottom of the left example). This indicates that video a and b have similar starts, and the intra-video appearance change is also similar. Thus the query action is correctly classified as “Pretending to take something from somewhere”. In the right example, we can see the effectiveness of the focused prototypes. By Hungarian matching, \(\boldsymbol{p}_{f,1}^x\) is matched with \(\boldsymbol{p}_{f,2}^y\). Since they both encode the frames where the hands just tip the objects over, these two prototypes give high similarities, enabling the query action of “Tipping something over” to be correctly recognized.

Fig. 3.
figure 3

Visualization of self-attention weights of \(\boldsymbol{P}_s\) and \(\boldsymbol{P}_d\) for all samples on the test set of SSv2\(^\sharp \) and Kinetics datasets.

Fig. 4.
figure 4

Performance on the SSv2\(^\sharp \) and Kinetics datasets when changing the number of global/focused prototypes.

A statistical analysis of self-attention weights can be found in Fig. 3 showing the average response of the first 4 global prototypes and the first 4 focused prototypes on all videos of the test set. As a result of the loss functions \(L_{div},L_{att}\) and the matching strategies, on both SSv2\(^\sharp \) and Kinetics datasets, \(\boldsymbol{P}_g\) (first 4 rows) have a more uniform attention distribution, while \(\boldsymbol{P}_f\) have obvious temporal regions of focus. The diversity of the prototypes ensures a robust representation of the videos, thus similarity between videos can be better computed during the few-shot learning process.

How Much Does Each Group of Pprototype Contribute? To find the answer, we test our method using different numbers of prototypes (\(m_g\) and \(m_f\)) and show the results in Fig. 4. Although the best combination of \(m_g\) and \(m_f\) are different for each dataset, the performance gets better when the number of prototypes increases, and after a certain threshold, the result saturates because of the overfitting on the training data. Best results on both SSv2\(^\sharp \) (62.0) and Kinetics (86.9) datasets are achieved when \(m_f\) is larger than \(m_g\). Although \(m_g=m_f=8\) is not the optimal setting, we apply this setting in Sect. 4.1 and Sect. 4.2 since it is the most stable setting on all datasets. A method to automatically choose the number of prototypes is left for our future work.

Fig. 5.
figure 5

Class accuracy improvement when our method uses \(m_g=m_f=8\) prototypes compared to: orange bars: \(m_g=16\), \(m_f=0\); blue bars: \(m_g=0\), \(m_f=16\). S stands for the abbreviation of “something”. (Color figure online)

Also, we show the class accuracy improvement when our method uses both groups of prototypes compared with our method using only one group of prototype. In Fig. 5, orange bars denote the accuracy difference between the \(m_g=m_f=8\) setting and the \(m_g=16, m_f=0\) setting, which indicates the performance gain by introducing the focused prototypes. Blue bars, on the other hand, show the accuracy improvement brought by the global prototypes. We can see on the SSv2\(^\sharp \) dataset that when combining the two groups of prototypes, some hard classes like “pulling S out of S”, “pulling S from left to right” and “pushing S from right to left” can be better distinguished. From the results of the two datasets, we observe the focused prototypes are more effective in the Kinetics dataset. This is because the Kinetics dataset focuses more on appearance, which can be better captured and compared by the focused prototypes.

Will Wrong Bipartite Matching Destroy the Temporal Ordering of Actions? Although we observe great performance gain brought by \(\boldsymbol{P}_f\) in Table 3 and Fig. 5, the bipartite matching will unavoidably produce some wrongly matched prototype pairs when creating the correct matchings. Our additional experiments in the supplementary show that filtering the matched prototypes with low similarity negatively affects the convergence of the model. One reason is that positional encoding implicitly encodes the temporal ordering of the frames within each prototype. During training, the similarity scores of all the wrong matching pairs are learned to be small and so that the final similarity score can be dominated by the similarity score of the correctly matched prototype pairs.

5 Conclusion

In this work, we introduce a novel method for few-shot action recognition by generating global and focused prototypes and compare video similarity based on the prototypes. When generating the prototypes, we encode spatiotemporal object relations to address the actions that involve multiple objects. The two groups of prototypes are encouraged to capture different aspects of the input by different loss functions and matching strategies. In our future work, we will explore a more flexible prototype matching strategy that can avoid the mismatch in the bipartite matching.