Discovering Multi-label Actor-Action Association in a Weakly Supervised Setting

Biswas, Sovan; Gall, Juergen

doi:10.1007/978-3-030-69541-5_33

Sovan Biswas¹² &
Juergen Gall¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12626))

Included in the following conference series:

Asian Conference on Computer Vision

831 Accesses

Abstract

Since collecting and annotating data for spatio-temporal action detection is very expensive, there is a need to learn approaches with less supervision. Weakly supervised approaches do not require any bounding box annotations and can be trained only from labels that indicate whether an action occurs in a video clip. Current approaches, however, cannot handle the case when there are multiple persons in a video that perform multiple actions at the same time. In this work, we address this very challenging task for the first time. We propose a baseline based on multi-instance and multi-label learning. Furthermore, we propose a novel approach that uses sets of actions as representation instead of modeling individual action classes. Since computing the probabilities for the full power set becomes intractable as the number of action classes increases, we assign an action set to each detected person under the constraint that the assignment is consistent with the annotation of the video clip. We evaluate the proposed approach on the challenging AVA dataset where the proposed approach outperforms the MIML baseline and is competitive to fully supervised approaches.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

Spot On: Action Localization from Pointly-Supervised Proposals

1 Introduction

In recent years, we have seen a major progress for spatially and temporally detecting actions in videos [1,2,3,4,5,6,7,8,9,10]. For this task, the bounding box of each person and their corresponding action labels need to be estimated for each frame as shown in Fig. 1. Such approaches, however, require the same type of dense annotations for training. Thus, collecting and annotating datasets for spatio-temporal action detection becomes very expensive.

To alleviate this problem, weakly supervised approaches have been proposed [11,12,13] where the bounding boxes are not given, but only the action that occurs in a video clip. Despite the promising results of the weakly supervised approaches for spatio-temporal action detection, current approaches are limited to video clips that predominantly contain a single actor performing a single action as in the datasets UCF 101 [14] and JHMDB [15]. However, most real world videos are more complex and contain multiple actors performing multiple actions simultaneously. In this paper, we move a step forward and introduce the task of weakly supervised multi-label spatio-temporal action detection with multiple actors in a video. The goal is to infer a list of multiple actions for each actor in a given video clip as in the fully supervised case [5,6,7,8,9,10]. However, in the weakly supervised setting only actions occurring in each training video are known. Any spatio-temporal information about the persons performing these actions is not provided. This is illustrated in Fig. 1 that shows two people standing and chatting. The video clip is only annotated by the four occurring actions Stand, Listen to, Talk to, and Watch. Additional information like bounding boxes or the number of present persons is not provided. In contrast to previous experimental settings for weakly supervised learning, the proposed task is much more challenging since a video clip can contain multiple persons, each person can perform multiple actions at the same time, and multiple persons can perform the same action. For instance, both persons in Fig. 1 perform the actions Stand and Watch at the same time.

In order to address multi-label spatio-temporal action detection in the proposed weakly supervised setup, we first introduce a baseline that uses multi-instance and multi-label (MIML) learning [16,17,18]. Second, we introduce a novel approach that is better suited for the multi-label setting. Instead of modeling the class probabilities for each action class, we build the power set of all possible action combinations and model the probability for each subset of actions. Using a set representation has the advantage that we model directly the combination of multiple occurring actions instead of the probabilities of single actions. Since computing the probabilities for the full power set becomes intractable as the number of action classes increases, we assign an action set to each detected person under the constraint that the assignment is consistent with the annotation of the video clip. This is done by linear programming, which maximizes the overall gain across all plausible actors and action subset combinations. We evaluate the proposed approach on the challenging AVA 2.2 dataset [19], which is currently the only dataset that can be used for evaluating this task. In our experiments, we show that the proposed approach outperforms the MIML baseline by a large margin and that the proposed approach achieves $83\%$ of the mAP compared to a model trained with full supervision.

In summary, the contribution of this paper is three-fold:

We introduce the novel task of weakly supervised multi-label spatio-temporal action detection with multiple actors.
We introduce a first baseline for this task based on multi-instance and multi-label learning.
We propose a novel approach based on an action set representation.

2 Related Work

Spatio-Temporal Action Detection. A popular approach for fully supervised spatio-temporal action detection comprises the joint detection and linking of bounding boxes [1, 3, 4, 20]. These linked bounding boxes form tubelets which are subsequently classified. Recently, many methods [9, 10, 21, 22] use standard person detectors for actor localization and focus on learning implicitly or explicitly spatio-temporal interactions. All these approaches, however, require that each frame is annotated with person locations and corresponding action labels. Since such dense annotations are expensive to obtain on a large scale, recent approaches [8, 19, 23] deal with temporally sparse annotations. Here, the action labels and locations are annotated only for a subset of frames. Even though there is a reduction in annotation, these methods still require person specific bounding boxes and their actions. Very few methods such as [11, 13] explore the possibility of weakly supervised learning. Most of these methods such as [24, 25] use multiple instance learning to recognize distinct action characteristics. These works, however, consider the case where a single person performs not more than one action.

Actor-Action Associations. Actor-action associations have been key to identify actions both in a fully supervised and weakly supervised settings. [26] performs soft actor-action association using tags as pre-training on a very large dataset for fully supervised action recognition. With respect to weak supervision, a few approaches use movie subtitles [27, 28] or transcripts [29, 30] to temporally align actions to frames. In terms of actor-action associations for multiple persons, [31, 32] associate a single action to various persons. To the best of our knowledge, our work is the first to perform multi-person and multi-label associations.

Multi-instance and Multi-label Learning. In the past, many MIML algorithms [33, 34] have been proposed. For example, [17] propose the MIMLBoost and MIMLSVM algorithms based on boosting or SVMs. [35] optimize a regularized rank-loss objective. MIML has been also used for different computer vision applications such as scene classification [16], multi-object recognition [18], and image tagging [36]. Recently, MIML based approaches have been used for action recognition [32, 37].

3 Multi-label Action Detection and Recognition

Given a video clip with multiple actors where each actor can perform multiple actions at the same time as shown in Fig. 1, the goal is to localize these actors and predict for each actor the corresponding actions. In contrast to fully supervised learning, where bounding boxes with multiple action labels are given for training, we address for the first time a weakly supervised setting where only a list of actions is provided for each video clip during training. This is a very challenging task as we do not know how many actors are present and each actor can perform multiple actions at the same time. This is in contrast to weakly supervised spatio-temporal action localization where it is assumed that only one person is in the video and that the person does not perform more than one action at a given point in time.

In order to address this problem, we first discuss a baseline, which uses multi-instance and multi-label (MIML) learning [16,17,18], in Sect. 4. In Sect. 5, we will then propose a novel method which uses a set representation instead of a representation of individual actions. This means that we build from the annotation of a video clip the power set of all possible action combinations. For example, the power set $\varOmega $ for the three action labels Listen, Talk, and Watch is given by {$\varnothing $, {Listen}, {Talk}, {Watch}, {Listen,Talk}, {Listen,Watch}, {Talk,Watch}, {Listen,Talk,Watch}}. We then assign one set $\omega _i \in \varOmega \setminus \varnothing $ to each actor $a_i$ under the constraint that each action c occurs at least once, i.e., $c \in \bigcup _i \omega _i$. Using a set representation has the advantage that we model directly the combination of multiple occurring actions instead of the probabilities of single actions.

4 Multi-instance and Multi-label (MIML) Learning

One way to address the weakly supervised learning problem is to use multiple-instance learning. Since we have a multi-label problem, i.e., an actor can perform multiple actions at the same time, we use the concept of multi-instance and multi-label (MIML) learning [16,17,18]. We first use a person detector [38] to spatially localize the actors in a frame t and use a 3D-CNN such as I3D [39] or Slowfast [10] for predicting the action probabilities similar to fully supervised methods [8, 9]. However, we use the MIML loss to train the networks.

We denote by $A_t = \{a^t_1,a^t_2, \ldots , a^t_{n_t}\}$ the detected bounding boxes and by $f(a^t_i)$ the class probabilities that are predicted by the 3D-CNN. Let Y be the vector which contains the annotations of the video clip, i.e., $Y(c)=1$ if the action class c occurs in the video clip and $Y(c)=0$ otherwise. In other words, the bag $A_t$ is labeled by $Y(c)=1$ if at least one actor performs the action c and by $Y(c)=0$ if none of the actors performs the action. The MIML loss is then given by

$$\begin{aligned} \mathcal {L}_{MIML} = \mathcal {L}\left( Y, \underset{i}{max}(f(a^t_{i})) \right) \end{aligned}$$

(1)

where $\mathcal {L}$ is the binary cross entropy. This means that the class probability should be close to one for at least one bounding box if the action is present and it should be close to zero for all bounding boxes if the action class is not present.

5 Actor-Action Association

While multi-instance and multi-label learning discussed in Sect. 4 already provides a good baseline for the new task of weakly supervised multi-label action detection, we propose in this section a novel method that outperforms the baseline by a large margin. As discussed in Sect. 3, the main idea is to change the representation from individual action labels to sets of actions. This means that we have one probability for a subset of actions $\omega \in \varOmega $ instead of C probabilities where C is the number of action labels. We discuss how the probability of a set actions is estimated in Sect. 5.1. Due to the weakly supervised setting not all combinations of subsets are possible for each video clip. We therefore assign an action set $\omega \in \varOmega $ to each actor a under the constraint that the assignment is consistent with the annotation of the video clip, i.e., each annotated action c needs to occur at least once and actions that are not annotated should not occur. The assignment is discussed in Sect. 5.2.

Figure 2 illustrates the complete approach. As described in Sect. 4, we use a 3D CNN such as I3D [39] or Slowfast [10]. Since the actors in a frame often interact with each other, we use a graph to model the relations between the actors. The graph connects all actors and we use a graph RNN to infer the action probabilities for each actor based on the spatial and temporal context. In our approach, we use the hierarchical Graph RNN (HGRNN) [7] where the features per node are obtained by ROI pooling over the 3D CNN feature maps. The HGRNN and 3D CNN are learned using the MIML loss (1). From the action class probabilities, we infer the action set probabilities as described in Sect. 5.1 and we infer the action set for each actor as described in Sect. 5.2. Finally, we train the HGRNN and the 3D CNN based on the assignments. This will be discussed in Sect. 5.3.

5.1 Power Set of Actions

In principle, we could modify our network to predict the probability for each subset of all action classes instead of the probabilities for all action classes. However, this is infeasible since the power set of all actions is very large. If C is the number of actions in a dataset, the power set for all actions consists of $2^{C}$ subsets. Already with 50 action classes, we would need to predict the probabilities for over one quadrillion subsets. Instead, we use an idea that was proposed for HEX graphs [40] where the probabilities of a hierarchy are computed from the probabilities of the leave nodes. While we do not use a hierarchy, we can compute the probability of a subset of actions from the predictions of a network for individual actions.

Let $s_{c} \in (-\infty ,\infty )$ denote the logit that is predicted by the network for the action class c. The probability of a subset of actions $\omega $ can then be computed by

$$\begin{aligned} p_{\omega } = \frac{\exp \left( \sum _{c \in \omega } s_c\right) }{\sum _{\omega '}\exp \left( \sum _{c \in \omega '} s_c\right) }. \end{aligned}$$

(2)

The normalization term, however, is still infeasible to compute since we still need to sum over all possible subsets ($\omega '$) for the dataset.

Since our goal is the assignment of a subset of actions $\omega $ to each actor, we do not need to compute the full probability (2). Instead of using the power set of all actions, we build the power set only for the actions that are provided as weak labels for each training video clip. This means that the power set will differ for each video clip. For the example shown in Fig. 1, we build the power set $\varOmega $ for the actions Stand, Listen, Talk, and Watch. In this example, $\vert \varOmega \vert =16$. We exclude $\varnothing $ since in the used dataset each actor is annotated with at least one action. Furthermore, we multiply $p_{\omega }$ with the confidence d of the person detector. The scoring function $p_{\omega ,i}$ that we use for the assignment of a subset $\omega \in \varOmega \setminus \varnothing $ to a detected actor $a_i$ is therefore given by

$$\begin{aligned} p_{\omega ,i} = \frac{\exp \left( \sum _{c \in \omega } s_{c,i}\right) d_i}{\sum _{\omega ' \in \varOmega \setminus \varnothing }\exp \left( \sum _{c \in \omega '} s_{c,i}\right) } \end{aligned}$$

(3)

where $s_{c,i}$ is the predicted logit for action c and person $a_i$. Taking the detection confidence $d_i$ of person $a_i$ into account is necessary to reduce the impact of false positives that usually have a low detection confidence.

5.2 Actor-Action Association

While the scoring function (3) indicates how likely a given subset of actions $\omega \in \varOmega \setminus \varnothing $ fits to an actor $a_i$, it does not take all information that is available for each video clip into account. For instance, we know that each annotated action is performed by at least one actor. In order to exploit this additional knowledge, we find the optimal assignment of action subsets to actors based on the constraints that each actor performs at least one action and that each action c occurs at least once, i.e., $c \in \bigcup _i \omega _i$. Since we build the power set only from the actions that occur in a video clip, which we denote by L, the power set $\varOmega (L)$ varies for each training video clip.

The association of subsets $\omega \in \varOmega (L) \setminus \varnothing $ to actors $A = \{a_1, a_2, \ldots , a_n\}$ can be formulated as a binary linear program where the binary variables $x_{\omega ,i}$ are one if the subset $\omega $ is assigned to actor $a_i$ and it is zero otherwise. The optimal assignment is defined by the assignment with the highest score (4). While the first constraint (5) enforces that exactly one subset $\omega $ is assigned to each actor $a_i$, the second constraint (6) enforces that $c \in \bigcup _{\omega : x_{\omega ,i} = 1} \omega $ for all $c \in L$, where $\{\omega : x_{\omega ,i} = 1\}$ is the set of all subsets that have been assigned. Note that (6) rephrases this constraint such that it can be used for optimization where the indicator function ${\mathbbm {1}}_{\omega }(c)$ is one if $c \in \omega $ and it is zero otherwise. The left hand side of the inequality therefore counts the number of assigned subsets that contain the action class c. Since this number must be larger than zero, it ensures that each action $c \in L$ is assigned to at least one actor. The complete binary linear program is thus given by:

$$\begin{aligned} \mathop {\mathrm {argmax}}\limits _{x_{\omega ,i}} \;&\sum _{i=1}^{n}\sum _{\omega \in \varOmega (L) \setminus \varnothing } p_{\omega ,i} x_{\omega ,i} \end{aligned}$$

(4)

$$\begin{aligned} \text {subject to} \;&\sum _{\omega \in \varOmega (L) \setminus \varnothing } x_{\omega ,i} = 1 \qquad \qquad \qquad \qquad \qquad \forall i =1 ,..., n \end{aligned}$$

(5)

$$\begin{aligned}&\sum _{i=1}^{n} \sum _{\omega \in \varOmega (L) \setminus \varnothing } {\mathbbm {1}}_{\omega }(c) x_{\omega ,i} \ge 1 \qquad \qquad \qquad \qquad \forall c \in L \nonumber \\&x_{\omega ,i} \in \{0,1\} \qquad \qquad \qquad \forall \omega \in \varOmega (L) \setminus \varnothing ; \; \forall i =1 ,..., n. \end{aligned}$$

(6)

Figure 3 illustrates the constraints.

5.3 Training

We train first the network using the MIML loss (1) to obtain initial estimates of the logits $s_{c,i}$. We then assign subsets of actions to the detected persons using the scoring function (3). Finally, we train our network using the loss

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{MIML} + \alpha \sum _{i=1}^{n_t} \mathcal {L}\left( \hat{Y}_{\omega ^t_i}, f(a^t_{i})\right) \end{aligned}$$

(7)

where $\omega ^t_i$ denotes the action subset that has been assigned to actor $a^t_i$ in frame t and $\hat{Y}_{\omega ^t_i}$ is a vector with $\hat{Y}_{\omega ^t_i}(c) = 1$ if $c \in \omega ^t_i$ and $\hat{Y}_{\omega ^t_i}(c) = 0$ otherwise. $\mathcal {L}$ is the binary cross entropy. Since $\mathcal {L}_{MIML}$ is computed once per frame but $\mathcal {L}(\hat{Y}_{\omega ^t_i}, f(a^t_{i}))$ is computed for each detected person, we use $\alpha =0.3$ to compensate for this difference.

6 Experiments

6.1 Dataset and Implementation Details

We use the AVA 2.2 dataset [19] for evaluation. The dataset contains 235 videos for training, 64 videos for validation, and 131 videos for testing. The dataset contains 60 action classes. The persons perform often multiple actions at the same time and the videos contain multiple persons. For each annotated person a bounding box is provided. An example is given in Fig. 1. Only one frame per second is annotated. The accuracy is measured by mean average precision (mAP) over all actions with an IoU threshold for bounding boxes of 0.5 as described in [19]. In the weakly supervised setting, we use only the present actions for training, but not the bounding boxes.

To detect persons, we use Faster RCNN [41] with ResNext-101 [38] as backbone. The detector was pre-trained on ImageNet and fine-tuned on the COCO dataset. In our experiments, we report results for two 3D CNNs, namely I3D [39] and Slowfast [10]. I3D is pre-trained on Kinetics-400. For Slowfast, we use the ResNet-101 + NL ($8\,\times \,8$) version that is pre-trained on Kinetics 600. The temporal scope was set to 64 frames with a stride of 2. For HGRNN we use a temporal window of 11 frames. For training, we use the SGD optimizer until the validation error saturated. The learning rate with linear warmup was set to 0.04 and 0.025 for I3D and Slowfast, respectively. The batch size was set to 16. We used cropping as data augmentation where we crop images of size $224 \times 224$ pixels from the frames that have $256\,\times \,256$ image resolution.^{Footnote 1}

Table 1. Comparison of MIML with proposed method. The proposed approach outperforms MIML in case of I3D and Slowfast.

Full size table

6.2 Experimental Results

Comparison of MIML with Proposed Method. Table 1 shows the comparison of the proposed approach with the multi-instance and multi-label (MIML) baseline on the validation set. When I3D is used as 3D CNN, the proposed approach improves the MIML baseline by $+3.2\%$. When Slowfast is used, the accuracy of all methods is higher but the improvement of the proposed approach over the MIML approach remains nearly the same with $+3.3\%$. We also report the result when HGRNN is trained only with the MIML loss. In this case, the actor-action association is not used and we denote this setting by MIML+HGRNN. While HGRNN improves the results since it models the spatio-temporal relations between persons better than a 3D CNN alone, the proposed actor-action assignment improves the mAP compared to MIML+HGRNN by $+2.1\%$ and $+2.0\%$ for I3D and Slowfast, respectively. Figure 4 shows the improvement of the proposed approach over the MIML baseline for the 10 action classes that occur most frequently in the training set. A few qualitative results are show in Fig. 5.

Table 2. Results of various actor-action assignment approaches using HGRNN over different 3D CNNs. The Frequent-5 column and the Least-10 column show the average mAP over the 5 most frequently and 10 least occurring classes in the training set.

Full size table

Table 3. Performance with ground-truth bounding boxes for evaluation. The results show the improvement in mAP on the validation set when ground-truth bounding boxes (GT bb) instead of detected bounding boxes (Detected bb) are used for evaluation. Furthermore, the results are reported when the model is trained with full supervision.

Full size table

Impact of Actor-Action Association. In Table 1, we have observed that the actor-action association improves the accuracy. In Table 2, we analyze the impact of the actor-action association more in detail. We use HGRNN using both I3D and Slowfast as 3D CNN backbone. In case of MIML+HGRNN, the actor-action association is not used. We also report the result if we perform the association directly by the confidences without solving a binary linear program. We denote this setting by Proposed Approach w/o LP. In this case, we associate an action to an actor if the class probability is greater than 0.5. For I3D, the association without LP improves the results mainly for the most frequent classes with almost no improvement on least frequent classes. For Slowfast, the performance even decreases in comparison to MIML+HGRNN without LP. Instead, solving the linear program results in better associations for both I3D and Slowfast.

Impact of the Object Detector. We use the Faster RCNN with ResNext [38] person detector which achieves $90.10\%$ mAP for person detection on the AVA training set and $90.45\%$ on the AVA validation set. Irrespective of the high detection performance, we analyze how much the accuracy improves if the detected bounding boxes are replaced with the ground-truth bounding boxes during evaluation. Note that the ground-truth bounding boxes are not used for training, but only for evaluation. The results are shown in Table 3. We observe that the performance improves by $+7.0\%$ and $+7.2\%$ mAP on the validation set for I3D and Slowfast, respectively. We also report the results if the approach is trained using full supervision. In this case, the network is trained on the ground-truth bounding boxes and the ground-truth action labels per bounding box. Compared to the fully supervised approach, our weakly supervised approach achieves around $83\%$ of the mAP for both 3D CNNs ($17.3\%$ vs. $20.7\%$ for I3D and $25.1\%$ vs. $30.1\%$ for Slowfast) if detected bounding boxes are used for evaluation. The gap gets even smaller when ground-truth bounding boxes are used for evaluation. In this case, the relative performance is $95.7\%$ for I3D and $90.5\%$ for Slowfast. This demonstrates that the proposed approach learns the actions very well despite of the weak supervision.

Table 4. Comparison to fully supervised approaches. We also report the result of our approach if it is trained with full supervision. Note that we do not use multi-scale and horizontal flipping augmentation as in Slowfast++.

Full size table

Comparison to Fully Supervised Methods. Since this is the first approach that addresses weakly supervised learning for multi-label and multi-person action detection, we cannot compare to other weakly supervised approaches. However, we compare our approach with the state-of-the-art for fully supervised action detection in Table 4. Our approach is competitive to fully supervised approaches [5,6,7,8]. When we train our approach with full supervision, we improve over SlowFast [10] by $+1.1\%$ mAP on the validation set. While the Slowfast++ network performs slightly better, it uses additional data augmentation and a different network configuration. We expect that these changes would improve our approach as well.

7 Conclusion

In this paper, we introduced the challenging task of weakly supervised multi-label spatio-temporal action detection with multiple actors. We first introduced a baseline based on multi-instance and multi-label learning. We furthermore presented a novel approach where the multi-label problem is represented by the power set of the action classes. In this context, we assign an element of the power set to each detected person using linear programming. We evaluated our approach on the challenging AVA dataset where the proposed method outperforms the MIML approach. Despite of the weak supervision, the proposed approach is competitive to fully supervised approaches.

Notes

1.
Code: https://github.com/sovan-biswas/MultiLabelActorActionAssignment.

References

Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR, pp. 759–768 (2015)
Google Scholar
Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos. In: ICCV, pp. 5822–5831 (2017)
Google Scholar
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: ICCV, pp. 4415–4423 (2017)
Google Scholar
Singh, G., Saha, S., Sapienza, M., Torr, P., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: ICCV, pp. 3657–3666 (2017)
Google Scholar
Sun, C., Shrivastava, A., Vondrick, C., Murphy, K., Sukthankar, R., Schmid, C.: Actor-centric relation network. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 335–351. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_20
Chapter Google Scholar
Sun, C., Shrivastava, A., Vondrick, C., Sukthankar, R., Murphy, K., Schmid, C.: Relational action forecasting. In: CVPR (2019)
Google Scholar
Biswas, S., Souri, Y., Gall, J.: Hierarchical graph-RNNs for action detection of multiple activities. In: ICIP (2019)
Google Scholar
Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: Video action transformer network. In: CVPR, pp. 244–253 (2019)
Google Scholar
Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: CVPR, pp. 284–293 (2019)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: ICCV, pp. 6202–6211 (2019)
Google Scholar
Mettes, P., Snoek, C.G., Chang, S.F.: Localizing actions from video labels and pseudo-annotations. In: BMVC (2017)
Google Scholar
Soomro, K., Shah, M.: Unsupervised action discovery and localization in videos. In: ICCV, pp. 696–705 (2017)
Google Scholar
Chéron, G., Alayrac, J.B., Laptev, I., Schmid, C.: A flexible model for training action localization with varying levels of supervision. In: NIPS, pp. 942–953 (2018)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012)
Google Scholar
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: ICCV, pp. 3192–3199 (2013)
Google Scholar
Zhou, Z.H., Zhang, M.L.: Multi-instance multi-label learning with application to scene classification. In: NIPS, pp. 1609–1616 (2006)
Google Scholar
Zhou, Z.H., Zhang, M.L., Huang, S.J., Li, Y.F.: Multi-instance multi-label learning. Artif. Intell. 176, 2291–2320 (2012)
Article MathSciNet Google Scholar
Yang, H., Tianyi Zhou, J., Cai, J., Soon Ong, Y.: MIML-FCN+: multi-instance multi-label learning via fully convolutional networks with privileged information. In: CVPR, pp. 1577–1585 (2017)
Google Scholar
Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: CVPR, pp. 6047–6056 (2018)
Google Scholar
Song, L., Zhang, S., Yu, G., Sun, H.: TACNet: transition-aware context network for spatio-temporal action detection. In: CVPR, pp. 11987–11995 (2019)
Google Scholar
Feichtenhofer, C.: X3D: expanding architectures for efficient video recognition. In: CVPR (2020)
Google Scholar
Ji, J., Krishna, R., Fei-Fei, L., Niebles, J.C.: Action genome: actions as compositions of spatio-temporal scene graphs. In: CVPR (2020)
Google Scholar
Weinzaepfel, P., Martin, X., Schmid, C.: Towards weakly-supervised action localization. arXiv preprint arXiv:1605.05197 (2016)
Siva, P., Xiang, T.: Weakly supervised action detection. In: BMVC, p. 6 (2011)
Google Scholar
Mettes, P., Snoek, C.G.: Spatio-temporal instance learning: action tubes from class supervision. arXiv preprint arXiv:1807.02800 (2018)
Ghadiyaram, D., Tran, D., Mahajan, D.: Large-scale weakly-supervised pre-training for video action recognition. In: CVPR (2019)
Google Scholar
Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Finding actors and actions in movies. In: ICCV, pp. 2280–2287 (2013)
Google Scholar
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR, pp. 1–8 (2008)
Google Scholar
Kuehne, H., Richard, A., Gall, J.: A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. PAMI 42, 765–779 (2018)
Article Google Scholar
Richard, A., Kuehne, H., Gall, J.: Action sets: weakly supervised action segmentation without ordering constraints. In: CVPR, pp. 5987–5996 (2018)
Google Scholar
Ramanathan, V., Huang, J., Abu-El-Haija, S., Gorban, A., Murphy, K., Fei-Fei, L.: Detecting events and key actors in multi-person videos. In: CVPR, pp. 3043–3053 (2016)
Google Scholar
Li, J., Liu, J., Yongkang, W., Nishimura, S., Kankanhalli, M.: Weakly-supervised multi-person action recognition in 360$^{\circ }$ videos. In: WACV (2020)
Google Scholar
Nguyen, C.T., Zhan, D.C., Zhou, Z.H.: Multi-modal image annotation with multi-instance multi-label LDA. In: IJCAI (2013)
Google Scholar
Nguyen, N.: A new SVM approach to multi-instance multi-label learning. In: ICDM, pp. 384–392 (2010)
Google Scholar
Briggs, F., Fern, X.Z., Raich, R.: Rank-loss support instance machines for MIML instance annotation. In: SIGKDD, pp. 534–542 (2012)
Google Scholar
Zha, Z.J., Hua, X.S., Mei, T., Wang, J., Qi, G.J., Wang, Z.: Joint multi-label multi-instance learning for image classification. In: CVPR, pp. 1–8 (2008)
Google Scholar
Zhang, X.Y., Shi, H., Li, C., Li, P.: Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos. In: AAAI, pp. 12886–12893 (2020)
Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)
Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 4724–4733 (2017)
Google Scholar
Deng, J., et al.: Large-scale object classification using label relation graphs. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 48–64. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_4
Chapter Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)
Google Scholar

Download references

Acknowledgment

The work has been financially supported by the ERC Starting Grant ARCA (677650).

Author information

Authors and Affiliations

University of Bonn, 53115, Bonn, Germany
Sovan Biswas & Juergen Gall

Authors

Sovan Biswas
View author publications
You can also search for this author in PubMed Google Scholar
Juergen Gall
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sovan Biswas .

Editor information

Editors and Affiliations

Waseda University, Tokyo, Japan
Hiroshi Ishikawa
Institute of Automation of Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Czech Technical University in Prague, Prague, Czech Republic
Tomas Pajdla
University of Pennsylvania, Philadelphia, PA, USA
Jianbo Shi

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 141 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Biswas, S., Gall, J. (2021). Discovering Multi-label Actor-Action Association in a Weakly Supervised Setting. In: Ishikawa, H., Liu, CL., Pajdla, T., Shi, J. (eds) Computer Vision – ACCV 2020. ACCV 2020. Lecture Notes in Computer Science(), vol 12626. Springer, Cham. https://doi.org/10.1007/978-3-030-69541-5_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-69541-5_33
Published: 26 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69540-8
Online ISBN: 978-3-030-69541-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Discovering Multi-label Actor-Action Association in a Weakly Supervised Setting

Abstract

Similar content being viewed by others

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

Spot On: Action Localization from Pointly-Supervised Proposals

1 Introduction

2 Related Work

3 Multi-label Action Detection and Recognition

4 Multi-instance and Multi-label (MIML) Learning