1 Introduction

In recent years, we have seen a major progress for spatially and temporally detecting actions in videos [1,2,3,4,5,6,7,8,9,10]. For this task, the bounding box of each person and their corresponding action labels need to be estimated for each frame as shown in Fig. 1. Such approaches, however, require the same type of dense annotations for training. Thus, collecting and annotating datasets for spatio-temporal action detection becomes very expensive.

Fig. 1.
figure 1

The image shows a scene where two persons are talking. In this case there are two person that perform multiple actions at the same time. Person A indicated by the bounding box performs the actions Stand, Listen to, and Watch. Person B indicated by the bounding box performs the actions Stand, Talk to, and Watch. While in the supervised setting this information is also given for training, we study for the first time a weakly supervised setting where the video clip is only annotated by the actions Stand, Listen to, Talk to, and Watch without any bounding boxes or associations to the present persons. (Color figure online)

To alleviate this problem, weakly supervised approaches have been proposed [11,12,13] where the bounding boxes are not given, but only the action that occurs in a video clip. Despite the promising results of the weakly supervised approaches for spatio-temporal action detection, current approaches are limited to video clips that predominantly contain a single actor performing a single action as in the datasets UCF 101 [14] and JHMDB [15]. However, most real world videos are more complex and contain multiple actors performing multiple actions simultaneously. In this paper, we move a step forward and introduce the task of weakly supervised multi-label spatio-temporal action detection with multiple actors in a video. The goal is to infer a list of multiple actions for each actor in a given video clip as in the fully supervised case [5,6,7,8,9,10]. However, in the weakly supervised setting only actions occurring in each training video are known. Any spatio-temporal information about the persons performing these actions is not provided. This is illustrated in Fig. 1 that shows two people standing and chatting. The video clip is only annotated by the four occurring actions Stand, Listen to, Talk to, and Watch. Additional information like bounding boxes or the number of present persons is not provided. In contrast to previous experimental settings for weakly supervised learning, the proposed task is much more challenging since a video clip can contain multiple persons, each person can perform multiple actions at the same time, and multiple persons can perform the same action. For instance, both persons in Fig. 1 perform the actions Stand and Watch at the same time.

In order to address multi-label spatio-temporal action detection in the proposed weakly supervised setup, we first introduce a baseline that uses multi-instance and multi-label (MIML) learning [16,17,18]. Second, we introduce a novel approach that is better suited for the multi-label setting. Instead of modeling the class probabilities for each action class, we build the power set of all possible action combinations and model the probability for each subset of actions. Using a set representation has the advantage that we model directly the combination of multiple occurring actions instead of the probabilities of single actions. Since computing the probabilities for the full power set becomes intractable as the number of action classes increases, we assign an action set to each detected person under the constraint that the assignment is consistent with the annotation of the video clip. This is done by linear programming, which maximizes the overall gain across all plausible actors and action subset combinations. We evaluate the proposed approach on the challenging AVA 2.2 dataset [19], which is currently the only dataset that can be used for evaluating this task. In our experiments, we show that the proposed approach outperforms the MIML baseline by a large margin and that the proposed approach achieves \(83\%\) of the mAP compared to a model trained with full supervision.

In summary, the contribution of this paper is three-fold:

  • We introduce the novel task of weakly supervised multi-label spatio-temporal action detection with multiple actors.

  • We introduce a first baseline for this task based on multi-instance and multi-label learning.

  • We propose a novel approach based on an action set representation.

2 Related Work

Spatio-Temporal Action Detection. A popular approach for fully supervised spatio-temporal action detection comprises the joint detection and linking of bounding boxes [1, 3, 4, 20]. These linked bounding boxes form tubelets which are subsequently classified. Recently, many methods [9, 10, 21, 22] use standard person detectors for actor localization and focus on learning implicitly or explicitly spatio-temporal interactions. All these approaches, however, require that each frame is annotated with person locations and corresponding action labels. Since such dense annotations are expensive to obtain on a large scale, recent approaches [8, 19, 23] deal with temporally sparse annotations. Here, the action labels and locations are annotated only for a subset of frames. Even though there is a reduction in annotation, these methods still require person specific bounding boxes and their actions. Very few methods such as [11, 13] explore the possibility of weakly supervised learning. Most of these methods such as [24, 25] use multiple instance learning to recognize distinct action characteristics. These works, however, consider the case where a single person performs not more than one action.

Actor-Action Associations. Actor-action associations have been key to identify actions both in a fully supervised and weakly supervised settings. [26] performs soft actor-action association using tags as pre-training on a very large dataset for fully supervised action recognition. With respect to weak supervision, a few approaches use movie subtitles  [27, 28] or transcripts [29, 30] to temporally align actions to frames. In terms of actor-action associations for multiple persons, [31, 32] associate a single action to various persons. To the best of our knowledge, our work is the first to perform multi-person and multi-label associations.

Multi-instance and Multi-label Learning. In the past, many MIML algorithms [33, 34] have been proposed. For example, [17] propose the MIMLBoost and MIMLSVM algorithms based on boosting or SVMs. [35] optimize a regularized rank-loss objective. MIML has been also used for different computer vision applications such as scene classification [16], multi-object recognition [18], and image tagging [36]. Recently, MIML based approaches have been used for action recognition [32, 37].

3 Multi-label Action Detection and Recognition

Given a video clip with multiple actors where each actor can perform multiple actions at the same time as shown in Fig. 1, the goal is to localize these actors and predict for each actor the corresponding actions. In contrast to fully supervised learning, where bounding boxes with multiple action labels are given for training, we address for the first time a weakly supervised setting where only a list of actions is provided for each video clip during training. This is a very challenging task as we do not know how many actors are present and each actor can perform multiple actions at the same time. This is in contrast to weakly supervised spatio-temporal action localization where it is assumed that only one person is in the video and that the person does not perform more than one action at a given point in time.

In order to address this problem, we first discuss a baseline, which uses multi-instance and multi-label (MIML) learning [16,17,18], in Sect. 4. In Sect. 5, we will then propose a novel method which uses a set representation instead of a representation of individual actions. This means that we build from the annotation of a video clip the power set of all possible action combinations. For example, the power set \(\varOmega \) for the three action labels Listen, Talk, and Watch is given by {\(\varnothing \), {Listen}, {Talk}, {Watch}, {Listen,Talk}, {Listen,Watch}, {Talk,Watch}, {Listen,Talk,Watch}}. We then assign one set \(\omega _i \in \varOmega \setminus \varnothing \) to each actor \(a_i\) under the constraint that each action c occurs at least once, i.e., \(c \in \bigcup _i \omega _i\). Using a set representation has the advantage that we model directly the combination of multiple occurring actions instead of the probabilities of single actions.

4 Multi-instance and Multi-label (MIML) Learning

One way to address the weakly supervised learning problem is to use multiple-instance learning. Since we have a multi-label problem, i.e., an actor can perform multiple actions at the same time, we use the concept of multi-instance and multi-label (MIML) learning [16,17,18]. We first use a person detector [38] to spatially localize the actors in a frame t and use a 3D-CNN such as I3D [39] or Slowfast [10] for predicting the action probabilities similar to fully supervised methods [8, 9]. However, we use the MIML loss to train the networks.

We denote by \(A_t = \{a^t_1,a^t_2, \ldots , a^t_{n_t}\}\) the detected bounding boxes and by \(f(a^t_i)\) the class probabilities that are predicted by the 3D-CNN. Let Y be the vector which contains the annotations of the video clip, i.e., \(Y(c)=1\) if the action class c occurs in the video clip and \(Y(c)=0\) otherwise. In other words, the bag \(A_t\) is labeled by \(Y(c)=1\) if at least one actor performs the action c and by \(Y(c)=0\) if none of the actors performs the action. The MIML loss is then given by

$$\begin{aligned} \mathcal {L}_{MIML} = \mathcal {L}\left( Y, \underset{i}{max}(f(a^t_{i})) \right) \end{aligned}$$
(1)

where \(\mathcal {L}\) is the binary cross entropy. This means that the class probability should be close to one for at least one bounding box if the action is present and it should be close to zero for all bounding boxes if the action class is not present.

5 Actor-Action Association

Fig. 2.
figure 2

Overview of the proposed approach. Given a training video clip with action labels {A1, A2, A3, A4}, we first detect persons in the video. We then train a 3D CNN with a graph RNN that models the spatio-temporal relations between the detected persons using the MIML loss to obtain initial estimates of the action logits. During actor-action association, subsets of the action labels are assigned to each detected person. The training of the network is continued using the MIML loss and the actor-action associations.

While multi-instance and multi-label learning discussed in Sect. 4 already provides a good baseline for the new task of weakly supervised multi-label action detection, we propose in this section a novel method that outperforms the baseline by a large margin. As discussed in Sect. 3, the main idea is to change the representation from individual action labels to sets of actions. This means that we have one probability for a subset of actions \(\omega \in \varOmega \) instead of C probabilities where C is the number of action labels. We discuss how the probability of a set actions is estimated in Sect. 5.1. Due to the weakly supervised setting not all combinations of subsets are possible for each video clip. We therefore assign an action set \(\omega \in \varOmega \) to each actor a under the constraint that the assignment is consistent with the annotation of the video clip, i.e., each annotated action c needs to occur at least once and actions that are not annotated should not occur. The assignment is discussed in Sect. 5.2.

Figure 2 illustrates the complete approach. As described in Sect. 4, we use a 3D CNN such as I3D [39] or Slowfast [10]. Since the actors in a frame often interact with each other, we use a graph to model the relations between the actors. The graph connects all actors and we use a graph RNN to infer the action probabilities for each actor based on the spatial and temporal context. In our approach, we use the hierarchical Graph RNN (HGRNN) [7] where the features per node are obtained by ROI pooling over the 3D CNN feature maps. The HGRNN and 3D CNN are learned using the MIML loss (1). From the action class probabilities, we infer the action set probabilities as described in Sect. 5.1 and we infer the action set for each actor as described in Sect. 5.2. Finally, we train the HGRNN and the 3D CNN based on the assignments. This will be discussed in Sect. 5.3.

5.1 Power Set of Actions

In principle, we could modify our network to predict the probability for each subset of all action classes instead of the probabilities for all action classes. However, this is infeasible since the power set of all actions is very large. If C is the number of actions in a dataset, the power set for all actions consists of \(2^{C}\) subsets. Already with 50 action classes, we would need to predict the probabilities for over one quadrillion subsets. Instead, we use an idea that was proposed for HEX graphs [40] where the probabilities of a hierarchy are computed from the probabilities of the leave nodes. While we do not use a hierarchy, we can compute the probability of a subset of actions from the predictions of a network for individual actions.

Let \(s_{c} \in (-\infty ,\infty )\) denote the logit that is predicted by the network for the action class c. The probability of a subset of actions \(\omega \) can then be computed by

$$\begin{aligned} p_{\omega } = \frac{\exp \left( \sum _{c \in \omega } s_c\right) }{\sum _{\omega '}\exp \left( \sum _{c \in \omega '} s_c\right) }. \end{aligned}$$
(2)

The normalization term, however, is still infeasible to compute since we still need to sum over all possible subsets (\(\omega '\)) for the dataset.

Since our goal is the assignment of a subset of actions \(\omega \) to each actor, we do not need to compute the full probability (2). Instead of using the power set of all actions, we build the power set only for the actions that are provided as weak labels for each training video clip. This means that the power set will differ for each video clip. For the example shown in Fig. 1, we build the power set \(\varOmega \) for the actions Stand, Listen, Talk, and Watch. In this example, \(\vert \varOmega \vert =16\). We exclude \(\varnothing \) since in the used dataset each actor is annotated with at least one action. Furthermore, we multiply \(p_{\omega }\) with the confidence d of the person detector. The scoring function \(p_{\omega ,i}\) that we use for the assignment of a subset \(\omega \in \varOmega \setminus \varnothing \) to a detected actor \(a_i\) is therefore given by

$$\begin{aligned} p_{\omega ,i} = \frac{\exp \left( \sum _{c \in \omega } s_{c,i}\right) d_i}{\sum _{\omega ' \in \varOmega \setminus \varnothing }\exp \left( \sum _{c \in \omega '} s_{c,i}\right) } \end{aligned}$$
(3)

where \(s_{c,i}\) is the predicted logit for action c and person \(a_i\). Taking the detection confidence \(d_i\) of person \(a_i\) into account is necessary to reduce the impact of false positives that usually have a low detection confidence.

5.2 Actor-Action Association

While the scoring function (3) indicates how likely a given subset of actions \(\omega \in \varOmega \setminus \varnothing \) fits to an actor \(a_i\), it does not take all information that is available for each video clip into account. For instance, we know that each annotated action is performed by at least one actor. In order to exploit this additional knowledge, we find the optimal assignment of action subsets to actors based on the constraints that each actor performs at least one action and that each action c occurs at least once, i.e., \(c \in \bigcup _i \omega _i\). Since we build the power set only from the actions that occur in a video clip, which we denote by L, the power set \(\varOmega (L)\) varies for each training video clip.

Fig. 3.
figure 3

For the annotated actions \(L= \{1,2,3\}\) and the actors \(A=\{a_1,a_2,a_3,a_4\}\), the figures demonstrate various actor-action assignments. While the assignment a) satisfies all constraints, b) violates (5) since two subsets are assigned to actor \(a_1\) and c) violates (6) since the action 1 is not part of any assigned subset.

The association of subsets \(\omega \in \varOmega (L) \setminus \varnothing \) to actors \(A = \{a_1, a_2, \ldots , a_n\}\) can be formulated as a binary linear program where the binary variables \(x_{\omega ,i}\) are one if the subset \(\omega \) is assigned to actor \(a_i\) and it is zero otherwise. The optimal assignment is defined by the assignment with the highest score (4). While the first constraint (5) enforces that exactly one subset \(\omega \) is assigned to each actor \(a_i\), the second constraint (6) enforces that \(c \in \bigcup _{\omega : x_{\omega ,i} = 1} \omega \) for all \(c \in L\), where \(\{\omega : x_{\omega ,i} = 1\}\) is the set of all subsets that have been assigned. Note that (6) rephrases this constraint such that it can be used for optimization where the indicator function \({\mathbbm {1}}_{\omega }(c)\) is one if \(c \in \omega \) and it is zero otherwise. The left hand side of the inequality therefore counts the number of assigned subsets that contain the action class c. Since this number must be larger than zero, it ensures that each action \(c \in L\) is assigned to at least one actor. The complete binary linear program is thus given by:

$$\begin{aligned} \mathop {\mathrm {argmax}}\limits _{x_{\omega ,i}} \;&\sum _{i=1}^{n}\sum _{\omega \in \varOmega (L) \setminus \varnothing } p_{\omega ,i} x_{\omega ,i} \end{aligned}$$
(4)
$$\begin{aligned} \text {subject to} \;&\sum _{\omega \in \varOmega (L) \setminus \varnothing } x_{\omega ,i} = 1 \qquad \qquad \qquad \qquad \qquad \forall i =1 ,..., n \end{aligned}$$
(5)
$$\begin{aligned}&\sum _{i=1}^{n} \sum _{\omega \in \varOmega (L) \setminus \varnothing } {\mathbbm {1}}_{\omega }(c) x_{\omega ,i} \ge 1 \qquad \qquad \qquad \qquad \forall c \in L \nonumber \\&x_{\omega ,i} \in \{0,1\} \qquad \qquad \qquad \forall \omega \in \varOmega (L) \setminus \varnothing ; \; \forall i =1 ,..., n. \end{aligned}$$
(6)

Figure 3 illustrates the constraints.

5.3 Training

We train first the network using the MIML loss (1) to obtain initial estimates of the logits \(s_{c,i}\). We then assign subsets of actions to the detected persons using the scoring function (3). Finally, we train our network using the loss

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{MIML} + \alpha \sum _{i=1}^{n_t} \mathcal {L}\left( \hat{Y}_{\omega ^t_i}, f(a^t_{i})\right) \end{aligned}$$
(7)

where \(\omega ^t_i\) denotes the action subset that has been assigned to actor \(a^t_i\) in frame t and \(\hat{Y}_{\omega ^t_i}\) is a vector with \(\hat{Y}_{\omega ^t_i}(c) = 1\) if \(c \in \omega ^t_i\) and \(\hat{Y}_{\omega ^t_i}(c) = 0\) otherwise. \(\mathcal {L}\) is the binary cross entropy. Since \(\mathcal {L}_{MIML}\) is computed once per frame but \(\mathcal {L}(\hat{Y}_{\omega ^t_i}, f(a^t_{i}))\) is computed for each detected person, we use \(\alpha =0.3\) to compensate for this difference.

6 Experiments

6.1 Dataset and Implementation Details

We use the AVA 2.2 dataset [19] for evaluation. The dataset contains 235 videos for training, 64 videos for validation, and 131 videos for testing. The dataset contains 60 action classes. The persons perform often multiple actions at the same time and the videos contain multiple persons. For each annotated person a bounding box is provided. An example is given in Fig. 1. Only one frame per second is annotated. The accuracy is measured by mean average precision (mAP) over all actions with an IoU threshold for bounding boxes of 0.5 as described in [19]. In the weakly supervised setting, we use only the present actions for training, but not the bounding boxes.

To detect persons, we use Faster RCNN [41] with ResNext-101 [38] as backbone. The detector was pre-trained on ImageNet and fine-tuned on the COCO dataset. In our experiments, we report results for two 3D CNNs, namely I3D [39] and Slowfast [10]. I3D is pre-trained on Kinetics-400. For Slowfast, we use the ResNet-101 + NL (\(8\,\times \,8\)) version that is pre-trained on Kinetics 600. The temporal scope was set to 64 frames with a stride of 2. For HGRNN we use a temporal window of 11 frames. For training, we use the SGD optimizer until the validation error saturated. The learning rate with linear warmup was set to 0.04 and 0.025 for I3D and Slowfast, respectively. The batch size was set to 16. We used cropping as data augmentation where we crop images of size \(224 \times 224\) pixels from the frames that have \(256\,\times \,256\) image resolution.Footnote 1

Table 1. Comparison of MIML with proposed method. The proposed approach outperforms MIML in case of I3D and Slowfast.
Fig. 4.
figure 4

Comparison of MIML with proposed method. The plot shows the per class mAP for the 10 most frequently occurring classes in the training set. The actions are sorted by the number of occurrences in an decreasing order from left to right. A plot with all 60 action classes is part of the supplementary material.

6.2 Experimental Results

Comparison of MIML with Proposed Method. Table 1 shows the comparison of the proposed approach with the multi-instance and multi-label (MIML) baseline on the validation set. When I3D is used as 3D CNN, the proposed approach improves the MIML baseline by \(+3.2\%\). When Slowfast is used, the accuracy of all methods is higher but the improvement of the proposed approach over the MIML approach remains nearly the same with \(+3.3\%\). We also report the result when HGRNN is trained only with the MIML loss. In this case, the actor-action association is not used and we denote this setting by MIML+HGRNN. While HGRNN improves the results since it models the spatio-temporal relations between persons better than a 3D CNN alone, the proposed actor-action assignment improves the mAP compared to MIML+HGRNN by \(+2.1\%\) and \(+2.0\%\) for I3D and Slowfast, respectively. Figure 4 shows the improvement of the proposed approach over the MIML baseline for the 10 action classes that occur most frequently in the training set. A few qualitative results are show in Fig. 5.

Table 2. Results of various actor-action assignment approaches using HGRNN over different 3D CNNs. The Frequent-5 column and the Least-10 column show the average mAP over the 5 most frequently and 10 least occurring classes in the training set.
Table 3. Performance with ground-truth bounding boxes for evaluation. The results show the improvement in mAP on the validation set when ground-truth bounding boxes (GT bb) instead of detected bounding boxes (Detected bb) are used for evaluation. Furthermore, the results are reported when the model is trained with full supervision.

Impact of Actor-Action Association. In Table 1, we have observed that the actor-action association improves the accuracy. In Table 2, we analyze the impact of the actor-action association more in detail. We use HGRNN using both I3D and Slowfast as 3D CNN backbone. In case of MIML+HGRNN, the actor-action association is not used. We also report the result if we perform the association directly by the confidences without solving a binary linear program. We denote this setting by Proposed Approach w/o LP. In this case, we associate an action to an actor if the class probability is greater than 0.5. For I3D, the association without LP improves the results mainly for the most frequent classes with almost no improvement on least frequent classes. For Slowfast, the performance even decreases in comparison to MIML+HGRNN without LP. Instead, solving the linear program results in better associations for both I3D and Slowfast.

Impact of the Object Detector. We use the Faster RCNN with ResNext [38] person detector which achieves \(90.10\%\) mAP for person detection on the AVA training set and \(90.45\%\) on the AVA validation set. Irrespective of the high detection performance, we analyze how much the accuracy improves if the detected bounding boxes are replaced with the ground-truth bounding boxes during evaluation. Note that the ground-truth bounding boxes are not used for training, but only for evaluation. The results are shown in Table 3. We observe that the performance improves by \(+7.0\%\) and \(+7.2\%\) mAP on the validation set for I3D and Slowfast, respectively. We also report the results if the approach is trained using full supervision. In this case, the network is trained on the ground-truth bounding boxes and the ground-truth action labels per bounding box. Compared to the fully supervised approach, our weakly supervised approach achieves around \(83\%\) of the mAP for both 3D CNNs (\(17.3\%\) vs. \(20.7\%\) for I3D and \(25.1\%\) vs. \(30.1\%\) for Slowfast) if detected bounding boxes are used for evaluation. The gap gets even smaller when ground-truth bounding boxes are used for evaluation. In this case, the relative performance is \(95.7\%\) for I3D and \(90.5\%\) for Slowfast. This demonstrates that the proposed approach learns the actions very well despite of the weak supervision.

Table 4. Comparison to fully supervised approaches. We also report the result of our approach if it is trained with full supervision. Note that we do not use multi-scale and horizontal flipping augmentation as in Slowfast++.
Fig. 5.
figure 5

Qualitative results. The left column shows the ground-truth annotations. The middle column shows the results of the MIML baseline. The right column shows the results of the proposed method. The colors distinguish only different persons, but they are otherwise irrelevant. The predicted action classes with confidence scores are on top of the estimated bounding boxes. The proposed approach recognizes more action classes per bounding box correctly compared to MIML. Both methods also detect genuine actions that are not annotated in the dataset as seen from the missing persons in the second and fourth row. The bias of the proposed method towards the background is visible in last row, where the “swim” action is associated to both persons. Best viewed using the zoom function of the PDF viewer.

Comparison to Fully Supervised Methods. Since this is the first approach that addresses weakly supervised learning for multi-label and multi-person action detection, we cannot compare to other weakly supervised approaches. However, we compare our approach with the state-of-the-art for fully supervised action detection in Table 4. Our approach is competitive to fully supervised approaches [5,6,7,8]. When we train our approach with full supervision, we improve over SlowFast [10] by \(+1.1\%\) mAP on the validation set. While the Slowfast++ network performs slightly better, it uses additional data augmentation and a different network configuration. We expect that these changes would improve our approach as well.

7 Conclusion

In this paper, we introduced the challenging task of weakly supervised multi-label spatio-temporal action detection with multiple actors. We first introduced a baseline based on multi-instance and multi-label learning. We furthermore presented a novel approach where the multi-label problem is represented by the power set of the action classes. In this context, we assign an element of the power set to each detected person using linear programming. We evaluated our approach on the challenging AVA dataset where the proposed method outperforms the MIML approach. Despite of the weak supervision, the proposed approach is competitive to fully supervised approaches.