Keywords

1 Introduction

Group Activity Recognition is an extensively studied topic widely used in the video content analysis area [1]. Traditional video captioning techniques such as LSTM-YT [2] and S2VT [3] use recurrent neural networks, specifically LSTMs [4], to train the models with video-sentence pairs [1,2,3, 5]. The models can learn the association between video frames’ sequence and the sequence of sentences to generate a description of videos [3]. Krishna et al. indicated that those video captioning approaches only works for a short video with only one major event [5]. Therefore, they introduced a new captioning module that uses contextual information from the timeline to describe all the events during a video clip [5]. However, it is still very limited in video captioning approaches to generate a detailed description of human actions, and their interactions [6]. Recent studies in pose estimation, human action recognition, and group activity recognition areas show the capability to describe more detailed human actions, and human group activities [7, 8].

Human action recognition and group activity recognition are an important problem in video understanding [9, 10]. The action and activity recognition techniques have been widely applied in different areas such as social behavior understanding, sports video analysis, and video surveillance. To better understanding a video scene that includes multiple persons, it is essential to understand both each individual’s action and their collective activity. Actor Relation Graph (ARG) based group activity recognition is the state-of-the-art model that focuses on capturing the appearance and position relationship between each actor in the scene and performing the action and group activity recognition [9].

In this paper, we propose several approaches to improve the functionality and the performance of the Actor Relation Graph-based model, which called the IARG model to perform a better group activity recognition under extreme lighting conditions. To enhance human action and group activity recognition performance, we apply MobileNet [11] in the CNN layer and use Normalized cross-correlation (NCC) and the sum of absolute differences (SAD) to calculate the pair-wise appearance similarity to build the Actor Relation Graph. We also introduce a visualization model that plots each input video frame with predicted bounding boxes on each human object and predicted individual action and group activity. The output examples are shown in Fig. 2.

2 Related Work

2.1 Video Captioning Based Group Activity Recognition

One important study area of group activity understanding is video captioning. In 2015, S. Venugopalan et al. proposed an end-to-end sequence-to-sequence model which exploited recurrent neural network, specifically Long Short-Term Memory (LSTM [4]) networks as trained on video-sentence pairs and learned to associate a sequence of frames in a video to sequential words to generate the descriptions of the event in the video as captions [3]. A stack of two LSTMs was used to learn the frames’ sequence’s temporal structure and the sequence model of the generated sentences. In this approach, the entire video sequence needs to be encoded using the LSTM network at the beginning. Long video sequences could lead to vanishing gradients and prevent the model from being trained successfully [5].

In 2017, R. Krishna et al. introduced a Dense-Captioning Events (DCE) model that can detect multiple events and generate a description for each event using the contextual information from past, concurrent, and future in a single pass of the video [5]. In this paper, the process is divided into two steps: event detection and description of detected events. The DCE model leverages a multi-scale variant of the deep action proposal model to localize temporal proposals of interest in short and long video sequences. In addition, a captioning LSTM model is introduced to exploit the context from the past and future with an attention mechanism.

In 2018, X. P. Li et al. introduced a novel attention-based framework called Residual attention-based LSTM (Res-ATT [12]). This new model benefits from the existing attention mechanism and further integrates the residual mapping into a two-layer LSTM network to avoid losing previously generated words information. The residual attention-based decoder model is designed with five separate parts: a sentence encoder, temporal attention, a visual and sentence feature fusion layer, a residual layer, and an MLP [12]. The sentence encoder is an LSTM layer that explores important syntactic information from a sentence. The temporal attention is designed to identify the importance of each frame. The visual and sentence feature fusion LSTM layer is working on mixing natural language information with image features, and the residual layer is proposed to reduce the transmission loss. The MLP layer is used to predict the word to generate a description in natural language [12]. However all the video captioning approaches above have good performance on individual activity recognition, but not group activity recognition.

2.2 Pose Estimation Based Group Activity Recognition

To better understanding a video scene that includes multiple persons, it is essential to understand both each individual’s action. OpenPose is an open-source real-time system which is used for 2D multi-person pose detection [8]. Nowadays, it is also widely used in body and facial landmark points detection in video frames [7, 13, 14]. It produces a spatial encoding of pairwise relationships between body parts for a variable number of people, followed by a greedy bipartite graph matching to output the 2D keypoints for all people in the image. In this approach, both prediction of part affinity fields (PAFs) and detection of confidence maps are refined at each stage [8, 15, 16]. By doing this, the real-time performance is improved while it maintains the accuracy of each component separately. The online OpenPose library supports jointly detect the human body, hand, and facial keypoints on a single image, which provides 2D human pose estimation for our proposed system.

Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells are widely used for human action recognition with the emerging accessible human activity recognition methods. In this paper, F. M. Noori et al. proposes an approach that first extracts anatomical keypoints from RGB images using the OpenPose library and then obtains extra-temporal motions features after considering the movements in consecutive video frames, and lastly classifies the features into associated activities using RNN with LSTM [7]. Improved performance is shown as organizing activities performed by several different subjects from various camera angles. However, the pose estimation based human action recognition has poor performance on multi-person interactions and group activity recognition.

2.3 Learning Actor Relation Graphs for Group Activity Recognition

Human action recognition and group activity recognition are an important problem in group activity understanding [9]. In 2019, J. Wu et al. proposed to use Actor Relational Graph (ARG) to model relationships between actors and recognize group activity with multiple persons involved [9]. Using the ARG in a multi-person scene, the relation between actors from respect to appearance similarity and the relative location is inferred and captured. Compared with using a CNN to extract person-level features and later aggregate the features into a scene-level feature or using RNN to capture temporal information in the densely sampled frames, learning with ARG is less computationally expensive and more flexible while dealing with variation in the group activity. Given a video sequence with bounding boxes and ground truth labels of action for actors in the scene, the trained network can recognize individual actions and group activity in a multi-person scene. For long-range video clips, ARG’s efficiency is improved by forcing a relational connection only in a local neighborhood and randomly dropping several frames while maintaining the training samples’ diversity and reducing the risk of overfitting. At the beginning of the training process, the actors’ features are first extracted by CNN and RoIAlign model [17] using the provided bounding boxes. After obtaining the feature vectors for actors in the scene, multiple actor relation graphs are built to represent the diverse information for the same set of actors’ features. Finally, Graph Convolutional Network (GCN) is applied to perform learning and inference to recognize individual actions and group activity based on ARG. Two classifiers used for individual actions and group activity recognition are applied respectively to the pooled ARG. Scene-level representation is generated by max-pooling individual actor representations, which later uses for group activity classification. However the ARG based model has bad performance on video with extreme brightness and contrast condition.

3 Our Approach

We propose using an improved Actor relation graph-based model (IARG) that focused on group activity recognition without restrictions of extreme brightness and contrast condition. The overview of the Improved ARG-based model is shown in Fig. 1.

Fig. 1.
figure 1

The improved ARG-based human actions and group activity recognition model.

The model first extracts actor features from sampled video frames with manually labeled bounding boxes using CNN, and RoIAlign [17]. Next, it builds an N by d dimensional feature matrix, using a d-dimension vector to represent each actor’s bounding box and using N to present the total number of bounding boxes in video frames. The actor relation graphs are then built to capture each actor’s appearance and position relationship in the scene. Afterward, the model uses Graph Convolutional Networks (GCN) to analyze each actor’s relationship from the ARG. Finally, the original and relational features are aggregated and used by two separate classifiers to perform actions, and group activity recognition [9]. Since the study is mainly focused on group activity recognition, the individual action recognition is not very accurate because the model only uses the region of interest and CNN to perform action recognition.

To improve the overall accuracy, we propose to apply MobileNet [11] in the CNN layer to extract image feature maps and use Normalized cross-correlation (NCC) and the sum of absolute differences (SAD) to calculate the pair-wise appearance similarity to build the Actor Relation Graph. More details of our proposed methodology are mentioned in section V.

To make our model have a more visualized result, we also introduce a visualization model that plots each input video frame with predicted bounding boxes on each human object and predicted individual action and group activity as the output. The output examples are shown in Fig. 2.

4 Methodology

In our model, how to build this Actor relation graph is the key. J. Wu et al. has proved that the ARG can represent the graph structure of pair-wise relation information between each pair of actors in each frame and use the related information for group activity understanding [9].

To better understand the relationship between two actors, both appearance features and position information are used to construct the ARG. The relation value is defined as a composite function below, which function fa indicates the appearance relation, and function fs indicates the position relation. The \(x^a_i\) and \(x^a_j\) refers to the actor i’s and actor j’s appearance features, while \(x^s_i\) and \(x^s_j\) refers to the actor i’s and actor j’s location features (the center coordinates of each actor’s bounding box). The function h fuses appearance and position relation to a scalar weight [9]:

$$\begin{aligned} G_{ij} = h(f_a(x^a_i, x^a_j ), f_s(x^s_i, x^s_j)). \end{aligned}$$
(1)

The normalization is further adopted on each actor node with SoftMax function so that the sum of all the corresponding values of each actor node will always equal to one [9]:

$$\begin{aligned} \textbf{G}_{i j}=\frac{f_{s}\left( \textbf{x}_{i}^{s}, \textbf{x}_{j}^{s}\right) \exp \left( f_{a}\left( \textbf{x}_{i}^{a}, \textbf{x}_{j}^{a}\right) \right) }{\sum _{j=1}^{N} f_{s}\left( \textbf{x}_{i}^{s}, \textbf{x}_{j}^{s}\right) \exp \left( f_{a}\left( \textbf{x}_{i}^{a}, \textbf{x}_{j}^{a}\right) \right) }. \end{aligned}$$
(2)

4.1 Appearance Relation

In J. Wu’s paper, the Embedded Dot-Product is implemented to compute the similarity between the two actors’ appearance features (the image feature inside each actor’s bounding box) in embedding space [9]:. The corresponding function is written in the way below:

$$\begin{aligned} f_{a}\left( \textbf{x}_{i}^{a}, \textbf{x}_{j}^{a}\right) =\frac{\theta \left( \textbf{x}_{i}^{a}\right) ^{\textrm{T}} \phi \left( \textbf{x}_{j}^{a}\right) }{\sqrt{d_{k}}}. \end{aligned}$$
(3)

The \(\theta \) and \(\phi \) are two functions using Wx + b, in which W and b are learnable weight. The learnable transformations of original appearance features can better understand the relation between two actors in a subspace.

Our model evaluates two other methods for the appearance relation calculation: the Normalized cross-correlation (NCC) and the sum of absolute differences (SAD).

Normalized cross-correlation (NCC) is a method to evaluate the degree of similarity between two compared images. The brightness of the compared images can vary due to lighting and exposure conditions, so the images can be first normalized to get a more accurate similarity score by applying NCC. The advantage of the normalized cross-correlation is that it is less sensitive to linear changes in the amplitude of illumination in the two compared images, and the corresponding function can be written in the following way [18]:

$$\begin{aligned} \varphi _{{x}_{i}^{a} {x}_{j}^{a}}^{\prime }(t)=\frac{\varphi _{{x}_{i}^{a} {x}_{j}^{a}}(t)}{\sqrt{\varphi _{{x}_{i}^{a} {x}_{i}^{a}}(0) \varphi _{{x}_{j}^{a} {x}_{j}^{a}}(0)}}. \end{aligned}$$
(4)

The normalized quantity \(\varphi _{{x}_{i}^{a} {x}_{j}^{a}}^{\prime }(t)\) will be vary between −1 and 1, while value 1 indicates exactly matching between two images, and value of 0 indicates no matching between two images. The NCC value can help us better understand the appearance relation between each pair of actors.

Sum of absolute differences (SAD) is another method we evaluate to calculate the appearance relation when building the ARG. SAD calculates the distance between two matrices by computing the sum of absolute difference of the components of the matrices as the formula:

$$\begin{aligned} {\text {SAD}}\left( {x}_{i}^{a}, {x}_{j}^{a}\right) =\sum _{k}^{n}\left| {x}_{ik}^{a}-{x}_{jk}^{a}\right| . \end{aligned}$$
(5)

Since SAD is more resistant to extreme values in the data, it is more robust when comparing appearance features and better captures the appearance relation.

4.2 Position Relation

Besides, spatial structural information is considered to capture the position relation between actors better. A distance mask has been applied to obtain signals from entities that are not distantly apart. Since relation in a local scope is more crucial compared with global relation for group activity understanding, a measure of Euclidean distance \(\textbf{G}_{i j}\) between two actors is computed as:

$$\begin{aligned} f_{s}\left( \textbf{x}_{i}^{s}, \textbf{x}_{j}^{s}\right) =\mathbb {I}\left( d\left( \textbf{x}_{i}^{s}, \textbf{x}_{j}^{s}\right) \le \mu \right) , \end{aligned}$$
(6)

where \(\mathbb {I}(\cdot )\) denotes the indicator function, \(d(\textbf{x}_{i}^{s}, \textbf{x}_{j}^{s})\) calculates the Euclidean distance between the center coordinates of two actors’ bounding boxes, and \(\mu \) is a distance threshold.

5 Experiments and Results

5.1 Datasets

In this paper, we used the public group activity recognition datasets called collective activity dataset and augmented dataset [19] to train and test our model. This dataset has 74 video scene that includes multiple persons in each scene. The manually defined bounding boxes on each person and the ground truth of their actions and the group activity are also labeled in each frame.

5.2 Implementation Details and Results

We use a minibatch size of 16 with a learning rate of 0.0001 and train our network in 100 epochs. The individual action loss weight lambda = 1 is used. The GCN parameters are set as dk = 256, ds = 32, and 1/5 of the image width is adopted to be the distance mask threshold \(\mu \). The default backbone CNN network for feature extraction is set as Inception-v3 [20], and the default of the appearance relation function is set as embedded dot-product. Our implementation is based on the PyTorch framework and two pieces of RTX 2080 Ti GPUs.

Evaluation 1 - Evaluate with Different Backbone Networks. In this subsection, we conduct detailed studies on the Collective Activity dataset to understand the proposed backbone networks’ relation modeling using group activity prediction accuracy as the assessment metric. The results of the experiments are shown in Table 1.

In our 2-stage training, we first finetune the ImageNet pre-trained backbone network with the randomly selected frame from each of the training samples. Then, the weights of the feature extraction part of the backbone network are fixed in stage 2. We further train the network with GCN and calculate appearance relation using embedded dot-product. We begin our experiments with Inception-v3 [20] as the backbone network. The first stage of training takes approximately 2.6 h, while the second stage takes a longer time, about 3.3 h. With Inception-v3 [20], the group activity recognition accuracy after the training of stage 1 achieves 90.91%. With the additional training of GCN in stage 2, our model yields a higher recognition accuracy of 92.71%.

We further adopt MobileNet [11] as the backbone network to boost the speed of our model. MobileNet [11] is a deep convolutional neural network that is lightweight but efficient. With MobileNet [11], the training time of stage 1 is 1.8 h, which reduces the time spent at stage 1 by 32.7%. The training time of stage 2 is 2.4 h, which is 26% less than the training time of Inception-v3 at the same stage. In summary, the training speed of our model is increased by 35%. However, the activity recognition accuracy is slightly dropped from 92.71% to 91.44%.

Table 1. Accuracy (%) of group activity prediction from two backbone networks

Evaluation 2 - Evaluate with Different Appearance Relation Functions. In this experiment, we evaluate the group activity recognition performance with different appearance relation functions. We first train and validate the group activity recognition performance based on the default Inception-v3 [20] backbone and the embedded dot-product for appearance relation calculation. The best result we get is 92.71%. Then we update the appearance relation function with Normalized cross-correlation (NCC) to draw the actor relation graph, and the best result we achieve is 93.50%. We further evaluate the sum of absolute distance (SAD) function to calculate the appearance similarity, and the best score we achieve is 93.98%.

Table 2. Accuracy (%) of group activity prediction on backbone network inception-v3 [20] with different appearance relation functions

In this experiment 2, we prove that our proposed model with either NCC or SAD as the appearance relation calculation function will achieve better group activity prediction accuracy as expected. The results of the experiments are shown in Table 2.

6 Conclusion

This paper utilizes the actor relation graph (ARG) based model with novel improvements for group activity recognition. To enhance our model’s performance, we learn ARG to perform appearance relation reasoning on graphs using normalized cross-correlation (NCC) and the sum of absolute difference (SAD). Besides, to improve the computational speed, we introduce MobileNet [11] as the backbone network into our proposed model. Furthermore, extensive experiments demonstrate that the proposed methods are robust and effective for enhancing both accuracy and speed on the Collective Activity dataset. Since our project is mainly focused on group activity recognition, the individual action recognition is not very accurate because we only use the region of interest and CNN to perform action recognition.