Keywords

1 Introduction

The massive streams of visual information captured by CCTV surveillance and body-worn cameras cannot be easily monitored by human operators, particularly in the field of law enforcement. To assist law enforcement officers in their daily tasks and to improve their operational and investigation capabilities, several tools have been developed in order to automatically process and analyse such video streams and subsequently alert the human operators when events of interest, such as any abnormal activities, take place. Abnormalities can be considered as non-normal states, unknown states, everything abnormal, deviant, or outliers. This work focuses on such systems that aim to recognise actions of interest performed by humans or vehicles and categorise each action to one of existing predefined categories. Leveraging the significant advancements in deep learning neural networks , state-of-the-art action recognition methods are based on convolutional neural networks (CNNs) and recurrent neural networks (RNNs) [10, 12]. Moreover, the architectures of such activity recognition systems typically consist of two parts: the feature extractor and the classifier. To this end, this work proposes an end-to-end activity recognition framework that extracts visual features from video streams and classifies them to predefined activities. The proposed framework is evaluated using the VIRAT [8] dataset and the activities considered in the TRECVID Activities in Extended Video (ActEV) evaluation series [3].

The main contributions of this work are the proposal of a complete end-to-end activity recognition framework based on deep learning neural networks, the investigation of early and late fusion techniques in the context of this framework, and the extensive evaluation experiments using the VIRAT dataset. Moreover, since some of the ActEV activities are fine-grained, we group similar activities together so as to consider coarser-grained activities that are likely to be of more interest to general activity-based recognition systems; we have thus performed evaluation experiments using both the finer- and the coarser-grained activities.

The remainder of the chapter is structured as follows. Section 9.2 discusses related work and relevant datasets, Sect. 9.3 presents the proposed framework, Sect. 9.4 describes the experimental setup and presents the evaluation results, and Sect. 9.5 concludes this work.

2 Related Work

State-of-the-art activity recognition methods are based on deep learning techniques. Simonyan et al. [9] proposed a 2D convolution-based architecture that takes into account the visual and stacked optical-flow features and generates a two-stream neural network that can learn simultaneously the motion and the appearance of the input video. Ji et al. [5] proposed a 3D convolution-based approach in order to extract spatio-temporal features, while Tran et al. [12] also trained a 3D convolutional neural network. Hara et al. [4] extended previous works that make use of 3D convolutional kernels with filter size equal to 3×3×3 by using varied kernel sizes and very deep convolutional neural networks. They also concluded that the Kinetics [6] dataset, consisting of more than 300,000 videos that depict 400 human-related activities, can be widely employed for training and testing activity recognition systems, similarly to the wide use of the ImageNet [2] dataset for training object detection systems.

Apart from Kinetics, several other datasets have been built for the activity recognition problem. HMDB-51 [7] is one of such dataset that consists of more than 6766 videos, with a mean duration of approximately 3 seconds, categorised into 51 human activities extracted from movies. The ActivityNet [1] is another such dataset consisting of around 20, 000 videos categorised into 200 human activities. Finally, both the videos of the VIRAT [8] dataset and their annotations are provided by the National Institute of Standards and Technology (NIST – https://www.nist.gov/) in the context of the TRECVid Activities in Extended Video (ActEV – https://actev.nist.gov/) evaluation series.

3 Activity Recognition Framework

This work follows the supervised learning paradigm for human-related activity recognition that employs a deep neural network architecture, namely, the 3D ResNet neural network [4]. This 3D convolutional-based architecture achieves faster processing and can thus perform human activity recognition in (near) real time while using simultaneously (batch) frame processing. In particular, the architectures with 18, 50, and 101 layers as described in [4] have been deployed.

The 3D-ResNet-18 architecture consists of basic blocks, with each block consisting of two 3D convolutional layers followed by batch normalisation and ReLU (rectified linear unit) activation layers, as depicted on the left part of Fig. 9.1. The other two architectures (3D-ResNet-50 and 3D-ResNet-101) follow the bottleneck blocks approach (see right part of Fig. 9.1), where each bottleneck block consists of three 3D convolution layers followed by batch normalisation and ReLU activation layers, with the convolution kernels being 1×1×1 for the first and third convolution layers and 3×3×3 for the middle one.

Fig. 9.1
figure 1

3D-ResNet basic and bottleneck blocks [4]. “to” 3D-ResNet basic and Bottleneck blocks (as illustrated by [4])

Finally, it should be noted that the weights of the Kinetics dataset [6] were pre-loaded for all architectures. The Kinetics dataset was selected since it covers a large number of human activity classes (400 classes) and also contains videos that were not collected from sources in specific domains (e.g. movies, soccer games, etc.), but videos from diverse data sources uploaded on YouTube.

4 Experiments

This section reports on the experimental evaluation of the proposed activity recognition framework by presenting first the datasets used in our experiments (Sect. 9.4.1), then the overall experimental setup (Sect. 9.4.2), and finally the evaluation results of our experiments (Sect. 9.4.3).

4.1 Dataset

In order to evaluate the proposed method, we selected the dataset provided by NIST under the ActEV evaluation series. This dataset was selected since it contains several human activities and vehicle actions that can be considered as abnormal in particular contexts. In particular, ActEV considers activities where one or more people generate movements or interact with objects (or groups of objects), such as other people (P) and vehicles (V). Specifically, ActEV defines and clearly annotates 18 human activities and vehicle actions listed in Table 9.1. The ActEV dataset consists of a total of 2446 annotated activities in its training and validation sets extracted from 118 videos of the VIRAT (release 1.0 and 2.0) dataset (http://viratdata.org/). The training set consists of 64 videos that contain 1338 recognised activities, while the validation set consists of 54 videos that contain 1128 recognised activities. The test set will not be considered as its annotations are not publicly available. The distribution of the activities both for the training and validation sets is depicted in Fig. 9.2. As it can be observed, ActEV is a challenging dataset, as it is highly unbalanced.

Table 9.1 ActEV activities official declaration
Fig. 9.2
figure 2

ActEV dataset activities distribution

As some of the ActEV activities are rather fine-grained, we have also grouped similar activities together so as to consider coarser-grained activities that are likely to be of interest to more general activity-based recognition systems (e.g. recognition of vehicle-relevant activities). Table 9.2 lists these so-called super-activities, while Fig. 9.3 depicts the distribution of these super-activities for the training and validation sets, which is also highly unbalanced, similarly to before.

Table 9.2 ActEV activities grouped to “super-activities”
Fig. 9.3
figure 3

ActEV dataset super-activities distribution

4.2 Experimental Setup

The aim of the evaluation experiments was to assess the effectiveness of the activity recognition system, and therefore they focused on processing and analysing only the parts of the video streams where some form of activity had been observed. To this end, first, the frames from all videos were extracted; to be more specific, one every four frames was extracted. Then, only the frames that depict an activity were considered and were stored in a valid format (.png).

The same training strategy was followed for each experiment. Specifically, the batch size was set to 32, the number of total epochs was set to 200, and stochastic gradient descent [11] was used as an optimiser with an initial learning rate equal to 0.1. A “reduce on plateau” strategy was applied in order to create a learning rate schedule with max patience equal to 10 epochs. This strategy allows to reduce the learning rate by a factor once learning stagnates; if no improvement is seen for a “patience” number of epochs, the learning rate is reduced. Furthermore, five different scale factors were used for data augmentation [1.0, 0.84, 0.70, 0.59, 0.49], while a corner cropping strategy was also applied; this refers to the random selection of a cropped box from the four corners and the centre.

The training process was monitored for a complete evaluation by utilising the TensorBoard application downloaded from the TensorFlowFootnote 1 repository. Figure 9.4 presents the accuracy per epoch during training and denotes the 3D-ResNet architecture consisting of 18, 50, and 101 layers with blue, orange, and red, respectively. The correspondingly losses during training are depicted in Fig. 9.5.

Fig. 9.4
figure 4

Accuracy during training of ResNet-18(blue), ResNet-50(orange), and ResNet-101(red) with respect to the number of epochs

Fig. 9.5
figure 5

Cross-entropy loss during training of ResNet-18(blue), ResNet-50(orange), and ResNet-101(red) with respect to the number of epochs

The validation set of the ActEV dataset was used for evaluating the proposed activity recognition framework in order to investigate how the depth of a 3D-ResNet network architecture affects its effectiveness. To this end, we applied two different experimental settings, one that considers the 18 activities of the ActEV dataset and one that considers the 6 super-activities. Regarding the super-activities, we apply both late and early fusion. For the late fusion, the accuracy of each super-class comprises the summation of the subclasses’ predictions during testing, whereas for early fusion, the super-activities are merged during training (i.e. a single training set is created for each super-activity by merging the training sets of its sub-activities).

Precision@N is used as the basic evaluation criterion which allows us to show the accuracy of the framework for different numbers of retrieved activities where N ∈ {1, …, 18} in the case of ActEV activities and N ∈ {1, …, 6} in the case of super-activities. Precision@1 indicates the percentage of videos where the top prediction by our framework corresponds to the correct activity shown in the video. Hence, Precision@18 for the ActEV activities and Precision@6 for the super-activities should always be equal to 1, as the framework is bound to predict correctly if it simply provides all available activities. In addition, confusion matrices are also presented.

4.3 Results

This section presents the results for the different ResNet architectures both for the 18 activities and also for the 6 super-activities; in the latter case, the results listed below correspond to the late fusion, whereas the results for the early fusion are presented at the end of this section.

ResNet-50 results. Figure 9.6 presents the Precision@N using the ResNet-50 architecture. Precision@1 equals to 28% when all 18 activities are considered and 51% in the case of super-activities. As expected, coarser-grained activities can be more easily identified. Figures 9.7 and 9.8 present the confusion matrices of the prediction activities both for the 18 activities and the 6 super-activities. A detailed examination indicates that the unbalanced characteristics of the ActEV dataset lead the model to a dominated learning state adapted to the activity with the highest occurrence (“activity carrying”). On the other hand, in the super-activities dataset, the number of false negatives and false positives has been reduced and disengaged from a dominating activity.

Fig. 9.6
figure 6

Precision@N, ActEV, and super-activities trained using ResNet-503

Fig. 9.7
figure 7

Confusion matrix using ActEV dataset trained on ResNet-50

Fig. 9.8
figure 8

Confusion matrix using super-activities dataset trained on ResNet-50

ResNet-18 Results.

Figure 9.9 presents the Precision@N using ResNet-18 architecture. Precision@1 has decreased to 25%, compared to the 28% achieved by the ResNet-50 architecture for the 18 activities. Regarding the super-activities, Precision@1 has also decreased from 51% to 47%.

Fig. 9.9
figure 9

Precision@N, ActEV, and super-activities trained using ResNet-18

ResNet-101 Results.

Finally, the results of the experiments for the ResNet-101 neural network architecture are depicted in Fig. 9.10. As the results indicate, a higher capacity neural network can learn more accurately the classification problem. Specifically, the ResNet-101 architecture outperforms the previous ones when considering the super-activities, but the results for the 18 activities dataset are even lower than the ResNet-50 architecture. A detailed examination indicates that many of these 18 activities are closer (in terms of visual content) to each other, and thus, a higher capacity neural network which tries to differentiate between them aggressively results in lower Precision@1, even though the Precision@5 remains similar to the ResNet-50 results .

Fig. 9.10
figure 10

Precision@N, ActEV, and super-activities trained using ResNet-101

Early Versus. Late Fusion.

In addition to the late fusion experiments presented above, we also carried out early fusion experiments for the case of super-activities.

To compare the effectiveness of the two approaches, we select the ResNet-101 architecture as it achieves the best performance in the case of super-activities. Figure 9.11 depicts the Precision@N both for early and late fusion. Specifically, early fusion increases the system performance for all N except for Precision@1. Furthermore, Fig. 9.12 compares the confusion matrices for early and late fusion and indicates that although the Precision@1 is lower when applying early fusion, the value of the error of misclassified activities is smaller and Precision@N for N > 1 is higher.

Fig. 9.11
figure 11

Precision@N both for early and late fusion using ResNet-101

Fig. 9.12
figure 12

Confusion matrices both for early and late fusion using ResNet-101

5 Conclusions

This work presented a framework for recognising activities in video streams. Specifically, the framework makes use of 3D convolutional filters in order to learn the spatio-temporal representation of activities. The framework was evaluated using the challenging ActEV dataset and also a second dataset that was created using the same data and which merges the ActEV activities into super-activities in order to evaluate the proposed framework in a more general activity-based recognition domain. The experimental results indicate that our framework can capture coarse level representations as it performs satisfactorily in the super-activities dataset. Finally, the early fusion approach proved to be advantageous in contrast to the late fusion when more than one activity were retrieved.