Visual Recognition of Abnormal Activities in Video Streams

Gkountakos, Konstantinos; Ioannidis, Konstantinos; Tsikrika, Theodora; Vrochidis, Stefanos; Kompatsiaris, Ioannis

doi:10.1007/978-3-030-69460-9_9

Konstantinos Gkountakos⁵,
Konstantinos Ioannidis⁵,
Theodora Tsikrika⁵,
Stefanos Vrochidis⁵ &
…
Ioannis Kompatsiaris⁵

Part of the book series: Security Informatics and Law Enforcement ((SILE))

760 Accesses

Abstract

Nowadays, the visual information captured by CCTV surveillance and body worn cameras is continuously increasing. Such visual information is often used for security purposes, such as the recognition of suspicious activities, including potential crime- and terrorism-related activities and violent behaviours. To this end, specific tools have been developed in order to provide law enforcement with better investigation capabilities and to improve their crime and terrorism detection and prevention strategies. This work proposes a novel framework for recognising abnormal activities where the continuous recognition of such activities in visual streams is carried out using state-of-the-art deep learning techniques. Specifically, the proposed method is based on an adaptable (near) real-time image processing strategy followed by the widely used 3D convolution architecture. The proposed framework is evaluated using the publicly available diverse dataset VIRAT for activity detection and recognition in outdoor environments. Taking into account the non-batch image processing and the advantage of 3D convolution approaches, the proposed method achieves satisfactory results on the recognition of human-centred activities and vehicle actions in (near) real time.

Access provided by Autonomous University of Puebla. Download chapter PDF

Vision-Based Activity Recognition System with a Deep Neural Network for Surveillance

Towards Intelligent Vision Surveillance for Police Information Systems

Deep anomaly detection through visual attention in surveillance videos

Article Open access 16 October 2020

Keywords

1 Introduction

The massive streams of visual information captured by CCTV surveillance and body-worn cameras cannot be easily monitored by human operators, particularly in the field of law enforcement. To assist law enforcement officers in their daily tasks and to improve their operational and investigation capabilities, several tools have been developed in order to automatically process and analyse such video streams and subsequently alert the human operators when events of interest, such as any abnormal activities, take place. Abnormalities can be considered as non-normal states, unknown states, everything abnormal, deviant, or outliers. This work focuses on such systems that aim to recognise actions of interest performed by humans or vehicles and categorise each action to one of existing predefined categories. Leveraging the significant advancements in deep learning neural networks , state-of-the-art action recognition methods are based on convolutional neural networks (CNNs) and recurrent neural networks (RNNs) [10, 12]. Moreover, the architectures of such activity recognition systems typically consist of two parts: the feature extractor and the classifier. To this end, this work proposes an end-to-end activity recognition framework that extracts visual features from video streams and classifies them to predefined activities. The proposed framework is evaluated using the VIRAT [8] dataset and the activities considered in the TRECVID Activities in Extended Video (ActEV) evaluation series [3].

The main contributions of this work are the proposal of a complete end-to-end activity recognition framework based on deep learning neural networks, the investigation of early and late fusion techniques in the context of this framework, and the extensive evaluation experiments using the VIRAT dataset. Moreover, since some of the ActEV activities are fine-grained, we group similar activities together so as to consider coarser-grained activities that are likely to be of more interest to general activity-based recognition systems; we have thus performed evaluation experiments using both the finer- and the coarser-grained activities.

The remainder of the chapter is structured as follows. Section 9.2 discusses related work and relevant datasets, Sect. 9.3 presents the proposed framework, Sect. 9.4 describes the experimental setup and presents the evaluation results, and Sect. 9.5 concludes this work.

2 Related Work

State-of-the-art activity recognition methods are based on deep learning techniques. Simonyan et al. [9] proposed a 2D convolution-based architecture that takes into account the visual and stacked optical-flow features and generates a two-stream neural network that can learn simultaneously the motion and the appearance of the input video. Ji et al. [5] proposed a 3D convolution-based approach in order to extract spatio-temporal features, while Tran et al. [12] also trained a 3D convolutional neural network. Hara et al. [4] extended previous works that make use of 3D convolutional kernels with filter size equal to 3×3×3 by using varied kernel sizes and very deep convolutional neural networks. They also concluded that the Kinetics [6] dataset, consisting of more than 300,000 videos that depict 400 human-related activities, can be widely employed for training and testing activity recognition systems, similarly to the wide use of the ImageNet [2] dataset for training object detection systems.

Apart from Kinetics, several other datasets have been built for the activity recognition problem. HMDB-51 [7] is one of such dataset that consists of more than 6766 videos, with a mean duration of approximately 3 seconds, categorised into 51 human activities extracted from movies. The ActivityNet [1] is another such dataset consisting of around 20, 000 videos categorised into 200 human activities. Finally, both the videos of the VIRAT [8] dataset and their annotations are provided by the National Institute of Standards and Technology (NIST – https://www.nist.gov/) in the context of the TRECVid Activities in Extended Video (ActEV – https://actev.nist.gov/) evaluation series.

3 Activity Recognition Framework

This work follows the supervised learning paradigm for human-related activity recognition that employs a deep neural network architecture, namely, the 3D ResNet neural network [4]. This 3D convolutional-based architecture achieves faster processing and can thus perform human activity recognition in (near) real time while using simultaneously (batch) frame processing. In particular, the architectures with 18, 50, and 101 layers as described in [4] have been deployed.

The 3D-ResNet-18 architecture consists of basic blocks, with each block consisting of two 3D convolutional layers followed by batch normalisation and ReLU (rectified linear unit) activation layers, as depicted on the left part of Fig. 9.1. The other two architectures (3D-ResNet-50 and 3D-ResNet-101) follow the bottleneck blocks approach (see right part of Fig. 9.1), where each bottleneck block consists of three 3D convolution layers followed by batch normalisation and ReLU activation layers, with the convolution kernels being 1×1×1 for the first and third convolution layers and 3×3×3 for the middle one.

Finally, it should be noted that the weights of the Kinetics dataset [6] were pre-loaded for all architectures. The Kinetics dataset was selected since it covers a large number of human activity classes (400 classes) and also contains videos that were not collected from sources in specific domains (e.g. movies, soccer games, etc.), but videos from diverse data sources uploaded on YouTube.

4 Experiments

This section reports on the experimental evaluation of the proposed activity recognition framework by presenting first the datasets used in our experiments (Sect. 9.4.1), then the overall experimental setup (Sect. 9.4.2), and finally the evaluation results of our experiments (Sect. 9.4.3).

4.1 Dataset

In order to evaluate the proposed method, we selected the dataset provided by NIST under the ActEV evaluation series. This dataset was selected since it contains several human activities and vehicle actions that can be considered as abnormal in particular contexts. In particular, ActEV considers activities where one or more people generate movements or interact with objects (or groups of objects), such as other people (P) and vehicles (V). Specifically, ActEV defines and clearly annotates 18 human activities and vehicle actions listed in Table 9.1. The ActEV dataset consists of a total of 2446 annotated activities in its training and validation sets extracted from 118 videos of the VIRAT (release 1.0 and 2.0) dataset (http://viratdata.org/). The training set consists of 64 videos that contain 1338 recognised activities, while the validation set consists of 54 videos that contain 1128 recognised activities. The test set will not be considered as its annotations are not publicly available. The distribution of the activities both for the training and validation sets is depicted in Fig. 9.2. As it can be observed, ActEV is a challenging dataset, as it is highly unbalanced.

Table 9.1 ActEV activities official declaration

Full size table

As some of the ActEV activities are rather fine-grained, we have also grouped similar activities together so as to consider coarser-grained activities that are likely to be of interest to more general activity-based recognition systems (e.g. recognition of vehicle-relevant activities). Table 9.2 lists these so-called super-activities, while Fig. 9.3 depicts the distribution of these super-activities for the training and validation sets, which is also highly unbalanced, similarly to before.

Table 9.2 ActEV activities grouped to “super-activities”

Full size table

4.2 Experimental Setup

The aim of the evaluation experiments was to assess the effectiveness of the activity recognition system, and therefore they focused on processing and analysing only the parts of the video streams where some form of activity had been observed. To this end, first, the frames from all videos were extracted; to be more specific, one every four frames was extracted. Then, only the frames that depict an activity were considered and were stored in a valid format (.png).

The same training strategy was followed for each experiment. Specifically, the batch size was set to 32, the number of total epochs was set to 200, and stochastic gradient descent [11] was used as an optimiser with an initial learning rate equal to 0.1. A “reduce on plateau” strategy was applied in order to create a learning rate schedule with max patience equal to 10 epochs. This strategy allows to reduce the learning rate by a factor once learning stagnates; if no improvement is seen for a “patience” number of epochs, the learning rate is reduced. Furthermore, five different scale factors were used for data augmentation [1.0, 0.84, 0.70, 0.59, 0.49], while a corner cropping strategy was also applied; this refers to the random selection of a cropped box from the four corners and the centre.

The training process was monitored for a complete evaluation by utilising the TensorBoard application downloaded from the TensorFlow^{Footnote 1} repository. Figure 9.4 presents the accuracy per epoch during training and denotes the 3D-ResNet architecture consisting of 18, 50, and 101 layers with blue, orange, and red, respectively. The correspondingly losses during training are depicted in Fig. 9.5.

The validation set of the ActEV dataset was used for evaluating the proposed activity recognition framework in order to investigate how the depth of a 3D-ResNet network architecture affects its effectiveness. To this end, we applied two different experimental settings, one that considers the 18 activities of the ActEV dataset and one that considers the 6 super-activities. Regarding the super-activities, we apply both late and early fusion. For the late fusion, the accuracy of each super-class comprises the summation of the subclasses’ predictions during testing, whereas for early fusion, the super-activities are merged during training (i.e. a single training set is created for each super-activity by merging the training sets of its sub-activities).

Precision@N is used as the basic evaluation criterion which allows us to show the accuracy of the framework for different numbers of retrieved activities where N ∈ {1, …, 18} in the case of ActEV activities and N ∈ {1, …, 6} in the case of super-activities. Precision@1 indicates the percentage of videos where the top prediction by our framework corresponds to the correct activity shown in the video. Hence, Precision@18 for the ActEV activities and Precision@6 for the super-activities should always be equal to 1, as the framework is bound to predict correctly if it simply provides all available activities. In addition, confusion matrices are also presented.

4.3 Results

This section presents the results for the different ResNet architectures both for the 18 activities and also for the 6 super-activities; in the latter case, the results listed below correspond to the late fusion, whereas the results for the early fusion are presented at the end of this section.

ResNet-50 results. Figure 9.6 presents the Precision@N using the ResNet-50 architecture. Precision@1 equals to 28% when all 18 activities are considered and 51% in the case of super-activities. As expected, coarser-grained activities can be more easily identified. Figures 9.7 and 9.8 present the confusion matrices of the prediction activities both for the 18 activities and the 6 super-activities. A detailed examination indicates that the unbalanced characteristics of the ActEV dataset lead the model to a dominated learning state adapted to the activity with the highest occurrence (“activity carrying”). On the other hand, in the super-activities dataset, the number of false negatives and false positives has been reduced and disengaged from a dominating activity.

ResNet-18 Results.

Figure 9.9 presents the Precision@N using ResNet-18 architecture. Precision@1 has decreased to 25%, compared to the 28% achieved by the ResNet-50 architecture for the 18 activities. Regarding the super-activities, Precision@1 has also decreased from 51% to 47%.

ResNet-101 Results.

Finally, the results of the experiments for the ResNet-101 neural network architecture are depicted in Fig. 9.10. As the results indicate, a higher capacity neural network can learn more accurately the classification problem. Specifically, the ResNet-101 architecture outperforms the previous ones when considering the super-activities, but the results for the 18 activities dataset are even lower than the ResNet-50 architecture. A detailed examination indicates that many of these 18 activities are closer (in terms of visual content) to each other, and thus, a higher capacity neural network which tries to differentiate between them aggressively results in lower Precision@1, even though the Precision@5 remains similar to the ResNet-50 results .

Early Versus. Late Fusion.

In addition to the late fusion experiments presented above, we also carried out early fusion experiments for the case of super-activities.

To compare the effectiveness of the two approaches, we select the ResNet-101 architecture as it achieves the best performance in the case of super-activities. Figure 9.11 depicts the Precision@N both for early and late fusion. Specifically, early fusion increases the system performance for all N except for Precision@1. Furthermore, Fig. 9.12 compares the confusion matrices for early and late fusion and indicates that although the Precision@1 is lower when applying early fusion, the value of the error of misclassified activities is smaller and Precision@N for N > 1 is higher.

5 Conclusions

This work presented a framework for recognising activities in video streams. Specifically, the framework makes use of 3D convolutional filters in order to learn the spatio-temporal representation of activities. The framework was evaluated using the challenging ActEV dataset and also a second dataset that was created using the same data and which merges the ActEV activities into super-activities in order to evaluate the proposed framework in a more general activity-based recognition domain. The experimental results indicate that our framework can capture coarse level representations as it performs satisfactorily in the super-activities dataset. Finally, the early fusion approach proved to be advantageous in contrast to the late fusion when more than one activity were retrieved.

Notes

1.
https://github.com/tensorflow/tensorboard

Bibliography

Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR) (pp. 961–970). IEEE.
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR) (pp. 248–255). IEEE.
Google Scholar
Awad, G., Butt, A. A., Curtis, K., Lee, Y., Fiscus, J., Godil, A., ... & Quenot, G. (2020). Trecvid 2019: An evaluation campaign to benchmark video activity detection, video captioning and matching, and video search & retrieval. arXiv preprint arXiv:2009.09984.
Google Scholar
Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6546–6555). IEEE.
Google Scholar
Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35(1), 221–231.
Article Google Scholar
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: a large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (pp. 2556–2563). IEEE.
Google Scholar
Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C. C., Lee, J. T., Mukherjee, S., Aggarwal, J., Lee, H., Davis, L., et al. (2011). A large-scale benchmark dataset for event recognition in surveillance video. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3153–3160). IEEE.
Google Scholar
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Proceedings of the international conference of advances in Neural Information Processing Systems (NIPS), (pp. 568–576), NeurIPS foundation.
Google Scholar
Singh, D., Merdivan, E., Psychoula, I., Kropf, J., Hanke, S., Geist, M., & Holzinger, A. (2017). Human activity recognition using recurrent neural networks. In Proceedings of the international Cross-Domain conference for Machine Learning and Knowledge Extraction (CD-MAKE) (pp. 267–274). Springer.
Google Scholar
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning (ICML) (pp. 1139–1147).
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (pp. 4489–4497). IEEE.
Google Scholar

Download references

Acknowledgements

	This research has received funding from the European Union’s H2020 research and innovation programme as part of the CONNEXIONs (H2020-786731) project.

Author information

Authors and Affiliations

Centre for Research and Technology Hellas, Information Technologies Institute, Thessaloniki, Greece
Konstantinos Gkountakos, Konstantinos Ioannidis, Theodora Tsikrika, Stefanos Vrochidis & Ioannis Kompatsiaris

Authors

Konstantinos Gkountakos
View author publications
You can also search for this author in PubMed Google Scholar
Konstantinos Ioannidis
View author publications
You can also search for this author in PubMed Google Scholar
Theodora Tsikrika
View author publications
You can also search for this author in PubMed Google Scholar
Stefanos Vrochidis
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Kompatsiaris
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Konstantinos Gkountakos .

Editor information

Editors and Affiliations

CENTRIC (Centre of Excellence in Terrorism, Resilience, Intelligence and Organised Crime Research), Sheffield Hallam University, Sheffield, UK
Babak Akhgar
Center for Security Studies-KEMEA, Athens, Greece
Dimitrios Kavallieros
Crisis Management and Security Unit, Institute of Communication and Computer, Athens, Greece
Evangelos Sdongos

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gkountakos, K., Ioannidis, K., Tsikrika, T., Vrochidis, S., Kompatsiaris, I. (2021). Visual Recognition of Abnormal Activities in Video Streams. In: Akhgar, B., Kavallieros, D., Sdongos, E. (eds) Technology Development for Security Practitioners. Security Informatics and Law Enforcement. Springer, Cham. https://doi.org/10.1007/978-3-030-69460-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-69460-9_9
Published: 25 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69459-3
Online ISBN: 978-3-030-69460-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics