Keywords

1 Introduction

A minor security breach in identification of suspicious elements who carrying such harmful or restricted weapons/objects at such sensitive locations might have serious consequences in terms of national security. This becomes more challengeable particularly in public spaces such as airports, protests, movie theatres, stadiums, and national parks where maintaining security standards is necessary to ensure safety of infostructure and national themes including human life. DIPR studies [31, 32] suggest that the availability of violent material like riffle, churra, sword, sticks, handgun, grandees, bomb, etc. at location or even showing the violent material can escalate aggression and violent behavior in the people. The early detection of these material or the person who carrying these vulnerable materials may help security forces to apply their best crowd management strategies. Although the human visual framework has performed admirably in terms of monitoring, humans can be slow, expensive, and corruptible in the long run, can expose people on the ground to danger. With the escalation of technologies in hardware has given an opportunity to study and monitor the situations physically using videos and CCTV live footages but the amount of data generated was enormous, and studying each footage was a daunting process for humans. With the advancement of computational powers and availability of intensive dataset, training large deep learning networks for computer vision applications is possible. In the domain of object detection, CNN-based architecture algorithms like R-CNN (Regions with CNN features), Fast R-CNN, Faster R-CNN, SSD (Single Shot Multi-Box Detector), YOLO (You Only Look Once), and its versions, perform exceptionally well on real-time object detection, which in turn paved the way for machine-based surveillance systems (Fig. 1).

Fig. 1.
figure 1

Comparison of YOLO with other algorithms [11]

Building a customized weapon object detection algorithm poses major problems like, 1) lack of freely available weapon datasets, 2) necessity for a domain expert to decide harmful objects classes, and 3) real-time accurate classification and localization these objects from given video surveillance footage. In this work, we recognised various markers/harmful objects, such as cameras that can hamper privacy of sensitive locations like army base, and some hazardous weapons, such as Sticks, Daggers, Swords, Handguns, and Riffles, that can cause significant physical human injury. Further, we generated a novel dataset “DIAT-Weapon” Dataset along with annotations for threat object detection. The dataset consists of 2712 images divided into six categories. We customized one stage model – YOLO for the first time to detect weapons as it is outperforming other algorithms in terms of FPS, AP score and being size invariant, for getting real-time performance as any event like triggering gun in public places can happen in any moment. We fine-tuned YOLO algorithm to get better trade-off between accurate detection/localization and real-time performance.

2 Literature Review

Many object detection techniques based on deep learning have been proposed since 2018. These techniques can be based on 1) two-stage models, such as R-CNN [28], Fast R-CNN [25], Faster R-CNN [26], mask-RCNN [27], etc. mostly consisting, the region proposal network (RPN) network to select the approximate region of possible objects; followed by the object detection network to classify the candidate regions with accurate bounding; or 2) one-stage model, such as YOLO series [7, 8, 11] SSD [13], etc. where object detection is framed as a regression problem offering faster speed, with slightly lower accuracy.

Murugan et al. [1], has presented different object identification, object classification, and object tracking algorithms in the literature and has given methods for video summarization. Hu et al. [2], proposed a novel unified method for recognizing vehicle number plates and automobiles, gathering high energy frequency portions of images from digital camera imaging sensors with their proposed algorithm. For video surveillance Raghunandan et al. [3], has enhanced algorithms for various object detection techniques such as face detection, skin detection, color detection, shape detection, and target detection.

Elhoseny et al. [4], proposed a machine learning model for multi-object recognition and tracking that use an optimal Kalman filter [5] to track objects. Ahmad et al. [6], developed a framework for monitoring students during virtual tests, which employs YOLOv2 [7] and YOLOv3 [8] to detect things such as cell phones, laptops, iPads, and notebooks. Thoudoju et al. [9], uses YOLOv3 to detect objects in aerial images and satellite images. Kumar et al. [10], Detects vehicle classes such as automobile, truck, two-wheeler, and people using YOLOv3 and YOLOv4 [11]. Jose et al. [12], Detects things such as firearms and knives in suspicious regions using YOLO architecture to determine the likelihood of domestic violence. Though there are many variants of convolutional neural networks like SSD [13], R-FCN [14] performs well in terms of accuracy compared to YOLO, but their FPS (frame per second) is major drawback.

Though there are many object detection algorithms none has used to identify threat objects to ensure security standards. This study aims to build a real-time surveillance system that can detect dangerous objects in live CCTV feeds and, in the future, support these systems in surveillance robots. The paper is organized as follows: Sect. 3 describes about the dataset; Sect. 4 describes about the YOLO algorithms used and Sect. 5 consists of results and experiments.

3 DIAT Weapon Dataset

We chose stick, riffle, sword, handgun, camera, and dagger as our markers based on the severity of the damage that may be caused by utilizing these objects in prohibited locations, and it will offer the institution a perspective of what countermeasures they should do. Many weapons were brought under the umbrella of these classes, and subclass specifications were given in Table 1.

Table 1. Subclasses

The data was gathered from a variety of open sources, including the OIDV4 toolkit [15] and web scraping the images available in Internet ensuring diversity of collected data with respect to various conditions including different color, different shapes, different backgrounds, different time periods, variety of weather conditions, different occlusions, multiple perspective etc. Roboflow software is used to construct bounding boxes for each image to support for training the models in Darknet, TensorFlow and PyTorch. These classes are difficult to collect as there are few unique images for class in open domain, yet we managed 2,712 photos in total, divided into six classifications, with an average of 1.6 annotations per image. The sample images with annotation from generated DIAT-Weapon dataset are shown in Fig. 2. Those who are interested to get educational access to the DIAT-Weapon dataset, please send an e-mail request to “sunitadhavale@diat.ac.in” mentioning the subject: “DIAT-Weapon Image Dataset Educational Access Request” from their institutional e-mail id. This dataset will also be made publicly available at https://www.diat.ac.in/view-profile/?id=98.

Histogram plot of number objects for each class and number of annotations per image has given in Fig. 3 and 4. Data augmentation techniques are used to handle data imbalance problems. The Purpose of the data augmentation is to make model much robust towards the data. We Primarily focused on Photometric distortions such as random noise, Hue and Exposure. The image quality statistics were collected using Roboflow software, the average size of collected images were 0.7 mp, ranging from 0.01 to 20.90 mp and the median ratio of images is 1024 \(\times \) 685. The aspect ratio histogram plot was given in Fig. 5, where majority of the images fell into the category of wider images. We conducted the experiments on images and resized to the size 416 \(\times \) 416 and we added random noise of 5%, Hue and Exposure are between −25° to +25° as an augmentation technique.

Fig. 2.
figure 2

Sample images

Fig. 3.
figure 3

Histogram of class balance

Fig. 4.
figure 4

Annotations distribution of classes

Fig. 5.
figure 5

Aspect ratio of images

4 YOLO Architecture

YOLOv1 [29] a single stage object detector where it localizes and classifies at the same time. Input size of YOLO v1 is 448 * 448, and it is divided into s * s grids where each grid is responsible for detection of one object, however it can use multiple bounding boxes, but it only gives the bounding box with maximum Intersection-over-Union (IOU) as output. The output will be position information of this bounding box (centre point coordinates x, y, width w, height h), and prediction confidence. YOLO v1 has certain drawbacks like constant image size and able to predict one object per grid. These drawbacks were improved in later versions. The architecture of YOLO v1 is given Fig. 6 where it contains 24 convolutional layers followed by 2 fully connected layers, in later algorithms these fully connected layers are with anchor boxes for predicting bounding boxes.

In YOLOv2 [7], YOLOv2 has added several amazing concepts such as 1) pre-trained anchor boxes, which were built using K-means clustering, to solve the problems of imprecise bounding box detections and relatively low recall caused by fully connected layers in YOLOv1. 2) Batch Normalization has improved 2% mAP score compared to YOLOv1. 3) Multi scale Training, to improve the model’s stability across multiple image sizes, images were resized to 416 * 416 and for every 10 epochs, the image dimension was varied at random by multiples of 32 from 320 to 608 as the YOLOv2 model down samples by a factor of 2. The total number of detections in YOLOv2 is of 13 * 13 * number of anchor boxes. Darknet 19 was used as backbone architecture, to detect things faster, and it includes of 19 convolutional layers and 5 max pool layers.

Fig. 6.
figure 6

YOLOv1 architecture [29]

In YOLOv3 [8], the Independent SoftMax layers was replaced with Independent logistic classifier for multi label classification to address the overlapping labels like Women and Person. YOLO v3 predicts in 3 feature scales like feature pyramid networks [30] as to improve prediction levels at large, medium, and small targets. At each scale it uses 3 boxes, and the shape of tensor is N * N * (3 * (4 + 1 + C)), where C is number of classes, 4 bounding box offsets and 1 objectiveness score. These feature maps are up sampled to concatenate with previous layer outputs. The backbone architecture Darknet-19 was modified with darknet-53 architecture because darknet-53 is size invariant. The convolutional layer of stride 2 is use instead of max pooling operation. YOLO v3 tiny is variation is YOLO v3 architecture where its backbone network consists of 7 convolutional layers and 6 max pooling layers, and it predicts in 2 scales. YOLO v3 tiny has compromised accuracy but it has a faster detection time.

Bochkovskiy et al. [11], proposed YOLOv4 architecture for object detection shown in Fig. 7, this architecture was implemented in Darknet framework. The YOLOv4 architecture was divided into 4 categories. 1) Input, it contains images, patches and video stream etc. 2) Back Bone, these convolutional Neural Network architectures were trained on Imagenet [18] Dataset and the author considers these networks CSPDarknet53 [19], CSPResNext50 [19] and EfficientNet-B3 [17] and finalizes CSPDarknet53 [19] as backbone network. 3) Neck, at neck combinations of various levels of backbone features are mixed and the author uses SPP [20] and PAN [21] as neck.4) Head, as a head the architecture uses YOLOv3 [8] for detection of objects.

Fig. 7.
figure 7

Object detection algorithm architecture [11]

Jiang et al. [16], the proposed YOLOv4 small architecture was adapted from the original YOLOv4 algorithm with a few tweaks to conduct real-time prediction with trade-off accuracy. The major changes in the architecture are, total convolutional layers were compromised to 29 layers and YOLO layers were reduced to two instead of three and uses CSPDarknet53 [19] as its backbone architecture. With the introduction of YOLOv4 [11], ultralytics has announced YOLOv5 with open-source code [22], many believe YOLOv5 is further modification of YOLOv4, and it is implemented in PyTorch.

Wangh et al. [24], proposed a scaled YOLOv4 architecture implemented in PyTorch framework. Developed a network scaling strategy in YOLOv4 architecture using CSP approach that scales the network in both directions while maintaining standards in accuracy and optimal speed. This scaling modifies the depth, width, resolution, and structure of the network. The YOLOv4 large is one network in scaled in YOLOv4 networks, designed for cloud-based GPU to achieve high accuracy and it has variations like YOLOv4-P5, YOLOv4-P6 and YOLOv4-P7 where it detects objects in the scale of 3, 4, 5.

5 Experimental Results

In this section, the outcomes of the models were explained, as well as the metrics used to evaluate the models and the models’ outputs. The DIAT Weapon image dataset is divided into train set, and test set in the ratio of 80:20. All these images are manually labelled using Roboflow software. All images are resized to 416 × 416 size. Due to small sized dataset, we used data augmentation like random horizontal translation, image flipping, and image distortion. The same transformation is performed on corresponding bounding boxes. All experiments are carried out on NVIDIA RTX-6000 GPU powered high end Tyrone workstation with 2 Intel Xeon processors, 256 GB RAM, 4 TB HDD configuration. For software stack, Python 3.7.2, CUDA 10.0, cuDNN 7.6.5, PyTorch 3.7.2, Darknet used.

In object detection models, we calculate precision based on the Intersection over Union metric [23]. Mean average precision (mAP) is the standard evaluation metric used for any object detection algorithms. Here we used mAP along with precision (P) and recall (R). Precision (P) ranged between [0, 1] and refers to the proportion of the correctly predicted ‘True’ labels in all the predicted ‘True’ labels. Recall (R) ranged between [0, 1] and represents the proportion of correctly predicted ‘True’ labels in the total number of actual ‘True’ labels. F1 Score is used to comprehensively measure the quality of algorithm in terms of both P and R as given in Eq. 1. For a category, average Precision (AP) refers to the area under the curve drawn according to P and R is given in Eq. 2. For multi-classification tasks, mAP of multiple categories is calculated as the average mAP score of all classes, and it is given in Eq. 3. Higher mAP means better model. For real-time classifications, frames per second (FPS) is used to measure real-time performance of the model.

$$ {\text{F}}_1 = 2*\frac{{{\text{PR}}}}{{{\text{P}} + {\text{R}}}} \in \left[ {0,1} \right] $$
(1)
$$ {\text{AP}}_{\text{i}} = \int_0^1 {{\text{P}}_{\text{i}} \left( {{\text{R}}_{\text{i}} } \right){\text{dR}}_{\text{i}} } $$
(2)
$$ {\text{mAP}} = \frac{{\sum_1^{\text{n}} {{\text{AP}}_{\text{i}} } }}{{\text{n}}} \in \left[ {0,1} \right] $$
(3)

The metrics of the model are given in Table 2. All YOLO models pertained with MS-COCO benchmark dataset are fine-tuned to our weapon dataset. Whereas YOLOv4 outperformed other models in terms of mAP, Precision, and F1-score as 0.63, 0.77, 0.65 YOLOv4-csp outperformed other models in terms of recall with 0.67. The mAP value plots per batch for YOLOv4 and YOLOv4 tiny, trained in Darknet framework given Fig. 8 where x-axis represents number of batches and y-axis represent loss. For YOLOv5 and scaled YOLOv4 CSP, trained using PyTorch were given in Fig. 9 where x-axis represents epochs and y-axis for mAP score. Figures 10 and 11 show the results of YOLOv4 detecting single objects per image and multiple objects per image of six classes, where each object is identified by a bounding box and tagged with its class. In real time frames are extracted from videos using OpenCV and detection has done per image basis.

Table 2. Model results
Fig. 8.
figure 8

Plot of mAP a) YOLOv4tiny b) YOLOv4

Fig. 9.
figure 9

Plot of mAP a) YOLOv5 b) scaled YOLOv4 CSP

Fig. 10.
figure 10

Detection of single objects a) stick b) dagger c) handgun d) sword e) camera f) rifle

Fig. 11.
figure 11

Detection of multiple objects a) stick b) sword c) camera d) riffle

6 Conclusion and Future Scope

In this paper, we customized YOLO object detector for real time weapon detection. We introduced DIAT Weapon Dataset having 6 different classes of weapons i.e. Handgun, Sword, Camera, Riffle, Stick and Dagger. This dataset and learnt YOLO Model will be useful for harmful object detection to address the concerns of national security. During experimental analysis, it is found that YOLOv4 has achieved significant results in real time demonstration using videos as test dataset with more than 30 fps using OpenCV DNN module. Although YOLO v5 is lightweight than YOLO v4, accuracy of YOLO v4 was found good in real time performance. In future, we have two objectives, 1) we will add some more images and categories of weapon in our proposed dataset and will try to enhance prediction accuracy, generalization capability along with faster detection. To expand the data set, we will use Generative Adversarial Networks (GAN) network. 2) To integrate the trained models to work on real time CCTV feeds along with robots.