Keywords

1 Introduction

Even though vaccines for COVID-19 are now available worldwide to fight against the pandemic, the fundamentals of preventive measures are still highly anticipated. Vaccination is just an additional step in reducing the severity effect of the disease and death. The extend of how much it can protect a person from the infection and transmitting the virus to others is still unknown [1]. The term, social distancing, can be described as a public health practice that limits any in-person contact with anyone by staying at home and away from public spaces to reduce the airborne transmission [2, 3]. Ainslie et al. revealed that the number of new cases dropped significantly during the imposition of strict social distancing and movement restrictions towards mainland China and Hong Kong SAR from late January to early February 2020 [4]. Prem et al. had studied the effectiveness of physical distancing in Wuhan whereby it decreased the median number of infections by more than 92% in middle of 2020 and 24% at the end of 2020 [5]. Fong et al. and Kahalé had also verified that social distancing is an effective preventive measure in combatting the pandemic [6, 7]. On the other side, deep learning is a universal learning approach that can perform in almost all application domains in cases where humans do not have to be present in the scene to conduct the specific task. It can be defined as a subset of machine learning that uses neural networks with many layers and is introduced to mimic the function of the human brain in data processing [8], object detection [9, 10], and fault detection [11, 12]. It has been evolving for the past decades with improvised algorithms to produce higher accuracy percentage and generate data concurrent with the present situations. Developing a social distancing monitoring model using deep learning can contribute to slowing down the virus transmission rate that is affecting the public health by identifying social distance violation through person detection.

2 Related Work

A summary of other similar works in using deep learning for object detection to monitor the practice of preventive measures is shown in Table 1.

Table 1 Comparison of quantitative analysis data based on different social distancing models

Uddin et al. [13] used ResNet50 as the CNN architecture to develop an intelligent model that categorized people based on body temperature which resulted in person tracking accuracy at 84%. Saponara et al. [14] applied YOLOv2 to monitor social distance and body temperature through thermal camera using two different datasets and achieved accuracy detection of 95.6% and 94.5%, respectively. Punn et al. [15] utilized YOLOv3 framework with the addition of Deepsort approach that can track the identified people by assigning them with unique IDs. The proposed model had 84.6% accuracy. Ahmed et al. [16] proposed his model to detect human from overhead perspective by implementing YOLOv3 adopted with transfer learning which in return achieving 95% accuracy. Rahim et al. [17] developed a social distancing monitoring model specifically for low-light environment targeting night-time using YOLOv4 algorithm. Despite the limitation of having the proposed model to focus in the environment temporarily before monitoring, the accuracy result was 97.84%. Razaei and Azarmi [18] aimed to have a viewpoint-independent human classification algorithm to monitor social distancing that can overcome limitation of light condition and challenging environment without needing to consider the angle and position of the camera. Their proposed model was built on YOLOv4 algorithm and obtained an accuracy of 99.8%.

3 Methodology

3.1 Dataset Preparation

A total of 530 images are collected randomly from various online sources shown in Google Images as well as selectively from raw images published by X. zhangyang’s GitHub [19] and Prajnasb’s GitHub [20]. These images are taken with people from all ages and gender in different situations like walking, standing, sitting, and other possible body positions to maximize the stimulated conditions for detecting person with and without facemask. The dataset consists of both closed-up and distant images with 200 images focusing on single person with mask only, 160 images focussing on single person without mask only, and 170 images mix with a group of people with and without mask. They are pre-processed by resizing and orienting to establish a base size and orientation to be fed into the framework. This helps in improving the quality and consistency of the data for feature extraction as shown in Fig. 1.

3.2 Model Training

The model training process is conducted via Google Colab to utilize their GPU acceleration for extra computational power. The reason behind choosing YOLOv4 algorithm rather than other deep learning methods is that it is the only framework that can run in a conventional GPU that is easily accessible with minimal cost from home. Besides, its performance in speed and accuracy has been proven with astonishing outcomes, and it suits the real-time application for the proposed model [21]. The network size for model training is 416 × 416. The hyperparameters, which cannot be inferred by the model, are set such that momentum is configured to 0.949, weight decay is configured to 0.0005, and the learning rate is at 0.001. The classification model is trained to predict three classes namely person, with mask, and without mask.

3.3 Performance Evaluation

Quantitative metrics are the measurements of how robust the model is and act as a form of feedback to determine which aspects of the model can be improved. Since the proposed model focuses on classification performance, the metrics used for performance evaluation are precision, recall, F1-score, mean average precision (mAP), and intersection over union (IoU).

Precision is used to measure the ratio of true positives (TP) to the total positives predicted as expressed in Eq. (1). It is more on how many predictions did the model capture correctly.

$$\mathrm{Precision}= \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$
(1)

Recall, also known as sensitivity, is used to measure the ratio of TP to the actual number of positives as expressed in Eq. (2). It is more on how many predictions did the model miss.

$$\mathrm{Recall}= \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$
(2)

Both precision and recall can be represented in a single score called F1-score. It takes the harmonic mean of those two metrics as expressed in Eq. (3).

$${F}_{1}=2\times \frac{\mathrm{Precision }\times \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$$
(3)

Meanwhile, average precision (AP) is the result of the area under the precision–recall curve and can be calculated using Eq. (4). This is where mean average precision (mAP) comes into the picture to calculate the average of AP for all the classes as shown in Eq. (5).

$$\mathrm{AP}=\frac{1}{11}\sum_{{\mathrm{Recall}}_{i}}\mathrm{Precision}\left({\mathrm{Recall}}_{i}\right)$$
(4)
$$\mathrm{mAP}= \frac{1}{N} \times \sum\limits_{i=1}^{N}{\mathrm{AP}}_{i}$$
(5)

3.4 Deployment of Classifier Model

The model deployment is conducted in PyCharm Community Edition 2021.2.1. The overview workflow of the classifier model can be seen in Fig. 2. The model is begun by reading the input video and converting it into frames. The ability of object detector is then applied to classify three classes based on the confidence value. If the predicted object is a person, the model would proceed with evaluating the inter-distance measurement. If the predicted object is with mask, purple bounding box is generated with “mask” text labelled on top of it. If the predicted object is without mask, red bounding box is generated with “no mask” text labelled on top of it. A mini dashboard is updated at the top left corner of the output video to show the monitoring status according to the number of bounding boxes generated per frame.

Fig. 1
A set of photographs from 0 to 20 of different adults, both men, and women. Most of them wear face masks while few are not.

Samples of images for dataset preparation

Fig. 2
A flowchart of a model named social distancing monitoring. The flowchart begins from start to end. Some elements involved are reading real-time video, applying object detection, and calculating the F P S value.

Flowchart of social distancing monitoring model

The inter-distance calculation is performed by measuring the distance between the centre point of every bounding box of predicted person. The centroid coordinate of the bounding box can be obtained by adding the lowest and highest value of the same axis and divide them by two as expressed in Eq. (6). Ci, which is also equivalent to (Xi, Yi), represents the centroid coordinate. Xmin and Xmax are the lowest and highest x-coordination of the bounding box, respectively. Likewise, Ymin and Ymax are the lowest and highest y-coordination of the bounding box, respectively.

$${C}_{i}=\left({X}_{i}, {Y}_{i}\right)=\left(\frac{{X}_{\mathrm{min}}+{X}_{\mathrm{max}}}{2},\frac{{Y}_{\mathrm{min}}+{Y}_{\mathrm{max}}}{2}\right)$$
(6)

After that, Euclidean distance criterion is applied here to translate the distance between the pixels in the input frame to metric distance format. The equation of Euclidean formula is shown in Eq. (7). The distance between two centroid points of the bounding boxes is represented as D(C1, C2). Xmax and Ymax represent the coordinates from either one of the centroid points that has the largest value. Xmin and Ymin represent the coordinates from the other the centroid point that has the smallest value.

$$D\left({C}_{1},{C}_{2}\right)=\sqrt{{\left({X}_{\mathrm{max}}-{X}_{\mathrm{min}}\right)}^{2}+{\left({Y}_{\mathrm{max}}-{Y}_{\mathrm{min}}\right)}^{2}}$$
(7)

Initially, the bounding boxes will not be drawn first when they are detected. Once the inter-distance calculation is computed, the model will decide whether the bounding boxes will be in green or red. The violation distance is denoted as the violation threshold value in this case. If D(C1, C2) is more than or equal to the violation threshold value, then the bounding boxes will be drawn in green with the text “safe” as the label on top of them. If D(C1, C2) is smaller than the violation threshold value, the bounding boxes be drawn in green with the text “at risk” as the label on top of them. This process is repeated in loop for every frame in real-time video.

4 Results and Discussion

4.1 Quantitative Analysis of Deep Learning Methods

In this work, three different deep learning models are pre-trained with the same dataset and hyperparameters for comparisons. The labelled images are split into 80% of training set and 20% of testing set to measure the robustness of the models.

Based on the quantitative metrics tabulated in Table 2, it is analysed that YOLOv2 model has the lowest performance out of the three training models. On the contrary, the overall robustness of both YOLOv4 and YOLOv3 models are quite similar as their precedence is the other’s flaw and vice versa. YOLOv4 model has the upper hand in terms of accuracy and recall whereas YOLOv3 model has the upper hand in terms of precision and F1-score. After some considerations, YOLOv4 model is selected to be deployed as the classifier in the proposed social distancing monitoring model due to having the highest accuracy detection of 93.79% when compared to the other two models. It also has the best sensitivity in not missing out any true positives with recall value at 0.94 and a fair F1-score at 0.87.

Table 2 Comparison analysis between three different training models

4.2 Performance of Social Distancing Monitoring Model

The experiment is done at public areas that have potential widespread of COVID-19 transmission. Hence, the videos are captured from three different cases. Figure 3a, b shows the results at one of the rest stops beside Lebuhraya Utara-Selatan in Perak as an example of open space area. The second case is aimed at enclosed space, for example, like Mid Valley Megamall, and the results are shown in Fig. 3c, d. The third case is aimed at public semi-enclosed place like KL Sentral Transit Hub as shown in Fig. 3e, f.

Fig. 3
6 photographs. a. Two men standing with face masks covering their faces. b. Five people walk around with face masks. c and d. People standing at the entrance of the mall with face masks. e and f. People at the transit hub wear face masks.

Visualization of classification and localization as well as monitoring social distancing

By referring to the output frames in Fig. 3, it is observed that the overall result of object classification and localization is executed well towards detecting objects that are close to the camera. Besides, the monitoring of social distance violation is well performed as expected and the mini dashboard is updated correctly for every single frame according to the number of bounding boxes generated. It can be interpreted that camera position should be taken into considerations as the monitoring performance is able to execute better when the camera is positioned at eye level rather than at lower angle assuming at sitting position level. The performance of social-distance monitoring model, in terms of number of high risks, number of low risks, number of individuals without mask, and number of individuals with mask, is summarized in Table 3.

Table 3 Performance of social-distance monitoring model at Perak Rest Stop, Mid Valley Megamall, and KL Sentral Transit Hub

5 Conclusion

The development of social distancing monitoring model using deep learning and the analysis of the model performance are covered in this paper. The effectiveness of social distancing is studied before building the model to understand better in relation to the objective of the project. The process of model training using YOLOv4 method is discussed so that the proposed model can work with real-time and video detection. As a result, the model has achieved accuracy detection of 93.79% and F1-score of 0.87. In terms of deployment performance, it is shown that the object classification and localization as well as the evaluation of social distance violence are executed well towards predicted objects that are close to the camera at eye level position. The outcome of the social distancing monitoring model can be implemented in situations where public health is emphasized corresponds to the practice of preventive measures during COVID-19 pandemic. An additional feature of facemask detection is included too in an effort to mitigate the transmission rate of airborne virus in public places. Nevertheless, improvements can be made in future work to detect a wider range of the crowds since the proposed model only works with objects that are close to the camera.