1 Introduction

The report n.48 of the World Health Organization (WHO) noted that COVID-19 disease 2019 has globally infected over 58 million people and caused over 1.4 million deaths (9 April 2021). With this outbreak of COVID-19 coronavirus, many countries or we can say that all countries were obliged to commence new rules for social distance and face mask wearing. The governments have obliged hospitals and different organizations to use new infection interference measures to prevent the spreading of COVID-19 because its transmission rate is increasing. However, the transmission rate could vary per the measure and policies applied by the governments. As COVID-19 is transmitted through airdrops and shut contact, governments have started using new rules forcing individuals to prevent people from setting close to each other and wear a face mask to scale back the transmission and spreading rate. New mutated versions of the coronavirus took hold after the relaxation of many countries in adhering to safety rules (Indian, Nigerian …), which made The WHO recommend the use of personal protective equipment (PPE) among people in health care settings.

The spread of COVID-19 affected people's lives and disrupted the economy. It considered as major problem of public health and economy. The transmission of the COVID-19 virus spreads more easily in close contact and crowded environments. Countries need guidance and surveillance of people in crowded environments and public areas, incredibly packed to ensure that social distance and wearing face masks laws are applied. This could be achieved through video surveillance systems to obtain video sequences and deep learning models to detect human faces. However, most social distance applications and current research concerning social distance tasks solve the social distance problem but ignore wearing face masks. The lack of research will lead to the virus spreading by people who do not wear a face mask.

This study refers to the protection motivation theory, which is adaptable to both health-related and technology-related motivations. The concept of social distancing, risk persons and no mask or incorrectly face mask detections are added.

The aim of this study is to find an approach that can detect and track COVID-19 risk or contact people. The proposed approach is based on the unmasked or incorrectly masked faces detection using deep learning (DL) and social distancing. The use of this approach and the provision of health related data requested will increase our understanding of the concerns for the protection of people from the COVID-19 pandemic and play an important role in prevention. The objective is to find an automatic, efficient and rapid model, which could then be improved by strategies oriented towards the public of appropriate decisions.

Our contribution consists in associating several methods based on theSocial-Scaled-YOLOv4 model to create a detection, tracking and social distance system in order to prevent the spread of COVID-19. Detection technics for the couples of persons using DeepSORT tracker and a new face mask detection model namely DSFD&MobileNetv2 detection are proposed.

The proposed approach framework based on Social-Scaled- YOLOv4&DeepSort. The remainder of this paper is organized as follows. Section 2 introduces the most original recent works. The proposed method is described in Section 3 and the implementation platform and libraries are presented in Section 4. The experimental results and discussions are presented in Section 5. Section 6: approach limitations, Section 7: comparison with state of the art and Section 8: concludes this paper.

2 Related works

2.1 Recent methods of detection and localization based DL (Deep Learning)

In this paper, we relied on two surveys focus on DL approaches for detection and localization of objects, which can be found in [1, 2]. For face detection, we based on [3,4,5]. State of the art object detectors uses DL, which are divided in to two categories. The first one called two-stage detector models, of which the famous models are RCNN (Recurrent Convolutional Neural Network) [6], Fast RCNN [7], Faster RCNN [8], which starts with Region Proposal Network (RPN) to generate regions of interests and then performs the classification and bounding box regression. The second named one-stage detectors, of which the famous models are YOLO (You Only Look Once) [9], YOLOv2 [10], YOLOV3 [11], YOLOv4 [12], Scaled-YOLOv4 Scaling: Cross Stage Partial (CSP) [13], Single Shot Multibox Detector (SSD) [14], RetinaNet [15], and EfficientDet [16]. The most popular one stage model is YOLO; Fig. 1 shows the timeline and the comparison between members of the YOLO family models and their performance. The evaluation of models was usually based on two datasets, Pascal VOC [17] and MS-COCO (Microsoft Common Objects in Context) [18], the results are given in Table 1.

Fig. 1
figure 1

Overall structure of proposed monitoring and face mask detection system

Table 1 Object detection models and their accuracies

2.2 Recent methods COVID-19 social distancing based DL

An automated framework to monitor social distancing using surveillance video is presented in [18]. It uses YOLOv3 object detection model for detecting people and drawing the bounding boxes around them. It also compares the results with faster RCNN and SSD models through parameters like loss and FPS. Its advantages are that it presents deep comparative study between different models and it uses l2 norm to identify clusters of people not obeying social distancing.

M Rezaei and M Azarmi [20] proposed a work sous title “Deep-social: Social distancing monitoring and infection risk assessment in covid-19 pandemic”. In this paper the authors proposed a DL social distance system which utilizes a webcam as the source and Deep-social detector passed yolov4 detection model and euclidian distance to measure social distance between peoples. This paper supply also a magnificent visualization using many of tools such heat-maps, moving-trajectory …

Yang Yurtsever and Renganathan [21] proposed "a vision-based social distancing and critical density detection system for COVID19". This paper uses DL based real-time object detectors to measure social distancing. It uses: i) pre-trained models (YOLOv4 and Faster-RCNN) to detect persons with bounding boxes in each frame and ii) bird’s-eye view coordinates to transform the detected boxes in the image domain into real world domain.

Bharathi and Anandharaj [22] proposed Real-time DL framework to monitor social distancing using improved SSD based on overhead position. It is an efficient real-time monitoring of people to detect safe social distancing in public places. This model uses: i) improved SSD with Transfer Learning (TL) to detect persons with bounding boxes in each frame.

Meivel Sindhwani et all [23] proposed a work sous title “Mask Detection and Social Distance Identification Using Internet of Things and Faster R-CNN Algorithm”. This paper uses DL to monitor the social-distance between peoples to ensuring the safety in public-areas based Faster R-CNN detection model. This method is integrated in unmanned aircraft systems (drone) based Raspberry Pi4.

MD Elamin firoz et al. [46] proposed “Object Detection and Classification from a Real-Time Video Using SSD and YOLO Models” This research introduces an improved real-time object detection and recognition technique from web camera video. The technique detects and recognizes objects like people, vehicles, and animals. We use Single Shot Detector (SSD) and You Only Look Once (YOLO) models, which show promising results in object detection and recognition. This system can detect objects even in adverse and uncontrolled environments. We use convolutional neural network (CNN) for object classification. This technique provides real-time performance with satisfactory detection and classification results and better accuracy. This proposed model has an accuracy percentage of 63–90% in object detection and classification.

ML Mokeddem et all [47] proposed “COVID-19 risk reduce based YOLOv4-P6-FaceMask detector and DeepSORT tracker” In this research the authors proposed a new high performance two stage facemask detector and tracker with a monocular camera and a deep learning based framework for automating the task of facemask detection based Scaled YoloV4 model (YOLOv4-P6-FaceMask) and tracking based DeepSORT tracker using video sequences. YOLOv4-P6-FaceMask is a model with high accuracy that achieves 93% mean average precision, 92% mean average recall and the real-time speed of 35 fps on single GPU Tesla-T4 graphic card.

Table 2 illustrated the difference between models.

Table 2 State-of-art social distance and facemask framework based deep learning (SD; Social Distancing Dm: detection model Tr: tracking model S_ no_ MF: Masked faces detection)

2.3 Contributions

A new detection and social distance system is proposed in this paper to prevent the spread of COVID-19. Several methods based on the Scaled-YOLOv4 model are fully exploited. The main contributions are:

  1. 1.

    Detection of risk persons with bounding boxes in each frame.

    • Collection of indoor / outdoor sequences

    • Adaptation of a new version of the Scaled-YOLOv4 model (Social-Scaled-YOLOv4) for persons detection

    • Application to images and sequences in real time

  2. 2.

    Risk persons localization based perspective transformation

    • Use of perspective transformation technique (birds-eye view)

    • Extraction of 3D coordination using monocular camera based on perspective transformation

  3. 3.

    Social distance computation

  4. 4.

    Coupled people detection

    • Couple detection is passed on space and time

    • Distance between two people remains less than the permissible limit for social distancing (1.8 m) for an approximate period of 10 s.

  5. 5.

    DL detecting and tracking persons without face mask

    • Collection a dataset with face masks and without face masks

    • Propose a new face mask detector namely DSFD&MobileNetv2

  6. 6.

    Create risk persons database

  7. 7.

    DL detector to detect masked / no masked faces

  8. 8.

    Create masked / no masked faces database

  9. 9.

    Save persons breaching social distance norms (pedestrians and faces) for identification and tracking.

3 Proposed approach

The proposed method is depicted in Fig. 1. We propose a four-stage model including pedestrian detection, tracking, inter distance estimation as a solution for social-distance monitoring and face mask detection. The proposed system can be integrate on CCTV surveillance cameras with any resolution with an acceptable real-time performance. Social-Scaled-YOLOv4 (Social-YOLOv4-P6) is trained for pedestrian detection to identify human bodies in real-time video or online cameras, and then the extraction of 3D coordination is assured by perspective transformation method. The distance between centers of every box is computed using Euclidian distance and DeepSORT for persons tracking. Finally, persons faces are detected using DSFD model trained in Wider-Faces and MobileNetv2 classifier for mask classification. The main reason for using transfer-learning networks is that they provide excellent results in terms of accuracy and speed. In addition, the datasets used to train the model are huge MS-COCO dataset and Google-Open-Image dataset, which minimize the errors, the training time and prevent the models from over fitting. The dataset used for DSFD&MobileNetv2 is a collection of 5740 images belonging to two classes: “with mask” and “without mask” from the Real-World Masked Face dataset (RMFD) and Simulated Masked Face dataset (SMFD).

The majority of the sequences and images are datasets like Oxford Town Center (OTC) [24], Multiple Object Tracking (MOT) Dataset [25], or downloaded from the site: pixabay (https://pixabay.com/). The rest of used sequences are public and used already in references [22].

3.1 Detection and localization

The main objective of the first stage is to develop a robust pedestrian detection model. YOLO model belongs to the family of One-Stage Detectors, it is an object detection model used in DL use cases. In this paper, we will not talk about the history or background of previous versions of YOLO (v1, v2, and v3). Figure 2 shows all parties the overall structure of the one stage model YOLOv4. Which suggests a detection network with a backbone, a neck and heads. The CSPDarknet53 [26] is applied as a backbone and refers to a general feature extractor made up of CNN to extract informations in images to feature maps. Spatial Pyramid Pooling (SPP) [27] and Path Aggregation Network (PAN) [28] were applied as a neck. The SPP used to eliminate the requirement of fixed-size (e.g., 512 × 512) input image, the PAN used to collect multi-level features and connect with the spatial pyramid network and the YOLOv3 used as a head to predict the bounding box (calculate the coordinates, confidence threshold, non-maximum value suppression).

Fig. 2
figure 2

Scaled-YOLOv4 detection Model

3.1.1 YOLOv4 scaling

In traditional models of detection, the scaling means modifies the depth of model by add more convolutional layers. For example, the VGGNet [29] scaled to VGG-11, VGG-13, VGG-16, and VGG-19 architectures. Now the scaling approach mean modifies the depth, width, resolution, and structure of the network, which finally forms scaled model for example Scaled YOLOv4. To prove the superiority of the selected model YOLOv4-P6 in terms of backbone, accuracy and real-time performance in this paper, we compare it with Fast-RCNN, Faster-RCNN, YOLOv3, YOLOv4, YOLOv4-CSP, SSD, RetinaNet, EfficientDet-D1, EfficientDet-D0 YOLOv4-P5 and YOLOv4-P7, which are the state-of-the-art pedestrian detection models. Table 3 shows the training parameters of Social-YOLOv4-P6 model on the MS- COCO and Google-Open-Image dataset. The comparison of testing results with state-off -art are given in Table 4.

Table 3 Training parameters of proposed Social-Scaled-YOLOv4 on the MS-COCO and Google-Open-Image dataset
Table 4 Comparison of the speed and accuracy Social-YOLOv4-P6 on the MS-COCO dataset

3.1.2 Social-Scaled-YOLOv4

Training dataset

To have a strong and robust pedestrian detector, we would need a group of training datasets that include different difficulties of image processing like blurring, distance between faces and camera, variety of gender or age, with annotation and labelling. We chose to used two datasets, MS-COCO and Google-Open-Image dataset.

Social-Scaled-YOLOv4 model parameters and results

In our proposed approach, illustrated in Fig. 1, we use the Scaled YOLOv4 detection technique to detect persons in single pictures, real-time video, or online cameras (first stage). This paper will not discuss the history or background of previous versions of YOLO (YOLOv1, YOLOv2, and YOLOv3). We trained a custom YOLOv4-P6 model for persons detection and localization by using MS-COCO dataset. the network architecture of Scaled-YOLOv4 illustrated in Fig. 2.

Training parameters of proposed Social-Scaled-YOLOv4 shows in Table 3.

For the pedestrian detection task, fifteen different object detection models in the Tensorflow object model.

Zoo were trained and tested on the MS-COCO dataset to compare their accuracy with ours proposed model to assure the superiority of Social-Scaled-YOLOv4 model. Table 4 shows the models and their accuracies.

3.2 Persons tracking

The second stage after the people detection phase is track people and ID assignment for each box using the DeepSORT technique (Fig. 3).

Fig. 3
figure 3

DeepSORT Tracking technique

DeepSORT [30] is an online algorithm for the track of people that considers both the information about the manifestation of the tracked objects and the bounding box parameters of the detection results to associate the detections in the frame at time t + 1 with tracked objects at a time t. Therefore, DeepSORT needs to process the whole video at once, only considers information about the present and former frames to form predictions about the present frame.

At the first frame of the sequence the algorithm assigned to every bounding box that represents a poeple and has a confidence value above a set threshold. The Hungarian algorithm (combinatorial optimization algorithm) is used to assign the detections during a new frame to existing in order that the assignment cost function reaches the global minimum.

The cost-function involves the M-D (Mahalanobis-distance) (Eq. (1)) of the box that detected from the position predicted with the known position at time t of that object, and a visible distance (Eq. (2)) that considers the looks of the detected object and therefore the history of the looks of the tracked object.

The expression of M-D d (1) is given by:

$${d}^{(1)}(i,j)={({d}_{j}-{y}_{i})}^{T}{S}_{i}^{-1}({d}_{j}-{y}_{i})$$
(1)

where:

yi:

the mean.

Si:

the covariance matrix bounding box observations for the i-th track.

dj:

the j-th detected bounding box.

The expression of visual distance d (2) that relies on appearance feature descriptors:

$${d}^{(2)}(i,j)=\mathrm{min}\left\{1-{r}_{j}^{T}{r}_{k}^{\left(i\right)}\left|{r}_{k}^{\left(i\right)}\in \mathfrak{R}\right.\right\}$$
(2)

where:

rj:

the appearance descriptor extracted from the part of the image within the j-th detected bounding box.

ℛ:

the set of last 100 appearance descriptors \({r}_{k}^{\left(i\right)}\) associated with the track i.

The cosine-distance used by d (2) measure between the j-th detection / i-th track in the current detection to select the track where visually the most similar detection was previously found.

The value function of assigning a detected object j to a track i is given by:

$${c}_{i,j}=\lambda {d}^{(1)}(i,j)+(1-\lambda ){d}^{(2)}(i,j)$$
(3)

where:

λ:

parameter that can be set to determine the influence of the visual distance d (2) and the M-D d (1).

New track IDs are generated whenever (Fig. 3). There are more detections in a frame than already tracked persons.

A detection cannot be assigned to any track, because the detection is too far from any track, or not visually similar to any previous detection.

3.3 Distance computation

The third stage after the people tracking is the distance estimation between boxes. Such as in, the binocular stereo vision that uses two cameras of the same specification instead of human eyes is a popular technique for distance estimation but it is not appropriate for our application. We discussed here, the use of a single camera in the surveillance systems. This prompted us to search for a solution that enables us to calculate the distance between people, but by using a monocular camera, we were able to solve this problem using a technique called perspective transformation or birds-eye view.

3.3.1 Perspective transformation

The projection of a 3D scene world by employing a monocular camera into a 2D perspective image plane results in unrealistic pixel-distances between the objects, this method named Perspective Transformation (PT) or bird’s eye technique, we will change the attitude of a given image or video for recuperating insights about the specified information. In PT, we would like to provide the points on the image from which we want to collect informations by changing the attitude (need a 3 × 3-transformation matrix). Straight lines will remain straight even after the transformation. To seek out this transformation matrix four-point transformation method are used, where, the 4 points are within the order of top left, top right, bottom right, bottom left of the bounding box. PT and warp perspective methods from cv2 are used and the Euclidean distance criterion to evaluate inter-people distance is calculated. Figure 4 shows the original image captured from a perspective to the vertical view of bird's eye, where the dimensions in the picture have a linear relationship with real dimensions. The relationship between pixel (x, y) in the bird's eye picture and pixel (u, v) in the original picture is defined as:

$$\left[\begin{array}{c}X\\ Y\\ Z\end{array}\right]=M\left[\begin{array}{c}u\\ v\\ z\end{array}\right]$$
(4)
$$\left[\begin{array}{c}X\\ Y\\ Z\end{array}\right]=\left[\begin{array}{ccc}{m}_{11}& {m}_{12}& {m}_{13}\\ {m}_{21}& {m}_{22}& {m}_{23}\\ {m}_{31}& {m}_{32}& {m}_{33}\end{array}\right]\left[\begin{array}{c}u\\ v\\ z\end{array}\right]$$
(5)

where M in Eq. (4) is the transformation matrix. Finally, the distance between each pair of people is measured by estimating the Euclidean distance (l2-norm) between the center points of each boundary box in the bird's eye view.

Fig. 4
figure 4

Perspective Transformation

3.3.2 Euclidean distance

L2-norm distance (Eq. 6) is the shortest distance between two points (xi, yi) and (xj, yj) in a Euclidean space (two-dimensional space). It is used as a standard metric to measure the similarity between two data points and utilized in various fields.

$${\left({\left({x}_{i}-{x}_{j}\right)}^{2}+{\left({y}_{i}-{y}_{j}\right)}^{2}\right)}^{1/2}=d$$
(6)

The approximation of the number of pixels in an image that represents 1.8 m in real-world changes from dataset to other. Example: in the Oxford Town Center (OTC) dataset (Fig. 5) every 10 pixels, in the Bird Eye View space is equal to 0.98 m in the real-world. Therefore, 1.8 m in the real-world is equal to 19 pixels in the Bird Eye View space.

Fig. 5
figure 5

Distance Risk (< 1.8 m)

Figure 6 represents the bird eye blocs and steps:

  1. 1.

    Input real time video sequences

  2. 2.

    Four points representation (plot)

  3. 3.

    Output image after the application of perspective transformation

  4. 4.

    Bird eye view after detection of violation persons (red point: violated persons, orange point: coupled persons, green point: saved persons).

Fig. 6
figure 6

Perspective Transformation (with couple detection). Oxford Town Centre dataset

The output in Fig. 7 shows is the result of the proposed method without couple detection (no orange point) only red and green points (red points: violated persons, green points: saved persons).

Fig. 7
figure 7

Perspective Transformation (without couple detection)

3.4 Persons couple detection

How to deal with couples and family members when tracking social distance monitoring is one of the most important ideas offered by authorities. Some researchers advising the couples and family members to walk in a close proximity without being counted as a breach of social distancing countries, this is what encouraged some countries to establish new laws allowing family members to walk together, such as some countries in the European Union region and UK. We can notice that the current research of social distancing applies on every single individual but ignores couples and family members to walk together without considering it as a breach of social distancing.

The proposed technique for detecting couples of persons (Fig. 8) is based on ID number for each person, that we got by DeepSORT tracking for all frames of sequence. If the distance between two boxes is smaller than allowed distance 1.8mas shown in algorithm 1, the couple of two IDs (IDbox1,IDbox2) is saved in a list of tuples named cpID.

Fig. 8
figure 8

Proposed technique of Detection and Tracking Risked/Coupled Persons

$$cpID=\left[\begin{array}{c}(I{D}_{bo{x}_{1}},I{D}_{bo{x}_{1}}),...,(I{D}_{bo{x}_{i}},I{D}_{bo{x}_{j}})\\ \left|i\in \left[0,lenght\_of\_risk\_box\right]\right.,j\in \left[i+1,lenght\_of\_risk\_box\right]\end{array}\right]$$
(7)

In the next frame, we count the repetition of every couple of ID in cpID, the explanation is given in Algorithm 2.

Algorithm 1:
figure a

Green and Risk Boxes

Algorithm 2:
figure b

Red and Orange boxes (coupled persons detection)

3.5 Face mask detection

This section is divided into: i) Face detection with Dual Shot Face Detector DSFD [31], ii) Face Masked or unmasked with MobileNetv2 classifier. The structure of the proposed face mask detector is illustrated in Fig. 9.

Fig. 9
figure 9

Proposed DSFD&MobileNetv2 Face Mask Detector

3.5.1 Dual Shot Face Detector DSFD

In this section the first step for detect face in cropped person image DSFD model is used that inherits the architecture of SSD and introduces a Feature Enhance Module (FEM) for transferring the original feature maps to extend the single shot detector to dual shot detector. Specially, Progressive Anchor Loss (PAL), the model is trained on Wider Face dataset and its accuracy is equal to (easy: 0.966, medium: 0.957, hard: 0.904) and FDDB (discontinuous: 0.991, continuous: 0.862).

The Wider face contains 32,203 images and 393,703 faces with a high degree of variability in scale, pose and occlusion.

DSFD architecture uses the same extended VGG16 backbone as Pyramid Box [32] and S3FD [33] which is truncated before the classification layers and added with some auxiliary structures.

3.5.2 MobileNetv2 Classifier

The second step is face mask classification. For the face mask task, MobileNetv2 object classification models were trained and tested on a collected dataset, the face mask dataset named Simulated Masked Face Dataset [34] (SMFD) and Real Masked Face Dataset (RMFD) [35].

Data preprocessing and dataset

Before the training of models, the step of image augmentation is done on collected dataset (SMFD [34] / RMFD [35]). This technique used to increase the size of dataset by artificially modifying. The images are augmented with distinct operations namely Shearing, Gaussian Blur, Average Blur, Motion Blur. The generated dataset is then rescaled to 224 × 224 pixels. An example is shown in Fig. 10.

Fig. 10
figure 10

Images from SMFD dataset and RMFD dataset

Model training

For training the model we load pre-trained MobileNetv2 model without last few layers and freeze all the layers, after we define the face mask classifier model by adding a few layers on top of MobileNetv2 pre-trained model, extract faces from the dataset and save them in the specified directory, then should be two sub-directories corresponding to masked and no-masked faces. The parameters of model training are shown in Table 5 and the results of training in Fig. 11 which represent the progress in the training process and Fig. 12 which represent the test of DSFD&MobileNetv2 Mask Detector model with a set of public images.

Table 5 Training parameters of MobileNetv2 Face mask classifier
Fig. 11
figure 11

MobileNetv2 classifier training results

Fig. 12
figure 12

Output Results of DSFD&MobileNetv2 Mask Detector

3.5.3 Performance evaluation

For testing the performance of the DSFD&MobileNetv2 model, test part of collected dataset is used. We can see from Table 6 that the model achieves detection accuracy of 99.3% and loss of 0.01%. From the detection images obtained we can confirm that the DSFD&MobileNetv2 models detect correctly all indoor face images in front of the camera. They also detect all the faces of the outdoor images close to the camera as well as those, which are far from it with orientation of the head, bad resolution as well as the burring images.

Table 6 Accuracy of trained DSFD&MobileNetv2 face mask detector

4 Implementation platform and libraries

To implement the Social distance monitoring the Python language on Google Colab notebook and DESKTOP-DIPLV8E i5-3230 M, 2.60 GHz, and GeForce GTX 1080 Ti Graphics Cards—Nvidia are used. In the first stage, the weight of Social-Scaled-YOLOv4 converted from darknet format to TensorFlow format. The DSFD&MobileNetv2 face mask detection model trained and performed in a single Tesla T4 GPU of Google Colab. The libraries used in the implementation processes: Keras, Os, OpenCv, NumPy, MatPlotLib and pillow.

5 Results of proposed social distance and face mask monitor

Figure 13 provides a basic statistic about the number of persons every 100 frames from Oxford Town Center dataset, the number of people who break and don’t break the rules of social distancing without taking into consideration coupled persons as violations.

Fig. 13
figure 13

Social-Scaled-YOLOV4&DeepSort Social Distance Results without Couples on Oxford Town Center dataset

The Fig. 13 is a 2D registration of the number of detected persons in 1000 frames from the Oxford Town Center Dataset, as well as the number of violations (Insafe) and number of safe persons.

5.1 Results: Social-Scaled-YOLOv4&DeepSORT on single images

Figure 14 shows the output of the proposed approach tested on single images. The big red boxes represent violate persons or persons break social distance rules.

Fig. 14
figure 14

The output of proposed Social-Scaled-YOLOv4&DeepSORT tested in single images

If the big box is red, we pass to face mask detection. If the face is masked we insert a green face box, if the face is non-masked insert a red box faces. The results show the performance of the proposed approach for both cases indoor and outdoor.

The detection images obtained confirm that the proposed method for social distance and face mask is performing.

5.2 Results: Social-Scaled-YOLOv4&DeepSORT on video sequence

To evaluate the performance of this proposed solution in real time sequences; few tests are performed, in the series of experiments are shown in Fig. 15 and Fig. 16. The performance of the model is explored on outdoor and indoor sequences that contain difficulties and obstacles such as brightness, blurring, and proximity faces to camera, congestion (schools, airports, malls…) to show the effectiveness and accuracy of the study social distance monitor and face mask detector. The results are shown in Fig. 15 a), b) and Fig. 16a), b), c).

Fig. 15
figure 15

Social-Scaled-YOLOv4&DeepSORT tested in outdoor sequences a) Oxford Town Center dataset b) CCTV camera walk student

Fig. 16
figure 16

Social-Scaled-YOLOv4&DeepSORT tested in outdoor a) Airport sequence without couple detection b) Social distance detection MOT20 dataset and masked/no masked persons c) Student in Chinese school from pixabay

5.2.1 Social-Scaled-YOLOv4&DeepSORT in outdoor

Figure 15 shows the results of the proposed solution tested on outdoor sequences (Oxford Town Center Dataset, CCTV walk students, Persons walk in Bridge). Oxford Town Center camera calibration is available; this help us to extract the perspective transformation matrix. In the other sequence, 4-points are used to extract the perspective transformation matrix.

We notice that the model gives excellent results in outdoor environment that contain difficulties and obstacles such as brightness, blurring, and proximity faces to camera, and we can apply the detection of coupled persons.

5.2.2 Social-Scaled-YOLOv4&DeepSORT indoor

Figure 16 shows the output of proposed solution tested on indoor sequences (MOT20 dataset, Student in Chinese school, airport sequences). In all sequence, 4-points technique are used to extract the perspective transformation matrix.

We notice that indoor crowded environment the detection application of coupled persons is very difficult and we apply social distance monitoring without couple detection. This helps us to conclude that the best option for indoor environments is to use social distance monitoring without coupled person detection.

5.3 Social-Scaled-YOLOv4&DeepSORT databases extraction

After the person detection, crop and save every person breaching social distance norms (red boxes) and the faces of persons that not wear a medical mask. The persons and faces are identified by the tracking ID number and number of frame in the sequence. For the faces, we save only faces of persons breaching social distance norms and not wearing face mask:

  • Fig. 17 shows examples of persons cropped from Oxford Town Center dataset

  • Fig. 18 shows examples of no-masked faces cropped from Oxford Town Center dataset

  • Algorithm 3 illustrates the technique used to crop and save the picture of persons and faces only once using the tracking ID (identification number)

Fig. 17
figure 17

Results of the proposed Social-Scaled-YOLOv4 and DeepSORT save persons from Oxford Town dataset

Fig. 18
figure 18

Results of the proposed Social-Scaled-YOLOv4&DeepSORT save faces from Oxford Town dataset

Algorithm 3:
figure c

Save violate persons-unmasked face

6 Approach limitations

Bird eye view (Perspective Transformation) gives us a top view of a scene, this results in a close but inaccurate distance calculation. Figure 19 shows that the detection of coupled persons can be applied only in outdoor but in crowded indoor environment we cannot use this and we can use social distance monitoring without couple detection. The use of three models (Social-Scaled-YOLOv4/DeepSORT/DSFD&Mobilenetv2) decreases the real time performance of the model.

Fig. 19
figure 19

Limits of Couple Detection (MOT20 Dataset)

7 Comparison with state of the art

In Tables 7 and 8, we present a comparison between the proposed approach and the state of the art. There is a comparison with the state of the art of published methods of social distance monitoring in Table 7, we can notice that the proposed approach offers a new option, that is, the use of DeepSORT to track persons and a technique to save risk pedestrians and no masked faces.

Table 7 Options comparison of our approach social distance and face mask detection with state of the art
Table 8 Comparison of our approach face mask detection with state of the art

Table 8 provides the details of state-of-the-art model of face mask detection published in last two years. Notice that the proposed model:

  • Provides an acceptable performance compared to the state of the art of face mask detection models;

  • Improves the accuracy and real-time performance of DSFD&MobileNetv2; Uses the database extracted (violate persons) for person identification and risk person detection for a person's disease based on facial expression.

8 Conclusion

This paper developed a four stages real-time model based Deep Learning and monocular camera to monitor the social distance and face mask detection to avoid the spread of the corona virus and to assure persons safety in the COVID-19 pandemic. In the first stage, we used the Social-Scaled-YOLOv4 model for risk persons detection. In the second stage, DeepSORT is used for tracking people in the scene. Perspective transformation used to unrealistic pixel-distances between the persons. Then, the pairwise centroid distances between detected bounding boxes are measured with Euclidean distance. To check social distance violations between people, an approximation of physical distance to the pixel is used, and a threshold is defined. Distance less than threshold value and persons not coupled that means persons breaching social distance norms. Finally, face mask detection of these persons is realized by using the proposed DSFD&MobileNetv2 face mask detector. The result of this breaching is print red person box, save person boxes in a predefined folder, face mask detection is achieved by showing bounding boxes on the identified person face with mask (green box face), or no-mask (red box face). The Social-Scaled-YOLOv4 model achieves a detection accuracy of 56.2% on MS-COCO dataset and DSFD&MobileNetv2 model achieves a detection accuracy of 99.3%. Database of masked faces or no masked is created for identification.

In future work, we aspire to:

  • To extract database to identify violate persons or infected persons based in facial expression.

  • To combine tis monitoring system with another recent tracker

  • to integrate this method in an embedded system (Arduino, raspberry Pi, ArduPilot …)