Introduction

Corona Virus (COVID-19) is a highly contagious worldwide virus that has wreaked havoc globally. It has created distance among people. The Corona Pandemic affected about 220 nations and regions worldwide, with roughly 197,201,247 confirmed cases since July 2021 [1]. Numerous COVID-19 virus variants have already been identified worldwide. This virus is continuously evolving due to mutations. Recently, scientists discovered a novel variation in India, dubbed the Delta variant. The Delta strain spreads at a rate that is several times that of other strains [2]. It is dominating the world health sector.

To prevent the massive spread of COVID-19, the WHO establishes specific fundamental guidelines for humans, such as maintaining physical distance from others, wearing a face mask, washing hands for 20 s, wearing PPE, and staying at home. However, it is difficult for the government to control individuals in public areas. As a result, governments in many countries altered their policies to minimize the number of people in public places, such as closing all educational institutions, limiting person attendance in the workplace, and maintaining a distance of at least 2 m for garment workers wearing a face mask. Furthermore, several software firms established their internet offices from home. The physical distance between people contributes to the reduction of COVID-19 virus transmission. Numerous institutions initiate a program to remove individuals from densely populated regions to reduce the COVID-19 spread. For instance, the government puts police officers in public areas to ensure public safety, security, and physical distance. The traditional approach (check individuals by security personnel) are tedious, time-consuming, and imprecise. The primary purpose of doing this study is to determine the distance between individuals using a real-time video, which also helps to protect the particular place from the spreading of COVID-19. Our system effectively identifies physical spaces and improves the safety from COVID-19 in specific regions, such as constructions and garments areas.

The technology is expected to use the latest machine learning technology, and surveillance cameras in the building to identify whether or not individuals are keeping a safe physical distance from each other based on real-time video feeds [3]. This technology is also said to connect with security cameras at various businesses to prevent employees from working too closely together. Three demonstration steps were provided, each representing a stage in the process of calibration, detection, and measurement.

The development of a wide variety of numerical techniques and models has allowed us to evaluate the evolution of pandemic processes throughout the last 2 years due to research on the assessment of physical distance in COVID-19 [4,5,6,7,8]. They spoke about how people’s physical distance is affected by their social environment, and they brought up COVID-19’s worry. Recently, some commercialized computer vision-based systems [9,10,11] for monitoring physical distance have been established. These methods are intriguing; however, there is no statistical analysis included in the findings. In addition, there is little discussion of implementation. While the conversations are enlightening, they do not provide concrete results for measuring physical distance, leave the issue open, and may be considered undesirable by others.

Nevertheless, none of these approaches provide in-depth explanations of their methods, performance benchmarks, or logic for their detection algorithm selection. Some work that one must use a theoretical approach for physical distance, but leave out specifics of what steps can be taken in real life. In contrast, we have suggested a real-time automated surveillance system for detecting and monitoring people and measuring inter-distance between humans, assessing risk in real-time by warnings in the form of red text (’Unsafe’) and counting them using the bounding boxes. Furthermore, our technology is crucial for observing the physical distance between peoples in density areas and controls entry to a particular location. Listed below are the most important contributions of this paper:

  • We introduced a deep learning-based automated system for monitoring and detecting people to reduce coronavirus expansion and economic costs.

  • A transformer-based TH-Yolov5 is used to detect and classify Pedestrians, while Deepsort [12] is used to track people in this system.

  • We used Transformer Heads (TH) to detect pedestrians in high-density environments. In addition, The CBAM was included in YOLOv5 to assist the network in locating areas of interest in images with extensive region coverage.

  • We utilized pairwise L2 vectorized normalization, which uses the centroid coordinates and dimensions of the bounding box to create 3D feature space. We calculated how many people are not following the physical distance regulation using the violation index.

The following sections comprise the rest of this paper: Section “Related Works” summarizes the relevant literature; Sect. “Proposed Method” elaborates on the proposed approach; Section “Experiments” explains the data sets and training details; Section “Results and Discussions” elaborates the specifics of our results and discussions. Finally, we conclude this article in Section “Conclusion”.

Related Works

COVID-19 affects individuals differently, but it mainly spreads via droplet contact, physical touch, and airborne transmission. physical distance may play an essential role in reducing the spread of COVID-19 [13]. As a result, everyone should be careful and observe the norms of physical distancing, such as keeping a set space (100 cm) between themselves and others. That is why it is also known as “Physical Distancing”. Several techniques for detecting objects from movies and pictures have previously been suggested for a variety of applications [14]. This section has been discussed comprehensively in the literature review.

Bouhlel et al. [15] developed a video sequence technique that combines two approaches, the macroscopic and microscopic methods, to calculate the real-time distance between individuals. Based on the techniques, they utilized two kinds of data sets, Mayenberg and Mliki. They used a three-level categorization method to improve performance substantially; however, it is not suited for real-time use.

In the COVID-19 scenario, Razavi et al. [16] devised an automated method to monitor construction workers to guarantee their safety. When the employees are on duty, the system detects the face mask and the physical distance between them. They utilized Faster R-CNN Inception ResNet V2 for image detection to improve the system’s accuracy. However, as the quantity of training data decreases, the accuracy decreases. Furthermore, the system is unable to detect the mask when the employees swivel their heads. They used Faster R-CNN Inception V2 to detect the distance between people. They calculated the actual distance traveled by the workers from the image in which the established method did not work effectively, because their performance is not good.

Rahim et al. [17] proposed a technique for measuring physical distance in low-light environments. COCO location measurements were used to assess the trained model’s performance. After utilizing the YOLO-v4 model for real-time object recognition and physical distance measurement, but it is not real time. Two-stage locators achieve better restriction and item recognition accuracy, while one-stage locators achieve faster deduction speed [16]. The technology only works in settings with a fixed physical distance and two target items. Some prototypes that use machine learning and sensing technologies for physical distance tracking have been proposed. Landing AI [3] suggested a physical distance estimator that uses a security camera to identify individuals whose physical distance is less than the acceptable value. In a manufacturing facility, another system [18] was used to control and tracking labor movements and deliver real-time audio warnings. Along with security cameras, systems based on LiDAR and stereo cameras [19] were presented, demonstrating that other kinds of sensors than monitoring cameras can also be beneficial.

Using YOLOv3 and Deepsort, a new approach was presented for locating and tracking individuals [20] that involves monitoring the user’s social distance. The procedure also measures the extent to which other non-social-distancing activities occur and then calculates an index of non-social-distancing behaviors. This method seems unique. However, it lacks any statistical analysis.

Rezaei et al. [6] introduced the DeepSOCIAL, which is the DNN-based model. It uses cutting-edge deep learning methods to identify, track, and measure social distances. Furthermore, it detected social distance and used dynamic risk assessment. While this paper did not address pure violation detection, it did discover a method to help reduce congestion.

While the methods described above are intriguing, collecting data and issuing invasive warnings may be considered undesirable by some individuals. On the contrary, we present an automated system capable of real-time human identification, tracking, and physical distance measurement and indicating whether a person is safe or unsafe.

Proposed Method

Fig. 1
figure 1

Block diagram of the proposed physical distancing system. The proposed TH-YOLOv5 used for pedestrian Detection and classification, deepsort and pairwise L2 norm are used to tracking and physical distancing

This section describes our approach system for physical distancing monitoring consisting of three stages: detection and tracking of persons, inter-distance calculation, and zone-based infection risk evaluations. The system is designed to operate with and is suitable for all types of CCTV security cameras, independent of video quality and real-time depiction of identified people and their distance. Figure 1 shows the block diagram of the proposed system. The system is divided into three parts: detection, tracking, and distance estimation. The subsections show the full process of our suggested system.

Pedestrian Detection

A real-time human detector is introduced for the complex environment that contains various objects to identify people correctly. This detector has a feature extractor and classification module, which is based on Yolo-v5 [21]. A fundamental strategy of improving the exactness feature extraction of CNN-based object detectors [22] is to broaden the receptive field and increase the system’s complexity employing additional layers whereas identifying tiny objects more effectively. Instead, for easier training, we utilized a skip-connections method. Consequently, to decrease the size of the parameter, a modified CSPDarknet53 with SPP layers is utilized as a feature extractor backbone. Then, PANet [23] serving as the neck, and YOLO serving as the detecting head [24]. A collection of freebies and specials [25] is supplied to optimize the whole architecture. It reprocesses and logically employs the feature maps collected by Backbone at various phases. A neck is usually made up of multiple bottom-up and top-down routes. The neck is an essential connection in the target detection architecture. After that, a transformer head (TH) used as classification network that is intended to extract features maps from the backbone to detect the location and classification of the person. The transformer encoder is utilized in Head part. Figure 2depicts the TH-YOLOv5 architecture.

Fig. 2
figure 2

Arhitecture of the proposed TH-Yolo-v5 for pedestrian detection. It has three main modules including backbone, neck and head

Transformer Head (TH). We analyze the CrowdHuman [26] and CityPersons [27] data sets discover a large number of minor occurrences; therefore, we add another prediction head for detecting small items. When combined with the other three prediction heads, our four-head structure can reduce the detrimental effect produced by violent object scale variation in the predictions. The prediction head (head No.1) that we add is created using a low-level, high-resolution feature map that is more sensitive to tiny objects, as illustrated in Fig. 2. The performance of tiny object detection improves significantly after adding a detection head, even though the compute and memory costs rise.

Transformer Encoder. We modified the original version of YOLOv5 to include transformer encoder blocks, which were inspired by the vision transformer [28]. We replaced several convolutional blocks and CSP bottleneck blocks with transformer encoder blocks. Figure 3a depicts the structure. Comparing the transformer encoder block with CSPDarknet53 to the original bottleneck block in CSPDarknet53, it can collect more global information and a great deal of contextual information. There are two sub-layers in each transformer encoder. It is divided into two sub-layers: the first is a multi-head attention layer (MLP), and the second MLP is fully connected. Between each sublayer, residual connections are employed. Transformer encoder blocks improve the ability to record a variety of local objects. It perform better on occluded persons with a high density on the CityPersons data set. We used multi-head attention, a unit for attention mechanisms that simultaneously processes an attention mechanism multiple times. In addition, it may be used for research on the feature representation possibilities. The outputs of the particular attention are then concatenated and converted linearly into the predicted dimension. Intuitively, having many attention heads enables one to focus on various sections of the sequence in different ways. It allows the model to continuously attend to input originating from various representation subspaces located at multiple places. A single attention head is required for averaging; thus, this is inhibited. The multi-head (MH) is followed as follows:

$$\begin{aligned} \text {MH}(Q, K, V)= & {} Concat(\text {head}_{1}, \ldots , \text {head}_n) P^{O} \end{aligned}$$
(1)
$$\begin{aligned} \text {head} {i}_{i}= & {} {Attention}(Q P_{i}^{Q}, K P_{i}^{K}, V P_{i}^{V}) \end{aligned}$$
(2)

where Q, K, and V depict the query, key and value, respectively. In addition, \(P_{i}^{Q}\), \(P_{i}^{K}\), and \(P_{i}^{V}\) are represent the matrices of the parameter, respectively. In addition, \(h=2\) parallel attention layers or heads are used to the minimizing dimension.

We used transformer encoder blocks in the head section of the backbone for human prediction. The Transformer Head (TH) is used based on YOLOv5, because the feature maps have a low resolution after the end of this backbone network. TH can reduce computational and memory costs when used on low-resolution feature maps. Furthermore, as we increase the resolution of the input images, we can eliminate certain TH blocks from the early layers to enable the training process.

Fig. 3
figure 3

Arhitecture of the transformer encoder module (a) and CBAM module (b)

Convolution Block Attention Module (CBAM). CBAM [29] is a simple yet very effective attention module. It is a lightweight module inserted into most well-known CNN architectures and trained in an end-to-end technique. CBAM performs adaptive feature refinement by progressively inferring the attention map along two different channel dimensions and then multiplying the attention map with the input feature map. CBAM uses a feature map as an input for adaptive feature improvement and then progressively implies the attention map along two distinct dimensions: channel and spatial. It then combines the attention map with the feature maps. Figure 3b illustrates the structure of the CBAM module. Large coverage regions in CCTV collected photos usually feature perplexing geographical characteristics. The CBAM is used to extract the attention area, which can benefit TH-YOLOv5 in avoiding misleading input and focusing on valuable target items.

Human Tracking

Deepsort [12] is used to track people in any video in human tracking. It is created using discovered from identified humans in pictures pattern, which subsequently is coupled with temporal data to forecast the subjects’ trajectories. It maps unique IDs to keep track of each item under investigation for statistical analysis. Deepsort can also be used to deal with occlusion, numerous perspectives, and annotation of training data. In Deepsort, Kalman filter and Hungarian algorithm have been commonly employed for accurate tracking. For improved association, the Kalman filter is employed recursively, and it can forecast future locations based on present positions [30]. We subsequently utilize this time information to assess the severity of physical distance breaches and the presence of high-risk zones on the scene. The status of each individual in a frame is represented by the following this equation:

$$\begin{aligned} F = [x, y, a, b, x\prime , y\prime , a\prime ]^T \end{aligned}$$
(3)

where (x, y) indicates the target bounding box’s horizontal and vertical positions; a signifies the scale (area); and b specifies the bounding box’s aspect ratio. \(x\prime\), \(y\prime\), \(a\prime\) are the anticipated values for the horizontal area, vertical area, and bounding box centroid, respectively, as predicted by the Kalman filter.

We performed the following matrix \(D_t\), which contains the position of the n identified persons in the image carrier grid: After completing the detection and tracking procedure (\(P^t\)), for each input frame \(w*h\) at time t, we define the following matrix:

$$\begin{aligned} {D_t} = \{P^t _{x_n , y_n} x\_{\texttt {n}} \in w , y_n \in h \} \end{aligned}$$
(4)

Distance Measurement

Researchers have developed various 2D and 3D depth estimation techniques [31, 32]. We calculated the distance between identified tracked individuals on each image and video.

The deepsort model generates a collection of bounding boxes and an ID for each person detected in the previous phase. When working with bounding boxes (such as rectangular boxes), coordinates (x, y) in the 3D (x, y, d) feature space. However, to the picture that was obtained from the camera, the reduced 2D space of (x, y) correlates to two parameters (x, y) alone, and depth (z) is not accessible. To better visualize the 3D shape of each bounding box, imagine every point in space equates to three values (x, y, z). Using this Eq. 5, the 2D pixel coordinates (x, y) as an input, the world coordinate points are then mapped to the points on the screen \((X_w, Y_w, Z_w)\) [33].

$$\begin{aligned}{}[x,y,1]^T = KRT [X_w ,Y_w , Z_w , 1]^T \end{aligned}$$
(5)

Here, K, R, and T represent the rotation, translation, and intrinsic matrix, respectively. This feature space indicate the coordinates of the centroid, and the value of d describes the depth of each object [34]. We computed the following equation 6 to estimate the depth between camera and objects, which can be acquired by studying the form of the image [35].

$$\begin{aligned} d = \frac{2\pi \times 180}{(w\times h\times 360)\times 1000+3} \end{aligned}$$
(6)

where w denotes the bounding box’s width, and h denotes the bounding box’s height. The pairwise L2 normalization is calculated for the collection of bounding boxes as provided by the following Eq. 7.

$$\begin{aligned} \texttt {D} = \sqrt{\sum ^n _{i=1} (q_i - p_i)^2} \end{aligned}$$
(7)

where we define this equation n=3. Here, D represent the distance and \(q_i\) and \(p_i\) represents pixels (range 90 to 170) between two humans. After locating the individual’s neighbors using the L2 norm, we allocate them based on their proximity sensitivity. The proximity threshold is constantly updated with a large number of tests using a set of numbers between 90 to 170 pixels wide, depending on the person’s position in a particular frame. To use the proximity property, every person in the system must be given at least one neighbor or many additional neighbors to create a group in distinct color-coding. The creation of groups implies the breach of the physical distance practice, which is measured using the following equation:

$$\begin{aligned} v_i = \frac{n_p}{n_g} \end{aligned}$$
(8)

where \(v_i\) represents the index of violation. To calculate the number of groups or clusters present in the video, the number of separate groups or clusters detected (\(n_g\)) and \(n_p\) defines the total number of individuals close to those groups or clusters nearby.

Experiments

Data sets and Evaluation Matrix

Table 1 Specifics of the data sets that were utilized in our experiments. The number of images assigned to each data set, train/test, and pedestrian is shown here

The pedestrian recognition, tracking, and distance estimation problems center on the full data set. It gives the same database for researchers to use when comparing the effectiveness of various algorithms. It serves as a source of data for researchers, so that they may carry out experimental experiments. When evaluating the quality of a data set, it is essential to consider both the quantity of data and the accuracy of the labeled information. The resilience of the detector is, to some part, determined by the depth of the data set being analyzed. When compared to tasks involving the recognition of broad objects, pedestrian detection possesses distinct properties. The proposed system is trained on object detection MS COCO data set [36] that has 80 classes and 123k images. The bounding box labels on each image were additionally annotated with the matching coordinates. The CrowdHuman [26] and CityPersons [27] data sets are also used to person detection to assess our system. Both data sets include annotations for two types of bounding boxes: visible human area and human full-body bounding boxes. The CrowdHuman data set is far more challenging to work with than CityPersons, since it has more instances per image, and those examples are often strongly overlapped. Table 1 illustrates an overview of the information included in the data set. To compare and evaluate the proposed system, we used another data set, the Oxford town Centre (OTC) data set [37], an unknown and complex data set with a high frequency of object recognition overlapping and overcrowded zones. The collection also included a wide range of human clothing and looks in the public location real world. FPS, mAP, and total loss in identifying the individual, as shown in Fig. 5, are constantly measured throughout the validation period.

Traning Details

All training and testing performed on the same PC, which has a Windows 10 operating system, an Intel Core i9-10850K CPU running at 3.6 GHz, 64 GB of RAM, and an NVIDIA GeForce RTX 2070 super GPU with 8GB video RAM. We utilized PyTorch, a CUDA environment, and a vscode editor to implement it. Following training using TH-YOLOv5, we evaluated the data sets and assessed the outcomes by exhibiting the failure occurrences. We determined that TH-YOLOv5 has excellent localization capability but limited classification capability.

The transfer learning concept was used to train the TH-Yolov5-based model, then fine-tuned and optimized before being used to train the proposed model on the MS COCO data set. We utilized SGD with warm restarts to alter the learning rate throughout the training phase. It aided in breaking out of local minima in the solution space and saving training time. The technique started with a high learning rate, slowed it down midway, and then decreased the learning rate for each batch with a slight downward slope. It has the effect of jumpstarting the solution space out of local minima, saving the training time. A high learning rate was used in the beginning stage and slowed down midway. Progressively, the learning rate was decreased for each batch, with a minimal drop in speed.

Results and Discussion

Performance of Pedestrian Detection

Table 2 Pedestrian Detection performance of three banchmak data sets with two well-known backbones in terms of mAP, NOI, Total loss and FPS

Table 2 demonstrates the performance of the Pedestrian Detection on three popular benchmark data sets using our proposed model, TH-YOLOv5. For every data set, we have employed two backbones in TH-YOLOv5, including CSPResNext50 and CSPDarkNet53. From Table 2, it can be observed that the higher mAP is obtained in every data set through our proposed TH-YOLOv5, mainly when CSPResNext50 and CSPDarkNet53 are utilized in CrowdHuman [26], the mAP is 91.3% and 92.1% with a minimum loss of 0.87 and 1.14, around 16k iterations and 28 FPS, respectively. Similarly, in CityPersons, and OTC, the TH-YOLOv5 shows outstanding accuracy, i.g., 93.7%, 92.8%, 88.7%, and 89.5% in terms of using CSPResNext50 and CSPDarkNet53 backbones. With the minimum loss, the number of iterations is also less (about 12.5k and 13k) in CityPersons [27] and OTC [37] data sets.

Fig. 4
figure 4

Visual representation of pedestrians detection. The detection of results obtained from the CrowdHuman [26], CityPersons [27], and OTC [37] data sets are shown in the figure’s first, second, and third rows, respectively

Figure 4 displays the visual findings that were achieved by the suggested TH-YOLOv5. The optical pedestrian detection of the CrowdHuman [26], CityPersons [27], and OTC [37] data sets is shown in the first, second, and third rows, respectively. In the first and second rows, we can see that our suggested model can adequately detect pedestrians. The suggested TH-YOLOv5 can identify tiny things (humans/pedestrians) from the perspective of a surveillance camera in the third row.

Performance Analysis

Fig. 5
figure 5

Minimum vs. average physical closest distance in Oxford Town Center data set

We calculated the evaluation matrics of mean average error (MAE) and average closest physical distance, \(d_{avg}\) for the overall frame for measuring physical distance. We calculated the \(d_{avg} = \frac{1}{N} \sum ^m_{i=1} d^{min}_i\) , where \(d^{min}_i = min (d_{i,j})\), \(\forall j \ne i \in\) {1,2,..n} is the closest physical distance between number of humans and the MAE of the social distancing violation ratio \(r_v = \frac{v}{n}\), where v is the number of humans who break the physical distance and n is total number of people.

Table 3 Performance of physical distance

The performance of physical distance (also known as COVID-19 social distance) is shown in Table 3. It calculates the amount of time it takes to identify breaches of physical distance. The proposed TH-YOLOv5 with CSPResNext50 and CSPDarknet53 backbones is showed the performance of measuring physical distance and violation of physical distance on the OTC data set. In the CSPResNext50 backbone, the proposed TH-YOLOv5 has achieved 1.514 davg and 0.167 rv of MAE. Our proposed method outperforms the CSPDarknet53 backbone, obtaining an MAE of 0.559 davg and 0.114 rv on the OTC data set. Figure 5 shows the lowest and average closest distances. The blue line shows the average number of individuals who are near each other, while the orange line represents the lowest number of people in this data set.

Fig. 6
figure 6

2D histograms that visualize the frequency of social distance violations vs. social density. The observation on the histograms correlates with a positive correlation

Figure 6 shows how the frequency of social distance violations correlates with the social density (\({human}/{m^2}\)) in 2D histograms on the OTC data set. A violation rises with a rise in social density. It can be seen in the graph, where the two variables (social density and violation) are linearly related to one another. In addition, it is possible to use the suggested linear regression.

Fig. 7
figure 7

Output of the proposed physical distancing system. Here, (a)–(d) represents the low risk, lower medium risk, medium risk and high risk

Figure 7 depicts the detection and physical distancing outcomes of the proposed approach. The distance between people is shown in this diagram. It shows the safe and count number with a white bounding box if the distance between two people is more than 100 cm. On the other hand, the bounding box was red with the ’unsafe’ word and contained a count of dangerous people from the input video.

Fig. 8
figure 8

Analysis of the overlapping activation map using the TH-YOLOv5 that was suggested. Here, the first, second, and third rows show the flow of people, violations, and dangerous areas. In addition, (a)–(c) illustrate the mosaic heatmap, the violation heatmap, and the 3D tracking + violation heatmap, respectively

Figure 8 depicts the analysis of overlapping activation map. According to this diagram, the level of violation increases as the blue color moves closer and closer to the red color. The first, second and third rows are represent the flow of pedestrian, violation heatmap and risky areas, respectively. Figure 8a, b represents two examples of situations in which two persons remained stationary in two specific spots of the grid for an extended length of time; as a result, the heat map became reddish for those locations. Because there was more foot activity on the sidewalks than in the center of the roadway, the heat maps for those areas revealed much higher temperatures. In general, the more red grids that are present may suggest locations that are possibly more dangerous. It would be more advantageous to analyze the density and position of the individuals who significantly breached the physical distancing measures, in addition to people’s natural movement and tracking data. In addition, for a better 3D portrayal of safe and dangerous zones, (a) through (c) correlate to people’s flow and violation heatmaps.

Fig. 9
figure 9

Visual investigation of the activation heatmap is shown. In this case, (a)–(d) refer to the pedestrian detection and tracking, mosaic heatmap, crowd heatmap, and people status statistics, respectively

The visual investigation of activation heatmap are shown in Fig. 9. In this figure, (a), (b), (c), and (d) demonstrate the human detection and tracking (the green and yellow bounding boxes represent the safe and dangerous region), the mosaic heatmap of violation, the crowd violation heatmap, and the live people status data, respectively. The crowd map (Fig. 9c) displays the map showing infractions and risks, where we applied trends for rising and falling decay levels. In addition, the visual graph (Fig. 9d) offers basic information about the population inside each frame, the number of persons who fail to observe the physical distance, and the number of social distance violations without including coupled groupings.

Comparative Analysis and Discussion

Table 4 Five well-known object detectors and the proposed TH-YOLOv5 with two backbones are compared in real-time human detection

Table 4 shows the performance of human detection with five well-known object detectors and the proposed TH-YOLOv5. The TH-YOLOv5 is evaluated in the proposed system using two alternative backbones: CSPResNext50 and CSPDarknet53. The faster RCNN [38], YOLO-v3 [38], YOLO-v4 [25], EfficientDet [40] and the YOLO-v5 [21] models are utilized to compare with the proposed TH-YOLOv5 for human detection and tracking. The Faster RCNN attained an accuracy of 0.813 mAP and a frame rate (FPS) of 27. The YOLO-v3 achieved an mAP of 0.839 with a frame rate of 26 FPS. It outperforms faster RCNN, while falls shorter FPS. This table also illustrates the performance of the proposed YOLO-v4, and YOLO-v5 with the two backbones and EfficientDet. The mAP, iteration number, training loss (TL), and FPS are compared. The results of the TH-YOLOv5 experiment with and without the transformer head (TH) are shown here. Both of these studies use the CSPResNext50 and CSPDarknet53 backbones, respectively. In the model provided, it can be seen that the values of mAP obtained with TH were the greatest (0.895 with CSPDarknet53). The suggested model with TH reached its maximum values while suffering only a tiny bit of a drop in terms of the number of iterations (NOI) (13847 and 1.08 in CSPDarknet53, respectively). In addition, The CSPResNext50 achieves the highest FPS of 30.

Table 5 Comparison of the proposed system with the state-of-the-art methods to perform real-time physical distancing measures at frames per second (FPS)

To illustrate the efficiency of the suggested approach, we compared our findings to those of previous techniques in Table 5. Rezaei et al. [6], and Yang et al. [7] presented the techniques and obtained fps of 24.1 and 25 with \(512 \times 512\) and \(1920 \times 1080\) input sizes, indicating that this method works well when the input size is big. In the suggested system, our input size is just \(640 \times 640\). However, our FPS is 30. Therefore, the proposed approach achieves high FPS on the same OTC data set and received better than the state-of-the-arts.

Table 6 Comparison of the proposed TH-YOLOv5 with the recent system in terms of \(AP_{50}\), input size and FPS

Table 6 shows the comparison between our proposed system and the recent work [42]. We can observe from this table that our model (TH-YOLOv5 with CSPDarkNet53) consumes less frames per second (300 vs. 33), but our average precision is much better (40.2 vs. 88.1) while having the same input size of 600 \(\times\) 600.

Conclusion

This article introduced a deep learning-based automated system that works on real-time physical distance monitoring, detecting, and tracking people using bounding boxes. We presented the TH-YOLOv5 for object detection and classification in this system, and Deepsort was used for monitoring the human. We calculated the physical distance between two or more peoples using pairwise L2 normalization to create 3D feature space. To monitor the physical distance, we generate a bounding box with text (safe and unsafe) indicating and monitoring who follows/breaks the actual distance. The number of violations is verified by calculating the number of groups created and the violation index term as the ratio of individuals to groups. The suggested approach for detecting people and estimating physical distances is tested using the Oxford Town Centre data set, including over 7,500 people detection and distance estimates. We also used two pedestrian benchmarks data sets to evaluate the proposed system, including HumanCrowd and Citypersons. The system performed well under various conditions, including occlusion, illumination changes, and partial vision. It demonstrated a significant improvement in the mAP score of 0.895 and the speed of 29 FPS, which comparatively outperforms the well-known recent state of art object detectors. In the future, this technique will be applied to mobile cameras, such as those placed on autonomous drones. Drones are thus easier to control and more effective at tracking fast-moving objects in all directions.