Introduction

Coronavirus is a group of enveloped viruses, which are known to infect human hosts and cause respiratory disease [1]. A coronavirus disease well-known in 2019, SARS-CoV-2, has triggered a pandemic of respirational disease, titled COVID-19. Preventive measures like wearing a face mask, sanitizing hands, and maintaining social distancing are important [2] and advised by the world health organization (WHO). Rapid transmission is the deadliest trait possessed by the COVID-19 virus, making social distancing the most important measure of them. This would also reduce the burden on the healthcare systems and retain the economy progressing. Commercials are essential to maintain the standard operating procedure (SOP) to control public transmission of the infection.

In overcrowded, indoor, or outdoor settings, it is challenging to manually check for social distance violations. As a result, computers may be utilized to track busy areas using cameras and ensure that the social separation guideline is being followed properly. For this, it is necessary to first identify the people in the video clip, second, determine the social distance between the identified individuals, and third, mark those who adhere to the one-meter social distancing rule as well as those who do not with various colors. People are made sure to abide by the social distance guideline in this way. Only those in the line for payments or entrance/exit at a mall should be required to conform to keep the social distance norand rem. The current work provides an alternative approach for identifying instances of social distancing among people who are captured in a specific region of a video frame. As a result, just a chosen area was treated, while the remaining areas were left out of the dimension, rather than the complete area seen in the video frame. The social distance between the persons in the photograph is found and tracked using object recognition techniques. It is the process of locating and classifying items in an image or video. The first step of this process is to make a bounding box to separate the object of concern and depending on the properties of the confined region, identify the object's class or type. Deep learning methods for object recognition have gained popularity recently. The three most popular deep learning-based object identification techniques are Region-based Convolutional Neural Network (R-CNN) [20], Single Shot Detector (SSD) [3], and You Only Look Once (YOLO) [32].

YOLO belongs to the single-stage object recognition technology that makes use of convolutional neural networks in deep learning. The first version of the YOLO model was created by Redmon et al. [32]. In recent times, a YOLO object detection model is used to isolate humans and classify them [4]. To identify individuals and determine their spatial separation from one another combined the YOLO v3 algorithm with the Deepsort technique. Their study demonstrated the superiority of the Deepsort algorithm and YOLO v3 over other object-detecting methods. Deepsort methods are practiced to monitor recognized people with the support of bounding boxes and assign an ID [5]. Distances between all the pairs of people in the frame are computed simultaneously to make sure all the cases in the input frame are covered. The model uses various parameters like confidence and threshold limits to calculate the criteria for classification. The scope of the project ranges from being used as a single camera classification model, which can later be integrated into the CCTV surveillance systems and then also be there for small resource starving systems [6]. Faster R-CNN, SSD, and YOLOv3 are compared for object detection by practicing on Google Earth photos, the DOTA dataset [7], and remote sensing images [8] obtained from the GF-1 and GF-2 satellites. According to the results of this study, YOLOv3 has a higher mAP and FPS than SSD and Faster R-CNN models. When YOLOv3 is compared with Mask R-CNN, YOLOv3 works three times better than Mask R-CNN [9].

These authors [10] employed the YOLOV4 algorithm to monitor the social distance in the low-light environment grounded on the video frames captured through motion-less cameras. YOLOV3 produces 95% accuracy when applied on the frame captured in the top view to identifying the person [11]. This [12] social distance monitoring system for the COVID-19 time of year was constructed using the YOLOv4-tiny algorithm. The YOLOv5 algorithm was introduced in 2020 and it has been successfully employed for counting applications in settings with sparse crowds. This work innovates the version YOLO-PC used to construct a system for counting people [13] and achieves higher accuracy. In general, YOLO and SSDalgorithm works better for social distance monitoring [14]. [15] examines YOLOv3, YOLOv4, and YOLOv5. The algorithms are trained and tested using the MS COCOdataset. It was discovered that YOLOv5 performs better concerning accuracy than YOLOv4 and YOLOv3. When compared to YOLOv4 and YOLOv5, YOLOv3 had a faster detection speed, while YOLOv4 and YOLOv5 had the same speed. According to the latest findings [16], the YOLO v3 model exhibited its most impressive accuracy (87.07%) and mAP (89.91%). In reverse, the YOLO v5s model managed to attain the maximum frame rate of up to 18.71. Also, they compared the superior model YOLOv3 and the faster model YOLOv5 with SSDand RestinaNet.

Moreover, YOLOv3 consumes greater computational costs to produce decent detection performance. To improve the object detection process's frames per second (fps), a more reliable method needs to be developed. Hence, it motivates me to practice the multithreaded YOLO approach on a single graphics processing unit (GPU). The primary contribution is the suggestion of a multi-thread method using YOLOv3 to carry out real-time object recognition in massive amounts of video feeds and compared it with other single-thread approaches, as well as comparative research studies on other YOLOv3 variants. Hence, this approach practices the YOLOv3 model that can form the bases for a variety of applications, not limited to social distancing during the COVID-19 pandemic, but to sparser day-to-day chores like maintaining distancing at places like railway ticket booking and religious places.

The following list explains the flow of this proposed article:

  • It starts with a review of the social distance monitoring system using YOLO models,

  • Shows the system architecture of YOLOv3, TINY-YOLOv3, SPP-YOLO architecture and tabulates configuration settings,

  • Illustrates the flow of the proposed working model to depict the steps to detect social distance,

  • The various results and their comparative analysis graphs are shown in the results and discussion section.

  • Tabulates the latest comparative study on YOLOv3, v4, and v5 and its discussion to stimulate the proposed work.

  • Tabulates the recent comparative efficiency analysis based on FPS, CPU usage, GPU power, memory usage, and energy consumption on YOLO models used in intelligent applications running in 5G edge devices.

  • It comprises the detection and classification results of single-thread yolov3 illustrated by single-thread serious violations, single-thread abnormal violations, and single-thread moving camera source. It is followed by YOLOv3 single thread graph according to time vs frames per second and time versus processing time.

  • Similarly, single-thread tiny-YOLOV3 and single-thread SPP-YOLOV3 have analyzed static output and moving output and their comparison.

  • Moreover, alike, multi-thread YOLOV3, multi-thread tiny-YOLOv3, and multi-thread SPP-YOLOv3 have been demonstrated and analyzed.

  • Tabulates and visualized the comparative analysis of existing YOLO models.

  • The article ends with the conclusion of YOLO models and future practices.

Henceforth, the proposed model is evaluated based on the number of processors, threads used, and different sorts of inputs, including still images and moving video streams. Tiny-YOLOv3, SPP-YOLOv3, and YOLOv3 all perform well in terms of frames per second and processing time. The model backs up its claims with evidence based on a variety of measures and characteristics. This model initiates the curiosity to develop a mobile application with security systems based on IoT and CCTV to monitor crowded places.

Literature Review

Social distancing maintenance is the application of object detection using OpenCV. Earlier works in this domain face general problems like noise in input, a high number of false positives, and redundancy of classified objects in the box. The research approach in [11, 17], and [18] focuses on using regional convolutional neural networks and comparing single-stage and two-stage detection. The proposed models improve multi-task joint optimization and multi-model information fusion for object prediction, detection, and monitoring. A similar approach is used with a sliding window algorithm [19] for detection and tracking. CNN architectures are categorized into different techniques based on feature map exploitation, spatial exploitation, width, depth, multi-path, channel boosting, and attention [6]. Models use 2 convolutional layers to identify regions and pooling to resize feature maps, and perform classification and regression [20].

The algorithm [21] proposes a GPR hybrid model to improve performance to enhance localization accuracy. Retrieving previous results is done using rHOG [22], to determine stationary objects along with occlusion and abrupt changes. Other methods [23] formulate crowd as Binary Quadratic Programming, to formulate the target’s information as the custom of motion, spatial proximity, and grouping constraints and resolves revealing and data association in chorus. Background modeling and subtraction generate the area of interest. A matching score is proposed to get detection results [24]. This method is compared with making mutable size bounding boxes with diverse orientations, concerning radial distance of the image [25].

The object detection part is followed by calculating the spaces between each set of people distinguished. This approach of social removing assessment denotes the people who are drawing nearer than a reasonable cutoff red. The approach involves sliding window-based region proposals in phases, detection, and tracking and monitoring one after the other [19]. Fixed pixels in the captured footage are also used to make the calculations. Certain inaccuracies are caused by anomalies in the social distance measuring method based on the detection box midpoint point values [4]. Faster R-CNN is used for unique, rapid, and precise methods of object detection, while only YOLO can work with live video feeds [26]. Another approach is using a variety of CNN which are different in feature-map exploitation, spatial exploitation, width, depth, multi-path, channel boosting, and attention. Some problems with CNN are overfitting, additional factors than thin deep networks, great spatial and time complexity, and considering the residual facts for defining the weight of every channel [6]. Advanced YOLOv1, has a novel inception model configuration, a focused pooling pyramid layer, with enhanced performance. A model maps high-dimensional to low-dimensional data, hence introducing bias.

The classification categories are not compatible with each other, making it also inadequate [27]. Since single shot multi-box detector (SSD) is grounded on the total eradication of the progression that produces a plan, it is a straightforward approach that needs an object proposal. Additionally, it does away with the following pixel and resampling phases. On the contrary, this model in place of utilizing anchor boxes to group features and assesses another classifier, produces a value for every object class in all boxes, which requires a lot of computation [28]. To make real-time embedded object identification easier, TINY SSD was created. It consists of significantly enhanced layers made up of a collection of non-uniform SSD-based auxiliary convolutional feature layers and a non-uniform fire sub-network. The gap here is reflected in the performance in the long term, since the arguments are denoted by half-precision floating-point numbers, the modal size of the deep neural network is further reduced, which affects the accuracy of object detection [29]. Apart from models, some classifiers use computer vision to calculate the distance between each person in the frame and generate patterns to classify them as “Safe” or “Unsafe”, displaying the tags as the result of object detection and classification. This classifier can be integrated with surveillance systems. The drawback of this method is that YOLO detects a pedestrian as an object while the partial parts of the body are observable, generating error due to overlapping frames [3]. Since the launch of Darknet, YOLOv3 was among the best enhancements made to an object-detecting system. Reviewers and other industry experts gave this improved version very favorable reviews. However, it has flaws of its own. YOLOv3 is still regarded as an experienced model, although the complexity analysis revealed problems and lacked ideal solutions for the loss function. Subsequently, it was fixed in an improved version of the same model, which was utilized and evaluated for operational improvements [30].

Some systems have been designed to make the whole artificial intelligence concepts for tracing people, social distancing, and checking the temperature. This method trains data from thermal cameras and is implemented on a network of devices that are monitored from a central system. This method gets promising results for people detection, working on fewer resources and making it more efficient [31]. The system sends alerts about unmasked people to respective authorities. The model produces feature maps and usages of two convolutional layers to find the region. The pooling method is used to resize feature maps, to perform classification and regression [20]. Another approach detects an object and distinguishes it as an issue of regression, based on input of sizes, anchored boxes, and extraction network functions. This is different as less optimized boxes are utilized to calculate the last bounding box for each object, rather than selecting the single with the most confidence [32]. A template-matching algorithm is used to mark specific areas. The intersection over union ratio is calculated concerning the spotted person and the area of interest to make a decision [33].

The YOLOv3-spp model is most successful in detecting the objects as compared to YOLOv3 and YOLOv3-tiny. Apart from this, the models yield outcomes in the form of model evaluation and accuracy aspect, and it makes to be more appropriate with great performance on terrestrial videos. To increase the accuracy factor, the authors must work on training and verification procedures for the dataset. In parallel, the most effective detection could be grasped with the objective research drives [34]. A dense connection strengthens the extracting feature and alleviates the vanishing gradient problem and also pools and concatenates the features of the multi-scale local region, to enhance comprehensive learning. The model implements cross-entropy as an alternative to the mean-square error to symbolize object classification damage [35]. Another common approach is using multiple sections with varied expansion ratios to create characteristic maps with diverse receptive fields. Branch outputs are combined to incorporate multi-scale data. The features are extracted from three altered measures and combine low-resolution features using a top-down pathway and lateral connections [36]. Some models use a spatial pyramid pooling (SPP) component presented at the stem of the locator for the size-tuning of the model values. Multi-scaled locale high points were created after pooling and link of the SPP. In addition to that DenseNet structure, many functional layers are joined to integrate additional functional information [37].

The 2-staged detector approaches, RCNN, Fast RCNN, and Faster RCNN are overviewed, while single-stage detectors like YOLO v1, v2, v3, and SSD are covered. The 2-staged detector focuses on accuracy, while the most important about stage detectors is speed. On the other hand, 2-stage detectors afford sufficient accuracy, the computation time is high [38]. Derived from Tiny-YOLOv3, Tinier-YOLO achieves better detection accuracy and efficient real-time outcomes while reducing model size. The SqueezeNet fire module is employed by inspecting the many fire modules along with their places to reduce the number of model parameters decreasing the size of the model. Tinier-YOLO introduces dense connections to reinforce the feature propagation and confirm the extreme facts in the network [39].

The latest introduction of YOLOv5 detects large, small, and tiny objects. They implement a multi-scale method to acquire deep discriminate feature depictions at diverse measures and routinely decide the best appropriate measures for detecting objects in a frame. This modification results in a reduction of many multi-scale parameters compared to the original architecture, resulting in an improvement of precision by a large margin [40]. Specifically created to reduce model size while maintaining object detection performance, Tiny-SSD is a single-shot detection deep CNN for present-day embedded object detection. It is pretended of a much optimized, non-uniform fire sub-network stack and a non-uniform sub-network stack of extremely optimized SSD-based auxiliary convolutional feature layers. For Tiny-SSD, the arguments are denoted in a half-precision floating-point number layout, which will lead to promoting deep neural network model size reductions whereas devising an insignificant consequence on object detection accuracy [41].

The related comparison of real-time deep learning algorithms from the literature review concluded that YOLOv3 outperforms other deep learning algorithms like Mask-CNN, Faster R-CNN, and SSD in terms of accuracy and speed. According to the literature, YOLOv4 is likewise more accurate than YOLOv3; however, the stated accuracy of YOLOv4 against YOLOv5 is still debatable, as some authors contend that YOLOv4 is superior to YOLOv5, while others contend that YOLOv5 is superior. YOLOv3 works with greater detection speed than YOLOv4 and YOLOv5. Numerous factors, including the various datasets utilized and altered hyperparameters, might be blamed for the disparate reported outcomes. These variations result from specific uses that other researchers have examined. Hence, this proposed work tries to compare the YOLOv3 model with different variations of static and moving inputs.

Materials and Methods

YOLOv3 System Architecture

The YOLO architecture (Fig.1) frames object to be a regression downside to spatially divided bounding boxes and related classes. A neural network forecasts bounding boxes and sophistication probabilities and confidences openly through complete pictures in a single evaluation. Hence the complete detection channel may be a particular network; it will be enhanced end-to-end openly on detection performance.

Fig. 1
figure 1

YOLO architecture

The YOLO detection network comprises 24 convolutional layers trailed by 2 completely coupled layers. Interspersing 1 × 1 convolutional layer reduces the feature area from previous layers. In the ImageNet classification, the convolutional layers have pertained to work at half the resolution (224 × 224 input image) and then double the resolution for detection. The starting convolutional layers of the network extract feature from the image when the completely linked layers forecast the output probabilities and coordinates. The quick interpretation of YOLO deliberates to drive the boundaries of rapid object detection. The speedy YOLO practices a neural network with lesser convolutional layers (9 rather than 24) and minimum filters in those layers. This network produces the last output (Fig. 2) the 7 × 7 × 30 [42] tensor of predictions.

Fig. 2
figure 2

YOLO test result

This design separates the picture into a framework of S*S size. Assuming the focal point of the jumping box of the article is in that matrix, then, at that point, this network is liable for recognizing that item. Every network predicts jumping boxes with its certainty score. Every certainty score shows how exactly it is that the jumping that predicts contains an article and how exactly it predicts the bounding box organizes concerning ground truth expectation. On the off chance that there is an article present in the picture, the certainty score is equivalent to IoU between ground truth and anticipated boxes. The likelihood is restrictive given the presence of an item in the lattice cell. In any case, the quantity of boxes in every lattice cell predicts just a single bunch of class probabilities.

Scores are generated by multiplying conditional class probabilities and box confidence predictions that produce class-specific confidence for every box. Scores denote both the probability of that class and the prediction [43] that fits the object.

TINY-YOLOv3 Architecture

The Tiny YOLOv3 architecture (Fig. 3) is a shortened variety of the existing YOLOv3 model. The Tiny-YOLOv3 model reduces the number of convolutional layers and uses only 7 convolutional layers for feature extraction. It uses the pooling layers, with a step size of 2 to obtain dimensionality lessening as compared to convolutional layers in YOLOv3 [38]. As compared to YOLOv3 the main advantage is that the network is simple (Fig. 3), the calculation is small, and can run on resource-starving devices. The disadvantage is low accuracy, both in candidate frame and classification accuracy.

Fig. 3
figure 3

Tiny-YOLOv3 network

The Tiny-YOLOv3 uses the training and loss function (Eq. 1) same as YOLOv3, which mainly is composed of prediction frame position, prediction frame size, prediction class, and prediction confidence [39]. The loss is given by:

$$\frac{1}{n}\mathop \sum \limits_{i = 1}^n {\text{loss}}\_xy + \frac{1}{n}\mathop \sum \limits_{i = 1}^n {\text{loss}}\_wh + \frac{1}{n}\mathop \sum \limits_{i = 1}^n {\text{loss}}\_{\text{class}} + \frac{1}{n}\mathop \sum \limits_{i = 1}^n {\text{loss}}\_{\text{confidence}}$$
(1)

SPP-YOLO Architecture

This architecture (Table 1) presents a better spatial pyramid pool (SPP) to extricate the multi-scale nearby area elements of the items, use the cross entropy to address characterization misfortune and construct and train the model to detect the objects.

Table 1 YOLO-SPP architecture

The network (Fig. 4) consists of a multi-scale object discovery block, a spatial pyramid pooling block with 3 max-pooling layers, a dense connection block with four dense units, and five laminated convolution-pooling blocks. The SPP block with three max-pooling layers is presented next to the DC block for connecting the nearby area highlights and is removed and met by multi-scale pooling [35]. The Spatial Pyramid Pooling approach improves the existing system by optimizing the connection structure of the backbone network and introduces a multi-scale local region feature extraction, helping in achieving object detection speed that is close to YOLOv2, and higher than conventional methods like De-convolutional Single Shot Detector (DSSD), Scale-transferable Detection Network (STDN) and YOLOv3 [35].

Fig. 4
figure 4

YOLO-SPP model

The new loss function is constructed adopts to illustrate the defeat of object detection, we use the cross-entropy of object classification and the mean squared error of the coordinate regression. The vanishing-gradient issue can be resolved and the training model can be made robust using cross-entropy to describe the object classification defeat.

Implementation

The model flowchart (Fig. 5) receives the input frame at the beginning of the model. This input frame is checked for the presence of a person by the model. If there is a person in the frame the model continues processing otherwise the frame is discarded. The frame is then transformed according to the view making it ideal for generating the results. If there are more than 2 people detected in the frame, the pairwise distances between all the people are calculated. This calculated distance along with the centroid coordinates and coordinates of bounding boxes around them is calculated.

Fig. 5
figure 5

Model flowchart

As the distance between the coordinates is changed with the receiving of continuous frames the distance calculated between people is checked against the MIN_DIST, which is the threshold for the distance between 2 people for maintaining Social Distancing. The frame and color of the Circle are then determined by this factor. The bounding box is given green color if the distance calculated is more than the threshold value or given yellow color if the distance is in the range of more than permitted and less than the threshold value. If the calculated distance is less than MIN_DIST, the bounding box is presented with red color. The red-colored box also increases the number of people violating the norms counter. After this, the frame is looped again for updating the coordinates continuously. This process is repeated by the model until a person is present in the frame. The indices are updated according to the presence of the people, finally ending the process when the input frames are finished or stopped.

Results and Discussion

The results and analysis of all the models are done with respect to two parameters namely, frames per second (FPS) and processing Time (Eq. 2). A frame per second is the rate at which the classified output is displayed on the screen and processing time is the difference between the current time and time at which the last frame was processed. These two factors significantly influence the results of this real-time classification model.

$${\text{Processing time }} = {\text{ Current time }}-{\text{ Time required to process the last frame}}$$
$${\text{Frames per second }} = { 1 }/ \, \left( {\text{Processing time}} \right)$$
(2)

According to the findings [16], the properties of the YOLO models v3, v4, and v5 and the average performance outcomes done on the 3 videos used for the comparative study are given in Table 2. For this implementation, complete IoU as the loss function and relu and leaky relu as the activation function practiced in all models. With YOLO v3 and YOLO v5s, videos were scanned at an average frame rate of 9.51 and 18.71, respectively. YOLO v3 managed to obtain the best accuracy rate of 87.07% despite having the lowest FPS value. The test films were processed by the YOLO v3 algorithm on average in 35.14 s, which suggests that it might have a speed-related performance problem. The YOLO v5s model has been created to solve this issue. The test moving inputs were scanned by the YOLO v5s on average in 17.48 seconds at an FPS of 18.71 with an average accuracy of 73.70%. Concerning the mAP(0.5) metric, the YOLO v3 model attained a superior performance of 89.91% compared with YOLOv4 and YOLOv5. Unfortunately, YOLO v4 results are very poor than v3 and v5 in all comparisons. YOLO v3 and v5s models are superior to one another with respect to mAP, FPS, and accuracy metrics. A trade-off between the two models should be established based on the application. While v5s should be utilized to achieve performance that is almost real-time (i.e., fast speed), V3 should be used if high precision and mAP are needed.

Table 2 Comparison on YOLO v3, v4 and v5 models
Table 3 Performance of YOLO models on Jetson Nano while resizing window [44]

In 5G intelligent solutions, YOLO models are very prevalent in accelerator-based single-board computers such as NVIDIA Jetson Nano, Jetson Xavier NX, and Raspberry Pi 4B with Intel neural compute stick2. It produces greater performance with minimal power consumption to detect the object in AI-based applications. Table 3 noted the performance of YOLOv3 and YOLO-tiny while executed on Jetson Nano to detect the object in two videos for comparison. If the window size (S) is small in the model, then it reaches a high FPS value. It means that increasing frames per second is proportional to increasing the CPU usage to process the frames. Similarly, the same scenario applied to GPU-powered systems will get a small impact on usage. Hence, selecting the CPU or GPU-powered system depends on the level of the task. Likewise, the energy consumption is also reflected as proportional to the S value. It suggests that more concern about the trade-off between inference accuracy and speed while applying AI intelligent applications running on edge devices [44]. Henceforth, the proposed work employed the superior model YOLOv3 with its different processing techniques such as single thread and multi-thread, and compared with the tiny, SPP YOLOv3 and SSD.

Detection and Classification Results

Single-Thread YOLOV3

The single threading detection and classification give smooth output as more of the frames are visible to the user. This also increases the visibility of the user. The frames marked with green color (Fig. 6) are at a safe distance from the user, while the frames that are red mark the pairs of people who are close to each other. The output frame also displays the number of serious violations i.e., 4, abnormal violations, threshold limit, and the safe distance threshold, i.e., 70px.

Fig. 6
figure 6

Single thread serious violations

The current testing video has a stationary camera so more frames are being processed in the classification. This also shows yellow frames for people in the frame (Fig. 7) who are in the range and about to violate the norms. For stationary cameras and single threads, the FPS obtained is 0.628.

Fig. 7
figure 7

Single thread abnormal violations

In the footage (Fig. 8), the source of the input camera is moving, so more frames are received and more frames need to be processed. This leads to a slight delay in processing because of moving sources and people in the frame but consistency with the classification, hence the model proves its robustness. The FPS obtained for this test is 0.610.

Fig. 8
figure 8

Single thread moving camera source

The graph (Fig. 9) for single-thread FPS concerning time can be observed. FPS increases with time, which portrays that processing speed increases with time.

Fig. 9
figure 9

YOLOv3 Single thread graph

Single Thread Tiny-YOLOV3

The figure (Fig. 10) shows the result of classification and detection with Tiny-YOLOv3 using a single thread and the camera is static.

Fig. 10
figure 10

Single thread Tiny-YOLOv3 static output

In this figure (Fig. 11), the source of input is moving, which results in a slight delay in processing. It is implemented using Tiny-YOLOv3 architecture consuming a single thread.

Fig. 11
figure 11

Single thread Tiny-YOLOv3 moving output

The graph (Fig. 12) presents the working of Tiny-YOLOv3 with respect to time on both the metrics Frames per second (FPS) and Processing Time.

Fig. 12
figure 12

Tiny-YOLOv3 single thread graph

Single Thread SPP-YOLOV3

 The figure (Fig. 13) shows the result of classification and detection with SPP-YOLOv3 working on a single thread and static camera input.

Fig. 13
figure 13

Single thread SPP-YOLOv3 static output

The result of classification and detection with SPP-YOLOv3 working on a single thread and moving camera input is illustrated in (Fig. 14). SPP-YOLOv3 works on spatial pooling pyramid architecture and works by increasing the number of spatial layers before feeding to convolutional layers. This results in increased efficiency in classification.

Fig. 14
figure 14

Single thread SPP-YOLOv3 moving output

The results of detection and classification are hence concluded in the following graph (Fig. 15) as well. The FPS values range from 1 to 1.5 for classification and processing time from 1.0 to 0.7.

Fig. 15
figure 15

SPP-YOLOv3 single thread graph

Multi-Thread YOLOV3

The Multi-threading model is implemented to get quicker results and more efficiency in processing power with the existing resources (Fig. 16). As the frame processing computations increase more people are classified in quick successions leading to better results. This test video is of a static camera-captured frame. The FPS obtained for this test video is 0.3559.

Fig. 16
figure 16

Multi-thread serious violations YOLOv3 static

The test video with a moving frame in multi-thread performs as compared to the test frame with a single-threaded test video. This test video (Fig. 17) also is a testament to the fact that, as the source of the input frame is moving, more people are moving relatively, and hence more frames are generated, still, there is a comparable amount of processing done. The FPS received for the same case is 0.4129.

Fig. 17
figure 17

Multi-thread YOLOv3 moving input

The YOLOv3 multi-thread output graph (Fig. 18)  shows the result of frames per second (FPS) and processing time with respect to time.

Fig. 18
figure 18

YOLOv3 multi-thread output graph

Multi-Thread Tiny-YOLOv3

The Tiny-YOLOv3 with multi-thread performs the fastest as it is a lightweight architecture. The result of detection and classification can be seen in the image (Fig. 19).

Fig. 19
figure 19

Multi-thread Tiny-YOLOv3 static input

The Tiny-YOLOv3 shows the best results in moving camera sources (Fig. 20) as well, as more classifications can be processed easily with multi-threading and small architecture. Tiny-YOLOv3 performs the best for classification, boosted by the presence of multi-tasking making it more stable and efficient [38]. Multi-threading makes efficient resource distribution.

Fig. 20
figure 20

Multi tread Tiny-YOLOv3 moving input

The graph (Fig. 21) visualizes the results of Tiny-YOLOv3 consuming multiple threads.

Fig. 21
figure 21

Tiny-YOLOv3 multi-thread output graph

Multi-Thread SPP-YOLOv3

SPP-YOLOv3 performs detection and classification with ease, due to the SPP detector attached at the beginning of the convolutional layers.  The image (Fig. 22) is the result of classification with a multi-thread system and static camera input.

Fig. 22
figure 22

Multi-thread SPP-YOLOv3 static input

The SPP-YOLOv3 architecture performs well while consuming multiple threads and with a moving camera input frame. The figure 23 presents the output of SPP-YOLOv3 with a moving source of input.

Fig. 23
figure 23

Multi-thread SPP-YOLOv3 moving input

The graph (Fig. 24) shows a linear increase in the FPS and vice versa in Processing time suggesting that the processing time of the frames reduces with time and makes it more efficient.

Fig. 24
figure 24

SPP-YOLOv3 multi-thread output graph

Comparative Analysis

The models are tested with Intel Core i5 10th generation processor with integrated GPU, trained and weighted on MS COCO dataset. The models are compared with various existing state of art models.

The YOLOv3 executes single-thread and multi-thread programs efficiently (Fig. 25). The difference in the speeds is observed in the following graph. Single thread achieves competitively more FPS as compared to multi-thread. This underlines the fact that processing time for a single thread is less as compared to multi-thread. The range of FPS is 1.5 to 1.8 for a single thread, and 0.3 to 0.6 for multi-thread (Table 4).

Fig. 25
figure 25

YOLOv3 comparison

Table 4 Comparison with existing models

Tiny YOLOv3 has the fastest processing speed among all YOLOv3 models tested (Fig. 26). The execution on single thread and multi-thread are most efficient, compared to the rest. The reduced architecture and parameters make it feasible for development in mobile and embedded devices. As observed from the graph, Tiny YOLOv3 FPS results range from 0.2 to 0.6 in a multi-thread as compared to 1.0 to 1.4 in a single thread.

Fig. 26
figure 26

Tiny-YOLOv3 comparisons

The SPP-YOLOv3 uses 3 different types of max pool sizes for the same images. The model also collects a feature map called “Fixed length representation” before feeding to fixed connected convolutional layers. The range of FPS in single-threaded lies from 1 to 1.4, as compared to 0.4 to 0.6 in multi-thread (Fig. 27).

Fig. 27
figure 27

SPP-YOLOv3 comparisons

SSD and YOLOv3, both are single-stage detectors and perform object detection and classification on the given input frame. The fundamental difference in the use of parameters, convolutional layers at the output, receptive fields, aspect ratio, and various grid sizes to calculate confidence values in them leads to the difference in the results [41]. The FPS for YOLOv3 is observed to be higher than the SSD at all points in time in the graphs (Fig. 28).

Fig. 28
figure 28

SSD and YOLOV3 comparison

The YOLOv3 is single-stage detection architecture. On a single thread, all the YOLOv3 models perform invariably to each other. Tiny-YOLOv3 is the best architecture for object detection and classification as it has the highest FPS in a single thread followed by YOLOv3 and SPP-YOLOv3 as observed in the graphs (Fig. 29).

Fig. 29
figure 29

YOLOv3 single thread comparison

The multi-thread models consumed more than one thread, making the process of detection and classification parallel. This results in more processing time for the input frame as the detections and results have to be rearranged in the final output frame. The Tiny-YOLOv3 again beats its subordinate architecture models. YOLOv3 and SPP-YOLOv3 have comparatively lower FPS and hence higher processing time as compared to Tiny-YOLOv3 [40] as observed from the graph (see Fig. 30).

Fig. 30
figure 30

YOLOv3 multi-thread comparison

Conclusion

With the waves of coronavirus, the world has strict rules to avoid gathering people close in such places as restaurants, industries, transportation, real estate, and agriculture. Hence, this proposed model can ensure social distancing in public places using object detection and classification. The proposed model is evaluated based on the number of processors, threads used, and different sorts of inputs, including still images and moving video streams. This model also has the option to use multi-threading for more efficient results using GPU to reduce the computation overhead. The alert system is handy for people, monitoring many video streams at a time, as alerts are generated after the threshold limit is reached. Apart from this, the FPS metric portrays the performance of each model on different static and moving video inputs.

As the result of a comparative study, Tiny-YOLOv3 performs with the best frames per second and the least processing time, followed by SPP-YOLOv3 and YOLOv3. The model proves its evidence on various parameters and metrics to work robustly. Also, the proposed work tabulated the reason to practice YOLOv3 than the latest YOLOv4 and YOLOv5 models according to the latest research work. It strongly recommends that to adopt YOLOv3 to achieve highly accurate results in a person detection system.

For future work, this model can be implemented inside mobile applications and other resource-starving IoT devices. Integration with CCTV cameras can provide an easy surveillance solution and monitor easier for security authorities. Applications of this model can be moreover used at crowded places like railway stations ticket booking centers, huge public gatherings, and religious ceremonies.