1 Introduction

The population in India has increased significantly in recent years. The increasing population has increased the number of vehicles registered and, subsequently, the road accident rates. The enforcement of traffic rules in the Indian context is essential because India recorded 155,622 road accident deaths in 2022. As per the latest report, India has over 21 crore two-wheelers registered as of August 2022. The two-wheelers have the highest share, approx. 74.7% in the composition of the vehicular population, followed by cars, jeeps and taxis at around 13.4% and other vehicles share is 6.9%, goods vehicle share is 4.4%, and buses has 0.7%. The fatalities have increased by 16.8%. This caused deaths per thousand vehicles from 0.45% to 0.53% in 2022. The analysis of the causes of deaths shows that road accident is mainly caused by overspeed, and the number of death persons is 2,40,828 from 4,03,116, which is 59.7% and has caused 87,050 deaths and 2,28,274 injuries. The impact of the other causes in road accidents is as follows: driving dangerously or carelessly and surpassing other vehicles has caused 1,03,629 road accidents (25.7% of the total) and has 42,853 causalities and injuries up to 91,893. Poor weather conditions were the reason for merely 2.8% of deaths (11,110 out of a total of 4,03,116), whereas driving after drug/alcohol has caused 1.9% of the injuries, i.e., 7,235 and 2,935 causalities (https://www.financialexpress.com/express-mobility/as-road-accidents-continue-to-rise-a-look-at-the-key-safety-norms-mandatory-in-india-now/2932079/, https://morth.nic.in/sites/default/files/RA_2021_Compressed.pdf, https://www.thehindubusinessline.com/data-stories/data-focus/overspeeding-accounts-for-60-of-the-road-accidents-in-india/article65852932.ece).

The day-to-day increases in traffic rule violations by two-wheeler riders make it difficult for the manual traffic monitoring system to ensure safety and rule obedience. Hence, there is a need to automate the continuous monitoring and surveillance system (Kaffash et al. 2021; Chen et al. 2021; Sun and Boukerche 2021). The intelligent transportation system can aid in this problem. The innovative transportation system is an autonomous surveillance system combining several methodologies, such as edge-enabled cameras, sensors, communication, electronics, etc. The ITS system provides a fast, intelligent and efficient traffic monitoring solution based on the surveillance cameras mounted on the roads to detect traffic rule violations such as traffic signal violations, helmet violations, triple riding, etc.

The paper proposes an automated deep learning-based system to detect two-wheeler riders who violate the traffic rules regarding triple riders (Goyal et al. 2022). The YOLOv8 model is used to classify the two-wheeler rider in the violator or no violator classes. The self-generated data set through the surveillance system is considered to evaluate the proposed model's efficiency. The results show that the proposed system achieves the accuracy of 94% for the triple rider detection.

The main contributions of the work are summarised as follows:

  1. a.

    The automated system is designed to detect two-wheeler traffic violations in case of triple rider. The edge-enabled surveillance cameras were deployed at essential cross-points inside the university campus.

  2. b.

    YOLO v8 model is used to detect the two-wheeler and person. The results show that the YOLO v8 model effectively detects multiple objects and closer objects in a single image, and it can also detect overlapped objects with greater accuracy and in less time.

  3. c.

    The proposed system can be implemented in real-time surveillance, too.

The paper is structured as follows: object detection and recognition approaches in Sect. 2, recent work in Sect. 3, the proposed classification system in Sect. 4, performance evaluation in Sect. 5, limitations of the proposed method in Sect. 6 and Sect. 7 concludes the paper with a conclusion and future scope.

2 Object detection and recognition approaches

Object detection is the sub-task of computer vision, which involves identifying and locating the objects in the video or images. It can be applied to numerous applications such as robotics, surveillance and autonomous cars. Object class detection is usually based on a set of features, i.e., every object of a particular class has certain features, based on which the thing is classified into different classes. For example, if we want to classify animals with birds, we can distinguish both by the presence or absence of the wings. Object detection methods are generally categorised into two categories: deep learning-based and machine learning-based. The machine learning approach takes the defining features as input. Then, the different classification techniques (SVM, etc.) can be applied to classify the objects into other classes. In contrast, the deep learning-based methods provide a single-step solution for object detection and categorisation using convolutional neural networks. The Deep learning-based object detection algorithm can be broadly categorised into two types based on the number of times the input image is passed through a network: single-stage and two-stage detectors (Lohia et al. 2021). The different methods under these approaches are in the Fig. 1.

Fig. 1
figure 1

Classification of object detection algorithms

The stage detection approach uses two passes to process an input image to predict the presence and localization of the objects. In the first pass, the predictions are made regarding the object’s existence and localization; in the second pass, the predictions are improved and final predictions are made. Examples of two-stage detection models are RCCN, Fast RCNN, Faster RCNN, RFCN and Mask RCNN (Girshick et al. 2014; Dai et al. 2016; Ren et al. 2015; Girshick 2015; Liu et al. 2016). These algorithms are also known as the region proposal-based approach because these algorithms attempt to build a boundary box around the object of interest to identify its location inside the image.

Single-stage detection approaches pass the input image into the network only once to make predictions about the presence and localization of the objects. These methods process the input image in a single pass, which is computationally efficient. The limitation of the single shot detection algorithm is that they are less accurate in predicting the objects and is mainly suitable for real-time surveillance in a limited resource environment. YOLO is a single-stage detection system that processes an input image using a fully Convolutional Neural Network (CNN). A stage object detection strategy is the Single Shot Multi-box Detector approach (He et al. 2017). This method employs a single forward pass for object localisation and classification tasks. A multi-box approach is utilized for boundary box regression. The detector is a network that is divided into two phases: feature map extraction and object detection using convolutional filters. SSD extracts features using VGG 16. So, the object detection method selection depends on the application type for which it is implemented. Many optimization techniques using machine learning are proposed in the literature (Xiao et al. 2021a; Xing et al. 2022; Xiao et al. 2021b). In (Xiao et al. 2021a), authors have proposed a federated learning-based model for recognising human activity. In (Xing et al. 2022; Xiao et al. 2021b), authors have proposed the classification approaches for the time series data.

2.1 You look only once (YOLO) models for object detection and recognition

The proposed model is implemented for real-time surveillance of two-wheeler traffic, so we will use the YOLO-based detection and prediction model. You Look Only Once is a neural network-based model that can simultaneously predict the boundary boxes and class probabilities. YOLO is different from the previous classification approaches for the detection of objects. The faster RCNN approach works by identifying the region of interest through the Region Proposal Network and reiterating these regions for the detection of the objects. In contrast, YOLO identifies and detect objects in a single step through a fully connected Convolutional Neural Network.

2.2 Working of YOLO model

The YOLO model (Xing et al. 2022) is a real-time object identification model that recognizes objects in live movies or photos. In 2016, Joseph Redmon and Ali Farhadi created the first version of the YOLO model. In 2018, an updated version of the YOLO model, version 3, was released. The YOLO model is made up of a fully Convolutional Neural Network and a neural network post-processing technique. The model's operation begins with the extraction of a single frame from live video. The image is then scaled to 416*416 in the following phase. The scaled image is then sent into a deep convolutional neural network, which detects the objects. Figure 2 depicts the YOLO model's backbone design The detection network has 24 convolutional layers, followed by two fully connected layers. An alternating sequence of 1*1 convolutional layers is used to decrease the feature space from the preceding layer. ImageNet trains the first 20 convolutional layers, including temporary average pooling and fully connected layers. This augmented model is used for detection, and the findings show that the augmentation improves detection accuracy. The YOLO model's last fully connected layer predicts the probabilities of an object belonging to a specific class as well as the coordinates of the boundary boxes. The YOLO model uses the S*S grid to partition the entire image. The grid in which an object's center is located is in charge of detecting the object. The number of boundary boxes and the confidence ratings of the boundary boxes are predicted for each object. Each prediction is a C + B*5 tensor, where C is the number of classes of things and B is a set of border boxes. It is multiplied by five because each prediction of a boundary box consists of the following information:

Fig. 2
figure 2

Backbone architecture of YOLO Model (Xing et al. 2022)

[Box centre offset (Bx), Box centre offset (By), Box width (Bw), Box height (Bh), Object Score (S)].

Since the grid size is S*S, the predicted tensor will be of size S*S*(C + B*5).

The boundary box with the highest Intersection of Union concerning the ground truth is selected as the predictor of that object. The YOLO model uses Non-Maximum Suppression as a post-processing technique to discard the redundant or incorrect boundary box and gives a single boundary box as output.

Several versions of the YOLO model have been released since its inception in 2015. Each new version outperforms the previous model in some way (Xing et al. 2022; Xiao et al. 2021b; https:, , www.v7labs.com, blog, yolo-object-detection. xxxx; Redmon et al. 2016; Redmon and Farhadi 2016; Redmon and Farhadi 2018; Bochkovskiy et al. 2020). The timeline of the introduction of different versions of the YOLO model is shown in Fig. 3:

Fig. 3
figure 3

Timeline of the inception of YOLO models (Xiao et al. 2021a)

Yolo v2 model (Xiao et al. 2021b) improves the accuracy of the base YOLO model by incorporating the concept of anchor boxes. The anchor boxes are pre-defined boundary boxes of varying shapes and contrasts. The use of anchor boxes improves the detection efficiency. The YOLO v2 model introduced another stack of 53 convolutional neural network layers, resulting in 106 layers. Hence, it provides the better accuracy in the object detection tasks.

YOLO v3 (https:, , www.v7labs.com, blog, yolo-object-detection. xxxx) model improves the computational cost for multi-label classification tasks by replacing the softmax function with the independent logistic classifier for determining the label of the specific object. This version updates the loss function to binary cross entropy instead of a squared mean error. Another improvement in this version was including a new backbone network, Darknet 53, for object detection, consisting of 53 layers. It also used the concept of a feature pyramid network, which determines the pyramid of feature maps to detect the objects at different scales.

YOLO v4 model (Redmon et al. 2016) was based on using new CNN architecture as Cross Stage Partial Network (CSPNET), a variation of RestNet architecture and comprises fewer layers up to 53. CSPNet is a specially enhanced version of ResNet explicitly designed for object detection tasks.

The next version of the YOLO model is YOLO v5 (Redmon and Farhadi 2016), which comprises a more dense architecture as a backbone network and is termed EfficientDet. The training for the YOLO v5 is different from previous versions because it can now be trained for the objects belonging to many classes. Another improvement in this model was the introduction of Spatial Pyramid pooling, which can be used more accurately to detect objects of different shapes.

The next version of the YOLO model is YOLO v6, based on the new CNN architecture as EfficientNet-L2. The results show that this version is computationally more efficient and accurate than previous versions.

The next model in the timeline of YOLO models is the YOLO v7 (Redmon and Farhadi 2018). In this model, the seven anchor boxes were used to detect objects of different shapes and aspect ratios. The second improvement in this model uses a new loss function as focal loss, which is used to detect the smaller objects efficiently. This model is most efficient compared to the earlier versions regarding image accuracy, speed and resolution.

The latest version in this category is YOLO v8 (Bochkovskiy et al. 2020). The reason why this version outperforms its predecessors is as follows:

  1. a.

    It provides greater accuracy and flexibility in object detection tasks as measured on the well-known COCO Dataset.

  2. b.

    It has several developer-friendly features, including a command line interface for model training, detecting and predicting, and a structured Python library framework.

  3. c.

    It is an anchor-free model, which implies it can directly predict the centre of an object instead of predicting the offset relative to the anchor boxes. So, the detection using this model is more efficient.

3 Related work

Decision-making, pose detection, autonomous driving, path detection, and remote sensing are all applications of digital image processing and computer vision. (Jocher 2020; Wang et al. 2022). The present research in this subject focuses on using deep learning models, which have demonstrated considerable success in various application domains (https:, , github.com, ultralytics, ultralytics. xxxx). Several deep learning-based algorithms for detecting helmet violations on bike riders have been proposed in the literature.

In (Ul Haq et al. 2022), authors have proposed a CNN-based helmet violation detection system. The training of this model has been done on the dataset of 493 images. The detection system used 4 CNN-based models: Mobile Net, Google Net, VGG 16 and VGG 19. The results demonstrate the supremacy of Google Net with an accuracy of 85%.

In (Mehmood et al. 2022), authors have proposed a helmet rule violation detection, which can also capture the license plate of the two-wheelers. The proposed system used HOG metrics along with the CNN models. The simulation results show that the proposed approach achieves 95% accuracy.

In (Boonsirisumpun et al. 2018), authors have used two YOLO-based models, YOLOv3 and YOLO-dense, for the helmet detection violation. The performance of the proposed system is evaluated using the self-generated and freely available datasets. The simulation results show that the proposed models achieve the mAP of 95% and 98%, respectively.

In (Yang 2022), authors have used the deep learning model Retina Net for helmet violation detection. The proposed model was trained on the self-generated dataset and achieved an accuracy of approx.—73%.

In (Siebert and Lin 2020), authors have used the deep learning model based on Faster R-CNN to detect bikers without helmets. The proposed model achieves an accuracy of 97%.

In (Raj et al. 2018a), YOLOv2 has been used to identify riders without helmets.

In (Afzal et al. 2021), authors have used the YOLO v3 model to detect the motorcycle and number plates. The proposed system consists of two deep learning models for the above tasks and achieves 89% and 92% accuracy for motorcycle and number plate detection.

In (Wu et al. 2019), authors have proposed a CNN-based model to perform multiple tasks like vehicle classification, helmet detection and mask detection.

In (Kathane et al. 2022), a deep transfer learning-based algorithm for vehicle detection is proposed for the KITTI dataset. The results show that the proposed system achieves better efficiency than the state-of-the-art approaches.

In (Rajalakshmi and Saravanan 2022), different deep learning-based approaches were proposed for helmet/no helmet detection. The results show that the proposed approach outperforms the existing approaches and achieves an accuracy of up to 90%.

In (Sridhar et al. 2022), authors have proposed license plate detection and recognition in a constrained environment where multiple license plates are detected.

In (Raj et al. 2018b), authors have proposed a Raspberry Pi and webcam-based Open Automatic License Plate Recognition system, which takes images of the license plates from the front and the back. The system can process almost ten frames per second. The OpenCV package of Python is used for detecting the license plate, and the Tesseract Optical Character Recognition method is used for reading the characters of the license plate. The limitation of this system is that it can detect one license plate at a time, and the model is trained for the license plate of Indian vehicles.

In (Wang et al. 2018), a comprehensive survey about vehicle and pedestrian detection is presented. The survey discussed the different approaches used in the literature for detecting vehicles and pedestrians. The survey concludes that deep learning has been proven to be most accurate in predicting and detecting traffic objects. In (Desai and Bartakke 2019), authors have used the deep learning models RetinaNet and RestNet 50 to detect motorcycle users without helmets. However, the limitation of the proposed model was the effect of environmental conditions and camera angles on the frame collection and image annotations. The authors of Silva and Jung (2018) proposed an InceptionV3 model for recognizing bike riders without helmets. According to the simulation findings, the suggested method achieves an accuracy of 81% for head helmet classification. For the validation set, the findings reveal an accuracy of 74%.The authors of Sieberta and Linb (2019) suggested a curriculum-based learning system for identifying, detecting, and counting triple riders and helmet violations on unconstrained road scenarios. An amodal regressor approach is proposed for generating border boxes even for occluded riders in the data preprocessing phase. The suggested model has an accuracy of approximately 86% for motorbike and rider detection and 90% for helmet/no helmet detection. In (Rohith et al. 2019), authors have proposed a low-cost IOT-based smart parking solution to address the parking problem in extremely crowded cities. Implementing the proposed system effectively reduces the time spent in finding and reserving parking slots.

In (Yang et al. 2018), authors have proposed blockchain-enabled secure federated learning in-vehicle network systems to predict traffic flow in urban areas. The extensive simulation results of the MNIST dataset show that the proposed approach achieves 93% accuracy.

The authors of suggested a deep dual neural network with phase structure and attention mechanism for sentiment analysis from Chinese text in Goyal et al. (2022). First, in this approach, the Chinese short financial text corpus is built. Several ablation tests were carried out in the second phase, employing five techniques: Pinyin, segmentation, lexical analysis, phrase structure, and attention mechanism. The sentence form and attention mechanism produce the best effects. The findings suggest that the proposed method outperforms the existing methods.

In (Gopal et al. 2019), authors have compared the different machine learning algorithms for big data analytics. The proposed approach consists of three phases: In the first phase, the data is collected from different social networking sites such as Twitter, Facebook, etc., using APIs. In the second phase, the different machine learning algorithms, supervised, unsupervised and reinforcement learning, are implemented and compared.

In (Thirunnavukkarasan et al. 2023), authors have proposed an intelligent anomaly detection framework for detecting anomalous activities in cyber-physical systems. The system consists of two processes; the first process is the preprocessing of data through transformation and filtering operations. In the second process, the Gaussian Mixture Model (GMM) uses the Kalman filter-based Deep CNN model to detect anomalous activities in cyber-physical systems.

The authors of Rao et al. (2021) presented an ambient intelligent technique for intrusion detection in a smart house setting. The suggested approach is divided into two phases: the first is the learning phase, which employs the reinforcement-based learning algorithm. The Deep Q Networks are used in the second phase to identify and classify. The suggested approach's performance is examined for four publicly available datasets, and the findings reveal that it outperforms previous approaches in terms of accuracy, precision, and recall.

The summary of the literature review and related works in the field of traffic surveillance have been summarized in Table 1.

Table 1 Comparison of related work

Based on the literature review, the following research gaps are identified:

  1. i.

    There is tremendous work done for the detection of vehicles and riders, detection of riders with/without helmets. But the other rule violations should also be addressed, for example, the detection of riders with triple rider and overspeed vehicles.

  2. ii.

    There needs to be more suitable datasets that can be applied to determine vehicles violating the traffic rules.

  3. iii.

    There needs to be more work in real-time video surveillance.

  4. iv.

    The dataset’s quality also hinders the detection of traffic rule violations.

  5. v.

    The objects in the images may be occluded so that the detection accuracy may be low.

To address these issues, we propose automatic surveillance, which can automatically detect the two-wheeler with three passengers. Due to the lack of a dataset for the task, the self-generated dataset is used to implement the proposed system. The current YOLO v8 model is utilized to train the suggested model, and riders who violate traffic rules are detected. The results demonstrate the effectiveness of the proposed method.

4 Proposed system

The complete model of the proposed system is shown in Fig. 4, which consists of three sub-systems: 1. Identification of vehicle and rider and 2. Classification of two-wheelers as violator or no violator. 3. Automatic Number Plate Recognition.

Fig. 4
figure 4

Flow Chart of the Proposed Approach

The traffic rule violation detection system comprises the first six steps, and the next two steps are used in the license plate detection and recognition system.

Steps

  1. 1.

    Video Capturing The first step in the proposed system is to capture real-time video. For this purpose, the edge-enabled surveillance cameras are placed at the intersection points at the main gate and basement parking.

  2. 2.

    Frame extraction The second step in the proposed system is extracting the frames from the recorded video, which will be provided as input to the deep learning model. The frames from the recorded videos have been captured at 10 frames per second using an open CV package from Python.

  3. 3.

    Frame Preprocessing After the frames have been extracted from the video streaming, some preprocessing is done to remove the redundant and irrelevant frames. Out of 650 images, 622 images were considered relevant.

  4. 4.

    Image annotation The next step in the proposed model is to annotate the image into different classes that we want to detect or track from the image. The labelling package is used to annotate the image, which labels different objects in an image. In our system, we are interested in three classes of objects: motorcycles, triple riders and license plates. After the image annotation, the information about the image is stored in the YOLO format, which is a text document form having the following fields: [Box centre offset (Bx), Box centre offset (By), Box width (Bw), Box height (Bh), Object Score (S), Probability of class A (Pa), Probability of class B(Pb), Probability of class N (Pn)].

  5. 5.

    Model Training- After the data set has been prepared, the classification model is developed using the YOLOv8 to classify the two-wheeler as either violator or non-violator.

  6. 6.

    Detection and Classification Phase The test set images are used to determine the model’s efficiency in this phase. The best weights obtained by the training process are used for detection purposes. The trained model is used to detect objects in three classes: motorcycle, triple rider and license plate. If any two-wheeler has triple riders, it will be shown as a red boundary box surrounding them. Otherwise, the green boundary box is shown to detect the rider and the motorcycle. The detailed process of the detection phase is shown in Fig. 5

Fig. 5
figure 5

Detection Phase

  1. a.

    Non-Max suppression for Multiple object detection

Non-max suppression algorithm is used to detect individual objects of the same kind.

  1. b.

    Depth Estimation algorithm

The depth estimation algorithm is used to distinguish the objects that are near or far from the camera. The application of the depth estimation algorithm generates the disparity map and disparity values for the pixels to determine the objects’ depth in the image. The disparity value for a pixel is calculated using Eq. 1

$$ d = w*s $$
(1)

where d, w and s represent the disparity values, width and scaling factor, respectively.

The depth of the object is calculated using the Eq. 2.

$$ {\text{Depth}} = \, \left( {{\text{focal}}*{\text{baseline}}} \right)/d $$
(2)

where focal represents the distance to the image and the baseline represents the difference in the camera positions.

  1. 7.

    Automatic number plate recognition

    The two-wheeler's license plate should be captured and analyzed for the details of the traffic law violation. The license plate image is separated from the input image. The connectionist temporal classification technique is then used to detect text from a cropped license plate image. Figure 6 depicts the process of the connectionist temporal classification algorithm. The CTC RNN model has three layers: the convolutional layer, which extracts relevant features from the input image; the recurrent layer, which consists of multiple Deep bidirectional LSTM and is used to learn the characters of the license plate; and the transcription layer, which predicts the sequence of the characters of the license plate.

    Fig. 6
    figure 6

    CTC RNN Algorithm

    The output of the license plate recognition is shown in the Fig. 7.

    Fig. 7
    figure 7

    Output of the ANPR

5 Performance evaluation

The efficacy of the object detection model is determined using the parameters as follows:

  1. i.

    Intersection over Union (IOU)- It is the most common performance measure to determine the localization accuracy and the localization errors in the object detection models. The boundary box may overlap in the case of multiple object detection of the same class, so this metric is useful in such cases.

First, the intersection area of the two boxes for the same object is determined to determine the IOU between the ground truth boundary boxes and the predicted boundary boxes. Then, the total area covered by the boundary boxes is determined, known as the Union and the overlap area is defined as the intersection. The ratio of Intersection divided by the Union provides the ratio of overlap to the total area, which gives a good measure of how accurately the predicted boundary box detects the actual object.

  1. ii.

    Average Precision (AP)- The weighted average of precision at each threshold level is average. Mean Average Precision (MAP) is the average of the Average Precision for each class. The following are the steps for calculating the MAP:

  2. a.

    Determine the prediction level of an object belonging to a class using the object detection model.

  3. b.

    Determine the class labels from the predictions.

  4. c.

    Determine the confusion matrix for the model depending on the parameters True Positive, False Positive, True Negative and False Negative.

  5. d.

    Determine the Precision and Recall measures for the model.

  6. e.

    Determine the area under the curve of Precision vs. Recall.

  7. f.

    Determine the average precision.

After performing these steps, the MAP is determined as the average precision for each class and the average for the class count. The formula to calculate the MAP is as follows:

$$ mAp = \frac{1}{n}\sum\limits_{i = 1}^{n} {\mathop {AP}\limits_{i} } $$
(1)

5.1 Simulation settings

The implementation of the proposed model is done as per the details in Table 2.

Table 2 Implementation details

5.2 Results and discussion

To develop the model, we considered a dataset of 622 photos separated into the following subsets: 540 images in the training subset, 31 in the test set, and 51 in the validation set. Table 3 contains a description of the dataset.

Table 3 Description of dataset

The three objects considered for the identification and detection tasks are motorcycles, license plates and triple riders. The sample image for the triple rider and triple rider are shown in the Fig. 8a and b.

Fig. 8
figure 8

Sample image a Triple Rider b Non Triple Rider

The model is trained using the YOLOv8 model for object detection for batch size 16 and number of epochs as 100. The training was done in 1.082 h. After training the model, the best and last weights are determined. The result of the detection of the triple rider is shown in Fig. 9a and b.

Fig. 9
figure 9

Results of the detection a Detection of vehicle and b Detection of person

The results of the traffic rule violation are shown in Fig. 10a–d.

Fig. 10
figure 10

a TP: Rider is Safe and Model also Predicted Safe/No Violation b. FP: Rider Violated the Rule, but Model predicted Safe c. TN: The Rider Violated the Rule, and the Model also detected Violation d. FN: The rider is Safe, but Model predicted a Violation

  • Training results

The proposed model for the detection of triple riders is being developed and trained for 100 epochs

  1. 1.

    Loss

The binary cross entropy function determines the loss in the training phase. The regression loss gives accuracy in detecting the boundary box concerning the ground truth. The class loss represents the accuracy of detecting the objects of different classes. The distributed focal loss represents the optimised distribution of boundary box boundaries. The results of the losses for the training dataset are shown in Fig. 11a–c. It can be observed that the loss value decreases with the increasing value of epochs.

Fig. 11
figure 11

Loss for the Training Data a. Regression Loss b. Class Loss c. DFL Loss

The results of the losses for the training dataset are shown in Fig. 12a–c. It can be observed that the loss value is decreased with the increasing value of epochs.

Fig. 12
figure 12

Loss for the Validation Data a. Regression Loss b. Class Loss c. DFL Loss

The results shown in Fig. 13a–c represent the variation of the learning rate concerning the epochs for the three classes. The result shows that the learning rate is decreased concerning the increasing number of epochs.

Fig. 13
figure 13

Learning Rate a. Class 0 b. Class 1 c. Class 3

  1. 2.

    Confusion matrix

The confusion matrix represents the performance of the proposed model. It consists of four fields, as shown in Table 4:

Table 4 Confusion matrix

TP means true positive, FP means False Positive, FN means False Negative and TN means True Negative. TP represents the count of instances when the riders are correctly classified as non-violator, FP represents the count of instances when the riders are wrongly classified as non-violator, FN represents the count of instances when the non-violator riders are wrongly classified as violators, and TN represents the count of instances when the violator riders are classified as violator. Figure 14 represents the confusion matrix. The count of instances correctly classified is 94%, 96% and 97% for the triple rider, license plate detection and motorcycle detection, respectively.

Fig. 14
figure 14

Confusion Matrix for the proposed approach

The results in Figs. 15, 16, 17, 18 show the performance metrics for the proposed model. It can be seen from the results shown in Fig. 15 that the precision of the model is increased up to 90% with the increasing number of epochs. The higher precision value indicates that our model is returning more relevant than irrelevant results. It can be seen from the results shown in Fig. 16 that the recall of the model is increased up to above 80% with the increasing number of epochs. The higher recall value indicates that most of the relevant results are being returned by our model. Figures 17 and 18 represent the Mean Average Precision for the proposed model.

Fig. 15
figure 15

Precision

Fig. 16
figure 16

Recall

Fig. 17
figure 17

mAP-50

Fig. 18
figure 18

Map 50–95

  1. 3.

    Recall Confidence Curve

The results of recall and confidence are shown in Fig. 19. It can be concluded from the results that there is a high area under the curve recall and confidence, which means the recall and confidence value is high. The high recall value indicates the low false negative rate. The increased confidence indicates high confidence in the importance of the results obtained for the proposed model.

Fig. 19
figure 19

Recall confidence curve

  1. 4.

    Precision-Recall Curve

The results of precision and recall are shown in Fig. 20. It can be concluded from the results that there is a high area under both curves, which means the model has higher precision and recall values. The higher precision indicates that our model is returning more relevant than irrelevant results.

Fig. 20
figure 20

Precision and Recall curve

  1. 5.

    F1-Confidence Curve

The F1 score and confidence results are shown in Fig. 21. The F1 score is used to determine the optimal point where we can balance the precision and recall value. A higher F1 score above 80% can be achieved for a wider range of confidence values for all three tasks.

Fig. 21
figure 21

F1 Score and confidence curve

  1. 6.

    Precision-Confidence curve

The results of precision and confidence are shown in Fig. 22. It can be concluded from the results that a higher confidence level will result in higher precision for all three tasks.

Fig. 22
figure 22

Precision and confidence curve

After training the model and determining the best weights, the model is validated for the 51 validation test images. The model summary is in Table 5.

Table 5 Summary of the model

The time taken for the model in the prediction and validation phases is summarised in Table 6. The results show that the model takes a higher simulation time during the training and validation phases, but the predictions can be made in less time.

Table 6 Simulation time in different phases

The result of the validation dataset is in Table 7

Table 7 The performance measure for the validation dataset

It is observed in the table that the approach obtained higher precision and recall values for the three tasks. The efficiency of the model is evaluated for the two cases:

  1. a.

    This implies that the Intersection of Union (IOU) measure threshold is considered as 0.50.

  2. b.

    mAP50-90- This implies that the average precision is considered for the IOU threshold from 50 to 90 with the increment of 5

The mean average precision is also higher for all the tasks. This shows the superiority of the proposed model.

6 Limitations

Although the proposed system achieves significant performance, there are some limitations of the proposed approach as well:

  • The proposed model is effective for low-traffic areas.

  • The proximity of the people with the vehicle may reduce the accuracy.

  • The accuracy and precision of the proposed system depend on the angle of the placement of the camera and imaging quality.

  • The system is implemented for the college premises under the above constraints.

7 Conclusion and future scope

The paper presented a classification model based on YOLOv8 for detecting triple riders on a two-wheeler. The model was trained to determine the best weights for the detection. Then, the model is tested for the validation test of 51 images with the accuracy level of 91%, 94% and 96% for triple rider detection, license plate detection and motorcycle detection. The system can help regularise the traffic rule implementation with more accuracy and less effort than manual monitoring methods.

Detecting overspeed through the proposed model is part of the future work. Integrating the proposed system for the automatic alert message in case of traffic rule violation is also part of the future scope.