1 Introduction

Vehicles play a crucial role in our daily lives. The most important part of the transportation department is the traffic management system. The initial traffic control system was created in the 1910s. Through the use of computer vision techniques, traffic management systems have been evolved over time to become more intelligent and efficient. In addition to the traffic management system, a lot of data is required for other systems as well including surveillance, pollution control, autonomous vehicle detection (AVD), and vision-based vehicle parking management. Most importantly, the system needs to be reliable and accurate in order to be used in real-world AVD applications. Many datasets [16, 17] have been created in recent years for a variety of object detection purposes including vehicle detection. Autonomous vehicles locates obstacles using camera captures images from the vehicle. The range of the sensor and the size of the target must be considered when detecting vehicles [26]. It should be noted that the complexity of newer make and model vehicles’ shapes, sizes, colors, and textures is making this research field more challenging [11].

The pose and viewing angle make it difficult for a model to identify the precise object class. Additionally, it is challenging to create a generalized AVD dataset due to the numerous country or region-specific issues. The most widely used datasets like Pascal VOC2007 [10] and KITTI [13], only include five and seven vehicle classes that are frequently seen on urban roads, respectively. Only a few datasets are available that were created with Indian traffic conditions in mind. The most popular Indian vehicle dataset, known as IDD [38], is used for segmentation tasks. Images were taken in Bangalore and Hyderabad, two popular Indian cities. There are a total of 34 classes in this dataset, including 8 classes for vehicles. Other classes include the sky, roadside objects (such as walls, fences, billboards, etc.), distant objects (such as buildings and bridges), living things (such as animals and people), drivable and non-drivable objects (such as sidewalks and non-drivable fallbacks). Another Indian vehicle detection dataset is NITCAD [28], which includes 7 different vehicle classes. Images were collected throughout Kerala, a southern state of India, for this dataset. Despite the fact that auto-rickshaws are widely used in both urban and rural areas of India, NITCAD and IDD only included them in their datasets because they are frequently seen in urban areas.

However, more vehicles, including totos, cycle-rickshaws, and motor-rickshaws, are commonly found in rural areas, and those are not incorporated in any of the aforementioned datasets. Due to inadequate traffic control systems, auto-rickshaws and motorcycles frequently break traffic laws in rural areas. In congested urban areas, crosswalks are not always used by pedestrians. These elements contribute to India’s extremely difficult traffic management systems. The fact that the same kind of vehicle is used for multiple purposes in India presents another challenge for vehicle detection, and it confuses the computer vision based learning models. A motor-rickshaw, for instance, can carry both passengers and goods. This study offers a new dataset for AVD to address those problems, and an ensemble of deep-learning models is used to provide baseline results. Vehicle detection using still images is a challenging task due to various factors, such as vehicle orientation, lighting conditions, and scale variations. In the Indian context, there are several additional challenges to consider, such as non-standard vehicles, crowded scenes, and diverse driving behaviors. This article explores the challenges in vehicle detection in the Indian context and proposes solutions to address them. The first step in developing an effective vehicle detection system is to collect a standard dataset. The dataset should cover a range of scenarios, including different lighting conditions, backgrounds, and vehicle types and orientations. In the Indian context, the dataset should also include some uncommon vehicles (in other countries) such as auto-rickshaws and cycle-rickshaws. Figure 1 shows an example of a dataset that considers these factors.

Fig. 1
figure 1

Sample images taken from the dataset developed in the present work

One of the significant challenges in vehicle detection is the orientation of vehicles. Vehicles can have different orientations, such as front-facing, side-facing, or rear-facing, which can make it difficult to detect them using traditional methods. Additionally, variations in lighting conditions can affect the accuracy of detection. Vehicles can appear at different scales in images or videos, making it challenging to detect them using existing datasets. Furthermore, vehicles may blend into complex backgrounds, such as trees, buildings, and other vehicles, which can make it difficult to distinguish them from the surroundings. In the Indian context, traffic congestion is another significant issue. The high volume of traffic, particularly in urban areas, can make it problematic to detect vehicles accurately in crowded scenes. Moreover, in rural areas, we can observe a wide range of sometimes erratic driving behaviours, which can further complicate the detection task. Additionally, India has a diverse range of uncommon vehicles, such as auto-rickshaws and cycle-rickshaws, which can be difficult to detect if a model is trained on existing datasets.

Contributions

With the aforementioned information in mind, in this paper, we have developed the IRUVD: Indian Rural and Urban Vehicle Detection, a new still-image-based dataset for AVD. Specific contrubutions of this paper are as follows:

  • A new vehicle detection dataset that includes 13 vehicle classes and 1 pedestrian object has been introduced. Toto, cycle-rickshaw, and motor-rickshaw classes of vehicles that are frequently seen on Indian roads have been considered.

  • To make the dataset as realistic as possible, we have taken into account both urban and rural areas of India during data collection. This aids in capturing a variety of traffic scenarios, including both low-congested rural areas without a traffic system and highly congested urban areas with a well-maintained traffic system.

  • There are several challenges that are considered while preparing the current dataset, such as vehicle orientation, variations in lighting conditions, scale variation of the objects, complex backgrounds, occlusions, and different traffic diversities.

  • This dataset contains 14343 properly annotated bounding boxes for 4000 labeled images. Figure 1 displays some examples. The images were captured from various locations in West Bengal, a state in eastern India. The images were taken during the day, and the objects were in various poses. The resolution of each image is 1920 × 1080.

  • Utilizing the most up-to-date deep learning-based object detection models, we have benchmarked the IRUVD dataset using the You Only Look Once version 3 (YOLOv3), YOLO version 4 (YOLOv4), Scaled YOLO version 4 (Scaled-YOLOv3), and YOLO version 5 (YOLOv5) (YOLOv5). We have used a variety of object detection metrics, including recall, precision, F1-score, mean average precision (mAP) at intersection over union (IOU) threshold 0.5 (mAP@0.5), mAP score at IOU 0.75 (mAP@0.75), mAP scores at IOU 0.95 in steps of 0.05 (mAP@0.5:0.05:0.95), and mAP scores at IOU 0.95 (mAP@0.75:0.05:0.95).

  • On the IRUVD dataset, we have proposed both weighted and non-weighted ensemble approaches to establish the baseline results. In this case, we have observed that the non-weighted ensemble approach performs better than the weighted ensemble approach.

The rest of the paper is organized as follows. A literature review is given in Section 2. The developed dataset is presented in Section 3. In Section 4, we describe the procedure for benchmarking the dataset. To further improve detection results, we introduce an ensemble technique in Section 5. In Section 6, we tested the proposed ensemble technique on some additional datasets. Finally, in Section 7, we conclude our paper.

2 Literature survey

Researchers have created various datasets to address a variety of difficult problems in the field of computer vision. The most popular datasets for image classification, localization, and segmentation are ImageNet [6], Microsoft COCO [22], ADE20K [48], and Pascal VOC. But only Pascal VOC2007 can be used for vehicle detection tasks. The two categories of existing vehicle detection datasets are segmentation-based datasets and localization-based datasets, which are briefly discussed below.

2.1 Segmentation-based detection dataset

Segmentation implies the finding an exact outline around the object in an image. The most widely used dataset, Pascal VOC2007, has 20 main classes, but only 5 of them can be applied for developing autonomous vehicle and traffic management systems. The most well-known datasets for segmentation-based vehicle detection are Cityscapes [5], Mapillary Vistas [29], and CBCL StreetScenes [2]. The 3.5k images in CBCL StreetScenes, which have 9 classes including cars, pedestrians, bicycles, buildings, trees, sky, roads, sidewalks, and stores, were collected from urban streets in Boston, Massachusetts, in the United States. Another dataset for vehicle detection with 25k images is Vistas. With 11 different vehicle types and a total of 37 classes, the images of this dataset were collected from urban streets across 6 continents with the goal of diversifying the detection models. A segmentation-based dataset that took into account both urban and rural areas of the USA is the Berkeley Deep Drive Video dataset. The videos used to create this dataset totaled 10,000 hours. The largest vehicle dataset, BDD100K [44], was collected from New York City. Each of the 100,000 video sequences in this dataset has a duration of 40 seconds and ten different object classes. The primary driving force behind the creation of this dataset was the desire to experience various time and weather conditions, such as sunny, cloudy, and rainy. The Cityscapes segmentation-based vehicle detection dataset, which was created taking Indian traffic conditions into consideration, is the one that most closely resembles IDD. Only 8 of the 45k images’ 34 object classes are clearly identifiable as vehicle classes. However, some vehicles that are frequently seen on Indian rural roads are not taken into account in this dataset.

2.2 Localization-based detection dataset

Localization is a method for determining an object’s precise location, which is typically indicated by a rectangular box. For the sole purpose of detecting pedestrians, datasets like INRIA [7], Daimler Pedestrian [27], TudBrussels [43], and Citypersons dataset [45] were created. However, most objects encountered on the road must be able to be detected by AVD systems. A 3D car classification dataset with 207 car categories was introduced by Krause et al. [20]. The most well-known vehicle detection dataset, known as KITTI, was developed using images from Karlsruhe, Germany, and eight different vehicle classes. The KITTI dataset has 200k 3D bounding boxes and 15k images with 8 different object classes. The most recent Iranian vehicle detection dataset, LRVD [19], contains 110k images and 5 classes. BIT dataset [8] is another vehicle classification dataset that consists of 6 classes - bus, SUV, minivan, truck, microbus, and Sedan. It has more than 9.8k images which were taken day and night times from a camera installed on the highway. The limitation of this dataset is that it only contains a front view of every vehicle which is not preferable in a real-life scenario. NITCAD is a stereo vision-based autonomous navigation dataset, which was collected from Kerala, India. It has 7.5k distorted images. The main motive to develop this dataset was to provide more information about Indian roads. However, it only mentions one new class than other datasets, i.e., auto-rickshaw. Other vehicles like the toto, cycle-rickshaw, and motor-rickshaw seen on Indian roads were not taken into account when these datasets were made. In most cases, the object detection model predicts the presence of multiple objects within a single bounding box. The most accurate bounding boxes can be determined using either Non-maximum Suppression (NMS) or Soft Non-maximum Suppression (Soft-NMS).

2.2.1 Non-maximum suppression

Every detection model calculates the object’s location using a bounding box or anchor box, the box’s class, and the prediction percentage’s level of confidence. A filtering technique called the NMS is used to get rid of overlapping bounding boxes. IOU is a metric that is typically calculated for each object class to determine the maximum amount of box overlap. The highest IOU box is the only one left for the final prediction. However, this method eliminates partially obscured objects of the same class, which is undesirable. Hence, this type of technique is used in the training process of a detection model.

2.2.2 Soft non-maximum suppression

To predict the most ideal bounding box from multiple bounding boxes, Bodla et al. [4] developed soft-NMS. In contrast to NMS, soft-NMS assigns the score based on the IOU value. In this method, a very low confidence score is given if the IOU value is high, which raises the detection model’s training accuracy. Soft-NMS techniques are not very good at ensembling, though.

To develop an intelligent traffic control system in the Indian context, the above dataset may not be useful because of the lack of information about the vehicles like toto, cycle-rickshaw, motor-rickshaw, and tempo which are frequently seen on Indian roads. To this end, we have created a dataset with 14 classes that include a few special vehicle classes that, to the best of our knowledge, are not included in any other datasets. Since both urban and rural roads were taken into consideration, this dataset effectively captures both structured and unstructured traffic scenarios. This is a crucial component of developing nations like India’s traffic management system. Table 1 compares the widely used datasets with the one we have created. YOLOv3, YOLOv4, Scaled-YOLOv4, and YOLOv5 are the four most recent deep learning-based object detection models used to benchmark the results on this dataset. Additionally, weighted and non-weighted ensembles of these deep learning models are employed to boost the detection accuracy. The ensemble technique, which is primarily used in the machine learning field, combines the decisions of multiple predictive models to make the final prediction that is more accurate than those made by the base models. It has the following main advantages,

  • One of the main benefits of using the ensemble technique is that it improves the performance of the average accuracy of any present member.

  • It reduces the bias-variance trade-off of the contributing members.

  • It improves the robustness of the models.

Table 1 Comparison of existing vehicle detection datasets with the IRUVD

3 Developed object detection dataset

Data collection and annotation process, quality and statistics of the data, and comparison with other vehicle detection datasets are presented in this section.

3.1 Data collection

Due to the lack of structure in the traffic management system, 11% of all fatalities worldwide occur in India [15]. In order to create more reliable systems that can be applied in the real world, we have created a still-image-based AVD dataset that takes into account the unique characteristics of Indian roads and vehicles. We have gathered information from various locations and at various times of the day to adequately represent the variety of traffic conditions. The traffic control system is more organized and effectively managed in urban areas. However, in rural areas, some vehicles and pedestrians disobey traffic laws, which leads to a high number of traffic accidents. When gathering the data, different perspectives, including the front, side, and back views of the same object, were taken into account. As the images were captured from different angles, the size of the objects varies a lot. We made use of a 16MP Sony IMX519 high-resolution camera to take all of these pictures. To create the database, we watched 1080p footage at 60 frames per second for more than 5 hours (during various times). Some samples images are already shown in Fig. 1.

3.2 Annotation process

A non-iconic image is one that contains multiple objects, which makes it more difficult for researchers to identify objects accurately in such images. To accurately assess the performance of any existing or newly developed methods, the data annotations must be flawless. The majority of the dataset’s images are not iconic in any way. Although the annotation process may be prone to errors, we must make sure that they are kept to a minimum. We struggled to obtain the correct bounding boxes for each object when it was obscured by another. Objects may be regarded as noise because of their small size and light particle illumination. Figure 2 displays a challenging annotation example. Because the object in this example is too small, we have not considered it to be an object. Using the open-source program LabelIMG [36], all of the images have annotations.

Fig. 2
figure 2

An example of a difficult annotation problem. The green box shows a straightforward annotation and the red box indicates a difficult annotation

3.3 Statistics of the dataset

Because they are the most varied and common on Indian roadways, we selected 13 vehicle categories and one class for pedestrians from the photographs we collected. They are toto, bike, cyclist, auto-rickshaw, motor-rickshaw, van, tempo, car, bus, taxi, truck, jeep, cycle-rickshaw, and pedestrian. A sample of each category found in the developed dataset is shown in Fig. 3. We took into account every vehicle that was included in datasets like KITTI and NITCAD. Additionally, six new vehicle classes have been added: tempo, taxi, motor-rickshaw, jeep, toto, and cycle-rickshaw. We have annotated 4000 images using the aforementioned process, and a total of 14343 bounding boxes have been labeled manually. Each image in our dataset has an average of 3.58 boxes. Figure 4 shows the frequency of each object. From this figure, it is clear that there are a maximum number of pedestrians and a minimum number of cycle-rickshaws in the dataset.

Fig. 3
figure 3

Examples of different classes present in IRUVD dataset. (a) Toto (b) Cyclist (c) Bike (d) Truck (e) Motor-rickshaw (f) Van (g) Tempo (h) Car (i) Bus (j) Taxi (k) Auto-rickshaw (l) Jeep (m) Cycle-rickshaw (n) Pedestrian

Fig. 4
figure 4

Distribution of different object classes found in the IRUVD dataset. Y-axis denotes the number of occurrences of each class

3.4 Quality of annotation

Thanks to the open-source annotation software LabelImg [36], which made it easier to prepare most accurate annotations. We have annotated the images using this software in accordance with the YOLO format. As seen in Fig. 2, typical situations like occlusion and the small size of the objects, among others, lead to errors in the annotation process. To lessen the uncertainty of the bounding boxes and class labels, we have carefully examined the results.

3.5 Comparison with other vehicle detection datasets

For AVD, a number of datasets have been made available, including CBCL StreetScenes, KITTI, Dataset by Yu P. et al., BDD100k, Mapillary Vistas, IDD, NITCAD, and others. There are mainly two types of datasets: segmentation based datasets and localization based datasets. The dataset developed under the current work can be applied to localization. Our dataset is more comparable to the NITCAD dataset because we have concentrated on Indian vehicle detection and classification. As seen in Table 1, our dataset has 14 classes while NITCAD has only 7 classes. The current dataset has less object classes than datasets like IDD and Mapillary Vistas, but it has more vehicles because it adds new classes like tempo, taxi, motor-rickshaw, jeep, toto, and cycle-rickshaw.

4 Dataset benchmarking

Results of experimentation carried out under the current work are reported in this section. For the performance evaluation, a total of 11 methods have been applied on the IRUVD dataset.

4.1 Deep learning models used to benchmark

Several object detection models, including YOLOv3 [31], YOLOv4 [3], Scaled-YOLOv4 [40], YOLOv5 [18], YOLOX [12], YOLOR [41], YOLOv6 [21] and YOLOv7 [42] etc. have been introduced in recent years. In the current work, we have considered four state-of-the-art object detection models: YOLOv3, YOLOv4, Scaled-YOLOv4 and YOLOv5. The four models are discussed below.

4.1.1 YOLOv3

Joseph Redmon and Ali Farhadi’s [31] YOLOv3 algorithm is the one of the most widely-used object detection techniques in the field of computer vision. As shown in Fig. 5, the detection process is divided into three steps: feature extraction, bounding box prediction, and class prediction. The Darknet-53 method, used by YOLOv3, for feature extraction consists of 53 convolutional layers. In three different levels, YOLOv3 anticipates the boxes to determine the precise size of the object, or prior. K-means clustering is employed to estimate the box prior. Softmax does not discard overlapping boxes, which is useful for detection in more complicated domains like AVD, so it was used in YOLOv3. Figure 5 displays a typical YOLOv3 architecture.

Fig. 5
figure 5

Schematic diagram of the YOLOv3 object detection model with Darknet-53 backbone

4.1.2 YOLOv4

Bochkovskiy et al. [3] introduced YOLOv4 as a successor of YOLOv3 to better speed and accuracy in terms of mAP [@0.5:0.05:0.95] and mAP [@0.5]. YOLOv4 consists of four sub-blocks, namely, Backbone, Neck, Dense Prediction block, and Sparse Prediction block as shown in Fig. 6. A typical object detection model takes images as input and finds the features through the convolution layer. For this purpose authors of YOLOv4 used SpineNet [9], CSPResNext50 [39], CSPDarknet53 [39], EfficientNet-B3 [34], VGG16 [33] Backbone for the feature extractor. In this paper, we have used pre-trained CSPDarknet53 on ImageNet as the Backbone. Extracted feature maps are mixed in the Neck region to find more generalized characteristics of the objects. YOLOv4 was examined in a few Neck configurations like Feature Pyramid Network (FPN) [23], Path Aggregation Network (PANet) [24], Neural Architecture Search Feature Pyramid Network (NAS-FPN) [14], Bi-directional Feature Pyramid Network (BiFPN) [35], Adaptively Spatial Feature Fusion (ASFF) [25], Scaled-wise Feature Aggregation Module (SFAM) [47]. PANet is the most commonly used neck structure in YOLOv4. Generally, head sections refer to the Dense Prediction block and the Sparse Prediction block. The Sparse Prediction block is used for two-stage detection, while the Dense Prediction block is used for one-stage detection. The location and class are both estimated during one stage of the detection process. Two-stage detection separates each object’s locations and classes. The YOLO head has been used in this essay. The terms “Bag of freebies” and “Bag of specials” are two new techniques introduced in YOLOv4 that improve the model and boost performance. The authors noted that Rectified Linear Unit (ReLU) activation functions did not effectively optimize the features in YOLOv4. The Mish activation function was consequently used for improved performance. The YOLOv4 architecture is demonstrated in Fig. 6.

Fig. 6
figure 6

The architecture of YOLOv4. It has four parts—Backbone, Neck, Dense Prediction, and Sparse Prediction

4.1.3 Scaled-YOLOv4

Bochkovskiy et al. [40] suggested the Scaled-YOLOv4 to detect both small and large objects using the same model. Scaled-YOLOv4 includes CSPDarknet53, CSPUp sampling block, and CSPDown sampling block, as demonstrated in Fig. 7. In addition to the detection bandwidth, both large and small models can use up and down-sampling without sacrificing speed or accuracy. This approach led to the development of two models, tiny-YOLOv4 and large-YOLOv4, which offered cutting-edge outcomes for both small and large object detection models in terms of mAP scores. Scaled-YOLOv4 has a backbone in the form of CSPDarknet53. To lessen the computation at the model’s neck, the PAN architecture on Scaled-YOLOv4 employs the CSP-ize technique. By using this technique, the computation is slashed by about 40%. The neck also makes use of CSPSSP. It is clear that YOLOv4’s training processes heavily rely on data augmentation. In the scaled YOLOv4, the model is only fine-tuned by adding training data after the training is complete. Additionally, it aids in accelerating the training process.

Fig. 7
figure 7

Schematic diagram of the Scaled-YOLOv4 object detection model. Red lines represent the CSPUP block are replaced by the CSPSPP block for scaling

4.1.4 YOLOv5

Glenn Jocher et al. [18] came up with the idea for YOLOv5. It has three main sub-blocks, including Backbone, PANet Head, and Output block, like other YOLO models, as shown in Fig. 8. BottleNeckCSP is used to organize YOLOv5’s backbone into Darknet53, also known as CSPDarknet53. For feature extraction and feature dimension reduction, a large-scale backbone like CSPDarknet53 works best. This improves the model’s detection speed and accuracy. A PANet is present in the model’s second component. The model’s information flow from the backbone to the head is improved by PANet neck. A bottom-up method is used to pass information through the feature pyramid, allowing the model to recognize more low-level features. The model propagates the features to other layers with the aid of skip connections. Last but not least, the YOLO head is used by the YOLOv5’s head or output section. Similar to YOLOv3, YOLOv5 forecasts the output in three distinct scales, namely (72 × 72,36 × 36,and 18 × 18), enabling the model to recognize objects of various sizes. Four distinct models based on trainable parameters are available in YOLOv5.

  • YOLOv5s (small)

  • YOLOv5m (medium)

  • YOLOv5l (large)

  • YOLOv5x (extra-large)

In this paper, we have used YOLOv5s to benchmark the results on the IRUVD dataset.

Fig. 8
figure 8

An illustration of YOLOv5s architecture, with CSP Backbone, PANet Neck, and YOLO head

4.2 Object detection benchmarks

As mentioned earlier, we have reported object detection benchmarks using state-of-the-art deep learning models. Class-wise mAP scores from various detection models at an IOU threshold of 0.5 to 0.95 in steps of 0.05 are displayed in Table 2. For the evaluation of each model, we have used a 3-fold cross-validation scheme in order to get more accurate results. This table clearly shows that the detection models classify toto, truck, bus, and motor-rickshaw with greater accuracy. However, we have found that the accuracy of object detection is lower for cars, taxis, bicycles, and pedestrians. Since the shapes of a car and a taxi are almost identical but their colors differ, the object detection models occasionally fail to classify them properly. In Tables 3 and 4 we have shown the class-specific mAP at IOUs of 0.5 and 0.75. The average precision, recall, F1-score, and mAP score at IOU thresholds of 0.5, 0.75, and 0.5 to 0.95 in steps of 0.05 are displayed in Table 5. YOLOv5s yields the highest score for precision. In contrast, Scaled-YOLOv4 yields the highest rating for recall. The Scaled-YOLOv4 model yields the highest overall ratings.

Table 2 Performance comparison on the IRUVD dataset, class-wise mAP scores at IOU threshold ranging from 0.5 to 0.95 in steps of 0.05(mAP@[0.5:0.05:0.95])
Table 3 Performance comparison on the IRUVD dataset, class-wise mAP scores at IOU threshold 0.5(mAP@0.5)
Table 4 Performance comparison on the IRUVD dataset, class-wise mAP scores at IOU threshold 0.75(mAP@0.75)
Table 5 Performance comparison on the IRUVD dataset using different state-of-the-art object detection models

4.3 Confusion matrix and miss rate

The confusion matrix for the YOLOv5s model on the current AVD dataset is shown in Fig. 9. As can be seen, every object has been detected almost perfectly. However, because to their similar shapes, as already indicated, taxis are sometimes mistaken as cars. Following are a few erroneous classifications between some classes:

  • Car and Taxi

  • Cyclist and Pedestrian

  • Toto and Auto-rickshaw

  • Cycle-rickshaw and Cyclist

  • Cyclist and Motor-Bike

We have shown the total number of the true positive and false positive objects detected by each model. We have presented class-wise log average miss rate for each model to see how the model works. Figure 10(a) shows the total number of true positive and false positive objects detected by the YOLOv3 model for each class, whereas (b) shows the log average miss rate of the YOLOv3 model for each class. Figures 1112 and 13 show the similar results produced by Scaled-YOLOv4, YOLOv5 and YOLOv4 models, respectively.

Fig. 9
figure 9

Confusion matrix on the current AVD dataset using YOLOv5s

Fig. 10
figure 10

Vehicle detection results produced by YOLOv3: (a) Number of false positives and true positives for each class, and (b) Log Average miss rate of each class

Fig. 11
figure 11

Vehicle detection results produced by YOLOv4: (a) Number of false positives and true positives for each class, and (b) Log Average miss rate of each class

Fig. 12
figure 12

Vehicle detection results produced by Scaled-YOLOv4: (a) Number of false positives and true positives for each class, and (b) Log Average miss rate of each class

Fig. 13
figure 13

Vehicle detection results produced by YOLOv5: (a) Number of false positives and true positives for each class, and (b) Log Average miss rate of each class

4.4 Precision vs. recall curve

Precision measures the percentage of true positives, whereas recall measures the percentage of false negatives. Low false negative rates indicate a high recall value, and high true positive rates indicate a high precision value. A perfect model needs to be highly accurate and highly reliable. The precision-recall trade-off, on the other hand, states that as precision increases, recall decreases, and vice versa. For each of the 14 classes, the precision vs. recall curve is displayed in Fig. 14. This graph shows that we have achieved encouraging results for the bus, cycle-rickshaw, toto, van, and motor-rickshaw. We have received subpar results for some classes, including auto-rickshaw, pedestrian, taxi, cyclist and jeep, and this indicates a precision-recall trade-off.

Fig. 14
figure 14

Precision vs. recall plots of 14 object classes using YOLOv3, YOLOv4, Scaled-YOLOv4 and YOLOv5 object detection models. (a) Auto-rickshaw, (b) Bus, (c) Bike, (d) Car, (e) Cyclist, (f) Cycle-rickshaw, (g) Motor-rickshaw, (h) Pedestrian, (i) Taxi, (j) Tempo, (k) Toto, (l) Truck, (m) Van, and (n) Jeep

5 Ensemble techniques

The proposed ensemble method, which is actually a box estimation technique used to predict the precise bounding box from the N number of bounding boxes provided by the same number of detection algorithms, has been discussed in this section. Even though a single object detection model locates objects fairly accurately, it occasionally classifies the background as an object or becomes confused with objects that have a similar shape. We have suggested an ensemble technique, which is described below, to address this problem.

  • Consider the use of N detection models to create the aforementioned ensemble. At a specific IOU threshold T, each model provides a prediction score for the detected object. In our situation, we have established the limit T >= 0.5. The box that the Nth model returns is designated as BN.

  • Each box includes the object’s height and width as well as the class prediction, prediction confidence, and center coordinates (x,y). The algorithms count how many models classified an object as belonging to the same class for a given object out of N predictions for that class. The class that the majority of models select is the actual class of the object. Now, the true class is estimated using the prediction confidence scores if the same number of models correctly predict an object that belongs to multiple classes. The model with the highest confidence score is considered to have provided the final class for an object. For instance, if four models are used for assembly and three of them identify an object as a car and one as a taxi, our suggested method will label the object as a car. The method considers an object to be a car if two models predict it to be a car and the other two models predict it to be a taxi, but the first two models have the highest confidence scores. False predictions can be decreased by using this method.

  • Our method does not remove overlapping anchor boxes like NMS and Soft-NMS. However, it estimates new anchor box from N number of boxes obtained from different detection models. We propose a technique to estimate the box using a weighted method or a non-weighted method. Let us consider that N number of detection models gives N number of predictions for a particular object as [C1,S1,x1,y1,w1,h1],.............,[CN,SN,xN,yN,wN,hN], where CN = class predicted by Nth model, SN = confidence score given by Nth model, xN = x coordinate given by Nth model, yN = y coordinate given by Nth model, wN = width given by Nth model, hN = height given by Nth model. Then parameters are estimated using (1)—(7). The weighted method takes P1,P2,......,PN as inputs, where P are the parameter which we want to estimate, i.e., x coordinate, y coordinate, width, height, etc., and N is the number of models used in the ensemble technique. On the other hand, P1,P2,......,PN,W1,W2,.......,WN are used as inputs for the non-weighted method, where P is x coordinate, y coordinate, width, height, and WN is the confidence score of Nth model. We have considered the confidence score as the weight in weighted methods. To estimate the parameters, we have used seven different popular functions that are described below:

  • Non-weighted methods:

    1. 1.

      Mean:

      $$ f(P_{1},P_{2},......,P_{N}) = \frac{{\sum}_{i=1}^{N} P_{i}}{N} $$
      (1)
    2. 2.

      Harmonic Mean (HM):

      $$ f(P_{1},P_{2},......,P_{N}) = \frac{N}{{\sum}_{i=1}^{N}\frac{1}{P_{i}}} $$
      (2)
    3. 3.

      Contraharmonic Mean (CM):

      $$ f(P_{1},P_{2},......,P_{N}) = \frac{{\sum}_{i=1}^{N} {P_{N}}^{2}}{{\sum}_{i=1}^{N} P_{N}} $$
      (3)
    4. 4.

      Root Mean Square (RMS):

      $$ f(P_{1},P_{2},......,P_{N}) = \sqrt{\frac{1}{N}\sum\limits_{i=1}^{N} {P_{N}}^{2}} $$
      (4)
  • Weighted methods:

    1. 1.

      Weighted Mean (WM):

      $$ f(P_{1},.,P_{N},W_{1},.,W_{N}) = \frac{{\sum}_{i=1}^{N} W_{i} P_{i}}{{\sum}_{i=1}^{N} W_{i}} $$
      (5)
    2. 2.

      Weighted Harmonic Mean (WHM):

      $$ f(P_{1},..,P_{N},W_{1},..,W_{N}) = \frac{{\sum}_{i=1}^{N} W_{i}}{{\sum}_{i=1}^{N} {\frac{W_{i}}{P_{i}}}} $$
      (6)
    3. 3.

      Weighted Geometric Mean (WGM):

      $$ f(P_{1},..,P_{N},W_{1},..,W_{N}) = \frac{{\sum}_{i=1}^{N} {P_{N}}^{2}}{{\sum}_{i=1}^{N} P_{N}} $$
      (7)

Using (1)–(7), all the said parameters are estimated for the newly obtained bounding boxes. An illustration of our ensemble architecture is shown in Fig. 15. Tables 6 and 7 show results of an ensemble of different models using various methods in terms of mAP at IOU threshold 0.5, mAP at IOU threshold 0.75, mAP at IOU threshold ranging from 0.5 to 0.95 in steps of 0.05. We have combined two and more models to form different ensembles. From Table 6, it can be seen that the combination of YOLOv3 and YOLOv4 using the Mean method achieves 1% improvement on mAP@[0.5:0.05:0.95] than YOLOv4 and 8.8% than YOLOv3. Popular techniques like NMS and soft-NMS achieve only 3.8% more than YOLOv3, but 5% less than YOLOv4. Ensemble of YOLOv4 and Scaled-YOLOv4 using the Mean method gives an 83.6% mAP@[0.5:0.05:0.95] which is 7.1% more than YOLOv4 and 2.2% more than Scaled-YOLOv5. Other methods give similar performance, however, NMS and soft-NMS give less mAP than Scaled-YOLOv4 because of the unnecessary elimination of overlapping boxes. Fusion of YOLOv4 and YOLOv5 provides an 82.3% of mAP@[0.5:0.05:0.95], which is 1.5% higher than Scaled-YOLOv4 and 4.7% higher than YOLOv5s. When we have used an ensemble of YOLOv5 and Scaled-YOLOv4 models, NMS and S-NMS display better results than other methods, which is a 5% improvement than YOLOv5 and 1.9% improvement than the Scaled-YOLOv4. Not only two models, but more than two models are also used to form the ensemble, as shown in Table 7. Combinations of YOLOv3, YOLOv4, Scaled-YOLOv4 and YOLOv3, YOLOv4, YOLOv5 provide better results than a single model but do not provide better results than the combination of two models. However, an ensemble of YOLOv4, Scaled-YOLOv4, and YOLOv5 models using the Mean method shows higher performance than all other models. An illustration of a bar plot of mAP@[0.5:0.05:0.95] by the ensemble of different models using different methods for each class is shown in Fig. 16. A number of false positives, true positives and log average miss rate of each class given by the ensemble of YOLOv4, Scaled-YOLOv4, and YOLOv5 using the Mean method are shown in Fig. 17. In Fig. 18, a comparison of class-wise mAP at different thresholds from 0.5 to 0.95 is presented. We have used YOLOv3, YOLOv4, Scaled-YOLOv4, and YOLOv5, an ensemble of YOLOv4, Scaled-YOLOv4 using Mean method and ensemble of YOLOv4, Scaled-YOLOv4, YOLOv5 using Mean method. From Fig. 18 it can be clearly seen that YOLOv3 has the worst performance, whether as the ensemble of YOLOv4, Scaled-YOLOv4, YOLOv5 using the Mean method outperforms other models.

Fig. 15
figure 15

An illustration of the ensemble method used in the present work which considers N number of base models

Table 6 Performance comparison of the ensemble of two models using different methods on the IRUVD dataset
Table 7 Performance comparison of the ensemble of three models using different methods on the IRUVD dataset
Fig. 16
figure 16

A comparison of class-wise mAP@[0.5:0.05:0.95] of detection results using YOLOv3, YOLOv4, Scaled-YOLOv4, YOLOv5 and top seven the ensemble models in terms of mAP@[0.5:0.05:0.95] shown in Tables 6 and 7 on the IRUVD dataset

Fig. 17
figure 17

Detection results produced by the ensemble of YOLOv4, Scaled-YOLOv4, and YOLOv5 using the Mean method: (a) Number of false positives and true positives of each class, and (b) Log average miss rate of each class

Fig. 18
figure 18

mAP score plots given by YOLOv3, YOLOv4, Scaled-YOLOv4, YOLOv5 and the ensemble of YOLOv4 and Scaled-YOLOv4 using the Mean method, and the ensemble of YOLOv4, Scaled-YOLOv4 and YOLOv5 using the Mean method for each class

6 Experimentation on other datasets

In this paper, we have presented a dataset for vehicle detection on Indian roads and benchmarked the results using four state-of-the-art deep learning-based detection models. We have also proposed an ensemble technique for the improvement of the benchmark results on the developed IRUVD dataset. For better understanding of how a model trained on existing datasets fails to detect a vehicle in the complex situation on Indian roads, we have tested the YOLOv5 model trained on two datasets, namely Udacity Self-Driving-Car [37] and Otonomarc [32]. Comparative results on the IRUVD dataset are shown in Table 8. In terms of precision, recall, F1-score, and mAP@0.5, the model trained on the proposed dataset outperforms the models trained on the Udacity Self-Driving-Car and Otonomarc datasets. We have analysed predictions of the YOLOv5 model trained on the Udacity Self-Driving-Car, Otonomarc, and IRUVD datasets, as shown in Fig. 19. From Fig. 19, it is clear that the model trained on IRUVD gives the best result, whereas the model trained on the Udacity Self-Driving-Car model shows the worst performance.

Table 8 Performance comparison of YOLOv5 model trained on IRUVD dataset (our) and separately tested on Udacity Self-Driving-Car, Otonomarc and IRUVD (our) datasets
Fig. 19
figure 19

Qualitative results of YOLOv5 model trained on Udacity Self-Driving-Car, Otonomarc and IRUVD (our) datasets

We have also evaluated our proposed ensemble of YOLOv4 and YOLOv5 models on Udacity Self-Driving-Car and Otonomarc datasets, and we have obtained 31.05% and 13.29% mAP@0.5 improvements, respectively. Results of this testing are shown in Table 9. From the table, it is observed that in terms of mAP@0.5:0.05:0.95, the proposed ensemble method outperforms the base models, thereby ensuring the effectiveness of the ensemble method for the problem under consideration.

Table 9 Performance comparison of the ensemble of two models using different methods on Otonomarc and Udacity Self-Driving-Car datasets

7 Conclusion and future scope

Autonomous vehicles, intelligent traffic management systems, and other technologies have largely taken over our daily lives. Traffic management is becoming more and more challenging as vehicles proliferate, especially in crowded nations like India. Nowadays, deep learning models are primarily used by researchers to build an effective AVD system, which requires a large amount of data. Several datasets that are freely available for this purpose have been found in the literature. However, the majority of them only take into account the traffic patterns and vehicles that are frequently seen on urban roads, making them less useful for creating a comprehensive traffic management system. To this end, in this paper, we have developed a 14-class IRUVD dataset. Many vehicle classes that are frequently seen in rural areas were not taken into account in past datasets. New vehicle classes like the toto, motor-rickshaw, tempo, taxi, and cycle-rickshaw have been included in the dataset. With 14343 popper annotations, we have 4000 high-quality images to offer. Four state-of-the-art deep learning-based object detection models, namely YOLOv3, YOLOv4, Scaled-YOLOv4, and YOLOv5, have been used to benchmark the results on the said dataset. Additionally, we have suggested weighted and non-weighted ensemble techniques that improve mAP@[0.5:0.05:0.95] by 1.9%. Despite our best efforts, some classes like cycle-rickshaw, jeep, taxi, bus, etc. have a very small number of samples. This may result in overfitting of deep learning models. Hence to address this, we may employ several types of data augmentation techniques in future. However, in our research, we have observed that current models operate efficiently without applying any data augmentation techniques because the YOLO models use the focal loss for training, which can deal with imbalance data. [46]. On the other hand ensemble approaches surpass existing models in terms of the performance metrics under consideration. Another gap in our dataset is that we have not consider varied weather conditions such as wet, hazy, or overcast days, as well as the time of day such as evening or night. We would like to resolve these constraints in our future attempts. Our dataset can be expanded by collecting data from new types such as animals, signboards, and so on. It is necessary to conduct research on the development of unique architectures capable of detecting and classifying in a wide range of settings, including numerous edge cases. Our another future plan is to build a video dataset in Indian context for AVD purposes.