1 Introduction

The need for more effective solutions to the traffic congestion issue has grown in recent years due to the increase in vehicle traffic. Automatic vehicle classification (AVC) systems are needed for real-time traffic management and monitoring to deal with the ever-increasing volume of traffic. The rising number of vehicles on the road has been a concern for researchers towards the betterment of the road traffic scenario. Many studies on the traffic management system have been published in areas such as vehicle categorization [1], detection [2, 3], make and model recognition [4,5,6], segmentation, lane detection, pedestrian detection, etc. Autonomous driving will play a huge role in vehicle-related research in the future. For this reason, research on AVC systems is much needed in today’s scenario. Such systems can also be used to collect information about vehicle make and model details that are necessary for security reasons. Working on such problems using real-world traffic scenarios is difficult in terms of training, testing, and model validation. Huge amounts of realistic data are required for this purpose. Deep learning algorithms [7, 8] give better results in the case of real-time competitive performance in comparison to other machine learning algorithms and conventional approaches in various applications, and deep learning models strive to outperform previous results in a given domain. During the study of this domain, it has been found that there are not many datasets available for research, and many of them are based on speculative situations. Moreover, well-known AVC datasets are mostly paid, and the datasets with a decent number of images lack appropriate annotations, thus making it challenging to use them for research purpose. On the other hand, sufficient samples are needed to create an effective supervised learning-based model that is accurate as well as capable of functioning in real-life scenarios.

For vehicle localization, there are a few datasets available, but the number of datasets available for AVC is limited. Furthermore, existing AVC datasets do not mostly capture real-life scenarios adequately. For example, images taken in the Indian subcontinent frequently show multiple vehicles overlapping in a single frame due to heavy traffic congestion. This issue makes the classification, localization, detection, and segmentation processes extremely challenging. These challenges are relevant not only in India but also in Bangladesh, Pakistan, Sri Lanka, and many other South-Asian nations. The available AVC models which were trained on well-managed traffic scenarios, might not be applicable to the datasets collected from these nations. However, Bhattacharya et al. [9] developed a dataset, called JUVDsi, for vehicle detection in Indian road scenarios. Ali et al. [10] published a vehicle detection dataset, called IRUVD, which includes 14 vehicle classes typically based on the Indian roads scenario.

1.1 Research motivation and contributions

In this sub-section, the motivation behind this research work and its contributions have been discussed. For vehicle detection, a large number of datasets are found in the literature. However, only a few datasets are available for vehicle classification tasks. Not all datasets accurately reflect real-world situations. Similar to the Indian subcontinent, images taken in crowded traffic conditions sometime show many automobiles overlapped in a single image. This problem makes the classification, localization, detection, and segmentation processes very challenging. Also, there is no more room for research on some AVC datasets because researchers have already achieved almost 100% accuracy on those datasets. To reduce road accidents, it is also necessary to recognize vehicles [11], pedestrians [12] and re-identification [13,14,15] of vehicles. There are several potential applications of AVC in various sectors, such as smart cities, transportation, law enforcement, and car industry. Security, traffic management, along with customer experiences are all significantly enhanced by accurate vehicle identification and categorization. AVC may be used in a wide range of real-world situations, including intelligent parking systems, autonomous cars, toll payment systems, traffic monitoring systems, and vehicle identification at crime scenes. At transportation hubs, robust AVC techniques are frequently employed, as a result of which identifying vehicle while entering restricted areas can help with security-related difficulties. To train the model for this problem, a large number of real-world vehicle images are needed. Vehicle re-identification, which attempts to match the same vehicle image captured by many cameras, is vital to video surveillance for public safety since vehicles are an indispensable part of human existence. In this instance, the problem of vehicle re-identification demands the highest level of accuracy from AVC approaches. Considering the aforementioned details, in this paper, we have introduced a new still-image based dataset, called JUIVCDv1, for AVC. The following is a list of this paper’s main contributions.

  1. 1.

    This dataset offers a realistic image representation of the traffic situation in India, which is very different from that of other developed countries. Vehicle images captured in various scenarios are considered. A total number of 6335 vehicle images can be found in this dataset.

  2. 2.

    Researchers may take this dataset to evaluate the effectiveness of their methods for autonomous vehicle localization and categorization.

  3. 3.

    This dataset includes images of vehicles captured at night time, which makes the categorization task more challenging. As a result, the developed model is also capable of handling data collected in low-light conditions.

  4. 4.

    The vehicle images in the collection are taken in different weather conditions. Therefore, the model is resilient enough to handle data collected in a variety of meteorological scenarios.

  5. 5.

    Detailed annotations are provided for the performance evaluation of either new or existing methods developed on this dataset.

  6. 6.

    Initially, we have executed eight CNN models, namely Xception, InceptionV3, DenseNet121, MobileNetV2, VGG16, NasNetMobile, ResNet50, and ResNet152 on this dataset for AVC. We have also applied three ensemble models on this dataset, namely Majority Voting-based Ensemble (MVE), Weighted Average-based Ensemble (WAE), and Sum Rule-based Ensemble (SRE). Finally, we have benchmarked this dataset using an MVE classifier combination approach which achieves 95% accuracy.

2 Literature review

In this section, we have discussed three aspects of the research problem related to AVC. This section summarizes the AVC datasets that are available publicly for development and validation, performance comparison of different types of AVC methods, and some recent state-of-the-art approaches already available for AVC. During this study, it was found that the datasets related to AVC are quite expensive and the number of freely available AVC datasets is very few. Moreover, the datasets that are freely available to the research community have been used so frequently over the years that these attain an accuracy of almost 100%. Therefore, there is a need for a new dataset with more challenges portraying real-life scenarios. Also, a country like India has a different scenario in this domain due to the road conditions and the presence of multiple vehicles within a single frame. If we use other datasets that are not similar to the Indian road scenario, then the model may not be working properly in real-life conditions. Table 1 provides a summary of the freely available AVC datasets commonly used by researchers to date with state-of-the-art accuracies achieved on these datasets.

Table 1 List of freely available image datasets used for AVC till date

To the best of our knowledge, we have gathered the following shortcomings of the existing datasets. The BIT car dataset comprises only 6 classes, and the images in the dataset are captured under a single weather condition, posing a challenge for building a robust AVC model. The online images in the CompCars dataset are gathered from vehicle forums, public websites, and search engines. In contrast, in our developed dataset, images from real-world traffic scenarios are captured. Notably, this dataset includes 12 distinct vehicle classes. The images in our developed dataset are sourced from various weather conditions, including rainy, sunny, and cold seasons, as well as nighttime and daytime settings. We have partitioned the dataset into train and test sets using a 70:30 ratio, which help the future researchers to evaluate their methods. There are many studies related to vehicle detection, but the research articles related to AVC is found to be a few. Maity et al. [26] surveyed this topic in the year 2021 considering all the works done for AVC in the last decade. In this section, we will also discuss some significant AVC methods. Sun et al. [27] proposed a novel vehicle-type classification system, which uses a lightweight CNN with feature optimization and a joint learning strategy. The first step was to create a lightweight CNN with feature optimization, called LWCNN-FO. To minimize the parameters of the network, they employed depth-wise separable convolution. Additionally, the SENet module was included to automatically determine the significance of each channel with features using self-learning. Silva et al. [28] proposed an AVC system with a computer vision solution to solve the problem of the vehicles’ make and model classification. By observing the car’s attributes and contrasting them with those in the membership, a camera was set up to authenticate the vehicle. They concentrate on constructing a fine-grained AVC system that uses the system’s multi-camera composition to fuel a CNN with many views of the vehicle. The evaluations presented indicate that incorporating data from multiple perspectives of a vehicle enhances the classification accuracy of its make and model, particularly in difficult tolling situations. Ni et al. [29] proposed a vehicle attribute recognition system by appearance. The study covered both coarse-grained (vehicle type) and fine-grained (car manufacturer and model) components in its review of existing vehicle component identification methods. This paper aimed to perform vehicle type recognition by categorizing vehicles into broad classifications based on their sizes or intended usage, such as sedans, buses, and trucks. This study also conducted an analysis of Vehicle Make Recognition, which involves the classification of vehicles based on their respective manufacturers such as Ford, Toyota, and Chevrolet. Silva et al. [30] presented a subscription/membership function for an automated toll-collecting (ATC) system using computer vision. This application system established a one-to-one correspondence between a distinct identifier (ID), a tangible automobile, and a membership. A camera-based system was implemented to authenticate that every transaction aligned with the factual membership data by cross-verifying the vehicle and ID information. The visual system employed various algorithms to extract distinct features of a vehicle such as a number plate, make, model, color, number of axles, and so on. The system performed a comparison between the extracted characteristics and those present in the membership. The authors concentrated on addressing the vehicle’s make classification problem and suggested a detailed vehicle classification approach that leverages the system’s multi-camera configuration. Sahin et al. [31] proposed the utilization of Light Detection and Ranging (LiDAR) sensor data to differentiate between distinct categories of truck trailers, surpassing the capabilities of conventional classification sensors such as inductive loop detectors and piezoelectric sensors. The present study demonstrates the processing of point-cloud data obtained from a 16-beam LiDAR sensor to extract valuable information and features. The outcomes indicate that the Support Vector Machine (SVM) model can effectively differentiate various caravan body types with a remarkably high degree of precision, ranging from 85% to 98%. Liu et al. [32] proposed a new end-to-end CNN architecture that can simultaneously detect and remove adversarial perturbations by utilizing denoising techniques. This approach was referred to as Denoising Detection and Denoising Adversarial Perturbations (DDAP). The DDAP denoiser utilized the DDAP detector’s adversarial examples to eliminate adversarial perturbations. The method being proposed can be considered a pre-processing measure. It does not necessitate any alterations to the configuration of the vehicle classification model and has minimal impact on the classification outcomes of clear images. To validate the capabilities of DDAP, they conducted testing on public datasets such as BIT-Vehicle. Butt et al. [33] presented an AVC system that utilized a CNN to enhance the resilience of vehicle classification in real-time scenarios. The authors provide a dataset of vehicles consisting of 10,000 images that were classified into six distinct vehicle classes. The dataset was designed to account for challenging lighting conditions to enhance the reliability of vehicle classification systems in real time. The study involved fine-tuning pre-trained models such as GoogleNet, Inception-v3, VGG, AlexNet, and ResNet on a self-constructed vehicle dataset to assess their accuracy and convergence capabilities. To achieve generalization, the network was fine-tuned on the VeRi dataset, a publicly available collection of 50,000 images that have been classified into six distinct vehicle categories. Guo et al. [34] proposed a semi-supervised approach for vehicle type classification in Intelligent Transportation Systems (ITS) using broad ensemble learning. The methodology outlined comprised two primary components. The initial phase involves training a set of base Broad Learning System (BLS) classifiers using semi-supervised learning techniques to mitigate the growing burden of unlabeled samples and reduce the duration of the training process. In the second phase, a dynamic ensemble architecture is created using trained classifiers that possess distinct characteristics. The authors utilized the publicly available BIT-Vehicle dataset and MIOTCD dataset to conduct experiments and showcased that their proposed method has better performance in terms of effectiveness and efficiency when compared to a single BLS classifier and other commonly used methods. Mohine et al. [35] introduced a hybrid deep 1D CNN-bidirectional long short-term memory (CNN-BiLSTM) approach that utilized acoustic modality to move vehicle categorization into two-wheeler, low, medium, and heavyweight groups, as well as noise analysis. Furthermore, it underwent testing on the reference dataset, SITEX02, to validate its performance thus achieving an accuracy rate of 96%. The comparative analysis of the 1D CNN-BiLSTM model’s performance was conducted against traditional classifiers such as ANN, CNN, SVM, and CNN-LSTM models. Based on the empirical findings, it has been observed that the CNN-BiLSTM model has achieved a superior classification accuracy of 92% in comparison to traditional classifiers.

3 Dataset preparation

In this section, the specifics of creating the JUIVCDv1 dataset have been discussed in detail. Here, we have covered dataset nomenclature, methods of collecting vehicle videos, the process of creating images from videos, and the process of annotations.

3.1 Dataset nomenclature

We have named our developed dataset JUIVCDv1, where JUIVCD stands for ‘Jadavpur University Indian Vehicle Classification Dataset’. The dataset has 12 different vehicle classes namely, ‘Bicycle’, ‘Van’, ‘Car’, ‘Bus’, ‘Ambassador, ‘Autorickshaw’, ‘Rickshaw’, ‘Motorized2Wheeler’, ‘Motorvan’, ‘Toto’, ‘Truck’ and ‘Minitruck’. Figure 1 illustrates each of the vehicle classes, class names, and class labels.

Fig. 1
figure 1

Sample images of the vehicle classes that are considered in the JUIVCDv1 dataset. (the class labels are denoted by digits before the name of the vehicle classes)

3.2 Collection of raw data preparation

Images have been collected from highways in Kolkata, an Indian metropolitan city, and some rural locations around Kolkata. We made every effort to compile as many real-time traffic scenarios as possible. Videos are first taken, and then we used labelImg [36] to extract the frames and generate still images. JUIVCDv1 dataset includes images from both fixed positions of the camera as well as from a moving vehicle. We collected data both during the daytime and nighttime. We have also provided bounding boxes of the vehicles in our dataset. On Indian urban streets, the most realistic traffic situations have been adopted. To capture videos and still images, we mostly used two different camera phones:

  1. 1.

    Redmi Note 9Pro (1280x720p)

  2. 2.

    Honor - HRY-AL00 (1080x2340p)

To make it easier for the image processing algorithms to analyze each video, we have created image frames from each video and saved the images into the JPEG format. The steps of this procedure are discussed below:

  1. 1.

    Specific image frames have been chosen such that they can be easily distinguished from each other in the set of chosen frames, and are not too fuzzy.

  2. 2.

    All still images that have been transformed from videos to image frames have been divided into a training set and a test set. The first 70% image frames are taken into the training set and the remaining is taken into the test set.

3.3 Annotation of processed data

If we want to apply supervised learning algorithms, an accurate annotation is a crucial necessity for any developed dataset. But, sometimes it takes too much time to annotate proper data [37]. Having annotations in the test data is also beneficial for the researchers to evaluate performance while developing a new algorithm. This dataset’s annotations are given in both TXT and XML formats. We have used a standard tool, called labelImg tool for the annotation. Figure 2 represents the annotation format of a sample image taken from the JUIVCDv1 dataset. Table 2 shows the annotation on a sample image using the said tool. The bounding boxes of the objects are described as:-bx, by, where the x and y coordinates represent the center of the box. The bh and bw, are the height and width of the bounding boxes respectively relative to the entire image and c represents the class of the object. In TXT format, ’0’ is defined as a class of the object and the next values are x_center, y_center, width, and height, respectively. In JSON format, the annotation information is represented in the following order: the image name first, then the class of the vehicle, the x and y coordinates of the bounding box, followed by the width and height of the bounding box.

Fig. 2
figure 2

Annotation of a sample image taken from JUIVCDv1

4 Details of JUIVCDv1 dataset

The dataset contains images that can be utilized to create a realistic AVC system focusing mainly on typical Indian road conditions as well as traffic scenarios. The images are taken at various times of the day and night to accommodate every possible diversity of the typical Indian road scenes. Images contain a single object in a single frame. The videos are recorded from the sidewalks on both the sides of the road, and while riding on a moving vehicle. This provides a diversity of images and will help to strengthen the developed models by researchers. To make the model robust and to operate the same in varied situations, the dataset is intentionally kept unbalanced.

Table 2 Annotation format of JUIVCDv1 dataset still image

4.1 Train set

There are 12 folders in the train set of the JUIVCDv1 dataset. The folders consist of namely ‘0_Car’, ‘1_Bus’, ‘2_Bicycle’, ‘3_Ambassador’, ‘4_Van’, ‘5_Motorized2wheeler’, ‘6_Rickshaw’, ‘7_Motorvan’, ‘8_Truck’, ‘9_Autorickshaw’, ‘10_Toto’, ‘11_MiniTruck’. The ‘0_Car’ folder has 560 number of images, ‘1_Bus’ folder has 560 number of images, ‘2_Bicycle’ folder has 120 images, ‘3_Ambassador’ folder has 480 images, ‘4_Van’, ‘5_Motorized2wheeler’ and ‘6_Rickshaw’folder has 560 number of images, ‘7_Motorvan’ has only 33 images, ‘8_Truck’ has 140 images, ‘9_Autorickshaw’ has 564 images, ‘10_Toto’ has 36 images and ‘11_MiniTruck’ has 181 images. A total of 4300 vehicle images are provided in the training set of the JUIVCDv1 dataset. Sample images are already shown in Fig. 1. The number of objects present in each vehicle class in the training data has been shown in Fig. 3 using a bar graph. Here, the Y-axis is the number of vehicle images, while the X-axis denotes the number of samples present in each vehicle class.

Fig. 3
figure 3

Bar graph showing the number of images present per vehicle class in the train set of the JUIVCDv1 dataset

4.2 Test set

similar to the train set, there exist 12 folders in the test set of the JUIVCDv1 dataset. The folders contain namely ‘0_Car’, ‘1_Bus’, ‘2_Bicycle’, ‘3_Ambassador’, ‘4_Van’, ‘5_Motorized2wheeler’, ‘6_Rickshaw’, ‘7_Motorvan’, ‘8_Truck’, ‘9_Autorickshaw’, ‘10_Toto’, ‘11_MiniTruck’. In ‘0_Car’ folder, there are 240 number of images, ‘1_Bus’ folder has 240 number of images, ‘2_Bicycle’ folder has 80 images, ‘3_Ambassador’ folder has 320 images, ‘4_Van’, ‘5_Motorized2wheeler’ and ‘6_Rickshaw’folders have 240 number of images, ‘7_Motorvan’ has only 11 images, ‘8_Truck’ have 59 images, ‘9_Autorickshaw’ has 240 images, ‘10_Toto’ has 23 images and ‘11_MiniTruck’ has 122 images. A total of 2035 vehicle images are given in the test data of the JUIVCDv1 dataset. Sample images are already shown in Fig. 1. The number of objects present in each vehicle class in the test data is shown as a bar graph in Fig. 4. Here, the Y-axis represents the number of images, while the X-axis represents the number of items in a vehicle class.

5 Benchmarking JUIVCDv1 dataset

To categorize the automobiles in our dataset, we have also considered some state-of-the-art pre-trained deep learning models.

5.1 Xception

Chollet et al. [38] proposed a CNN model based on depth-wise separable convolution layers. They assert that it is possible to completely dissociate the cross-channel mapping and feature maps of spatial correlations. This hypothesis is the extreme version of Inception architecture. For this reason, the authors proposed the architectural name Xception, which means Extreme Inception. The Xception architecture consists of 36 convolutional layers for feature extraction. The Xception architecture is just a linear stack of residually connected depth-separable convolution layers. Figure 5 shows the architecture of the Xception model.

Fig. 4
figure 4

Bar graph showing the number of images present per vehicle class in the test set of the JUIVCDv1 dataset

Fig. 5
figure 5

Architecture of the Xception model

5.2 InceptionV3

In 2016, Szegedy et al. [39] proposed a novel model for classification, called InceptionV3. The InceptionV3 is a CNN model belonging to the Inception family. It incorporates various enhancements such as the utilization of label smoothing, factorized 7x7 convolutions, and an auxiliary classifier to disseminate label information to lower network layers. Additionally, batch normalization is employed for layers in the side head. In Fig. 6, the overall architecture of the InceptionV3 model is shown.

Fig. 6
figure 6

Architecture of the InceptionV3 model [39]

5.3 DenseNet121

DenseNet, [40] is a CNN architecture, which was recently presented, with an intriguing connection pattern. DenseNet architecture connects layers with a dense block, promoting feature reuse and reduced overfitting. Each layer accesses predecessor’s feature maps, reducing overfitting. Direct supervision from the loss function and shortcut pathways contribute to implicit deep supervision. This results in a dense model, reducing overfitting and improving computational and memory efficiencies. Concatenation of layers enhances network compactness and growth rate, reducing channel count. In Fig. 7, the block diagram of the DenseNet model is shown.

Fig. 7
figure 7

Architecture of the DenseNet121 model

5.4 MobileNetV2

An effective model for mobile and embedded vision applications is provided in MobileNet [41]. It is a simplified architecture that builds lightweight deep CNN models using depth-wise separable convolutions. Each input channel of MobileNet receives a single filter applied to depth-wise convolution. Two layers, one for combining and one for filtering, are segregated from this by the depth-wise separable convolution. The result of this factorization is a significant decrease in computation and model size. Modern object detection systems can potentially use MobileNet as an efficient base network. The schematic diagram of the MobileNet model is shown in Fig. 8.

Fig. 8
figure 8

Architecture of the MobileNetV2 model

5.5 VGG16

In 2014, the Visual Geometry Group (VGG) at the University of Oxford developed VGG16 [42], which is a popular CNN model. It consists of a total of 16 layers, including 16 layers of convolutional processing, and three levels of fully linked processing. In the initial layers of the network, convolutional layers consisting of 3*3 filters have been used. The convolutional layers are then followed by max-pooling layers with 2*2 filters. These 2*2 filters can reduce the spatial size of the output of the convolutional layers in half. Of almost 16 convolutional layers, the first 13 layers employ 3*3 filters, and the remaining three use 1*1 filters. When smaller filters are used, a deeper network can be constructed using fewer parameters. The first convolutional layer starts with 64 filters and then works its way up to 512 filters in the final layer. Each of the fully linked layers that make up the final stage of the network consists of 4096 neurons. A Softmax layer serves as the last layer of the network and is responsible for producing the class probabilities. The architecture of the VGG16 model is represented in Fig. 9.

Fig. 9
figure 9

Architecture of the VGG16 model

5.6 ResNet50

A well-known CNN model, called ResNet-50 [43], a member of the ResNet (Residual Network) family, was introduced by He and colleagues. This model uses a standard input picture size of 224 by 224 pixels. In this model, a max-pooling layer comes after the first layer, which is a typical convolutional layer. ResNet-50 is made up of four stages and sixteen residual blocks. The number of residual blocks changes from stage to stage, as does the number of filters inside each block. The architecture of the Resnet50 model is presented in Fig. 10.

Fig. 10
figure 10

Architecture of the Resnet50 model

5.7 NasNetMobile

NasNet is a scalable CNN architecture (built for neural architecture search), which consists of fundamental building blocks refined through reinforcement learning. It was trained on over a million of images from the ImageNet database [44]. A cell is made up of only a few processes (a few separable convolutions and pooling) and is repeated several times to meet the network’s capacity requirements. NasNetMobile is a mobile version, which consists of 12 cells with 5.3 million parameters and 564 million multiply-accumulates (MACs). The element-wise addition method is used by NASNet, which is far more intuitive than vector-wise operations. When utilizing a feature map as an input, two types of convolutional cells are employed. The input picture size of the network is 224*224. The architecture of the NasNetMobile model is presented in Fig. 11.

Fig. 11
figure 11

Architecture of the NasNetMobile model [45]

5.8 Majority voting-based ensemble

The MVE is one of the most popular and commonly used classifier combination approaches. In a majority voting rule [46,47,48], each classifier casts a vote for one class, and the class with the maximum number of votes wins. In terms of statistics, the target label anticipated by the ensemble is the mode of distribution of the individual predictions of labels. For example, suppose three classifiers are used in the ensemble (C1, C2, and C3), and the class labels are A and B. If both C1 and C2 predict the result as A, and C3 predicts the result as B, according to the MVE approach, the result will be A. The voting procedures are based on a democratic (weighted) mechanism that aggregates the forecasts from the categorization models. These categorization models have been separately calibrated using many analytical sources. The simplest and most natural technique relies on the MVE rule, which designates a sample based on the most common class assignment (the "loose" method). In the event of a tie, the sample is not categorized. When all of the models under consideration have completed the forecast agreement, voting by strict majority depicts this situation. We have opted for the MVE strategy since it generally yields superior results [46]. For easy understanding, a pictorial illustration of the MVE approach is shown in Fig. 12.

Fig. 12
figure 12

A pictorial illustration of the MVE approach used to benchmark the JUIVCDv1 dataset

5.9 Sum rule-based ensemble

In machine learning [49], one popular idea related to ensemble learning is the SRE. In ensemble learning, many models are combined to produce a prediction model that itself is more powerful and reliable than any of the individual models. In particular, the term "SRE" describes a technique that adds predictions from several models by adding up each one’s unique forecasts. Next, a summary of each model’s predictions is provided. In classification tasks, weights may be provided to the class labels, and these weights are taken into account while calculating the total. The total is added together and is used to determine the final projection. When it comes to classification challenges, the ensemble prediction might be the class with the largest sum or weighted sum.

5.10 Weighted average-based ensemble

The WAE [50] is one of the most popular and commonly used classifier combination approaches. It combines the predictions of several models to increase prediction accuracy. This approach creates a weighted average of the predictions by giving weights to various models according to their performance on a validation set. For easy understanding, a pictorial illustration of the WAE approach is shown in Fig. 13.

Fig. 13
figure 13

A pictorial illustration of the WAE approach used to benchmark the JUIVCDv1 dataset

6 Results and discussion

The present work includes the development of a still image dataset called JUIVCDv1 for vehicle classification and benchmarking the results on the same. We have trained and tested eight different CNN models namely Xception, InceptionV3, DenseNet121, MobileNetV2, VGG16, NasNetmobile, ResNet50, and Resnet152 on our developed dataset. Finally, three popular classifier combination approaches such as MVE, SRE and WAE are used to enhance the overall classification performance. In the following subsections, we have discussed the results obtained with their corresponding analysis.

6.1 Model evaluation

Some standard evaluation metrics such as classification accuracy, precision [51], recall [51], F1-score [52], and confusion matrices [53] are used to measure the performance of the CNN-based models on our developed dataset. These evaluation metrics are previously defined in [9]. We have achieved a 0.94 accuracy score on the Xception model, while InceptionV3 has achieved an accuracy score of 0.93, DenseNet121 and MobileNetV2 models have achieved accuracy scores of 0.92 and 0.90 respectively, NasNetmobile have achieved 0.88, the VGG16 model has achieved 0.85 accuracy score and ResNet50 have achieved 0.56 and Resnet152 have achieved 0.55. After, analyzing the results, three base CNN models, namely Xception, InceptionV3, and DenseNet121 are chosen for the MVE, SRE, and WAE, techniques due to their performance for the said task. The MVE method has achieved an accuracy score of 0.95, whereas both the WAE and SRE approaches attained an accuracy score of 0.94.

6.2 Results obtained by CNN models

The outcomes of the eight pre-trained CNN models as well as three popular ensemble approaches have been examined in this section. The outcomes of the base learners have been displayed graphically. When it comes to identifying certain vehicle classes, some models have been observed to be more accurate than others. A graphical comparison of the test accuracies provided by eight pre-trained CNN models along with three ensemble approaches is shown in Fig. 14. In addition, a report detailing the accuracies of the classification scores of the three best performing CNN models has been provided as a classification report, and the confusion matrices for the three ensemble models are also provided for observing the accurate and inaccurate classifications made by each of the models.

Fig. 14
figure 14

Performance comparison of test accuracies produced by eight pre-trained CNN models along with three ensemble techniques used for AVC on the proposed JUIVCDv1 dataset

Table 3 Classification report showing the values of Precision, Recall, and F1-score for each of the vehicle classes produced by the Xception model on the JUIVCDv1 dataset

6.3 Performance comparison: classification report

The performance of AVC on the test set of the JUIVCDv1 dataset using the Xception model has been presented in Table 3. After evaluating Xception on our dataset, we have observed that the highest precision value of 0.99 is achieved by the class 4_Van and the class namely 5_Motorized2wheeler, and the lowest precision value of 0.69 achieved by 10_Toto. The class 7_Motorvan has achieved the highest recall value of 1.00 and the class 1_Bus has the lowest recall value of 0.57. The highest F1-Score has been achieved by the model for two classes, namely 5_Motorvan and 8_Truck i.e., 0.98. The Xception model has achieved an overall accuracy of 0.94.

In Table 4, we provide how AVC performs using the InceptionV3 model on the JUIVCDv1 test set. After analyzing InceptionV3 using our dataset, we found that class 2_Bicycle had the model’s greatest precision value of 1.00, while class 7_Motorvan had the model’s lowest precision value of 0.58. The model for two classes, 0_Car and 7_Motorvan, has the maximum recall value of 1.00, and 10_Toto has the lowest Recall value of 0.48. The model has the highest F1-Score of 0.98 for the two classes of 3_Ambassador and 5_Motorized2wheeler. The total accuracy score of the InceptionV3 model is 0.93.

Table 4 Classification report showing the values of Precision, Recall, and F1-score for each of the vehicle classes produced by the InceptionV3 model on the JUIVCDv1 dataset

The effectiveness of AVC utilizing the DenseNet121 model on the JUIVCDv1 test set is shown in Table 5. The model attained the highest precision of 1.00 for the classes 3_Ambassador and 7_Motorvan, and the lowest precision of 0.57 for the class 8_Truck, according to our evaluation of DenseNet121 on our dataset. In the model, the recall value of 0.43 for the class 10_Toto is the lowest, and the classes "0_Car" and "5_Motorized2wheeler" have the maximum recall value of 1.00. The model for class 5_Motorized2wheeler has the highest F1-Score, which is 0.98. An accuracy score of 0.92 is attained with the DenseNet121 model.

Table 5 Classification report representing the results of Precision, Recall, and F1-score for each of the vehicle classes produced by the DenseNet121 model on the JUIVCDv1 dataset

6.4 Results obtained by ensemble approaches

This section discusses the results provided by three popular state-of-the-art ensemble methods MVE, SRE and WAE. In Table 6, we show the performance of AVC using the MVE model on the JUIVCDv1 test set. We have chosen three CNN models for the majority voting technique since they are more accurate than the other CNN models utilized in this case: Xception, InceptionV3, and DenseNet121. After analyzing our dataset, we have found that the model attained the maximum accuracy value of 1.00 for class 4_Van and the lowest precision value of 0.48 for class 9_Autorickshaw. Class 0_Car has the model’s highest recall value of 1.00, while class 1_Bus has the lowest recall value of 0.61. The class 7_Motorvan model has the highest F1-Score, which is 0.99. The final accuracy of the MVE technique is 0.95.

Table 6 Classification report showing the values of Precision, Recall, and F1-score for each of the vehicle classes produced by the MVE approach on the JUIVCDv1 dataset
Table 7 Classification report showing the values of Precision, Recall, and F1-score for each of the vehicle classes produced by the SRE approach on the JUIVCDv1 dataset

In Table 7, we have demonstrated the performance of AVC on the JUIVCDv1 test set using the SRE model. We have chosen three CNN models, Xception, InceptionV3, and DenseNet121, for the WAE approach since they are more accurate than the other CNN models used in this example. We have discovered that the model achieved the highest accuracy value of 1.00 for class 2_Bicycle and the lowest precision value of 0.68 for class 8_Truck after evaluating our dataset. The model’s best recall value is 1.00 for classes Class 0_Car, 7_Motorvan, and 5_Motorized2wheeler, while the lowest recall value is 0.52 for class 10_Toto. The model in class 5_Motorized2wheeler has the highest F1-Score, which is 0.99. The SRE technique has a total accuracy of 0.94.

In Table 8, we have shown the performance of AVC using the WAE model on the JUIVCDv1 test set. We have chosen the best three performing CNN models, namely Xception, InceptionV3, and DenseNet121, for the WAE technique since they are more accurate than the other CNN models that are utilized in this case. After analyzing our dataset, we have found that the model attained the maximum precision value of 1.00 for class 2_Bicycle and the lowest precision value of 0.68 for class 8_Truck. Class 0_Car, 7_Motorvan, and 5_Motorized2wheeler have the model’s highest recall value of 1.00, while class 10_Toto has the lowest recall value of 0.52. The class 5_Motorized2wheeler model has the highest F1-Score, which is 0.99. The total accuracy of the WAE technique is 0.94.

Table 8 Classification report showing the values of Precision, Recall, and F1-score for each of the vehicle classes produced by the WAE approach on the JUIVCDv1 dataset
Fig. 15
figure 15

Training accuracy and validation accuracy curve achieved through Xception model

6.5 Performance comparison: accuracy vs epoch

In this section, curves related to training and validation accuracies of eight different base models have been shown. In Fig. 15, the training accuracy (TA) and validation accuracy (VA) curves of the Xception model are given concerning the number of epochs, whereas Fig. 16 denotes the same for the InceptionV3 model. After analyzing base CNN models on JUIVCDv1 dataset, we have found that the Xception model achieves the best accuracy in Indian road scenario, and it shows a accuracy of 0.94 where InceptionV3 and DenseNet121 achieves the accuracy score of 0.93 and 0.92 respectively. Finally, Fig. 17 shows the DenseNet121 model’s TA and VA curves vs the number of epochs.

Fig. 16
figure 16

Training accuracy and validation accuracy curve achieved through Xception model

6.6 Confusion matrices

In Fig. 18(a) shows the confusion matrix for the MVE approach model, Fig. 18(b) shows the confusion matrix for the SRE approach model and Fig. 18(c) shows the confusion matrix for the WAE approach model. In Fig. 19, a sample of the class 10_Toto is wrongly classified as a sample of the class 7_Motorvan, whereas in Fig. 20, a sample of the class 11_Minitruck is wrongly classified as a sample of the class 0_Car. This wrong classification may be attributed to the fact that for some images, there are intra-class similarities among the sample images belonging to different vehicle classes. Moreover, as there are very few still images of 10_Toto in our collected data, and 10_Toto as well as 6_Rickshaw have a significant amount of similarity in their appearances, several samples of 10_Toto have been incorrectly designated by the models. Adding more samples of 10_Toto to the training set will help the models learn better to distinguish 6_Rickshaw and 10_Toto, and to address the problem. Again, the images of the front side of 11_Minitruck are quite similar in appearance to some of the 0_Car. This also leads to the wrong classification of the 11_Minitruck images.

6.7 Data visualization

In the year 2019, Selvaraju et al. [54] proposed a visual explanation algorithm namely Gradient-Weighted Class Activation Mapping (Grad-CAM), that creates a coarse localization map, which highlights the significant areas in the image for prediction/classification by using the gradients of any target concept. Grad-CAM [55] may be used with several CNN models, like VGG which has fully connected layers, visual question-answering CNNs for multimodal tasks, or CNNs for reinforcement learning. Grad-CAM may be seen as one of the first steps in the bigger scheme of interpretable or explainable AI since the visualizations provide insights into failure and aid in the identification of bias while surpassing previous standards. The backpropagation technique’s issues with upsampling and downsampling relevance maps to create coarse relevance heatmaps are also successfully avoided by this extension of the CAM algorithm. Figure 21 shows the Grad-CAMs generated on some sample vehicle images by best five different CNN models used in the present work. In Fig. 21, the second and third columns depict the original vehicle image and Grad-CAMs generated by Xception, InceptionV3, DenseNet121, MobileNetV2 and NasNetMobile models respectively.

Fig. 17
figure 17

Training accuracy and validation accuracy curve achieved through Xception model

Fig. 18
figure 18

Confusion matrices produced by MVE, SRE, and WAE techniques on our developed JUIVCDv1 dataset

Fig. 19
figure 19

Missclassified Image 1

Fig. 20
figure 20

Missclassified Image 2

Fig. 21
figure 21

GradCAM-based data visualization using best five base CNN models on the proposed JUIVCDv1 dataset

6.8 Results on poribohonBD dataset

Tabassum et al. [56] created the PoribohonBD dataset for vehicle categorization based on vehicle images of Bangladesh. Sample vehicle images of this dataset are accessible at: https://data.mendeley.com/datasets/pwyyg8zmk5/2here. We have considered this dataset for experimentation since both the countries, India and Bangladesh have very similar road scenarios. Images of vehicles are collected from two sources: a) social media and b) smartphone cameras. The collection includes 9058 tagged and annotated photos of 15 native vehicles, that are commonly found on roads of Bangladesh, including bus, three-wheeler rickshaw, motorcycle, truck, and wheelbarrow. In this dataset, data augmentation techniques are also used to maintain the amount of images comparable for each type of vehicle. Initially, we have chosen eight base CNN models for the primary study. After analyzing the results, we have observed that the DenseNet121 model attains the maximum accuracy value of 0.94, the ResNet152v2 model attains an accuracy of 0.91, the MobileNet model attains an accuracy of 0.90, the Xception model attains an accuracy of 0.88, InceptionV3 model attains the accuracy of 0.85, VGG16 model attains the accuracy of 0.67, ResNet50 model attains the accuracy of 0.65 and NASNetMobile model attains the accuracy of 0.60. Among them, three best performing base CNN models such as Xception, InceptionV3, and DenseNet121 are chosen for implementing the MVE, WAE, and SRE techniques. After considering the ensemble methods, the MVE approach has achieved the highest accuracy score of 0.96, the SRE method has attained an accuracy score of 0.95, and the WAE method has attained an accuracy score of 0.95. Table 9 presents the performance of AVC given by both base CNN models as well as three ensemble models on the PoribohonBD dataset.

Table 9 Classification report showing the values of Precision, Recall and F1-score for eight base CNN models and three ensemble models applied on the PoribohonBD dataset

6.9 Limitations of JUIVCDv1 Dataset

Further analysis of the results gives some ideas about the complexities of the developed dataset. It would also help future researchers to work on this dataset and develop more advanced methods to deal with associated problems. Some major issues are as follows:

  • The training set of the dataset has a class imbalance, which is a major problem. A large proportion of the still images in this set include cars and motorcycles, whereas only a small fraction contains images of totes and bicycles. We have employed data augmentation techniques for such situations, where the samples are comparatively less.

  • There are 33 Motorvan in the training set, which may not be sufficient for the models to accurately learn about the vehicle classes. The 11 vehicles in the validation set allow for a more precise model and easier vehicle classification. However, adequate data is not available for the CNN models to properly learn bicycle class images. With less inequality across classes in terms of the number of images, we may have observed far better classification accuracy.

  • In our dataset, several Totos are misclassified due to the scarcity of a significant amount of sample images in the dataset.

  • All of the roads in nations like India, Bangladesh, or Pakistan are not as good as those in the developed nations of either Europe or America. For the former case, traffic congestion as well as breaking of traffic rules are quite common in these countries. These issues add inherent complexities to the image quality.

  • Sometimes, it becomes difficult to precisely characterize the vehicles in images taken in various weather conditions, such as those taken at night when illumination is compromised or those taken in rainy conditions.

7 Conclusion

Nowadays, there is a large number of vehicles on the roads, and hence the need for AVC systems has become more significant for managing real-time traffic, especially in highly populated cities. A realistic image/video dataset portraying a traffic condition is essential for this purpose. Researchers may utilize this dataset to assess the efficiencies of their approaches for both automated localization and classification of vehicles. Though there are plenty of datasets available for vehicle localization, only a few of them can be used for the classification task. Additionally, very few available datasets can accurately reflect real-world situations. For example, images taken on the Indian subcontinent, frequently show two or more vehicles overlapping in a single frame leading to traffic congestion. Therefore, researchers have faced difficulties in using this information because of the distinctive features of Indian roads, such as the high volume of traffic, clogged highways, the poor state of the roads, and traffic congestion. In this study, we have created an image dataset suitable for AVC keeping Indian roads in mind to overcome this research vacuum. We have included the necessary annotation for the assessment of the AVC algorithms. This dataset is free to use for the research communities only. We have used eight deep learning models, namely Xception, DenseNet121, InceptionV3, MobileNetV2, VGG16, NasNetmobile, ResNet50, and Resnet152 for benchmarking our dataset. Additionally, we have applied three popular state-of-the-art ensemble models such as MVE, WAE, and SRE to enhance the accuracy of the developed dataset. We have achieved the best accuracy of 0.95 by using the MVE approach, which is satisfactory given the complexity of the images.

7.1 Future scope

  • Version 1 of the dataset has about 6k images, which might not be sufficient to train CNN models properly. Therefore, we would like to continue gathering images or videos for the upgradation of the dataset.

  • We will take steps in the future to maintain an equivalent number of data in each vehicle class by collecting more images for classes such as totos, vans, and rickshaws.

  • We are planning to collect images in various weather conditions including foggy, nighttime, rainy, etc.

  • We will attempt to include more vehicle classes that are commonly found on Indian roads.

  • Multi-view or multimodal datasets are not available for the classification of vehicles. Lots of research and data are required to make a practical solution for AVC. So, we will plan to capture images for the multimodal dataset.