1 Introduction

Visual perception for recognizing objects, obstacles, and pedestrians is a core building block for efficient autonomous driving. Semantic segmentation has emerged as an efficient perception method that aims to determine the semantic labels for each pixel of an image (Siam et al., 2018). Thanks to the availability of rich scene segmentation datasets (discussed in Fig. 2), significant technical progress has been made in this direction. However, several formidable challenges still remain on the path to efficient autonomous driving in the wild.

Firstly, existing autonomous driving datasets (Brostow et al., 2009; Caesar et al., 2020; Cordts et al., 2016; Geiger et al., 2012; Sun et al., 2020; Yu et al., 2020) are not generalized; they cover well-paved urban roads of developed countries which represents 3.7% road infrastructure of the world (Schwab, 2019) and barely serve 17% of the total world’s population (Gaigbe-Togbe et al., 2022). More recently, Segment Anything (Kirillov et al., 2023)—the largest segmentation dataset with more than one billion masks for 11 million images has been released to perform general purpose segmentation tasks. However, despite being the largest in size, it only covers 0.9% of data samples from low-income countries. Therefore, these datasets have scant coverage of unstructured roadways containing hazardous road patches (i.e., distress, earthen, gravel) that are common in the developing world, as shown in Fig. 1. Such ambiguous road regions pose an enormous hazard to human drivers and lead to severe road accidents and fatalities. According to World Health Organization (WHO), 1.3 million people die every year due to road accidents (WHO, 2020) with 93% of causalities occurring in low- and middle-income countries. The global road safety report points out that non-standard road infrastructure is a key reason for higher road accident rates in these countries (WHO, 2019). Therefore, the under-representation of such challenging data in existing datasets is a critical omission for research on autonomous driving and an indication of the need for a benchmark to improve autonomous driving in such challenging road scenarios.

Secondly, pixel-level annotation of images is excessively expensive—for cityscapes, labeling an image took an hour on average (Cordts et al., 2016)—leading to smaller segmentation datasets than in other domains (Deng et al., 2009; Lin et al., 2014), consequently limiting the generalizability of the trained models. Although semi-supervised learning methods (Abdalla et al., 2019; He et al., 2019; Huang et al., 2018; Yu et al., 2022) have been proposed that leverage unlabeled data to improve learning, these methods suffer limitations because (i) segmentation datasets are often highly imbalanced in terms of pixel counts corresponding to each class (Rezaei et al., 2020), and different physical scenarios in which the dataset is collected. Therefore, the resulting model performs significantly worse in physical scenarios that are not common (e.g. rare weather conditions and unstructured roads), which can be lethal in autonomous driving; (ii) Biased predictions caused by the data imbalance in early semi-supervised training phase (He et al., 2019) lead to a higher misclassification rate during inference; (iii) self-training segmentation models are computationally very expensive due to a large number of pseudo labels (Wei et al., 2018). In this regard, there is a need for an efficient method to improve performance while considering accuracy-energy trade-offs. To address these challenges, we have made the following contributions:

Fig. 1
figure 1

Examples of our dataset images covering a wide array of roadways, varying across different lighting and weather conditions. Instead of considering the whole paved road region as one class, we distinguish safe asphalt road region and its associated atypical classes found on unstructured roads such as distress, wet surface, gravel, boggy, vegetation misc., crag-stone, road grime, drainage grate, earthen, water puddle, misc., speed breakers, and concrete road patches

  1. 1.

    We introduce Road Region Segmentation (R2S100K) dataset for autonomous driving comprising 100K diverse set of road images, covering 1000+ KMs of challenging roadways, as shown in Fig. 1. R2S100K dataset covers more challenging road categories and scenarios than existing datasets. Moreover, R2S100K serves as an initial step in representing unstructured roads prevalent in low-income countries, allowing for a more comprehensive stress-testing of foundational segmentation models for autonomous driving.

  2. 2.

    We propose an unsupervised Efficient Data Sampling (EDS) method to sample a subset from the unlabelled training data, which offers three benefits: (i) EDS notably alleviates the data imbalance in the physical scenarios, (ii) improves the performance of supervised (0.71–6.72% MIoU) and semi-supervised (0.26–1.84% MIoU) models, and (iii) significantly reduces the annotation and training costs (75% fewer pseudo-labels and 79% decrease in the training time).

  3. 3.

    The EDS is compatible with multiple learning frameworks (supervised, semi-supervised) and model architectures. It can be integrated with datasets such as Cityscapes, CamVid, and BDD100K due to a similar labeling schema.

The rest of the paper is organized as follows. Section 2 presents the related work of autonomous driving benchmarks and datasets for 2D visual object detection and scene segmentation, and semi-supervised methods to perform the aforementioned tasks. Section 3 presents our proposed R2S100K and the Efficient Data Sampling (EDS) enabled novel self-training method for drivable road region segmentation to distinguish safe and hazardous road regions. In Sect. 4, several state-of-the-art segmentation methods are evaluated to present a comprehensive benchmark study alongside the effectiveness of our EDS-based self-training settings. Lastly, concluding remarks are summarized in Sect. 6.

Fig. 2
figure 2

Comparison of dataset statistics with existing driving datasets i.e., KITTI (Geiger et al., 2012), CamVid (Brostow et al., 2009), CARL-D (Butt & Riaz, 2022), IDD (Varma et al., 2019), Cityscapes (Cordts et al., 2016), A2D2 (Geyer et al., 2020), BDD100K (Yu et al., 2020), nuScenes (Caesar et al., 2020), Waymo (Sun et al., 2020), MVD (Neuhold et al., 2017), and Wilddash (Zendel et al., 2018). Our R\(^{2}\)S100K covers more diverse road infrastructure and challenging scenarios than the existing benchmarks. Therefore, our dataset can be used to develop more robust and generalized road segmentation methods for autonomous driving

2 Background

2.1 Autonomous Driving Datasets

In the past couple of years, several datasets have been released to accelerate the development of visual perception algorithms. These datasets can be categorized into two major groups: (i) object detection—which focuses on 2D/3D objects (Caesar et al., 2020; Dollár et al., 2009; Geiger et al., 2012; Huang et al., 2019; Sun et al., 2020; Xiao et al., 2021; Zhang et al., 2017); and (ii) scene segmentation—which focuses on semantic segmentation for scene understanding. We present a detailed comparison of these state-of-the-art datasets in Fig. 2, highlighting their key attributes, such as image resolution, the number of images, and the diversity of regions and road types, while also emphasizing the unique differences between the state-of-the-art datasets alongside the comprehensive nature of our R2S100K dataset in terms of diversity and applicability to unstructured roadways. Here we discuss some important characteristics of these datasets.

Object Detection Datasets: KITTI (Geiger et al., 2012) is one of the most widely used vision benchmark suites for object detection on urban roads and highways which contains 15k images along with 200k annotations. Later, Waymo open dataset (Sun et al., 2020) presented more than 23 million 2D and 3D bounding boxes annotations of 1150 inter-cities urban scene segments. The nuScenes (Caesar et al., 2020) dataset presented 1.4 million 3D bounding box annotations of 1000 urban and suburban road scenes for 23 classes. In 2019, ApolloScape dataset (Huang et al., 2019) has been released with comprises 70k 3D annotations along with 160k semantic mask annotations of urban roads and highways under varying weather conditions. Similarly, Pandaset (Xiao et al., 2021) presented 1 million 3D bounding box annotations for object detection in urban traffic scenarios. Other than these, various other datasets (Caesar et al., 2020; Dollár et al., 2009; Zhang et al., 2017) have been proposed, which played an important role in developing efficient object detection and recognition algorithms.

Semantic Segmentation Datasets: CamVid (Brostow et al., 2009) is considered among the pioneer scene segmentation datasets—comprising 700 fine annotations for 32 classes. In 2016, Cityscapes (Cordts et al., 2016) was released, which contains 5000 fine and 20,000 coarse annotations for urban roads. In 2017, Mapillary Dataset (Neuhold et al., 2017) comprising 25K fine annotations of inter-continental urban scenes was presented. Later on, BDD100K (Yu et al., 2020) is released in 2020, which provides 10K fine annotations of urban roadways. MVD (Neuhold et al., 2017) contains 25K images covering diverse yet urban roadways.

Though these datasets provide enriched information on urban scenes for scene segmentation tasks, they do not cover unstructured road conditions and hazardous road patches, commonly encountered in developing countries. Therefore, models trained on these datasets cannot be generalized to the challenging roadways. Besides urban driving, a few datasets have been released for visual perception in off-road driving scenarios. OFFSEG (Viswanath et al., 2021) framework covers RELLIS-3D (Jiang et al., 2021) containing 6235 images, and RUGD (Wigness et al., 2019) comprising of 7546 images of outdoor off-road driving scenes. Wilddash-v2 contains 4256 images (Zendel et al., 2018) and covers unstructured road classes like distress and gravel patches. However, they label these classes under single Road class rather than distinguishing them as safe and hazardous regions. Recently, CARL-D (Butt & Riaz, 2022; Rasib et al., 2021), and IDD (Varma et al., 2019) datasets have also been released which provide annotations of urban and rural roads, however, they still lack aforementioned hazardous road patches that can highly influence the performance of autonomous driving models.

2.2 Scene Segmentation Methods

Fully Supervised Learning: Since the pioneering work of FCN (Long et al., 2015), significant progress has been made in developing deeper neural networks for semantic segmentation tasks. The semantic segmentation model aims to predict the semantic category of each pixel from a given label set and segment the input image according to semantic information—suggested by Long et al. (2015). The FCN outperforms conventional approaches by 20% on the Pascal VOC dataset. The U-net is an idea by Ronneberger et al. (2015) for segmenting biological images. U-net has a spatial path to maintain spatial information and a context path to learn context knowledge.

Later, various supervised methods (Badrinarayanan et al., 2017; Chen et al., 2014, 2017a, b, 2018; Noh et al., 2015; Romera et al., 2017; Yu et al., 2018; Zhao et al., 2017, 2018; Zhang et al., 2022) have been proposed to perform segmentation tasks efficiently. However, these methods employ deep CNNs as backbone networks, which require an immense amount of time to annotate large-scale data, limiting the model’s capacity to adapt and further improve segmentation performance.

Semi supervised Learning: Recently, semi-supervised learning methods have demonstrated better applicability in several segmentation domains. These methods have achieved state-of-the-art performance on several segmentation tasks by leveraging a huge amount of unlabeled data. In literature, several techniques such as video label propagation (Budvytis et al., 2017; Luc et al., 2017; Mustikovela et al., 2016), knowledge distillation (Xie et al., 2018; Liu et al., 2020), adversarial learning (Huang et al., 2018; Souly et al., 2017), and consistency regularization (Mittal et al., 2019) are employed to perform semi-supervised segmentation.

Recently, (Chen et al., 2021a) proposed a consistency regularization method named Cross Pseudo Supervision, which enforces consistency between two perturbed networks with different initialization, effectively expanding training data using unlabeled data with pseudo labels. In another research work, Ouali et al. (2020) proposed a cross-consistency-based semi-supervised training method that enforces consistency between the main decoder predictions and auxiliary decoders’ outputs that ultimately enhances the encoder’s representations and thus leads to the improved results on SOTA datasets.

3 Methodology

In this section, we describe the R2S100K dataset and our proposed efficient self-training method for semantic segmentation tasks. Figure 2 compares our dataset with existing datasets. This section introduces a benchmark suite for our proposed Road Region Segmentation Dataset (R2S100K). Firstly, we describe R2S100K in terms of the methodology adopted for data collection, frame selection, labeling, and distribution. Secondly, we discuss the categorization of supervised/ semi-supervised learning methods to develop a benchmark suite for our proposed dataset. In the later section, we discuss our proposed EDS-enabled teacher-student-based efficient self-training approach to solving the data imbalance problem for semantic segmentation tasks.

Fig. 3
figure 3

Examples of road types covered in existing autonomous driving datasets for visual scene segmentation. R2S100K covers more challenging/hazardous roads in both—the urban and rural areas. While most of the existing datasets focus on the well-paved road infrastructure of urban areas and do not distinguish between safe and hazardous road region

Fig. 4
figure 4

Statistical analysis demonstrating the diversity of R2S100K Dataset. (Left) Google Maps of routes covered for data collection. (Right) Different environmental and infrastructural characteristics: (1) timestamp, (2) weather conditions, and (3) road hierarchy. We cover over 1000 KMs of roadways of Pakistan—carefully considering the inclusion of motorways, highways, general inter-city and intra-city roads, as well as the rural and hilly areas, under different illumination and weather conditions

3.1 R2S100K

We present a large-scale R2S100K dataset to train and evaluate supervised/semi-supervised methods in challenging road scenarios. Our dataset can be distinguished from existing datasets in the following three major aspects:

Distribution Shift: R2S100K dataset covers unique and undesiring urban and rural road conditions—described in Table 1 which are commonly encountered while driving, especially in developing countries. Whereas, existing datasets such as KITTI (Geiger et al., 2012), CamVid (Brostow et al., 2009), Cityscapes (Cordts et al., 2016), A2D2 (Geyer et al., 2020), MVD (Neuhold et al., 2017), BDD100K (Yu et al., 2020), nuScenes (Caesar et al., 2020), Waymo (Sun et al., 2020) represent well developed urban roadways, as depicted in Fig. 3. IDD though covers distressed and muddy road regions, however, it only distinguishes the mud class from the road and covers damaged road patches under one road class. Moreover, OFFSEG (Viswanath et al., 2021) The framework primarily covers off-road driving scenes, which significantly differ from unstructured roadways regarding representation. Similarly, Wildash (Zendel et al., 2018) covers distress and gravel patches under a single Road class rather than distinguishing them as safe and hazardous regions.

Diversity: R2S100K is constructed over road sequences—captured from 1000+ KMs roadways of Pakistan considering diverse terrain, infrastructural features, and environmental attributes as shown in Fig. 4. To ensure diversity in data, we primarily focus on including motorways, highways, and urban traffic roads from Punjab, the largest province of Pakistan in terms of population (approximately 127.474 million). Additionally, we extend our coverage to encompass the rural and hilly areas of Khyber-Pakhtunkhwa, the second-largest province by population (approximately 35.53 million), operating under diverse illumination and weather conditions.

Generalizability: R2S100K covers a diverse range of road infrastructure, including well-paved asphalt roads along with associated unique hazardous road regions which are categorized as atypical classes, enlisted in Table 1. However, we assigned distinct labels for our anomalous road classes and used similar labeling schema for asphalt class as cityscapes and BDD100K to ensure the integration of datasets for domain adaptation and semi-supervised learning.

Fig. 5
figure 5

Distribution of road classes in R2S100K. Asphalt and concrete regions represent the safe drivable road regions with the higher representation among the other hazardous road patches

3.1.1 Data Acquisition

Driving Platform Setup: A camera is mounted over the dashboard of a standard van with a height of 1.4 m from the ground and configured to an aspect ratio of 16:9 to capture the ultimate width of the road. A camera stabilizer is also installed to reduce the vibration effects of the vehicle.

Table 1 List of classes along with their definitions

Road Video Collection: We carefully followed the travel advisory issued by the government to identify diverse roadways. Based on the analysis, we defined a route plan to cover diverse infrastructure for data collection (as shown in Fig. 4) to ensure the inclusion of highways, expressways, and general roads of urban cities, rural and hilly areas.

Data Quality Control: We performed pre- and post-collection quality control (QC) to ensure high-quality data collection. In pre-collection QC, the data engineer must set up and monitor the camera’s data stream while recording. Post-collection QC required data engineers to manually identify and remove the distorted/over-exposed/unclear video sequences. In our post-collection QC process, our data engineers meticulously apply multi-step checks to identify and exclude distorted, blurry, and unclear sequences from our dataset. Our quality check criteria encompass various factors, including but not limited to:

Ensuring Clarity: Firstly, we assess the sharpness of images using Structural Similarity Index and Gradient Magnitude to measure the sharpness of individual frames within each sequence quantitatively. Frames with low sharpness scores, indicative of blurriness, are flagged for visual/manual evaluation.

Detecting Distortion and Blur: Secondly, we analyze the contrast and exposure levels of frames to identify instances of distortion or motion blur. Histogram analysis is utilized to evaluate the distribution of pixel intensities and detect anomalies related to over-exposure or under-exposure.

Assessing Relevance and Consistency: We prioritize frames that best represent diverse road conditions and scenarios to ensure the relevance and representativeness of our dataset while striving for uniformity among the images to maintain consistency across the dataset. Our team conducted rigorous visual inspections of each video sequence. Trained evaluators assessed the overall clarity, distortion, and visual fidelity of the frames, considering factors such as motion blur, lens aberrations, and compression artifacts.

Data Distribution: After data collection under different illumination and weather conditions from 1000+ KM of roadways, distorted/blurry/unclear sequences are excluded, and frames are selected from the remaining video with a 10 s difference to avoid redundancy. The vehicle is moving at varying speeds [120 km/h (motorway), 60–100 km/h (highway), 20–60 km/h (within city)]. Therefore, speed variation, blurry sequence exclusion, and 10 s difference are key to avoiding data redundancy. Lastly, EDS further minimizes the chances of sequential frames in the data. We aligned video sequences to extract the frames to distribute the diverse road scenarios equally. To achieve better diversity, 10 frames are selected after every 10 s per frame. Therefore, 100K images of R2S100K dataset are sampled out of 10 million images.

3.1.2 Data Statistics

Labeled Data: The labeled set consists of 14,700 images with fine-layered polygonal annotations which are realized in-house to ensure the highest level of quality. Firstly, annotators were provided with extensive training sessions to familiarize them with the data categorization, classes, and annotation tool to ensure consistency and accuracy. During training, similar data samples were distributed to the data annotators to allow for cross-verification, and the labeling strategy has been refined through iterative Inter-Annotator Agreement considering the definitions of the road classes. Secondly, to avoid void spacing and erroneous class overlapping, images are labeled back to front so that no class boundary is dual-labeled. Due to the diversity in data, we categorized road regions into 14 distinct classes as described in Table 1. Additionally, to further facilitate the annotators, we use (SuperAnnotate AI Inc., 2024) for labeling which is a user-friendly tool especially for autonomous driving tasks. In the post annotation phase, a random sampling and expert validation has been performed by the experts to cross-evaluate the quality of annotations, and to identify and address the errors, ensuring the correctness and reliability of R2S100K dataset.

Unlabeled Data: The unlabeled set of our dataset contains 86,000 images, covering diverse road infrastructure. As shown in Fig. 4, our unlabeled set is collected under varying weather conditions and time periods to ensure diversity regarding downstream autonomous driving tasks.

Fig. 6
figure 6

Our Efficient Data Sampling (EDS) based self-training framework. Firstly, raw data samples are clustered based on similarity in road classes among image encodings (shown in Fig. 7)—generated by an encoder. Then, a small subset is uniformly formed from all clusters for annotation to train the teacher model. After training, pseudo-labels of the unlabeled set are generated using the teacher model, and the student model is trained on real and pseudo-labeled sets to achieve better generalization

3.2 Training Fully Supervised Baseline Models

To analyze the effectiveness of R2S100K, we fine-tuned SoTA segmentation networks to leverage the representations from pre-trained weights learned from large-scale datasets, enhancing the generalizability of models to our road region segmentation tasks. These models include FCN (Long et al., 2015), PSPNet (Zhao et al., 2017), FPN (Lin et al., 2017), LinkNet (Chaurasia & Culurciello, 2017), Deeplabv3+ (Chen et al., 2018), and LRASPP (Howard et al., 2019), MaskFormer (Cheng et al., 2021b), and SegFromer (Xie et al., 2021) along with various backbone networks to perform road region segmentation tasks. These methods are trained using a set of human-labeled images (xy) where \(x \in R^{H \times W \times 3}\) is a 3-channel RGB image, and \(y \in R^{H \times W \times C}\) is a respective segmentation mask where H and W refers to height and width of the mask, and C indicate classes present in that mask. Following common practices (Zhu et al., 2019), model M is trained using cross-entropy loss, and IoU is used as a performance metric.

Fig. 7
figure 7

Visualizing examples of clusters (twelve clusters representing three images in each) using our EDS. Our EDS efficiently clusters images concerning the similarities in road texture, luminous conditions, and road scenarios

3.3 Improving Self-Training Using Unlabeled Data

Recently, a surge of interest has been observed in utilizing unlabeled data to scale up the adaptation of deep models in various segmentation tasks. Leveraging many unlabeled sets from our R2S100K, we carefully employ semi-supervised training methods to study the generalizability of these models. We take inspiration from Zhu et al. (2019) and employ a teacher-student-based self-training framework for road segmentation. The student-teacher framework offers a structured approach to transfer rich representations and intricate spatial relationships from the teacher to the student. This guidance is particularly beneficial in tasks like road region segmentation, where precise delineation of spatial boundaries is crucial. Unlike directly using a pre-trained CNN/transformer model, which may overlook the nuanced insights captured by the teacher, the teacher-student framework facilitates focused knowledge transfer, leading to improved performance and more accurate segmentation results in complex real-world road scenarios.

Teacher-student-based self-training refers to an approach in which a large DL model (called the teacher) is trained using real labeled data. Then, a set of unlabeled images is given as input to the trained teacher model for inference, and the teacher model’s output is considered a pseudo-label for the corresponding input image. Finally, data with both—the real and pseudo-labels are combined to train a small/different DL model (called student model) to learn representations from whole data. The purpose of training the teacher model on real data is to guarantee its performance in generating pseudo labels. Therefore, we utilize a small labeled set along with a large unlabeled set to increase the accuracy of the trained model while mitigating the human effort in producing labels at scale. Similar to the practices in supervised learning, we fine-tuned these models to leverage the already learned representations from large-scale datasets for faster convergence.

3.3.1 Efficient Data Sampling (EDS)

In semi-supervised segmentation, dealing with the data imbalance problem is highly challenging. In street scene segmentation problem, two key factors cause data imbalance; (i) class imbalance, which includes class-wise pixel imbalance—a typical image is largely occupied by sky and road, while other classes like humans and bicycles represent far fewer pixels—and class object confusion—some classes, e.g., bicycles, are more challenging to segment due to their complex shapes, occlusions, and faded representations (Brostow et al., 2009; Cordts et al., 2016); and (ii) imbalance in physical scenarios, as highlighted in Fig. 4. Although both imbalances are equally important to address, class imbalance is a post-annotation issue that mainly depends on the underlying task, and is generally easily detected, e.g., by computing the confusion matrix of each class. On the contrary, an imbalance in physical scenarios is a pre-annotation issue inherent to the (unlabeled) images. Further, physical scenarios under-represented in the training set are also usually equally under-represented in the test set. Thus, detecting imbalances in physical scenarios is significantly challenging, let alone alleviating them. We identify a dire need for an efficient method to detect/alleviate data imbalances in physical scenarios at the pre-annotation stage to produce more balanced models on semantic segmentation tasks. The Fig. 8 shows the KL-divergence from the uniform distribution of physical scenarios in two subsets from the original dataset - (i) the randomly sampled subset; and (ii) the EDS sampled subset. Ideally, the sampled subset should represent different physical scenarios equally, resulting in a uniform distribution. For example, in the sampled datasets, all the times (Morning, Noon, Afternoon, and Evening) should be uniformly represented. Therefore, a lower KL divergence of the sampled subset from the uniform distribution indicates a better sampling strategy. EDS notably improves the imbalance in different physical scenarios, as illustrated by the reduced KL-divergence in Fig. 8, and consistently better performance of the models in Fig. 11, respectively.

To address these issues, we propose EDS, as depicted in Fig. 6. We aim to equally represent different physical scenarios in the training data. In this regard, our EDS approach has two main stages: (i) data categorization, and (ii) data selection.

Data Categorization: Firstly, given an unlabeled dataset, \({\mathcal {D}}_x\), for each \(x \in {\mathcal {D}}_x\), we extract region-of-interest (ROI) mainly comprising salient road features, sidewalks, and pedestrians, while ignoring background, e.g. sky. The extracted image \(\text {ROI}(x)\) is then processed through an off-the-shelf encoder network \(e(\cdot )\) to get encodings \(e(\text {ROI}(x))\). We use a U-Net model, built upon VGG-16 Imagenet encoder, \(e: {\mathcal {R}}^{512\times 512\times 3} \rightarrow {\mathcal {R}}^{32\times 32\times 512}\), as backbone. Due to the prevalent data imbalance problem in segmentation datasets, inherent biases in datasets are also reflected in trained models. Whereas, models trained on Imagenet learn more generic features spanning over 1000 classes, and can be used for multiple downstream tasks. We feel using a biased encoder (trained on street scene dataset) in EDS to mitigate biases in R2S100K is counter-intuitive.

Fig. 8
figure 8

KL divergence between both—the EDS and Random sampling-based data distributions

Data Selection: Secondly, encodings \(e(\text {ROI}(x))\) of unlabeled train set are fed to k-means to get k data clusters \(\{C_i\}_{i=1}^{k}\) based on similarities in road surface. Finally, to maintain equal distribution along all types of road representations, we uniformly sample n data instances from each cluster, \(C_i\), so that our final dataset, \({\mathcal {D}}^*_x\) has \(n \times k\) data samples. In typical settings, we choose \(n \times k = 3000\) to have a comparable dataset size as the Cityscapes dataset. Formally,

$$\begin{aligned} {\mathcal {D}}^*_{x} = \cup _{i=1}^k \{ x_j \sim C_i \}_{j=1}^n \end{aligned}$$
(1)

We choose k = 15 \(\times \) 20 = 300. Although k is a hyper-parameter, our goal for choosing k = 300 is to allow each of the 15 classes to be captured in 2(sun/no sun) \(\times \) 2(rain/no rain) \(\times \) 5(road areas) = 20 clusters representing different scenarios. In Fig. 9, we demonstrate the analysis regarding the influence of change in K on the average sampling. It can be seen that the change in K does not influence the sampling pattern. Therefore, setting k = 300 ensures the balanced sampling of physical road scenarios while keeping weather conditions and road classes in view. To compare EDS with random sampling, we sample 500 images from the original dataset using each method and compute the probability density of each physical scenario in Fig. 8 based on two sampled subsets. Ideally, all labels should have a uniform density, signifying equal representation in the dataset. Therefore, we compute KL-divergence between probability density and uniform distribution in Fig. 8. Results show that EDS significantly improves data imbalance as compared to random sampling.

Fig. 9
figure 9

Demonstrating the effect of change in k on data imbalance regarding physical scenarios

3.3.2 Student–Teacher Method for Segmentation Task

Our self training framework is illustrated in Fig. 6. Based on better performance in supervised learning, bestperforming model is selected as teacher model T which is used to generate pseudo labels of our unlabeled set of images. The teacher model is used to generate pseudo labels y of our unlabeled set of images x. Similar to supervised learning, one-hot encoding of the class labels is sampled from the \(p_{{\mathcal {T}}}(x)\) as given in equation.

$$\begin{aligned} L_{{\mathcal {T}}} = -\sum _{i=1}^{N} {\textbf{y}}_i \log (p_{{\mathcal {T}}}({\textbf{x}}_i)), \end{aligned}$$
(2)

where N denotes the number of labeled samples. \(y_i\) is the one-hot encoding of class labels, while \(p_{{\mathcal {T}}}\) represents softmax predictions from the teacher model containing class probabilities.

We demonstrate various examples of our teacher-generated pseudo labels in Fig. 10. Thanks to our well-performing teacher model, the quality of our teacher-generated pseudo labels x over the unlabeled set is closer to human-annotated labels despite a large domain gap. Therefore, we combine pseudo and real labeled sets to train the student model S. Therefore, we combined pseudo and real labeled sets to train the student model S. Thanks to the generalizability of our proposed self-training pipeline, any DL-based segmentation model can be used as a student model irrespective of their network architectures (briefly explained In Sect. 4.6). Following the practice—adopted in supervised learning, the focus is set to minimize the cross-entropy, given in Eq. 3.

$$\begin{aligned} L_{S} = - \sum _{i=1}^{N} \mathbf {y_i} \log (p_{{\mathcal {T}}}(\mathbf {x_i})) - \sum _{j=1}^{M} \mathbf {y'_j} \log (p_{{\mathcal {S}}}(\mathbf {x'_j})) \end{aligned}$$
(3)

M denotes the number of unlabeled samples. \(p_S\) represents softmax predictions from the student model containing the class probabilities. The predicted class probabilities of the student model will be near one-hot by training on hard pseudo-labels generated by the teacher model. Therefore, the entropy of unlabeled data is minimized with cross-entropy loss.

Fig. 10
figure 10

Demonstration of our teacher-generation pseudo-labels over diverse roads. Our teacher model can provide reasonable segmentation predictions

It is worth noting that the teacher model may generate noisy or incorrect pseudo-labels against rare/challenging scenes, which can significantly impact the training process of student models and ultimately hinder the overall performance. Therefore, we adopted a feedback-based training and evaluation approach to achieve maximum accuracy. The student model is first trained with real and pseudo-labeled data and then evaluated on the real validation set, which is common for both the teacher and student models. In the second step, the data engineers perform error analysis based on the IoU and confidence thresholding on the validation set to identify the source of misclassification. After completing the error analysis, the training set is regularized using EDS, where data samples of the identified class are augmented to improve convergence.

Table 2 Evaluation of baseline segmentation methods by training using different numbers of randomly sampled sets from the actual train set of R2S100K dataset for supervised learning

4 Experiments and Results

Firstly, we briefly describe the implementation details regarding hyper-parameter selection for training and evaluating supervised and semi-supervised learning methods. We categorize our experiments into five sections. In Sect. 4.1, we analyze the performance of supervised learning methods and compare the results between random data sampling and our proposed EDS method. Section 4.2 evaluates the performance of semi-supervised learning-based standard self-training methods leveraging our unlabeled data. In Sect. 4.3, we select the best-performing semi-supervised model as the teacher method and evaluate its efficacy of the student model with different ratios of unlabeled data samples. In Sect. 4.4, we analyze the generalization of other student models irrespective of different network architectures. Lastly, we evaluate the cross-domain generalization with the same categories on state-of-the-art autonomous driving datasets, including Cityscapes, CamVid, IDD, and CARL-D.

4.1 Basic Settings

Following the training practices from the Cityscapes and BDD100K, the learning rate is set to 0.0001 for fine-tuning with SGD as an optimizer. As per conventional practice (Liu et al., 2015), a polynomial learning rate is used to smooth learning, and batch size, momentum, and weight decay are set to 8, 0.9, and 0.0001, respectively. Nvidia RTX 3060 is used to perform experiments. The number of training epochs is set to 200 with validation patience of 10 epochs. In R2S100K, each class is divided into three portions: 60% for training, 30% for validation, and 10% for testing to ensure a balanced representation of classes across the training, validation, and test sets, facilitating fair model evaluation and performance comparison. Evaluation is done using pixel accuracy, precision, recall, F1-score, and standard Jaccard index (shown in Eq. 4), where FP, TP, and FN refer to the number of false positive, true positive, and false negative pixels, determined over the test set.

$$\begin{aligned} \text {IoU} = \frac{TP}{(TP+FP+TN)} \end{aligned}$$
(4)

4.2 Performance of Supervised Learning with EDS

Fig. 11
figure 11

Comparative analysis of baseline segmentation methods using standard data sampling (STDS) and our EDS. Our efficient data sampling method significantly improves supervised learning for semantic segmentation tasks

We employed FCN (Long et al., 2015), PSPNet (Zhao et al., 2017), FPN (Lin et al., 2017), LinkNet (Chaurasia & Culurciello, 2017), Deeplabv3+ (Chen et al., 2018), LRASPP (Howard et al., 2019) MaskFormer (Cheng et al., 2021b), and SegFormer (Xie et al., 2021) with various backbones on R2S100K. We analyze baseline segmentation methods over many labeled subsets (1k, 3k, 5k, 7k, and 9k), randomly sampled from actual 9000 train images. The primary motivation for adopting this training scheme is that in autonomous driving, the cameras are mounted in the center-straight orientation on the front (similar in Cityscapes, KITTI, BDD100K, IDD, and CARL-D) to capture the frontal view. Resultantly, the data, either in the form of video or frames, is sequential in nature with the static frame structure, i.e., the road in the lower-center, buildings, or trees on the left and right, while the sky covers the top-center part of the image. Secondly, the major part of the road is the asphalt and concrete region with the chunks of the other hazardous classes, defined in Table 1, and demonstrated in Fig. 1. In Fig. 5, it can be observed that asphalt and concrete road regions cover 39% of the labeled pixels in the dataset, which leads to highly unbalanced data. Considering these important factors, we hypothesize that using such unbalanced data may cause overfitting, ultimately hindering the model’s generalization performance. From Table 2, it can be seen that employed models—trained over 1k images experience the worst performance due to under-fitting. However, their performance significantly improves with the 3K train set. Interestingly, the employed models start saturating while training on large train sets, i.e., 5k, 7k, and 9k samples, and do not further improve learning because of the similarity in road pavement across training samples.

We further analyze the performance of employed models using two data sampling methods, i.e., the standard training data selection (STDS) method—in which the data samples are randomly selected based on their occurrence, and our proposed EDS method. It is clear from Table 2 that segmentation methods perform well with a 3k training set. Therefore, we randomly select 3000 labeled images from the train set in STDS based on the frequent occurrence. On the other hand, using our EDS, we first clustered all images based on their representation similarities, shown in Fig. 6. Then, we uniformly sampled 3000 labeled images from all clusters to form a representative sub-training set.

The results illustrated in Fig. 11 show that our EDS method significantly improves learning in segmentation tasks. For instance, Deeplab-v3 with ResNet-101 achieved a comparatively highest mIoU, i.e., 62.86% using our EDS method which is 6.72% higher than its baseline trained using the STDS method. A major reason for this performance increase is that most of the informative data samples are ignored during random selection, due to which, training data becomes highly unbalanced which ultimately leads to inefficient training and poor generalization. Consequently, the resultant model does not achieve better performance on test data. In our EDS method, training samples are uniformly selected based on their class representations. Therefore, the network efficiently learns an equal distribution of features from each class, boosting the performance of trained models over test data. A class-wise comparison of state-of-the-art segmentation models is shown in Table 3.

Table 3 Segmentation results (in percentage) of baseline fully-supervised models using EDS on our R2S100K dataset

4.3 Effectiveness of Student–Teacher Self-Training

Based on higher performance in supervised learning, we select DeepLabV3+ with ResNet101 as a teacher model to initiate self-training. Firstly, we generate pseudo labels of an unlabeled set with several subsets, as shown in Table 4. Then, a student model, i.e., PSPNet, is trained on real and pseudo-labeled sets. From Table 4, it can be observed that utilizing pseudo labels significantly improves segmentation models, which indicates that segmentation models can be improved using pseudo labels without large-scale labeled data.

Table 4 Evaluation of EDS-ST on R2S100K with different subsets of real and pseudo-labeled data

4.4 Effectiveness of EDS-Based Self-Training

Following supervised learning, we used STDS and EDS to analyze the efficient training and its impact on the inference of student models. The results are summarized in Table 4, and we have several observations. Firstly, EDS significantly improves student models with an average increase of 4% MIoU. Therefore, using EDS for training segmentation models is better than not using it. Secondly, EDS can be used as a generic approach to train teacher methods efficiently. From Fig. 11, it is clear that EDS improves teacher method by 4%. Thirdly, EDS is necessary to achieve better results when pseudo labels dominate the training set such as the 16k/32k set, otherwise, the performance of the models starts declining. For instance, student models trained without EDS over 16k, and 32k pseudo labeled sets dropped by 0.8% because of redundant training samples which contribute bias towards classes with more pixels against classes with lesser ones. Whereas EDS efficiently handles data imbalance, thus it improves the performance of student models as compared to the STDS approach, as shown in Fig. 12.

Fig. 12
figure 12

Visualizing the comparison of best-performing student methods on R2S100K. Results demonstrate that EDS-based self-training is a better approach to effectively handle class confusion in complex road scenarios. The labels are the same as in Fig. 1

Table 5 Evaluation of self-training methods on R2S100K
Table 6 Analyzing semi-supervised methods on R2S100K
Table 7 Generalizability of student methods irrespective of different backbone network architectures on R2100K

In addition, student models with more pseudo labels (16K, 32K) marginally improve compared to models with lesser pseudo labels (2K/4K). With fewer pseudo labels, the model learns more informative features as variable data samples are clustered based on similar representations by EDS. However, in the case of more pseudo labels, a vast range of sequential data samples is selected from each cluster, which causes the model to start saturating instead of learning new information. On the other hand, EDS ensures the selection of distinct sampling and helps the model refine mask boundaries, ultimately benefitting dense tasks.

4.5 Comparison with Related Self-Training Methods on R2S100K, Cityscapes, and CamVid

Here we describe a comparative analysis of existing self-training methods. As shown in Table 5, our EDS outperforms other self-training methods (Abdalla et al., 2019; Lee et al., 2013; Wang et al., 2022; Zhao et al., 2023; Zou et al., 2019, 2018) on R2100K, as well as on cityscapes and CamVid. On R2100K, consistency regularization achieved 53.70% mIoU i.e., considerably worse than all of the self-training methods, as the model is learning from inaccurate predictions in the first stage of training, leading to inaccurate inference on test data. Similarly, in the case of teacher fine-tuning, we observe that the model gets stuck at minima at an early stage of fine-tuning. Resultantly, the model starts overfitting instead of learning new information. Similarly, we notice that Wang et al. (2022) struggles to distinguish hazardous road regions in R2S100K due to higher textural similarities among classes, leading to a higher misclassification rate. We first efficiently select training data samples using the EDS approach to train a teacher model with considerable accuracy and use it to produce pseudo labels of our unlabeled data. Therefore, its performance consistently improves throughout the training process. Our framework is purely generic; using our approach, a teacher model can train any student model irrespective of their architectural differences, showing its generalization capability. The performance of EDS is shown in Table 6.

Fig. 13
figure 13

Comparison between our method, Segment Anything Model and DINOv2. The colors of labels in R2S100K examples are the same as in Fig. 1

4.6 Generalization to Other Student Methods

Another benefit of EDS-based self-training is that teacher and student models do not need the same architectures. Our framework is a generic pipeline that clusters data based on representations. Then, data samples are uniformly selected to ensure data balance for training a teacher model—used to generate pseudo labels which are utilized in improving the accuracy of the student model. In particular, we used DeepLabV3+ with ResNet101 as a teacher model and trained several student models including BiSeNet, PSPNet, LRASPP, LinkNet, FeedFormer, SegNeXt, and U-MixFormer with different backbone networks. These models are selected after analyzing their wide adaptation to segmentation tasks. The results in Table 7 demonstrate that EDS-based self-training can significantly improve student models irrespective of their architectures. Comparatively, PSPNet with ResNet101 outperformed other segmentation networks using the EDS approach.

5 Discussion and Future Direction

The annual report of the World Economic Forum (WEF, 2019) indicates that the quality of road infrastructure of Pakistan achieved a 4 road quality index, which is similar or comparable with many high-, middle-, and low-income countries. This list includes but is not limited to India, Russia, Kuwait, Romania, Indonesia, Bulgaria, Brazil, Malta, Iceland, Hungary, Czechia, Slovakia, Ukraine, Moldova, Georgia, Jordan, and Venezuela. These statistics indicate that models trained on R2S100K can be generalized to the road infrastructure of the aforementioned and other countries with similar road quality indexes. According to the Global Status Report on Road Safety 2023 (WHO, 2023), reporting countries collectively account for nearly 68 million km of roads, of which 4.5 million km are paved expressways; 47 million km are paved interurban roads; and 10 million km are unpaved inter-urban roads. Among these, 80% of the roads of the reporting countries do not meet a minimum 3-star rating for user safety due to non-standard road infrastructure. Consequently, 92% of deaths due to road fatalities occur in low- and middle-income countries which share similar socio-economic status.

The R2S100K dataset provides a diverse set of road images covering challenging roadways, including hazardous road patches that are common in developing countries. This diversity enhances the utility of the dataset for training and evaluating autonomous driving perception systems. By distinguishing safe asphalt road regions from hazardous road regions, the dataset offers finer-grained semantic segmentation labels, which can improve the accuracy of perception models in distinguishing between different types of road surfaces. Moreover, R2S100K addresses the under-representation of challenging road scenarios in existing datasets, thereby improving the efficiency of autonomous driving research by providing a more comprehensive benchmark for evaluating perception models.

The R2S100K is a first attempt at providing the labeling for semantic road region segmentation tasks. The current version of R2S100K contains one hundred thousand images, including 15K labeled and 85K unlabeled images. It is well-known that pixel-by-pixel image labeling is costly and time-consuming; this research gap is addressed using our EDS-based self-training method. With the recent advances in computer vision, in particular with the release of foundation segmentation models, i.e., Segment Anything (Kirillov et al., 2023), DINOV2 (Oquab et al., 2023), and Internimage (Wang et al., 2023) are being adopted by the community for auto-labeling. Though these models have achieved significant performance on well-known classes, However, these models under-perform on R2S100K, and cannot segment the classes in R2S100K, as shown in Fig. 13, due to the under-representation of such data in SA-dataset: it only covers 0.9% of data samples from low-income countries (Kirillov et al., 2023). Therefore, we manually labeled the data utilizing our data engineers’ expertise and proposed an EDS-based self-training method to efficiently utilize the unlabeled data in improving the model. In the future, we aim to integrate additional information modalities like odometry and Lidar point clouds into the dataset.

6 Conclusions

In this paper, we presented R2S100K to perform drivable road region segmentation on unstructured roadways. We also presented a self-training framework to improve semi-supervised learning for segmentation tasks. Results demonstrate that our proposed method can be utilized to improve supervised/semi-supervised learning for semantic segmentation due to its effective class confusion handling in complex road environments. Our training framework will facilitate research in various ML applications where generating labeled data is critical. In the future, we will extend the annotations to encompass lane markings, the surrounding environment and infrastructure, and vehicles.