1 Introduction

Autonomous driving has become one of the appealing areas in the automobile industry in the last few decades. Computer-based cognitive methods such as artificial intelligence (AI), computer vision (CV), machine learning (ML), etc. have manifested an immense impact on various industries. Implementing these technologies in the automobile sector may lead the revolutionary change in the upcoming days. Many pioneers like BMW, Volvo, Google, Stanford University, etc. are working on their prototypes and are about to launch these vehicles in the near future [1]. Keeping aside the legal, social, and technical aspects of this domain; full deployment of these autonomous vehicles in the countries like India seems to be very far away due to its unpredictable and challenging road environment. According to world health organization (WHO), the road traffic death rate per 100,000 population for India is estimated as 22.6 in 2016 which is 36% higher than the death rate as estimated in 2013 [2]. The reasons behind the high death rate in road traffic or highways are random and primarily consist of casual driving, bad road structure, poor lighting conditions, and inability to comprehend the driving environment effectively. Employing AI-enabled vehicles may be helpful to decrease the death rate in road accidents because such a technologically equipped system that is trained and tested in a real-time scenario may perform its driving proficiency with high precision and logical decision-making capability. This decision-making strategy demands more significant features to be extracted and many applications [3,4,5] have already been utilizing the expertise of neural networks and its variants. Moreover, the potential of various neural networks can also be enhanced using the recent [6,7,8,9,10] and benchmark evolutionary approaches and further can be utilized in the field of autonomous driving systems. Similarly, use of texture and standard feature mapping technique is gaining popularity in the field of pattern recognition [11, 12] in different image modalities which can also be fused up with the domain of autonomous driving using vision sensor-based scheme. Likewise, reinforcement learning (RL) [13] can empower an intelligent vehicle as an agent to obtain and learn control strategies for accomplishing its short-term/long-term goals.

The objective of an autonomous vehicle system is to provide an effective assistance to manual driving as well as it can assure to retain the safe driving environment. Safety precautions through mentioned technologies in these systems not only save the human life, rather it add values to the living standard of society and increase the economic value to the nation too. The state-of-art methods given by various researchers have enabled various functionalities to be embedded for the design of intelligent vehicle system. Most of them are sensors based, and among them, vision sensors are getting popularity towards the design of such systems. An intelligent vehicle system has been primarily designed for some specific and designated areas, where the roads, traffic, signals, lanes, etc. are the few of the areas where it has to perform driving as per the standard conditions. The objective of this article is to investigate the feasible challenges and highlight the future trends and possibilities in this domain. The major phases are as follows:

  1. A.

    Road Detection: In the standard road traffic system, an intelligent vehicle has to follow the drivable area. The purpose of this phase is to recognize the path or the way to proceed in the driving direction.

  2. B.

    Lane Detection: Though the road is the core drivable area for a vehicle, but lane defines the portion of the road for driving different categories of vehicles. The division of the road into different lanes is based on the speed limit by lane marking scheme.

  3. C.

    Pothole Detection: Potholes are the kind of damaged areas or unstructured holes on the road surfaces caused by the destruction of material, poor material used, heavy driving, etc. Identification of such unstructured potholes must be identified to avoid any hazards.

  4. D.

    Pedestrian Detection: This phase deals with the most precious element of society ‘human.’ A pedestrian is a person who walks on the pathway nearby the road area. Sometimes a pedestrian has to cross the road to reach the other side of the road, this situation will be faced by an intelligent vehicle, and it is expected to have zero collision between both the participating elements.

  5. E.

    Vehicle Detection: Different categories of vehicles are seen in the real road scenario. Their accurate detection is important to take runtime decisions to avoid the collision or during the change of lane.

  6. F.

    Traffic Light/Signal/Gesture Detection: In most countries, traffic lights or sign-boards are the key components to direct the traffic. In absence of such elements, traffic policemen control the traffic. In both of these cases, detection of directing components are required for an intelligent vehicle system.

1.1 Motivation

The light detection and ranging (LiDAR) has shown its decent performance especially in the field of land surveying, mining, and transportation expansion. The autonomous vehicle is using LiDAR as a major sensor due to its capability of creating 3D mapping and hence can guide the vehicle. But, the collected 3D point cloud data is complex, and identifying patterns is challenging. Second, the device is somewhat expensive, and achieving driving intelligence using this device may not succeed and lead to a high operational cost. However, the less expensive camera/vision sensor and its optimal operational behavior have been found in autonomous vehicles which are becoming famous over the other sensors in various applications [14]. The CEO of Tesla, Elon Musk has already started the expansion and the adaptability of computer vision in their autonomous vehicle and stated that the “LiDAR does not make any sense” in the upcoming future [15, 16].

More researchers have focused on scene perception using LiDAR and similar ranging sensors. However, obtaining an optimal trade-off between operational scheme, cost, and adaptability for the commercialization of these vehicles using LiDAR is highly costly. Though the importance of having multiple sensors is not ignored, on the other hand, camera sensors have potential and can also achieve the same performance. Thus, in this paper, apart from the traditional phases of intelligent vehicle systems (IVS), we aim to explore some of the other significant phases from the perspective of camera sensors.

1.2 Contributions

Throughout the survey, a large number of studies incorporating the different schemes of feature extraction using vision sensors have been assembled, and are organized to find insights about unexplored areas of the intelligent vehicle system. Moreover, challenges and shortcomings from the referred studies and methods recommend upcoming research directions to have potentially more transparent, more affluent, and practical descriptions for forthcoming generations of autonomous vehicles. Our contributions include:

  • A systematic survey of different phases in the domain of intelligent vehicle systems has been elaborated based on

  • the potential of vision sensor-based approaches,

  • standard feature extraction schemes using intelligent techniques, and their performance,

  • unexplored phases of autonomous driving.

  • Recognization of the operational limitations of different methodologies and possible challenges in the existing studies in autonomous vehicles. The identified limitation and challenges in different phases may surely be incorporated in the upcoming development of autonomous vehicles.

  • Finding the set of optimal and meta-heuristic approaches and perceptions toward the design and decision-making process of intelligent vehicle systems may assure the fewer number of road traffic death as compared to human driving.

The organization of rest of this paper is summarized as: Sect. 2 contains a phase-wise review of the research studies for an intelligent vehicle. Section 3 describes different datasets used in the literature. The discussion and conclusion contain some of the challenges identified through the literature along with the future aspects is presented in Sect. 4.

2 Phases of intelligent vehicle system

An intelligent vehicle on the road must perform several challenging tasks to reach its destination. These tasks are categorized as different phases of an intelligent vehicle system. The generalized phases of the system are shown in Fig. 1, which includes basic and state-of-art methods given by many authors in two categories i.e. deep architecture-based and mathematical model-based approaches. Discussion of major phases of this article is as follows.

Fig. 1
figure 1

Major phases and methods of Intelligent Vehicle

2.1 Road detection

A defined road structure makes traveling easy for vehicles. The vehicle has to drive on a defined area of the road to avoid any hazardous situation. However, road structure becomes poor over time, and detection of the drivable areas and their boundaries becomes challenging for an intelligent vehicle.

The primary phase of the intelligent vehicle is to understand the scene and detect the drivable region on the road. Several methods in road detection are discussed as follows:

2.1.1 Deep architecture-based approach

A well-defined road has clear boundaries and a uniform structure. To detect a road, several characteristics of the road must be known for an intelligent vehicle. Geng et al. proposed an approach to achieve a robust and accurate road detection using a convolutional neural network (CNN) and Markov random field (MRF) [17]. In this work, the local features were extracted from the road images through simple linear iterative clustering (SLIC) algorithm for the segmentation phase. It involved dividing an image into a number of clusters and then an optimal number of clusters were chosen. It is worth mentioning that the small number of clusters lacks accuracy, and a greater number of clusters requires complex calculations. Thus, 320*240 sized images are divided into a cluster of 300 superpixels. The Exterior frame of a superpixel was obtained and reformed to a dimension of N*N. With the depth information of these superpixels, CNN consisting of convolution, down-sampling, and fully connected layers were applied to learn significant features and determine the road object. The classification result of CNN was used as an input in MRF to capture the features of neighboring pixels and to optimize the road detection accuracy.

In most of the research, work involving CNN-based architecture performs well but suffers from higher-level inconsistencies that usually occurred in segmentation. To deal with the mentioned issue, Zhu et al. proposed DeepLabv3 as a segmentation model and Pix2Pix as generative adversarial network (GAN) model for road detection [18]. As a combination of generator and discriminator network in the GAN, where the generator was built up as an encoderand decoder network that operates the convolutional and deconvolution layers. The objective was to discard less significant features iteratively up to a certain threshold. Further, the discriminator network acted as a supervisor to the generator in the optimization phase to classify the road image. It was mentioned that the learning through GAN has been preserved for the auxiliary loss module to fine tune-up the semantic model. The second phase in segmentation uses convolution filters with the updated learning rates to deal with objects having multiple scales. In CNN operations, successive pooling and subsampling significantly reduces spatial resolution in feature map and result in feature coarse-to-fine problem. Though deconvolution layer and interpolation layers provide upsampling of feature maps, refining the significant features is inadequate at this stage. To solve this issue, Ning et al. proposed an approach for fast semantic image segmentation [19]. Initial phase called Hdblock contains multilevel convolutions of 3*3 kernels with the dilation scheme that produce precise localization of target objects. The subsequent phase includes an integrated mechanism of coarse-to-fine block ‘CFblock’ that up samples and refine the obtained features at the same time. The deconvolution layer was used in the upsampling procedure along with collaborative phases to retain the significant features. Authors performed this concept into different public datasets namely CamVid [20, 21], Citypaces [22, 23], and KITTI [24].

The contribution of semantic segmentation in scene understanding and inferring relationship among objects are gaining popularity but seems hazy. The reason behind such vagueness is the improper implementation of max pooling and subsampling layers in deep learning architecture, which reduces the resolution of the significant feature maps. To deal with low-resolution issues, Badrinarayanan et al. proposed SegNet, a deep learning encoder-decoder model for scene segmentation [25]. The proposed encoder network layer contains 13 convolution layers followed by a decoder network in which a feature map was formed by convolving the encoder network. To minimize the loss during the feature map formation, the encoder network obtain and stores the boundary region attributes before the subsampling phase. In each encoder map, the feature value of maximum indices were stored for optimum memory utilization; while, the decoder network up samples these feature maps, and the corresponding outcomes were convolved to generate dense feature maps. The final outcome of this decoder network is then processed using a softmax classifier to identify the class with maximum probability score.

A well-defined object is detected through segmentation techniques but may fail when the object is not clearly defined with definite shape and boundaries. To handle these issues, Yadav et al. proposed pre-trained SegNet model that deals with unstructured road input using color lines model [26]. SegNet had been applied to develop two distinct color line models, the first represents the road as a foreground object while the second one represents the remaining part as the background. The mean value and variance of RGB colors were computed for probability scores and used in the conditional random field (CRF) to create a graph node corresponding to each pixel level. This work had accessed the images from publicly available datasets KITTI [24] and CamVid [20, 21].

Another CNN based approach in [27], where authors presented ‘RoadNet’ to detect the road. This architecture was implemented on 80 mW hardware accelerator to detect the road scene. It was mentioned that typical CNN architecture employs bottom-up assembly in final layers which hold the spatial information for the entire processing. To enhance the performance of detection, ‘RoadNet’ was used as a top to bottom structure with the idea to keep the positional information of road objects. Spatial information was considered by applying a heat mapping to the publicly available dataset KITTI [24] and Cityspaces [22, 23], where participating pixels were identified as road pixels. The position of each pixel was labeled and concatenated to the input image. Dewangan et al. had employed the classical segmentation model SegNet, fully convolutional network (FCN), and U-Net for road segmentation and detection purpose [28]. Among the adopted schemes, U-Net obtained the highest accuracy of 94% under the presented scenario and outperform other techniques using the standard dataset Cam Vid [20, 21].

Deep learning-based autoencoder model learns the significant features through samples using unsupervised learning [29]. The authors reconstructed this model by adding a supervised layer to learn significant features in image segmentation. Primarily, the objective function of the modified model was set to minimize error among the supervised label. After the encoded connection, the set of decoded weights were trained and passed through the weights from input to the hidden layer(s), and hidden to the output layer. In this direction, the sigmoid function was used as an activation function to map the feature representation. Also, updating weights in the network architecture was done by the gradient descent method. For feature learning, each of the associated weights from the encoder and decoder network were updated using a back-propagation algorithm. This procedure accelerate the entire network into a stacking model, and since the objective was to add a supervised layer to the traditional autoencoder model, it was quite difficult to adopt.

To retain and handle this stacking strategy, the entire training procedure had been divided into three interrelated groups. Here, each group is responsible to learn features by adjusting the weights and update its learning. It is mentioned that this modified model learns the significant features and rejects the details that are less significant in the segmentation phase and the schematic of the adopted method is shown in Fig. 2.

Fig. 2
figure 2

Supervised auto encoder deep model [29]

The road detection phase employs various supervised learning methods but suffers from weak manual annotation or insufficient data training. To deal with this issue, Han et al. presented a semi-supervised and weakly supervised method using GAN [30]. The idea is to reduce the manual annotation, and these unlabeled data had been directly trained without having overfitting problems. ResNet101 model is used as a backbone network and modified by substitution of three transposed convolution layers to the dense layer. In the discriminator part, max-pooling layers were substituted by convolutional layers to downsample the features in various scales. Prediction from this network has shown that the outcomes were either from labeled images or from unlabeled images. Both the generator and the discriminator configuration were applied on semi-supervised and weakly supervised learning. It was mentioned that in the semi-supervised method, the generator network was designed to perform segmentation while the weakly supervised method contains road shape prediction.

Segmentation of image depends upon the global and local information of the input image, while the accuracy of segmentation is mostly subjected to strong and label consistencies of associated features. Features obtained from the region-based approach represents precise local region information but fail to cover global information. However, hierarchical features are capable to portray global information too. To deal with the multiple instances of feature extraction, Geng et al. presented CNN-based technique that combines region and hierarchical features-based approaches to incorporate and improve the feature representation [31]. The authors used the ImageNetVGG network containing 21 layers to extract strong hierarchical information. For region-based mechanisms, input images were divided into many coherent classes followed by the extraction of significant local features. Though, processing of sematic segmentation targets to label every pixel but due to noise, may lead to false prediction and difficulty to preserve the object boundaries. To solve issues associated with superpixel segmentation; SLIC is applied, so that number of labels in every superpixel can be determined to select the largest number of pixel values as the superpixel tagging. In another work, Dewangan et al. had developed a CNN-based road classification network called RCNet to classify the road surfaces into five categories namely: dry, curvy, ice, rough, and wet road [32]. The authors have focused on the design of the proposed model and feature extraction using seven different optimizers.

Another work using GAN for the segmentation model is presented in [33], in which the objective is to detect the road and use the multi-scale context aggregation. The proposed network had two parts generator and discriminator. The generator network was used as a producer of the fake class samples,whereas discriminator role was to distinguish the segmentation map from generator network. U-Net-based encoder and decoder were occupied to produce segmentation outcomes. The generator network consisted of an analysis and synthesis phase, in which the analysis phase received the input image and extracted the features to categorize every pixel with pooling and convolution layers. Whereas, the synthesis phase stored the spatial details of precise localization thus produces a strong segmentation map. This discriminator network acted as classified CNN which has the information of ground truth and further processed the segmentation map to produce class labels.

A deep model implementation had the 112-residual layer network with pyramid pooling for segmentation and detection of road portion [34]. The architecture of this fully convolutional network had four residuals set, one set of pyramid pooling, and of one deconvolution layer. The author had considered only the lower portion (containing the road region) of the image and provided minimal computation complexity. In the subsequent phase, the pyramid pooling set contained a pyramid pooling layer, a convolution layer with 11 kernels, and a layer of interpolation used to resize the feature map. Features extracted from 33 kernels in this set, described drivable, non-drivable, and road edge as three different classes of road. These extracted features were deconvoluted afterward to provide fine feature mapping followed by softmax function to different classes of road labels. The sample output of the adopted scheme is represented in Fig. 3.

Fig. 3
figure 3

Blue (small) and red (large) blocks indicate the crop in the left and right images in the first row. The second row indicates the small crop while the third row portrays the test outcome from the large crop [34]

2.1.2 Mathematical model-based approach

Combining the idea of road detection with hardware implementation is presented, where a road segmentation technique was proposed using the disparity histogram feature [35]. However, U-disparity which extracts the road feature information by analyzing all vertical pixels on the image was computationally complex for hardware implementation. This procedure delays the road segmentation process till the entire disparity image were computed. To deal with the mentioned issue authors proposed a vertically local disparity histogram (VLDH), calculated from the differences of the local vertical region of a reference pixel. It deals with the resident N pixels in the vertical direction, and this entire procedure was pipelined with a different calculation of N lines. This pipelining fitness of the applied procedure was suited for hardware implementation in the given scenario.

Usually, road boundaries are parallel to each other, and hence follow characteristics of parallel lines converging at some specific point. With this unique characteristic, Valente et al. applied a vanishing point detection algorithm to identify road boundaries along with the road segmention using the seeded region growing method [36]. To detect vanishing end, authors used texture-based method through Gabor filter to determine the direction of every pixel; hence the primary pixels were identified. Also, the dominant orientation path that describes robust native movement was taken into account. In this direction, it was combined with orientation set of 0°, 45°, 90°, 135° using Gabor filters and optimal local dominant orientation method (OLDOM). In the subsequent stage, vanishing point voting was carried out, in which weight was assigned for each resulting dominant pixel computed from the Gabor orientation filter. This weighted pixel was reproduced using a distance function that generates a vote for that pixel, which was closer to its source rather than to those pixels away from the point. After the vanishing point is found, the seed region growing algorithm is applied. It described the dissimilarity of pixels in the region it was connected with, therefore the difference between the color and mean color of a pixel was found to describe the accurate region. It was stated that the presented scheme was appropriate and efficient for structured and unstructured roads.

As road structures are not uniform, it is quite complex for a monocular vision camera to detect roads in the challenging environment. One possible solution is presented in [37], where the detection system of the road was equipped with vanishing point detection, and authors developed a road detection model for segmentation. Initially, a Gabor filter with texture orientation for vanishing point estimation was used. To achieve accurate orientation, a large number of oriented filters in all directions were applied leading to computationally expensive procedure; hence, it used only 4 Gabor filters to estimate accurate orientation. The next phase consisted segmentation model where the vanishing point was calculated. Initially, a road model was employed to measure the angles between two neighboring pixels, and if their orientation is more than 5° and less than 20°, then the pixel is included otherwise rejected. Further, segmentation path from an image of road structure was formulated based on Bayesian posteriori estimation (fitness of segmentation model over the points calculated in the current and previous phase).

In a real-world scenario, structured and unstructured roads may suffer from the shadows in the daytime. It makes the detection process slightly complex for a vision-based approach in intelligent vehicles. The authors proposed a solution to deal with the mentioned challenging issue [38]. The procedure started from the conversion of RGB to HSI (Hue, Saturation, Intensity) space, as it was robust against the issue of shadows in road scenes. As road boundaries follow parallel line characteristics and converge at some part of the image, the next step involved the calculation of vanishing points using normalized cross-correlation. However, the false pixel values may affect the boundaries of the road image and degrade the detection accuracy. To handle this, bilateral and median filters were used to remove the noise in the intensity component. Though, sometimes the road boundaries were not straight but using hough transform it mapped the road boundaries in relative diffusion area, predicting these lines converged to the vanishing point.

The problem of occlusion offers difficulty for road boundaries detection through a line fitting mechanism. Hence a method based on curb helps to detect the boundaries in presence of occlusion [39]. The presented method follows the non-parametric model where the curb points information with vanishing end through Dijkstra road boundary procedure was approached to determine the road area. Primarily, vanishing point candidates were extracted through a difference map of efficient large scale (ELAS), and the resultant linear structure in the v-difference map was fitted with a random sample consensus (RANSAC) algorithm. Here, selecting the pixel point with the largest value is considered as the vanishing point. For the Dijkstra graph model, every pixel node was connected to the surrounding pixel, and the cost of every participating element is calculated. The road border that has a lower cost was considered to be the shortest path conforms the road boundaries of this model.

Using parallel line characteristics, the road boundaries can be used to determine fast and accurate vanishing point estimation as deliberated in [40]. After acquiring the texture information, the voting scheme was employed under vanishing point detection. Texture information is calculated by the variance of average pixel values and individual pixel values. For the orientation component, the Gabor filter was used to determine the local dominant direction precisely at every pixel. In the voting of the vanishing point, irrespective of all pixel values of voters, only those values were collected, which minimized the noise effects. This voting procedure started with accumulating all the valid pixels of the road images and were initialized to zero. By analyzing the orientation and position of the voter, if the location of these valid pixel vanishing points was on the line, the corresponding pixel values up by 1 and the largest accumulator was designated as the vanishing point.

Figure 4 displays the sample output for the predictable vanishing points under different road environments. Red marking designates the estimated vanishing point for the author’s proposed approach and is compared with other schemes represented in different color codings. Detection of the defined and structured road is comparably easy than the undefined road because defined road portrays more accurate features in terms of boundaries, color, uniform texture, etc., and using these information, the desired precised detection can be achieved. Work presented in [41] makes the use of spatial fuzzy clustering approach for the accurate detection of the unpaved road. The first phase in this work started with vanishing point detection, where the authors ignored the upper half of the image region that does not contain the road part. For this, the Bayesian posterior was used to spot the vanishing end point with a probability score to identify the vanishing line. In the succeeding phase, otsu segmentation was applied to extract foreground and background objects in the input image. It selects a threshold value using the gray image histogram to determine the optimal threshold, thus generating maximum interclass variance to segment the background and foreground. This mechanism of thresholding was satisfactory with the initial setup but not with the subsequent stages due to small intensities values of road and neighboring scenes. To handle this, the authors proposed segmentation using a double otsu threshold and receives robust results against the simple otsu segmentation. Later, a support vector machine (SVM) was employed as a classifier for road and non-road regions. In this, feature classification was used to normalize 3 types of attributes namely gray scale, position, and shape in the fusion procedure. Finally, an appropriate fuzzy clustering method was used to increase the recognition accuracy of the road using the membership matrix calculation into the maximum association.

Fig. 4
figure 4

Supervised auto encoder deep model [40]

Another way to detect roads is presented in [42], which contains a series of procedures. It started with horizon detection that uses only the specific region in the image, color space conversion from RGB to HSV, and removing noise using median filter without smoothening the boundary edges. Afterward, otsu thresholding was employed to obtain images in binary form. Initially, the graph-based method was used where every pixel of the image was considered as a graph node connected through neighboring elements. Each edge had its associated weight and indicated dissimilarity between different vertices. Employing image segmentation at this stage; where the cost weight was attained by a difference in location, color, or pixel intensity. Edge among two nodes in the same element had comparatively low weights than the edges among other nodes in components with higher weights. The quick-shift procedure examined a local neighbor relation for the most significant density. As a result, it made the association of pixels to their adjacent neighbor and hence increase the association density.

Some remarkable practices and model-based procedures of road detection with their techniques, datasets, and achieved performances is presented in Table 1.

Table 1 Road detection

2.1.3 Discussion

The study of the road detection phase reveals that detection of the road is mostly based on deep learning, machine learning techniques, and mathematical modeling.

The work proposed in [18] and [30] implemented GAN to extract the significant features of road detection. Work in [17] involved superpixel segmentation using CNN and MRF to determine the proposal map to represent per pixel class category. Semantic segmentation in [19] and [25] used significant features in the training phase, while other non-significant features were discarded, hence better in terms of time and memory computation. Approaches in [31] shown some false predictions to detect small scale objects.

Most of the approaches in mathematical modeling include the computation of vanishing points. Detection of vanishing points in [36], 37 calculates the pixel orientation in the participating road boundaries, whereas [39] calculated vanishing points using Dijkstra’sgraph cost model. Implementation in [38] finds the vanishing point using the HSI color model, where it focused on the shadow region. Afterward, hough transform was used toidentify the road boundary for the vanishing point. Graph-based segmentation, followed by contour extraction procedure, has been given in [42]. The review of existing methods by different authors for road detection seems to be difficult and challenging. These challenges are due to the changing environment of the road and the presence of pedestrians, vehicles, etc., which makes road detection problematic. Additionally, the approaches employed by various authors have considered their test setup under the standard condition and yet to be validated against the more heterogeneous environment. It is also essential to integrate various models in deep learning along with machine learning techniques for the strong detection and classification of roads. Optimization tools and mathematical modeling can be employed to develop effective road detection system modeling.

2.2 Lane detection

According to the lane marking scheme, a lane is a part of the road region that defines the portion on which a vehicle is expected to drive. Lane marking helps the vehicle to identify the driving area based on speed limit so that each vehicle should keep itself safe from other moving vehicles if followed properly.

2.2.1 Deep architecture-based approach

With CNN, [43] presented an idea of using UV-disparity for accurate segmentation accomplished through semi-global matching (SGM). Then, inverse perspective mapping (IPM) and Sobel filter were utilized to acquire the actual and relative position of lane markings. Afterward, this information was processed in hough transformation space because parallel lines are not parallel in reality due to the bumpy roads [44] where IPM is unable to process efficiently. Finally, the host lane classification is done using the adopted CNN architecture.

When a small loss is propagated back to the network, sometimes it is difficult to regulate and train the weights for the first few layers and may lead to the vanishing gradient issue. To resolve and detect the lane region area through segmentation, Dewangan et al. had utilized the hybrid of ResNet and U-Net network [14]. This fused-up model VLDNet achieves improved performance in terms of accuracy of 98.87% and validated the task using the KITTI dataset [24].

As it is one of the primary objectives of an intelligent vehicle to identify lane markings on the road scene; but in most of the scenarios, lane marking may not always be clear and visible in a real-time environment. The reason may vary due to the shadow of vehicles or trees, poor lane markings, bad weather conditions, or poor road conditions. In this context, Ye et al. presented a CNN implementation to handle the non-lane portion of the image by extracting the lane features [45]. At first, IPM was applied to get a bird’s eye view, thus eliminating the other portion of the background image that does not contribute to identify the lane. In the successive step, a CNN scheme was implemented to obtain the lane feature maps followed by subsampling. In the last stage, classification was carried out to indicate the absence or presence of a lane marking.

2.2.2 Mathematical model-based approach

In case of heavy traffic conditions on the road, the problem of occlusion is very common which generates a difficult situation for the accurate detection of lane marking. To deal with this issue, Chen et al. proposed a method of vehicle-free lanes in a multi-lane scenario using a gaussian mixture model (GMM) [46]. It utilized the progressive probabilistic hough transform to identify line segments. Based on the identified lane segments, lanes were classified through K-means clustering. The authors performed experiments on both single and multi-lane lines. Generally, lane marking structures are parallel in nature and follow a specific property of parallel lines of converging at some particular point called the vanishing point in the 2D plane. This parallel line property was applied to detect the lane lines using a probabilistic voting procedure [47]. In the subsequent phase, the Gaussian model was applied to identify the scores of participating pixel points for being a vanishing end point candidate.

As there are varieties of lane marking schemes designed for specific purposes, and it is very essential for an intelligent vehicle to distinguish between all of them to avoid any unexpected collision. The various categories of lane marking such as single-solid, dashed, double-solid, and merged are defined in [48]. In this study, the author used a bayesian classifier to classify these lane marking schemes based on the guassian distributions. In this direction, authors proposed a v-disparity map and visual odometry scheme, where the former scheme reduced the search area for detection of the vanishing point, and later scheme helped to discover the vanishing point in the straight and curved lanes [49]. Dijkstra graph search was then applied to produce a low-cost mapping to identify vanishing point, where the optimal track was considered.

In finding the lane lines, a model that identifies the pixel participation to form a smooth lane line and called the connected series of pixels as “Super Particle” was implemented [50] and represented in Fig. 5. A vertical line was identified in the image after IPM was applied, thus detecting the lane in the image. Similarly, using feature-based and model-based approaches, a model can recognize the lane line on the road. Features-based mechanism extracts the local or low-level proposals but mostly suffers from the noise that may affect the computation. Whereas, model-based approach expresses the structure of lane in a mathematical equation that fits all types of the lane including straight and curvy lane on the road. In this direction, The authors proposed a method with two-stage feature extraction (LDTFE) to identify the lane [51]. It followed the extraction of lane line segments using modified hough transform, and then divide them into clusters using density-based spatial clustering of applications with noise (DBSCAN). However, detection of occluded lane markings is restricted due to the scene complexity.

Fig. 5
figure 5

Lane model with parameters pc, αl, αr, βl, and βr and thick markings indicate lane boundary [50]

Detection of a lane line using its parallel characteristics uses the RANSAC algorithm to compute the vanishing point [52]. This algorithm randomly selects the participating elements and then an evolutionary algorithm was applied. In this study, edges were identified from a particular set of input images, and then using straight-line components, participants of vanishing points were estimated. Out of many participants of vanishing points, the optimal point was designated as vanishing point using a harmony search algorithm. Using the standard image processing techniques, authors have first focused the lower portion of the image and target its region of interest (ROI) to minimize the computation comlexity associated with lane line detection [53]. Further, canny edge detection algorithm had been applied and the proposed approach is validated through the public dataset KITTI [24].

It is quite clear that the lane line is nothing but a series of pixel values and can be considered as an edge. Here, edge detection alone may not work because the captured image in the actual scenario may consist of other objects that may have the same characteristics but cannot be considered as part of the road. The ROI is the key criteria through which a system may target at the particular image region to identify its feature. In [54], the authors proposed the mentioned idea to identify the lane. Hough transform was used to identify the edge for lane line detection. Authors had considered the ROI be of one-third of frame size (from bottom to up) to minimize the computation complexity.

Work in [55] utilized a parabola lane model suitable to fit straight and curvy lane lines in the road structure. A feature extraction model approach using anisotropic steerable filters is used, where the model extracts the feature of lane markings under the dynamic scenario. It is worth mentioning that the extracted parameters of curved and straight lane lines were anticipated through minimization of the objective function using optimization technique. There are many challenges that may affect the detection of multi-lane due to partial or full occlusion, lower visibility of lane, different perspective issues, etc. Having this idea, the authors had implemented IPM to use parallel line properties of a given lane line. IPM generates the top sight of the given image by handling the perspective result. Hence, detection of vanishing point via generalized Laplacian of Gaussian filter was used to estimate the convergence of lane line [56].

Afterward, blob filtering and verification were used to increase the robustness of lane detection. A similar approach considers the characteristics of parallel line and their convergence point estimation is implemented in [57], where vanishing point operated with hough transform. The angle and length properties of lane lines were also accumulated with vanishing point detection to improve the robustness of the given approach. As lanes can be straight and curvy in real-world road scenes having different shapes and representations, the model-based approach in [58] employed the catmull-rom splines model. Based on the feasibility of forming indiscriminate shape, smoothness, and continuity towards pixel points and control points, this model confirmed to identify lane model for parallel and curve structure on the road.. At this stage, an X*Y input image reduced to (X/2) * (Y/2) image by Laplacian pyramid, where the objective was to find out the global optima in the reduced search space until the best optimal point is found.

Estimation of vanishing point using parallel lane characteristics is sufficiently accurate only when the extracted lane features do not have non-participating candidates. For this reason, Wang et al. adopted the metropolis algorithm because it eliminated most undesirable proposals and subsequently reduced the search space [59]. It minimized the number of local maxima in the neighborhood of search space in the subsequent chance of distribution. Therefore, the metropolis algorithm was deliberated to discover the maximum point in all surrounding space boundaries. In context of the vanishing point, the given image is partitioned horizontally, and each partition is labeled. Finding vanishing points against each of the partitioned labels, detection of the lane is found satisfactory and suitable for straight and curvy lane lines in the standard scenario. Another model-based approach for lane detection in [60] built upon a collection of fuzzy points in which lines constructed on fuzzy sets allied to the fuzzy point, line, and spatial associations correspond to the given R2 linear fuzzy space. The initial phase of this procedure was focused to recognize the collection of points that were likely to be a lane line. Using this approach it examined the connected points component and considered it as a lane line of the road. Candidates who did not participate in this connected component scenario were rejected. Road proposal extraction phase included canny and Sobel edge detector to form new clusters of fuzzy co-linear points. This methodology was restricted for those cases where it demands more proposal points.

To overcome the difficulties associated with vanishing points, Dewangan et al. had used IPM to find out the lane lines using a monocular camera [61]. Authors had applied standard image processing techniques to identify the left lane, right lane, and lane center, and using the information perceived, the proposed intelligent vehicle prototype changes its driving behavior and tries to keep itself in the middle of the lane and represented in Fig. 6. Likewise, Son et al. proposed an approach of lane clustering model that worked under various illumination conditions and reduced the computation complexity via adaptive ROI [62]. Initially, the edge recognition approach kept the structural characteristics and considerably decrease the amount of non-significant data to examine. It minimized the computation complexity and further applied the canny edge detection method as it was robust against noise. Computation of vanishing point at the joining of perceived lane lines was executed and generated a voting map was to examine optimum vanishing point.

Fig. 6
figure 6

From the top image, the frame represents the computation of distance to the left and right lane. The second image from the top portrays the frames into small regions containing width of one-pixel. Likewise, third image from the top calculates the lane and frame center from the detected left and right lane boundaries and the last frame represents the lane center in green color while blue color indicates the detected lane lines [61]

In another real scenario, when traffic is a bit high, vehicles mostly occupy the maximum portion of the road and hence cover the lane portion partially or completely. However, if the front vehicle changes the lane position as per the availability of the free lane, it simply makes the current lane positions free for approaching vehicles in most cases. This mechanism is discussed in [63], where the sensors of smartphones were responsible to evaluate the position of the vehicle to detect the lane. To deal with unclear circumstances and repositioning of the vehicle places; Markov localization was used to preserve the chances of density over the set of possible positions. Here, the primary intention was to keep the expected error as low as possible rather than a selection of lane points with probability measure.

Apart from the above literature studies, Table 2 highlights major research contributions in lane detection.

Table 2 Lane detection

2.2.3 Discussion

The objective of lane detection study has been explored and categorized in two parts, namely deep architecture and mathematical model-based approach.

IPM that provides a birds-eye view is implemented in [43], 45 to detect the lane portion on the road [43]. made use of semi-global matching and hough transform to match parallel lines correctly. Work in [47, 49],56 is carried out to use the characteristics of parallel lines and thus finds the vanishing point from the lane line on the road. Reference [49] uses Dijkstra graph map to produce a low-cost mapping of candidate pixels points towards the discovery of vanishing point, whereas [56] uses IPM to follow the parallel lane line. In these works, lane lines are fitted suitably with straight lane lines. Another unique model in [52] is to determine the vanishing point of the lane lines using a meta-heuristic harmony search scheme.

Most of the works have been well suited for straight lane detection in the standard scenarios and could not maintain a good trade-off between optimal model training and obtaining high accuracy. Moreover, keeping aside the consideration of the standard scenario; there is a strong opportunity to find straight and curvy lanes in a dynamic environment where occlusion and illumination variations are challenging.

2.3 Pothole detection

A well-defined road structure and proper lane marking schemes assist the driver to keep itself safe and stay on the right track. But road structure may have undulations and potholes caused by many natural and impractical reasons make the driving of intelligent vehicle worst. Since intelligent vehicles are trained to drive on structured roads and guided lanes, facing bad structures may take it wrong all the way and may lead to serious hazards. Potholes are the most unpredictable element found on the roads, as they have no uniformity in their structure. Passing through these potholes without precaution will surely damage the assembly of an intelligent vehicle. Some of the works on the detection of potholes are discussed here particularly from the perspective of autonomous driving.

2.3.1 Deep architecture-based approach

Various sensors that can extract necessary information of potholes are not appropriate at every point of the scene including vibration sensors to 3D mapping, and most of them are have limited performance especially at night-time. All the objects emit some amount of heat energy and it is somewhat difficult for a vision camera to fetch. But at the same time, the thermal camera is good at obtaining the heat signature for those objects. With this thermal capability, Aparna et al. proposed a CNN architecture that utilized thermal imaging to detect potholes on the road [64]. After the data was acquired through FLIR ONE (thermal camera), the input data were preprocessed (cropping and resizing) to remove any inconsistencies. Zooming, rotation, mirroring, blurring and contrast enhancement schemes were carried out as data augmentation techniques to maximize the number of samples. Then sequential CNN was proposed in which a pool of Batch Normalization, Conv2D, MaxPooling2D, Dense, Dropout, and Activation layers were utilized. In this category ResNet18, ResNet34, ResNet50, ResNet101, ResNet152 were also employed. In another work, Dewangan et al. had developed CNN based model called PotNet to identify the pothole [65]. The work involves small-scale setup and using the hybrid approach of vision sensor and deep learning framework, the proposed model PotNet has obtained the accuracy score of 99.02%.

Varona et al. proposed a deep learning approach, where the objective was to categorize the roads according to surface structure in pothole detection [66]. It starts with multivariate time series classification where the adopted CNN scheme, a long short-term memory (LSTM), and reservoir computing (RC) models had been applied. The CNN model extracted the visual information through convolutional, pooling, and fully connected layers, while LSTM used in this procedure is a type of RNN applied to ignore less significant information.

Finally, RC is applied to minimize the computation power without training of internal weights. It achieved mentioned objective with PCA based dimensionality reduction approach and synchronized the input from the state sequence followed by the application of a linear model for classification.

2.3.2 Mathematical model-based approach

Mostly, the potholes on the road are not uniform in their structure, but their shape is approximately elliptical if the perspective view is considered. The texture inside the potholes is much grainier than the surrounding road surface area, one kind of distinct feature. These distinct features of potholes are used to detect potholes [67]. The shape extraction approach can be visualized in Fig. 7. After the color space conversion and preprocessing step, a shape-based thresholding algorithm was used in the image segmentation phase. In the shape extraction phase, the regions which are linear like joints; long cracks were predicted not to be potholes and eliminated. For this reason, other region features like major axis, location of centroid, orientation angles, etc. were determined.

Fig. 7
figure 7

Shape extraction procedure on pothole detection [67]

In the similar approach, Murthy et al. presented a sequential step to detect potholes [68]. In general, the basic segmention scheme of an object in the image involves bi-level thresholding. But due to the presence of multiple objects of varying intensities, the bi-level thresholding approach may not be applicable. Hence, the author had used a multilevel threshold to detect the multiple objects. In the subsequent phase, the edge detection procedure took place to satisfy the criteria of approximating the mathematical gradient operator. It used various patterns in different directions and applied the fitness value of native intensities with model parameters. Out of these applied mechanisms, the performance of Robert’s edge detector was found suitable. At the final stage, blob selection technique was used to examine the relationship of potholes feature using density, outline, characteristic proportion, and area concavity.

In another work, Authors considered the texture and low-intensity area features for pothole detection [69]. Dealing with a moving vehicle, relying only on two features were not sufficient. Shadow of potholes was considered as an extra feature as the potholes have darker shadows than the neighboring roadway portion.

Detection of potholes in several research studies includes the use of vibration sensors, 3D mapping, and 2D vision-based approach. However, many of these sensors cannot extract the exact information due to the dynamic and heterogeneous environment. Work in [70] involved the 2D vision-based technique with Raspberry Pi-2. In the image enhancement phase, the bit selection procedure was applied where images of 24 bits were converted into 8 bits i.e. by choosing only the blue channel of the image. In the same phase, median filtering was used to remove noise by calculating the median of all pixels. Weighted mean-based adaptive thresholding was computed for segmentation because it had to store variations of lighting conditions in the image. Image texture and the structure of distress features were determined in the subsequent steps for the accurate detection of potholes. The number of potholes presented in the image was determined with their approximate area, perimeter, and circularity as mentioned.

The summary of major studies in pothole detection is presented in Table 3.

Table 3 Pothole detection

2.3.3 Discussion

The comprehensive study in pothole detection is described under the above-mentioned sections i. e. deep architecture and mathematical model-based method. The authors in [64], 66] used the deep learning approach detection of potholes. In [64], details of potholes structures were acquired through thermal imaging, then training was performed using ResNet. The approach in [66] implements LSTM and RC to minimize computation cost without dropping significant features during training. The work proposed in [67] was used to characterize the shape of cracks or joints using a shape-based thresholding algorithm and identified the texture of potholes that are mostly uniform in all types of potholes considering the visibility in daylight. Texture structure using adaptive thresholding segmentation was also utilized but at a very high speed of vehicle or night condition may degrade the detection accuracy of pothole detection [70].

A dynamic environment like rain, dirt condition, and low illumination especially at night is more challenging in pothole detection which have not been considered in the referred studies. As potholes are not uniform in their structure, shape-based approaches are limited to detecting the specific one. Texture extraction in this direction is a unique approach but restricted when varieties of road material are used hence texture feature may not be the only criteria due to its non-uniformity.

2.4 Pedestrian detection

Pedestrian refers to the person(s) who are walking or running on a road or nearby region. Pedestrian detection and avoiding collision or accident is one of the significant phases of an intelligent vehicle system. Detection of pedestrians through intelligent techniques will assist the decision-making while driving so as to decrease the death rate in road accidents. For this phase, vision and sensor-based approaches are mostly followed by authors in their studies. These approaches are as follows:

2.4.1 Deep architecture-based approach

The vision-based approach follows the stepwise procedure for the identification of objects. Pedestrian detection is a subproblem of this domain. Usually, this detection scheme operates on color images with standard illumination conditions. Though, the traditional methods have a good recognition rate but shown limited performance in case of low illumination, noise, etc. in the surroundings. Li et al. proposed illumination-aware scheme for pedestrian detection, where the deep architecture was implemented through the combination of color and thermal images [71]. The given architecture had used R-CNN as a backbone network and consisted of three phases: a multispectral faster R-CNN was implemented to produces distinct recognitions from the thermal and color images. Next, the illumination assessment unit produce an illumination condition measurement. The final phase involved precise and dynamic recognition, in which a gated layer was presented to fuse up the thermal and outcomes of color recognition scheme.

In actual situations, there may have some unusual objects that an intelligent vehicle is never been trained to recognize. In another case, it must identify the pedestrian and non-pedestrian objects. These non-participating objects in this context are anomalies. These anomalies may have a complex structure and vary in different sizes. Having this scenario, the presented idea involves anomaly detection in the pedestrian pathways [72]. Region-based scalable convolution neural network (RS-CNN) architecture was applied to identify such anomalies in the pedestrian pathway. This architecture contained two phases: region proposal network (RPN) with scalable proposals, which generates the regions, and fast R-CNN model phase to spot anomalies.

Pedestrian detection for an intelligent vehicle is sufficiently operative in daylight, but at night time it is complex to recognize due to low illumination or noise from the capturing environment. One possible direction is to adopt the thermal camera for nighttime detection, but implementing this is an expensive option. The work in [73] attempts to provide a solution based on the mentioned scenario. The architecture of faster R-CNN is modified in this approach, followed by a data augmentation procedure to provide an adequate amount of training data. Apart from the basic augmentation techniques, it also involved noise accumulation and brightness dropout to daytime data. In the proposed architecture last layer feature map revealed combined maps of successive frames.

The CNN-based method is computationally expensive but is effective in feature extraction and proving the precise detection. The idea is to cascade the aggregate channel features (ACF) sensor with a deep CNN [74]. The ACF detector takes the given training dataset as an input and provided proposals with their confidence score. Further, this score was used to define the bounding box ensuring the detection of pedestrians in the visual scenes. For the mentioned architecture, which was built upon VGG16 is initially looked up for the extracted proposals for the classification, where fine-tuning and transfer learning were involved to ensure the high detection accuracy of the system. In recent practice, the direction and speed of pedestrians are also taken into consideration. Authors had used a deep learning technique to accomplish a consistent recognition of pedestrians in a specific direction [75]. Here, the direction of movements included moving right, left, and front. Initially gradients of motion history image (MHI) and dense optical flow and were calculated to notice variations in the image followed by the histogram of oriented gradients (HOG) features and linear SVM. After the process of consecutive subtracted frames, the pixel level sum was obtained. These outputs were used as an input for the modified architecture of CNN’s including GoogLeNet, ResNet and AlexNet.

Abstruse elements in a static scene that are not identifiable to individuals can be occasionally notable when motion is offered. Motion proposal is an imperative entity to recognize anobject. The depth of this object proposal in the image is very high and a deep learning-based approach is capable to fetch the proposal for this kind of scenario. One similar approach is presented in [76], where the objective is to use still as well as motion proposals. In this work, two-stream convolutional features were used, where the first stream for spatial proposals followed the framework of convolutional channel features (CCF). It involved the lower part of VGGNet-16 which had three convolutional layers and three max-pooling layers. For the second stream for temporal proposals, convolutional features specialized in motion feature extraction were built. It is worth mentioning that the training of this two-stream detector was of large dimension in the context of proposal maps. The detector used in this architecture is a sliding window to categorize tightly tested windows over the proposal maps at the assessment period.

When the vehicle moves forward, the object closed to the vehicle is seen as an object of large scale; whereas, the far object is seen as a small-scale object. To deal with multiscale issues, Yang et al. operated VGG16 based fully convolutional network and modified numerous convolutional layers [77]. The proposals fetched from the lower and higher layers were responsible for small-scale and large-scale pedestrian, respectively. Dissimilar scale estimation boxes were built upon conferring to different layers. It uses m proposals map to define estimation. The scaling of the estimation boxes to each proposal map was defined in Eq. (1):

$${S}_{k}={S}_{min}+\left(\frac{{S}_{max}-{S}_{min}}{m-1}\right)*\left(k-1\right),k\epsilon [1,m]$$
(1)

where Smin = 0.1 (bottom layer) and Smax = 0.9 (uppermost layer).

Dealing with pedestrian detection involves the processing of a human body in terms of feature extraction. But there are cases when a pedestrian is partially hidden or occluded by some other vehicle in the real world. Detection of such occluded pedestrian is deliberated in [78], where a deep learning-based probabilistic framework was presented to learn thefeatures of a human body part.

If any of the body parts are identified in the scene even in the case of occlusion, it is considered as a pedestrian (refer to Fig. 8). Restricted Boltzmann machine (RBM) was used in the proposed framework, where it defined a probability distribution over binary hidden variables and binary observed variables. In the subsequent steps, the part model was defined, where parts were allocated according to the sizes and relationship to the layers. Lower layer parts had lesser size and the top layer had a large size. As the projected model is a normal graphical model, hence computationally weak in processing time and for training the network. Therefore, the DBN style model was selected because of its easy interpretation and fast training algorithm.

Fig. 8
figure 8

Parts based model [78]

To the development of visual proposals, many steps have been processed to minimize the computational complications. Having this idea, work presented in [79] optimized the stages of pedestrian recognition to improve detection accuracy. The optimization had been achieved through choosing the better proposal and rejecting the lesser one with the help of strategies such as selective search, sliding window, and locally decorrelated channel features (LDCF). Fine-tuning facilitated the training module by adjusting the weights of the network and classifier in the context of the optimal selection of parameters. Data augmentation was the next step in this scenario. A region proposal technique was employed to assure that the probability score of the given region belongs to a pedestrian.

2.4.2 Mathematical model-based approach

In most of the pedestrian detection studies, the work primarily focuses on front view-based detection. Apart from the traditional practices, Suhr et al. proposed a rear view camera-based approach where the existence of a pedestrian is confirmed after the detection and represented in Fig. 9 [80]. The adopted strategy included pedestrian body posture in two general cases: standing and sitting. In the standing position of a pedestrian, the system predicts the lower position of a body to find out leg structure; whereas, in the sitting position it identified the head part. Based on the HOF feature and total error rate (TER), hypothesis generation and verification-based classifier were employed.

Fig. 9
figure 9

Working flow of pedestrian detection [80]

However, it is somewhat easy for an operating system that identifies the pedestrian when the environment is steady or has a uniform background, but it is not always the case due to the randomness or complexity of the real-world environment. A concrete discussion on pedestrian detection for the complex background is provided in [81], where the authors combined static and dynamic proposals to increase the frequency of proposal extraction and at the same time achieves high recognition accuracy. In the case of a fixed size window, scaling of image data was carried out during the proposal pyramid design, and it used a non-maximum suppression scheme to precisely spot the pedestrian. Additionally, the pedestrian can be identified by keeping their positional information in successive frames, whereas the position of stationary or non-pedestrian doesnot change in the frame. To reduce this false detection Lucas Kanade sparse optical flow method was used, primarily responsible for the motion. Similarly, with time and speed accuracy; local information about the corners were kept on priority. As the corners are the junctures of dissimilar borders in the image, therefore corner location is calculated by the extreme eigenvalue of the autocorrelation matrix.

Usually, the entire image does not represent the intended object. The scene may also contain other elements, including background and foreground information. Here, the approach is to locate the key object in the given scene and can be accomplished through ROI, thus reducing the computation cost of the system. One such idea of locating the object and reducing the computation complexity is mentioned in [82], where the pedestrian detection task was done using ROI formation. It started with a depth map and HOG proposal descriptor in the color image, mostly robust in scaling and illumination. As the depth map in any image is the pixel value interrelated to the distance of that point as in the actual scene to the camera. It projects true projection of the scene, and hence detection of the pedestrian object becomes accurate. The depth image was fragmented into a high number of small part images where each small image contained objects with similar depth rates. Operation of K-means clustering on depth image produced its segment into k number of clusters. Afterward, a linear SVM and HOG descriptor were used to categorize the ROIs into two distinct classes: pedestrian and non-pedestrian.

Based on shape symmetry and appearance, features were fetched, and then pedestrians is identified [83]. Here, neighboring proposals are also designed to incorporate size, location, aspect ratio, and similar types of proposals for strong detection in results. The appearance reliability was mentioned as pedestrian and described through the head, upper body, and leg. Similarly, shape symmetry indicated that pedestrian shape was roughly proportioned in the horizontal direction. Neighboring and non-neighboring proposals were used to generate candidate proposals. Soft-cascade AdaBoost was utilized for erudition from candidate proposals.

The work for pedestrian detection in a vision-based approach has made decent development. It is worth mentioning that the involvement and success of thermal imaging sensors as used in the medical and military application provides a greater number of features to be included in this area also. A similar work that uses thermal images for pedestrian detection is presented in [84]. Though, thermal imaging deals with higher noise and appearance but considered less operative against natural and daylight images. To handle this, a local steering kernel (LSK) was used to reduce the noise and filtering from the image and presented a frame to acquire a tensor indicator with matrix cosine similarity. In one of the main functions of LSKs, after processing the gradient distribution in the small region (5*5) which is the steering matrix that holds the principal path of resident texture. The detection in this work used frequency domain and image-based normalization, produced a sophisticated framework for effective and fast pedestrian detection.

The scope of proposal extraction used a spatial–temporal histogram of oriented gradients (STHOG) that focused not only on the human shapes but also predicts their motion information. Further, the geometric ground constraint was applied to process the desired region thus enabling to reduce the cost of false prediction and improve the performance for pedestrian recognition. The motion complexity of the background and the target object was handled by the image stabilization phase, by defeating the background motion between consecutive frames and extracting the pedestrian motion. To deal with the multiscaling issues associated with the size of pedestrians, a manifold normalization window was used conferring the scale disparity. Finally,the STHOG pedestrian indicator gained from AdaBoost was applied for the object classification [85].

Sometimes only a portion of a pedestrian is visible due to the complexity of the scene. One possible solution is to use multiple cameras at different parts of an intelligent vehicle to accurately recognize pedestrians. As multiple data sources need to be handled, processed, and recognized, that itself is a complex task. Another solution apart from the vision-based approach is using 3D laser technology mentioned in [86]. The authors used 3D information for pedestrian detection. However, 2D classification cannot be directly applied to 3D data because the height of data points reflected from different objects may differ.

The simulated and real results of pedestrian detection along with the laser point cloud data are represented in Fig. 10a which contains the data frames. The red cube box represents the identified pedestrians while the yellow cube box portrays the vehicles and shows the optimal performance of the laser point cloud detection method. Similarly, Fig. 10b represents the final output containing the fusion of image data and laser point cloud data. In another approach, Zhang et al. offered to obtain the rectangular features using a geometric shape model that consisted of three rectangles associated with head,upper body, and lower body [87]. A pool of templates was designed to represent the shape information using the binary and ternary model. Binary multimodal has a limitation of not representing the corner information accurately, where as the ternary multimodal represents pedestrian shape through corner regions even in a more complex geometric arrangement. It started with the small size of the window scanning the entire image. When one or more alike templates are available in the pool, the existing template is rejected. In the last stage, a multichannel cell descriptor was used, where every template was processed via manifold channels to produce a multichannel feature pool.

Fig. 10
figure 10

a Pedestrian detection result using point cloud data [86]. b Fusion result of pedestrian detection [86]

Table 4 shows a summary of some noteworthy work in pedestrian detection consisting of deep learning and mathematical model approaches.

Table 4 Pedestrian detection

Abbreviations: TP true positive, FP false positive, FN false negative, MR miss ratio, R recall, P Precision, FPS frames per second, TWR true warning rate, TSR true suppression rate, TWA total warning accuracy, DPM deformable part model, AMR average miss rate, DR detection rate, FPPI false positive per image, SS selective search, SW sliding window, LDCF locally decorrelated channel features, AlexNetP AlexNet for pedestrian, ASD average short distance, AMD average middle distance, ALD average long distance, HOG histogram of oriented gradients.

2.4.3 Discussion

The most essential and interesting problem for an intelligent vehicle is to detect pedestrians while driving on the road. According to work [71, 72], 73, pedestrians were detected using R-CNN architecture. To deal with low illumination [71], 73 used R-CNN for thermal images to accurately identify pedestrians, whereas [72] is focused to determine the anomaly in the given scene. However, research work in [75] established a deep learning-based architecture to handle issues of pedestrian orientation according to left–right or front movement. Pedestrian at the farthest distance suffers from scaling issue, [77] implemented a deep learning approach and solve this scaling issue. An improved deep learning model in [78] offered a unique approach to detect a pedestrian by identifying the upper body and lower body independently, especially to handle the case of occlusion. The model-based approach in [80] included detection of pedestrians according to their sitting and standing position, whereas [81], 82 used pyramid-based feature extraction and ROI with HOG respectively.

The review of these studies reveals that detecting pedestrian are difficult and challenging mostly due to the scaling and occlusion issues. Most of the authors have focused on daylight pedestrian detection, however, there is a need for improvement in the case of night mode and heterogeneous road environment in the upcoming works.

2.5 Vehicle detection

Another challenge for an intelligent vehicle is to face a varying environment where a number of vehicles are randomly driving on the same or neighboring track. Recognition of such vehicles and deciding to change or remain on the same track is a crucial task. To handle such situations, identifying the vehicle and keeping a safe distance to reach the destination is also one of the key objectives of an intelligent vehicle system. Some of the studies to detect and recognize vehicles using state-of-art techniques are as follows:

2.5.1 Deep architecture-based approach

It is obvious that the nearest front vehicle to which an intelligent vehicle is approaching on the road can be detected using size and shape parameters. However, getting accurate information considering such parameters is a difficult task, and subjected to the position and the varying speed of the vehicle. This challenging issue of scaling of far and near position vehicles is presented in [88], where authors have presented a deep learning architecture-based model SINet for fast vehicle detection. Initially, the proposal map was formed using a bounding box for object detection along with the region proposal network (RPN). The context-aware ROI pooling to regulate the proposals for the extracted feature dimension without losing any significant background information was used in the successive step. It uses deconvolution through the bilinear kernels to increase the minor feature proposals specific to the target object only. This pooling mechanism was processed through multiple layers, and then extracted features are merged to low- and high-level information to identify the vehicle object.

Many approaches have been recognized in vehicle detection, but the performance of their classifier is restricted when there is a large dissimilarity between the source scene and the target scene due to the scene complexity. To handle this issue, a deep learning-based approach is presented in [89], where a deep autoencoder was merged to basic deep CNN and presented a voting scheme for vehicle detection. This modified DCNN structure had one input layer, two convoluted sub-sampling hidden layers, and one feature vector output layer. Input layer size for each image was configured of 32*32 size, and the convolutional kernel of the hidden layer was set to 5*5. In the confidence map, various subsamples produced at different levels of the classifier were combined to get the average of subsamples. This computed confidence score was considered as a higher voting value to be used for vehicle detection.

Scaling problems, low accuracy, and slow detection rate are the major issues, which degrades the system performance. To deal with these, Zhnag et al. presented a deep learning-based approach for better vehicle detection performance [90]. For this purpose, deep CNN structure-based faster R-CNN built upon VGG-16 and inception model called DCN was used. In this phase, a precise vehicle region network (AVRN) followed by a vehicle attribute learning network (VALN) to identify vehicle-like boundaries is developed. Both mentioned networks were trained using the stochastic gradient descent (SGD) method. AVRN rejected an estimated boundary outside the image region and allocated the remaining regions to a binary tag represented as an object or background.

Classification of different vehicle categories in the real world is challenging due to the size of vehicles and the availability of driving space. Such classification technique is proposed in [91], where authors classified major categories of vehicles namely: cars, minivans, trucks, and buses. Faster R-CNN was employed to achieve this objective in which modified RPN was initially used for vehicle detection and then the bounding box regression loss and softmax function were implemented for the final classification decision. The approach is validated through the standard qualitative curves of precision and recall scores against the shared and unshared layers of the convolutional process on the validation dataset and shown in Fig. 11. However, from the figure, it is obvious that the learning curve towards the classification of minivans and buses is somewhat lower than that of cars and trucks.

Fig. 11
figure 11

Precision-Recall curve analysis for vehicle type classification [91]

In many deep learning approaches, the one-stage network architecture technique is suitable for fast detection but suffers from low detection accuracy. However, two or multi stage network-based architecture is best at accuracy but suffers from time computation capability and is considered slow. A combination of both the capabilities may enhance the performance of the system in real-time scenarios and similar ideas are presented in [92], where authors proposed HybridNet, a Fast-DeepCNN-based architecture for vehicle detection. Primarily, VGG16 has been adopted as the base architecture where the first 13 convolutional layers are applied to process the input data and produce the required features for precise detection. Based on the tentative evaluation of obtained features, bounding box formation and associated classifications were predicted. ROI mapping layer was added to the network in the transitional stage to receive high-resolution features. Authors used the counter weight matric [− 0.25 0 0.25] and extracted the coordinates of boxes to produce the features.

2.5.2 Mathematical model-based approach

A study on the scaling issues of vehicle detection is presented in [93], where the multiscale AND-OR graph (AOG) model was proposed. A multiscale model using graph structure of AOG that described local features (license plate and window) and global features (vehicle contour and flat regions) were used. In this AOG graph model, the graph consisted of three nodes AND, OR, and terminal nodes. The AND node \({A}_{e}^{f}\) described a vehicle or portion of the vehicle, e was its layer index in all AND nodes, and f describes the consecutive index. An OR node \({O}_{q}^{i}\) described a vehicle or a portion of a vehicle with an alternate assembly, q described layer index in all OR nodes, and i was the sequential index. This probability model and graph structure constitute this multiscale AOG model to determine the presence of vehicles on the road as shown in Fig. 12.

Fig. 12
figure 12

a AOG structure. b Templates for terminal nodes [93]

The amount of pollution emitted by motor vehicles is increasing and unsafe for the environment. But, having this unique idea on the smoke emission, Tao et al. presented a model for vehicle detection [94]. The vibe background subtraction was used to fetch and recognize the foreground objects. The matching degree was calculated against the foreground and background objects. Afterward, the rear portion of vehicle detection included a modified integral projection (IP) method known for extracting significant features and locating the objects. In comparison with the background of the road, the gray value of the black smoke was found comparatively low, and in feature extraction, the black smoke element connected to the vehicle object was detected as a foreground object. Gray-level co-occurrence matrix was used as statistical features to determine how often a specific value of pixel pair occurs, which described the spatial relationship of the particular pixel.

Most of the studies in multiclass vehicle detection and classification operated under standard environments. An intelligent vehicle may face a challenging environment including nighttime. In this direction, multiclass vehicle detection built upon feature selection using tensor decomposition and object proposal, especially for nighttime [95]. In feature extraction, HOG, local binary pattern (LBP), and four direction features (FDF) were employed and training images were divided into 64*64-pixel blocks. Different block sizes produced varieties of detection performance, and authors found that detection performance of generated 64 blocks of 8*8-pixel size was closer to multiscale blocks. In tensor construction, the features were organized in three indices: vertical, horizontal, and features themselves. Each block in this stage had 4 FDF, 36 HOG, and 58 LBP features. In the subsequent phase of object proposal, EdgeBoxes was applied, as it has a decent tradeoff between computational time and detection rate. As the neighboring environment may have similar local features like vehicle taillights that are somewhat misleading, so original images were converted to HSV space and then local contrast features were extracted.

Another research in this category is presented in [96], where authors used millimeter-wave (MMW) Radar and monovision cameras to detect and track the vehicle. Initially, radar identified the targeted vehicle and conveyed the location and size of ROI to the camera unit. This unit applied active contour and symmetry detection to recognize the vehicle inside the ROI offered by the radar sensor. Further, in the symmetry detection method; the Sobel operator was employed to determine the edge. For every line of the image, the amount of symmetrical and non-symmetrical points was calculated. The shadow under the vehicle was considered as a notable feature, and this feature was used in ROI by identifying the location of shadow under it. In the tracking phase, the target vehicle was defined as a bounding box and included five components: target boundary, presence illustration, motion and position representation, process and model modification. Subsequently, using a bounding box, local sensitive histogram used for presence illustration, and uniform motion were applied for motion and position representation.

Considering the dynamic and real-time challenging scenario on the road, a vehicle faces blind-spot area which can not be directly or indirectly seen while driving. In the context of intelligent vehicles, some portions of the road could not be captured through a camera, which leads to the chances of accidents. This is presented in [97], where a blind zone with vehicle detection was proposed using the fisheye camera on the rear side of the vehicle. In the vehicle detection phase, AdaBoost was applied to identify the front side of the vehicle using its rear side fisheye camera. In the novel method of rear vehicle detection for the blind zone, the authors used vehicle wheels as a notable feature as it appears less irregular and more circular. This phase of detection used hough circle detection process collaborated with the Canny edge detection algorithm. In-depth, wheel arch contour detection was equipped with contour using Douglas-Peucker polygon estimation to translate the arc shape into a triangular shape. It helped to identify the presence of the wheel inside this triangular boundary and make the detection accuracy.

Detection of occluded vehicle or object is challenging and similar work dealing with this issue is presented in [98]. It proposed a part-based vehicle recognition, that focused on probabilistic implications about partial opinion and viewpoint variations of the vehicle. This part-based vehicle was exhibited on every geometry discriminative viewpoint and appearance parameter. These parameter calculations were termed as viewpoint discriminative part appearance model (VDPAM) and viewpoint discriminative part based geometric model (VDPGM). For the given dataset, Eq. (2) describes viewpoints on K = 1,…k, a geometric model \({M}_{g}^{k}\) is trained and formulated as:

$${M}_{g}^{k}= {\zeta }_{i}^{k}|i=1\dots ,{n}^{k}; {\zeta }_{i}^{k}\sim N\left({\mu }_{i}^{k},{\sum }_{i}^{k}\right)$$
(2)

where geometric model \({M}_{g}^{k}\) Consist of nk parts. For each part \({\zeta }_{i}^{k}\), its relative position to a vehicle center rk is defined in a 2D Gaussian with its mean at \({\mu }_{i}^{k}\mathrm{a}\) nd covariance \({\sum }_{i}^{k}\). This model viewpoint simulated the occluded part and filter out the low-quality spotted part occurrence.

An alternative approach to detect vehicles in the complex scene is presented in [99]. From the preprocessing stage, images were converted into a grayscale image, and the averaging and Kalman filtering were processed to remove the noise. In the vehicle detection stage, a mixture of Gaussian operations was employed for background subtraction to identify the motion. At last, the Kalman filter was employed to track the vehicle movement. Similarly, active shape modeling (ASM) was engaged for the structural content similarity and offered shape information. Further, principal component analysis (PCA) was implemented to get geometric information used to set up pose in the ASM training phase. During the feature extraction stage, scale-invariant feature transforms (SIFT) and speeded-up robust features (SURF) mechanisms were applied to generate feature maps. Finally, an adaptive neuro-fuzzy inference system (ANFIS) was implemented to categorize moving vehicles.

Work presented in [100] involved a series of stages to detect and recognize vehicles for the traffic surveillance systems. The feature extraction phase included Haar-like features, efficient at describing the local appearance ofobject. Next, AdaBoost was applied to recognize the finest week classifier to form a robust classifier. Similarly, Gabor wavelet transform was employed at the vehicle recognition phase for sequential feature extraction. Authors in [101] presented a color fusion deformable part model (DPM) to detect vehicles. First, this model was virtually broken into the root model and part model. The root model defined complete features of the vehicle, and the part model described the local feature details of the vehicle. This model demonstrated the features of the vehicle in all surrounding environments such as lighting conditions, ground appearance, color variation, etc. As the HSI color space is stable to the mentioned environment, color images were initially converted into HSI space. Conferring the HOG features of the H, S, and I channel, training the DPM as individual color channels were tagged as DPMh, DPMs, and DPM equivalent to each channel. Afterward, the color features of the test images were also computed and this model was examined using fusion feature score using a sliding window. If the obtained score is above the designated threshold it is considered as vehicle area.

Detection of a single category of vehicle is comparatively easy but is not always with the case of multiple vehicles as they belong to a number of categories including truck, car, trolley vehicle, bike, etc. Detection of multiple vehicles is presented in [102] in which authors proposed feature extraction through HOG and Haar-based method in a two-step procedure. The main objective of this stage was to reduce the search area for approaching the region of interest. Further, finer and strong features were extracted from the reduced ROI using HOG. In the last stage of this work, a linear SVM classifier was employed for target vehicle detection and classification.

Real-time detection and tracking of vehicles are proposed in [103]. The authors proposed a solution using the Raspberry Pi board which detected, tracked and examined the number of vehicles on the road. As this approach extracted only the color features of the vehicle, this is computationally time effective. Initially, to recognize vehicles from another part of the image, RGB images were converted to an HSV color model due to the high range of color space. Next, an 8-connectivity chain code was employed to find out the active contours. Each object can be defined with a polygon shape, authors used this information to deal with the convex hull. With the help of data association, the centroid point of these extracted objects was allocated to the Kalman filter to handle track preservation, target finding for single and multiple targets.

A back camera-based (BSD) system is built that notices both vehicles and motorcycles to deal with accidents happening in blind spots [104]. The effort is made to identify the blind spot in which objects are not effectively recognized. The anticipated method focused on tyre detection as it has less shape variation. To achieve this, HOG was used for feature extraction, and SVM was applied in this direction to classify tyre and non-tyre objects. In the tracking phase, the Kalman filter was used to track the center of tyre. The same procedure is applied to detect and track motorcycles as they have the same tyre features as another category of vehicles. The representation of the upper part of the motorcycle was the least significant criteria as they had no noteworthy features.

Night conditions and daylight scenarios are equally important for intelligent vehicles systems. Work that includes both scenarios is presented in [105]. The authors proposed the Gaussian mixture model (GMM) to develop the spatial relationship between the components of interest. Most of the vehicle components have their discrete features for texture, color, and shape. Therefore, elements of the vehicle, like lamps and license plates are first observed with basic techniques under the daytime scenarios. Then GMM was applied to develop the association among these elements which improved the correct detection using these relationships. Intended for the nighttime scenario, region of lights were initially detected using frame difference technique, then headlights paring was processed using symmetry and shape features.

In the direction of the vehicle classification approach, the work in [106] has also contributed to classifying six different categories of vehicles. This work operated under visible lights and thermal image processing. The traditional foreground subtraction method was applied to separate foreground objects from the image. This process also contained background modeling and subtraction called BGS. In the successive steps, Sobel edges were detected first, then a median filter was applied to extract the features from grill and headlight regions. Texture information was processed by gray level concurrence matrix (GLCM) that models the relationship between pixels.In this procedure, elements like momentum, homogeneity, contrast, and entropy were calculated and similar approaches were applied to thermal images. However, it was mentioned that the accuracy of thermal images in the daytime is quite less as compared to the processing of visual cameras.

Detecting a vehicle is one task and attaching annotation is another, but work in [107] proposed recognition and annotation for vehicles called DAVE. It consisted of a fully convolutional fast vehicle proposal network (FVPN) to extract vehicles’ positions and deep attributes learning network (ALN) which extracted features of vehicles, color, pose, and other significant information. FVPN, a narrow fully convolution network in which differentiating between foreground and background objects was done through binary classification. Further, the creation of a bounding box that was normalized to map between 0 and 1 of the binary classification. Next, GoogleNet was opted and modified to work in ALN as it has less computation and memory requirement. This network was a multi-task with four softmax loss layers to annotate the vehicles, in which four labels were used in each training images vehicle (V), pose (P), color (C), and type (T).

Detection of the vehicle on the road are implemented and analyzed in many kinds of literature; here we have highlighted some of the key studies and their result statistics as represented in Table 5.

Table 5 Vehicle detection

Abbreviations: A accuracy, P precision, R recall, ROI region of interest, RPN proposal extraction network, SINet scale insensitive convolutional neural network, MMW millimeter-wave, WACD wheel arch contour detection, TPR true positive rate, FDR false detection rate, FP false positive, DPM deformable part model, VDPAM viewpoint discriminative part appearance, VDPGM viewpoint discriminative part based geometric model, RSPVM road structure-based probabilistic viewpoint map, ANFIS adaptive neuro-fuzzy inference system, HOG histogram of oriented gradients, SVM support vector machine, TP true positive, FP false positive, OT operating time.

2.5.3 Discussion

Size and shape information is highlighted in the vehicle detection and presented in [88] to handle the scaling issue. It used SINet and VGG implementation along with non-maximum suppression. The deep learning model in [91] proposed RCNN for vehicle classification in four different categories to determine the availability of driving space while changing the lane. Detection accuracy in vehicle detection is addressed in [90], and the authors proposed a deep model to extract and learn significant features thus enhancing the detection performance.

The authors also used a unique approach in which smoke emitted by vehicles was used to detect vehicles on the road [94]. Detecting vehicle in the blind zone is the most challenging task, and this is represented in [97], 104. Authors in [97] offer detection of vehicles by analyzing the features of vehicle wheels especially in the case of blind zone having a rear camera. Similarly, [104] applies the same approach along with tracking of the vehicle using a Kalman filter to identify the vehicle.

The mentioned literature in this section has contributed various schemes towards vehicle detection and most of them offer detection of vehicles in a standard working environment. However, detection performance is required to be improved in diverse environments. Similarly, precise classification in a greater number of vehicle categories is still demanding. Similarly, the detection of the occluded vehicle is the most critical part that requires a maximum level of accuracy.

2.6 Traffic light/signal/gesture detection

To manage the stream of traffic smoothly in urban and rural areas, traffic signals are expected to regulate the vehicle traffic. If followed properly, it helps to reduce the hazardous situation and can minimize the chances of accidents. However, identification of traffic signals and making decisions based on the surrounding scenario is not easy for an intelligent vehicle system. It must be trained in such a way that it can cooperate with the least or zero level of a hazardous situation. It is worth mentioning that there are situations when traffic signal systems fail due to some technical reason; in that case, the role of traffic police is mandatory to direct the traffic. An intelligent vehicle must recognize the series of actions of traffic police in order to move further on the road. Here, some of the important works are mentioned to project diverse views on the traffic signal and traffic police gesture recognition.

2.6.1 Deep architecture-based approach

The intensities of traffic lights are an important aspect of controlling traffic from a far distance. It is quite easy to identify red, green, or yellow lights of the traffic system, but for an intelligent vehicle system, it can be confusing too as there is a number of lighting variations, including vehicle rear light, lights from buildings, or hoarding reflection especially in nights. This gradually increases the complications of objective in a dynamic environment, because an intelligent vehicle must strongly differentiate between traffic signal lights and neighboring lighting variations.

Recognition of traffic lights under such an environment is presented in [108], where the author’s objective is to detect and recognize traffic lights using dark and bright channels respectively. This is because both of the channels have different perceptions in day and night lighting conditions. To reduce the time for processing, ROI was applied in the given data mostly in the upper portion of the image. It was considered for the reason that the traffic light system appears on the left or right top portion of the image. Deep learning architecture CaffeNet was used in this work well known for fast processing. This model was modified to deal with the recognition of 12 variations in light classes: horizontally and vertically aligned red and green light, right and left vehicle light, green and red arrow light, green and red pedestrian light, other fake red lights. To improve the detection accuracy, temporal-spatial analysis was used to observe preceding frames and decide whether the candidate in the same area in the next frame is available or not.

Finding objects in the provided image data may become challenging if image data contains many non-candidate objects. These non-candidate objects are sometimes referred to as noise. One way to handle this issue is to define a robust ROI, through which the candidate is bounded within a region and other non-candidate objects are rejected. The approach for finding such participating candidates through ROI extraction built upon the existing position on 3D maps and is discussed in [109]. Two stages were defined under this 3D mapping:3D point cloud map was responsible for 3D coordinate values and characterizes the shape statistics around the scene (refer to Fig. 13). 3D feature map was responsible for extracting proposals for objects other than all participating objects. ROI extraction was the very next stage, where 3D mapping and feature extraction established an association among camera and light detection and ranging (LIDAR) sensors to guess the camera position. In morphological operation, RGB color space was provided into HSV inthe color thresholding procedure. In deep learning architecture, to speed up the detection rate and to achieve high recognition accuracy, authors used SSD and modified it to identify the color state recognition for traffic lights.

Fig. 13
figure 13

ROI extraction in [109]

Traffic lights are not large objects as compared to other objects present in the road scene. Thus, the detection of small light objects surrounded by a complex background in the image is relatively challenging. The work offered in [110] applied CNN to identify and classify traffic light objects in given image data. In this direction, the attention proposal modeler (APM), which offered areas that are likely to hold targets using faster RCNN, and accomplished regression on the bounding box region. APM instructs the accurate locator and recognizer (ALR) about a specific portion in the image where the significant features were required to examine; then ALR focused and categorized the targets in regions and made regression on the bounding box of an actual item. The authors mentioned that detection of such small targets are effective even the area range is less than 322 pixels. The outcome of the proposed approach is given in Fig. 14 which represents the detection of traffic lights under various scenarios.

Fig. 14
figure 14

Traffic light detection result under a heterogenous environment which includes the day light, under the bridge, extreme dark regions, and against bright sun light [110]

One more proposal extraction architecture is used in [111], where authors proposed deep neural network architecture consisting of encoder-decoder with focal regression loss to recognize traffic light. An encoder network ResNet-101 in this work is employed to extract the proposal from an input image, and the output of this network is again input to the decoder network to generate a refined proposal map. Traffic light appears in the upper half of the image and authors consider this as the target area. The decoder network is intended to generate confidences, bounding boxes, and class probabilities. The loss function used in the proposed model is based on the loss function as used in YOLO v2 architecture. Unlike other anchor boxes used in object detection techniques, a freestyle anchor mechanism was deployed for accurate identification of the object.

In a real-world scenario, the traffic signboards are mostly of two types: text and symbol-based. Most of the work is built upon symbol-based sign detection and recognition. As the color and regular shape-based traffic signs are much easier to identify, text-based sign recognition is somewhat challenging. However, authors anticipated their work through CNN to recognize both text and symbol-based traffic signs [112]. The initial phase started with the extraction of ROI in traffic signs and a huge portion of contextual information was obtained. To detect them, maximally stable extremal regions (MSER) were adopted. The output of this phase was then filtered to refine ROI using HOG proposals and SVM classifier. The subsequent phase extracted the traffic sign ROI and put it to the custom-made CNN to receive the result for classification. This procedure was repeatedly applied to every frame of the test input video. Though its processing speed was quite slow, its improvement along with geometry consideration was proposed in the future aspect of the current work.

Similarly, VGG16 and Inception v2 were employed as the base network to effectively produce a proposal map for a traffic sign. A set of default boxes of varied dimensions were allocated to each element of the proposal map. Traffic sign templates for boundary corners were used as references from the shape classification to examine the test case template [113] and can be visualized in Fig. 15.

Fig. 15
figure 15

Boundary valuation using 2D pose and shape label [113]

Extracting and recognizing text-based road signs from the non-text background and solving the multi-scale issues is represented in [114]. A cascaded segmentation detection frameworkwas employed to fulfill this purpose. Work inherited from TextBoxes using CRNNwas modified for a multi-scale proposal map. Extraction of proposal maps for small size objects used lower layers whereas large size objects used depth layer. Further, to minimize the struggle associated with matching the default boxes, TextBoxes put up deeper boxes in the vertical direction.

All these procedures were done in consecutive frames of video data. Future work as mentioned in the work involved recognition of text-based signs for multi language. In diverse situations like varying illuminations, occlusion, motion blur, and other related issues are challenging for traffic sign detection. At the same time, achieving high accuracy with fast computation is required. To deal with these, a hybrid CNN implementation that consisted of R-CNN and MobileNets was employed to build the recognition process more competent. In this sense, region proposal networks were used to discriminate the foreground and background information. Afterward, this classifier which had two layers for box classification and box regression layer were input into the softmax layer to determine the probability score. Though small traffic signs cannot be confined accurately and it made the classification more problematic. Refining the object localization was the subsequent phase in this domain. This was accomplished by calculating the center of the regressed bounding box and extending its border line to cover the complete traffic sign [115].

To deal with the small size traffic sign, Yuan et al. offered a vertical spatial sequence attention network (VSSA-NET) that was designed to detect traffic sign detection consisting of two phases [116]. First, it was responsible for a multi-resolution proposal learning unit in collaboration with MobileNet as the backbone to minimize time processing. Second, vertical sequence attention unit to translate the vertical spatial context for precise classification of a traffic sign. LSTM decoder layer was put up to output the results of classification and recognition simultaneously. Detection of small traffic sign objects is a challenging issue along with a precise level of recognition. In this direction, Liu et al. offered a multiscale region-based CNN (MR-CNN), which was built upon VGG-16 [117]. As some layers were effectively suitable for localization but led to poor recognition and experienced it with the different-sized kernel, padding, and strides in the convolution layer. The authors designed a fused proposal map that increased the resolution of a small traffic sign and concurrently contained more semantic details in-depth, thus improving the efficiency of the region proposals. In the subsequent procedure, RPN produced the region proposal from the combined feature map.

Finally, based on the estimated probability score, classification using SoftMax function was executed whereas another layer implemented a bounding box to localize each target object.

Though the solution accuracy mentioned in these researches are quite promising there is plenty of problem scope which are needed to be solved. In the direction of the manual traffic control system, various actions of a human are recorded and captured through vision-based sensors. Understanding the posture or motion is a stimulating process for a system, as it is considered a tool for communication. Traffic police gesture comes under this mentioned category as it deals with human-controlled vehicles. Nevertheless, the future of intelligent vehicle systems that comprises a vision-based system in its operation must deal with the process of identifying the gesture of traffic police especially in Asian countries like India. In these continents, most of the traffic in their urban and rural areas is controlled by traffic police. It becomes necessary for an autonomous vehicle to deal with the identification of such traffic police gestures for accurate decisions according to the traffic police instructions. The following researches are mentioned to understand the perceptions in the context of intelligent vehicle.

Approach towards posture recognition of traffic police in [118] used Faster R-CNN, a deep learning-based architecture. Faster R-CNN has low recognition accuracy in the complex backgrounds; to overcome this problem and to accurately recognize the gesture, two-stream Faster RCNN using depth and color data is presented. RPN was used to generate proposal maps and used the SoftMax function to determine the anchor proposal. It was then followed by a bounding box to refine the resulting output. In a subsequent phase, this depth and color map proposal were used to determine the category of the proposal. In the end, refining the position of the recognized frame was done with the reimplementation of box regression.

Work in [119] employed multistage CNN for the detection of traffic pose. In-depth, two branches multistage CNN and backpropagation neural network were employed for classification. Initially, VGG Net was used with ten convolutional layers and three pooling layers in this work. This network was divided into two phases, where the objective of phase 1 was to forecast part confidence map, whereas phase 2 aimed for part affinity fields prediction. Part confidence map provided a 2D illustration of the probability density function of human joints and described the range and location of human joints. Part affinity fields represented features using location and orientation information. This extraction of proposals in each layer corresponds to human joints to describe pose behavior.

Deep learning implementation of convolutional 3D neural network called C3D network model and employed in [120], which is capable to extract Spatio-temporal proposal for action recognition along with the process of frame dropping. The custom C3D model used in this work had 8 layers of convolution, 5 layers of maxpooling, and 2 fully connected layers followed by a layer that governed SoftMax function for classification. The experiments were performed on both static and dynamic data. Based on different motion patterns, the weight data model trained by static data was used to refine the dynamic data.

2.6.2 Mathematical model-based approach

The traffic light system follows a uniform structure on color and shape geometry. Color analysis was the primary step in this approach through color segmentation specified by a threshold value. However, color analysis alone was not effective due to the availability of another similar type of color object in the scene. So, a morphological shape filter was used to fetch the information about the geometrical statistics. Geometric information was extracted and trained through the SURF proposal extraction method, followed by the nearest neighboring proposal matching procedure [121].

A perceptible discussion on traffic light recognition is provided in [123], where authors presented a modular architecture for the detection of traffic lights, which was done in four stages: ROI generation, recognition, classification,and tracking. Traffic light convention according to South Korea was presentedin this study, where the system used two, three, or four bulbs configuration arranged mostly in horizontal and somewhere in a vertical way. As ROI generation for traffic lights was not robust in the sense that image projection just assumed the flat road environment. The proposed approach was to include flat roads as well as slope roads for ROI generation. Instead of using the traditional method for recognition, perception-based approaches especially focuses on all traffic lights to enhance recognition rates. In the detection module, the Adaboost classifier built upon Haar-like proposals was applied, which reduces the false positive. The classification module in the projected domain classified objects in two phases: first, to discriminate traffic lights from the background, the second one determine the precise state of active light. This action plan for these phases contained the implementation of HOG and training with SVM. The perspective effect of the camera caused the positional discrepancies between two consecutive frames at short distances and long-distance. For this, adaptive multi-object tracking Kalman filter was employed with distance information.

There are many circumstances, where these signs are partially or fully occluded by some other objects and are difficult to recognize. One solution to such a problem is mentioned in [124], where the objective is to develop an occlusion proposal model to differentiate among occluded and non-occluded samples under the HOG and SVM schemes. The proposal maps were attained through the voting phase on an individual occlusion map. To minimize the computation cost, those negative samples with probability scores below 0.6 were considered as occluded ones and passed for another phase of analysis in the experiment. With the same objective, authors proposed their framework and described that the performance of an offline pre-trained detector can be enhanced through an operational restructured detector [125]. In offline detection, considering color and shape information, authors designed a detector with aggregated channel proposal followed by AdaBoost. With two small stages namely prediction & update, the Kalman filter was applied to track the object using the motion model. To improve its robustness and accuracy, it was used against an online detector that fetched the proposals using SIFT. The object detection work in this stage is capable to recognize objects having multiscale characteristics using scale grounded fusion procedure.

Another approach to refine the localization process for candidate traffic signs using optimization is proposed in [126]. In this work, localization was expressed as a segmentation problem and integrated the shape and color information of traffic signs. Based on Markov random field formulation, the segmentation procedure was formulated and expressed as an energy minimization problem. All the data related to shape, color, and smoothness terms were also expressed in mathematical terms in the stated context. As the conversion among two shapes was homograph, this transformation between two points are processed by RANSAC.

Similarly, operational scheme also contain the cumulative block intensity vector (CBIV) to identify hand gestures of traffic police is mentioned in [127]. Current and consecutive frames & pixel-wise variances were used to receive the abstract information for moving objects using n-frame differencing. Thus, producing the region to identify with the maximum intensity. After extracting the foreground image, ROI in terms of the bounding box was applied in the subsequent phase. In a similar phase, to minimize the computation cost, the extracted ROI from the previous phase is again divided into three blocks namely B1, B2, and B3. Average pixel intensities were calculated against each block and the block which depicted the maximum movement is considered for further examination. In an implementation, SVM, Random Forest (RF), and Decision Trees (DT) were employed to differentiate the different patterns of traffic police hand gesture.

In many countries, the attire of traffic police is very standardized and identification of their outfits is not complex. Keeping this standard feature of attire, Guo et al. proposed traffic police gesture recognition [122]. As per the Chinese traffic mandate, traffic police wear reflective identical outfits, and the discrete proposals were obtained from the color of the outfit as shown in Fig. 16. This identified color region was extracted using the color segmentation method. For this purpose, a 5-part body model which included the torso left and right lower and upper arm was used to extract traffic police gestures. Separating the region of the police’s torso and arm from the background, a kernel density estimator (KDE) is applied. To locate the arm portion of the body, the max-covering scheme was implemented which is evaluation of region inside the front outline formed on the morphological procedure.

Fig. 16
figure 16

The 5-part body model and corresponding tree structure to identify the gesture of traffic police [122]

Kinect from Microsoft &OpenNI provides real-time user interaction using gestures. These interactive gesture recording facilities is adopted in [128], where the idea to recognize the actions of air-marshalls and traffic policeman were presented. The system first fetched the physical proposals and produced a joint map to refine the skeletal representation of the user. An algorithmic procedure is deployed to confirm every small segment associated with a gesture. It was mentioned that every single gesture contained three components: initial, movement, and final image. Using these image series, the complexity of the entire process for gesture recognition was slightly reduced.

Similarly, static and dynamic feature descriptor is used in gesture detection of traffic police using Kinect is represented in [129]. For static features, a simplified 3D surface representation being a 2.5 D of depth image was used. Using this approach, the static descriptor of individual gestures was rebuilt from related frames. For dynamic features, to describe arm movement, a motion history image (MHI) was presented where every pixel described recent motion in that location. MSSIM used in the work described the similarity level between test and sample image. The sample output is represented in Fig. 17, where the first row indicates the body part's gesture through the pixel annotations. Similarly, the torso is characterized by red color, for upper arms, it uses green color, and for arms and head blue color is used. Likewise, the second row portrays the pose, attained by segments fitting in the earliest row. The yellow color for lower arms, green is for upper arm and the purple color has been kept for clear visibility for the head portion.

Fig. 17
figure 17

Traffic police gesture through scheme used in [129]

Some of the key research studies are summarized in Table 6 for traffic light/signal/gesture detection literature.

Table 6 Traffic light, signal and gesture detection

2.6.3 Discussion

The literature reviewed in this section is based on the following scenarios: Detecting traffic lights, signboards, and traffic police gestures required for an intelligent vehicle to synchronize with traffic rules.

Work in [108] detected and classified traffic lights in twelve different categories implemented by the deep learning approach. A study in [110] considered complex background and detected small light objects of the traffic signal. Authors in [114] used a unique approach through cascaded segmentation. Model-based approach including [121] used an exclusive method that implemented color and shape geometry analysis of traffic lights; some false predictions occurred due to similar shape, the color of vehicle, and background information. SIFT feature extractor was used to correctly identify the traffic elements. These studies are helpful in a traffic light or sign detection. But when traffic police direct the vehicle movement then it may not be easy to understand by an intelligent vehicle. Identifying traffic police action and proceeding accordingly will be a challenging area for such vehicles in near future. Some studies including [118,119,120] used deep learning architecture to detect gestures of traffic police, whereas [129] uses 2.5 D representation model to recognize the gesture.

3 Datasets and simulators

3.1 Datasets

In this article, different phases have been discussed, and the review reveals that mostly the public datasets have been accessed by authors for their adopted methods. However, in some cases, private data have also been prepared by authors for the application of their proposed techniques. In this section, a summary of various kinds of datasets is being presented for several phases of the intelligent vehicle systems.

In the Road Detection phase, approximately ten public datasets have been used, where KITTI datasets are the most common datasets accessed by authors. For the lane detection phase, nearly eight public datasets have been used, where the Caltech dataset is the most common one. About 13 datasets have been mentioned in the pedestrian detection phase and Caltech is the most preferred dataset among the authors. However, INRIA and DAIMLER were also marked as the second most used dataset. PASCAL VOC and KITTI datasets have also been used in the vehicle detection dataset. For traffic light/signal/gesture detection phase, several datasets have been used according to the practice and regulations of different nations. All these mentioned datasets have been found suitable and successfully accessed and validated in various articles. In several kinds of literature, authors prepared private datasets of their own and determined their applicability in various phases of the intelligent vehicle systems. However, for the phases like pothole detection, gesture detection, etc. the standard or benchmark public datasets are still lacking and needed for validation of methods in these phases.

In Table 7, the list of the public dataset used in different phases of intelligent vehicle systems by various authors along with the details of size, number of images or sequences, and available object classes are described. Further, a graphical representation is being given to illustrate the usage of dataset size and number of data samples in each phase for all available and studied dataset.

Table 7 List of public dataset accessed in literatures

Visualization of all the datasets used in different phases against their size and number of image/video sequences are shown in Fig. 18. Most of the data descriptions in this direction are taken from the corresponding articles. However, in some articles, the exact information related to data size, the number of images or sequences, and object classes could not be mentioned here due to the non-availability of the same in the concerned literature.

Fig. 18
figure 18

Datasets: a road detection. b Lane detection. c Pothole detection. d Pedestrian detection. e Vehicle detection, f Traffic Light/Signal/Gesture Detection.

3.2 Simulators

To validate the approaches used in autonomous driving, several studies have utilized the modeling and simulators [159,160,161,162,163,164,165,166,167,168,169,170,171,172] to investigate, design, attainment and exercise the learning in the automotive realm. Regardless of the diversity of adopted domains towards the progress of an IVS, there are various simulators and tools covering the complete development. The following discussion will cover those varieties focusing on the scene observation during the feature extraction phase. The simulated atmosphere contains estimated model movements and in some cases, it portrays the physics of an atmosphere such as mass, friction, etc. Though these tools are not the core objective of the proposed study, some of the important ones have been involved as an illustration. Table 8 compares some of the significant features of the simulators along with the adopted sensors. Simulators which has the implementation of the state-of-the-art sensors utilized in the existing autonomous driving system have been considered. Similarly, some of these simulators have also been compared on the basis of other key features and given in Table 9. The key area focuses on how optimal they are at the various real-time driving environment.

Table 8 Summarized details of simulators for autonomous driving containing the key sensors
Table 9 Brief comparison of different simulators considering various features

Though simulation tools have now become a foundation in the development of autonomous driving, uniform standards to assess its processed outcomes are still missing and challenging. According to one report of Department of motor vehicle, California—the automobile industries including Cruise, Waymo, and Tesla does not contain the complexity and variety of the covered distance gathered through their simulation tools [173]. Therefore, it would be optimal and informative to have standard simulation evaluation measures for a fair comparison between various simulators.

4 Discussion

4.1 Challenges

In the standard driving environment, it is quite simple for a driver less vehicle to drive from one source to its destination. However, the heterogeneous driving scenario offers much challenges which majorly involves the human life. Also, the instantaneous coordination between such autonomous driving in the dynamic scene may produce difficult situation to human driven vehicles. Apart from these, it is also expected that these vehicles may be adopted in the society as a transportation tool either in rent or in purchase mode. Both of these mode require a focused mechanism and involve a close observation because it should not let a vehicle to be hacked or misused in this digital era. In this direction, the possible social and technical challenges are hereby discussed to keep the effective implementation of autonomous vehicles which may provide the insights about the unexplored issues to be solved.

4.1.1 Human intervention

Keeping aside the commercialization scheme of autonomous driving, these vehicles will also be available as an individual service to the society. A person will have the facility to hire this mode of transportation using a smart phone. Hence, the autonomous vehicle driving must involve the effective design collaboration of software interfaces, hardware infrastructures and communication links through internet where it can handle the concurrent requests from the multiple users. Similarly, the positional information from where a service request is initiated needs to be processed using some GPS sensors, wireless networks or similar kind of localization sensors. Sometimes, a user cancels or ignore the hired service due to some unavoidable emergencies. In such cases, a system should be capable enough to free itself if no service is occupied after a particular time duration.

4.1.2 Vehicle-to-vehicle (V2V) communication

In case of multiple requests from different users, a system can transmit the requests immediately to the available autonomous vehicle in that area. This scenario requires a dedicated communication link between such vehicles. Likewise, to avoid traffic congestion, these systems can communicate and reroute themselves using some kind of city map. At this moment, following all the traffic rules is also a challenging scenario for these vehicles while implementing the V2V communication. In this direction, these intelligent vehicles will also have to observe the traffic environment and transmit the huge data and their learning outcome to the connected vehicles.

4.1.3 Path planning

Even from a viewpoint of a single autonomous vehicle, it is quite stimulating task to decide which path should be chosen in the heterogeneous traffic environment.. It may have multiple paths and choosing the best path depends upon the requirement and situation. Defining an optimal path involves the perception of the user and the scenario which also consist the geometric restrictions, traffic routes, experience from the historical data, and incoming service requests. Any changes in the physical city road map should also be incorporated to have an efficient path planning for the system. In another scenario, these vehicles must have some information and decision making strategy regarding the fuel or charging station without which a vehicle cannot drive.

4.1.4 Driving behavior on the road

Driving in the street is not an easy task. A vehicle has to overtake or change its driving pattern when it identifies a particular object in its pathway. A safe trajectory planning is a challenging problem and important issue for intelligent vehicle system which decides how effectively it drives and assures safety for other vehicle or pedestrians. Apart from the driving zone, an intelligent vehicle may also face the parking scenario where a precise and safe trajectory planning is required which is another significant and challenging task. Similarly, at the road intersection point, an autonomous vehicle is expected to drive intelligently without disturbing the other road elements such as other vehicles, pedestrian, cyclists and traffic rules.

4.1.5 Scene perception

In the functioning of intelligent vehicle system, number of different sensors are attached to understand the driving environment. Depending upon the application, system examines the extracted data, process them and uses to change the driving behavior of vehicle. To precisely understand the driving atmosphere, these vehicle systems are trained with a large amount of data samples and confirms the performance in the test mode. It involves sensing the environment and representing the sensory data into suitable format then taking appropriate decision accordingly. Towards this, localization indicates the present location of vehicle in context of road map; and application specific perception involves the various indicators such as traffic lights and symbols etc. However, challenging or unseen data object that may not have been trained previously may generates recognition difficult and may also lead to unavoidable hazard.

4.1.6 Safety obstacle

Autonomous vehicle facilitates the society and may help majorly on reduction of road traffic death rate. It assist the life safety of passengers, drivers, pedestrians and other road users. But several incidents has slowed down the development of autonomous vehicles. One such road death happened with Tesla Model S [174]. Similarly, algorithmic partiality has been studied in which pedestrian detection is suffered and the pedestrian with dark skin color is ignored [175]. Due to the malfunction or system failure, these machines have not gained 100% trust because putting life safety completely in the hands of an autonomous machine is harmful.

Recent approaches, technical aspects, and various methods have been identified with their strengths and limitations, and are summarized by highlighting various research challenges in each phase of the intelligent vehicle system:

  1. a.

    The detection of roads is challenging due to the complex scene environment. The presence of vehicle, pedestrian, and other road scene objects make the road detection difficult. Keeping aside the road object terminology, detection of road boundary, and its structure alone is a difficult task. Consideration of well-defined and standard road structures have been observed in most of the existing literature. However the runtime conditions like various categories of road, unclear road boundary, etc. must also be considered for the validation of the adopted method. Further, some optimization techniques may be hybridized with these road detection methods in order to get better performance under unfavorable and heterogeneous conditions of the road.

  2. b.

    Notable performance is marked through various mathematical modeling and other techniques for lane detection in reported studies. Based on the category of the road, it is divided into various lanes marking schemes. The selection and driving on the specific lanes by intelligent vehicles depends on the setting of speed range and surrounding conditions such as the presence of vehicles, traffic congestion, etc. Additionally, approaches for the detection of occluded and curvy lane lines have limited performance. Understanding these conditions and making decisions accordingly is a mandatory task to synchronize with the traffic and at the same time following the rules and regulations are a challenging and thus an important aspect of an intelligent vehicle system.

  3. c.

    Various methods have been applied using texture and other parameters to detect potholes. However, unstructured potholes under dynamic scenes demand higher detection and a validated accuracy. Also, the depth of these potholes is required to be determined because the chances to damage the vehicle are high if not identified properly during high speed driving. Detection of potholes and speed breaker objects are still unexplored and the domain requires an effective mechanism for the best possible decision within the minimum time frame in the runtime environment.

  4. d.

    Due to the varying speed of autonomous vehicles, detection and recognition of the pedestrian in a compound environment are difficult. In some cases, pedestrians are occluded by some other objects making the situation more challenging. Similarly, pedestrian orientation must also be considered, as they may approach any part of the road from any direction. Pedestrian direction and their relative speed will assist the intelligent vehicle to take prior action for avoiding the collision.

  5. e.

    Presence of heterogeneous (varying size, shape, and type) vehicles on the road offer difficulties for their detection. This phase attracted various authors and based on the vehicle category, an intelligent vehicle is expected to decide whether to overtake the leading vehicle or to follow it constantly and maintain the safe distance.

  6. f.

    Traffic Sign/Signal/Gesture recognition is another demanding area. Considering the case of a sudden traffic and signal failure; traffic police plays a significant role, where he can direct/instruct the vehicle to manage the road traffic. Understanding the gesture and acting upon them is a decisive task for an intelligent vehicle.

5 Conclusion

From the research studies reviewed in this article, it has been observed that various state-of-art techniques focuses mostly on feature learning based schemes including mathematical model, machine, and deep learning techniques. These studies are significant in terms of their effective methods and substantial performance. However, some of the issues need to be focused on and resolved in various phases of an intelligent vehicle system. The development of the intelligent vehicle is in its advanced stage and it is expanding the potential of the vision sensor [14]. But it has undergone road accidents with the death of a pedestrian [174] and faced issues like algorithmic biasedness [175]. Hence, this high driving precision is required as the system has to deal with a challenging environment where appropriate decisions have to be chosen based on various conditions and driving circumstances. In this article, more than 170 relevant studies have been marked on various phases of an autonomous vehicle.

Being the demanding area of automation in automobile sectors, this study focused on various issues and challenges in computer vision-based methods for various phases of intelligent vehicle systems. This study provides opportunities for the improvement in the performance of intelligent vehicle systems in all possible dimensions. Integrating various technologies like computer vision, machine learning and deep learning with the real-time system is a necessity in the development of intelligent vehicle systems. The identified challenges can be used to examine and build a more competent approach for an autonomous vehicle. For the developing state-of-art techniques, more comprehensive details and testing with factual parameters are required in this domain in the near future.

5.1 Future directions

The proposed approaches in this article have been validated through a number of experiments. However, the study suffers from a few limitations which are common to most of the related studies in the literature. Although the contributions of the paper offer the diversities of methods and challenges and it is known that a fair comparison of methods would be possible if the same datasets and other standards of images were used. The non-availability of standard datasets especially for the pothole detection phase limits the operational strategy. The study undertaken in this work has explored a number of possibilities for future research which are summarized as follows:

  1. a.

    Apart from the decision-making strategy through the front camera mounted on the vehicle’s top position; the future study may involve the vehicle having an attachment of the rear camera or camera having 360° rotation to make the decision about the situation such as blind spot.

  2. b.

    Future developments focus on unseen data covering a wider range of road objects and will consider the fusion of supervised and unsupervised schemes to increase the decision-making capability and thus reduce the chances of road accidents.

  3. c.

    Design and performance of road scene object detection using deep learning schemes can be further investigated over hyperparameter optimization and may involve some cognitive-based approaches.

  4. d.

    The study and development of speed bump type of road objects considering other types of bump and hump objects are also looked out as one of the future scopes of the work.

  5. e.

    This work is entirely focused on vision sensors due to their operational simplicity. However, the fusion compatibility of vision sensors with any one of the sensors may be investigated in the upcoming works.