1 INTRODUCTION

Technology developments, particularly incredible improvements in computing technologies, artificial intelligence, and hardware capabilities, have allowed us to broaden our horizons in terms of what we can create or enhance. As a result, one application that was once thought to be a dream, autonomous driving, now appears to be on the verge of becoming a reality. Being able to travel quickly, safely, and comfortably under various types of situations and weather conditions using driver-less cars has become a main research subject in recent years due to the anticipated benefits, such as ease of use, reduced number of accidents and traffic jams, and more efficient energy management [1].

In general, each autonomous driving system comprises three main stages (Fig. 1). The first one is the perception and localization stage that derives values that describe the driving situation so that the autonomous vehicle understands the surrounding environment. In this stage, data from different sensors and digital maps are combined, processed, and then the driving situation is interpreted. The second and third stages are responsible for decision making. The second stage is motion planning, which involves deciding on a reference trajectory that represents a driving behavior and sending it to the trajectory controller. The third stage is the trajectory controller, which seeks to keep as close as possible to the reference trajectory chosen by the previous stage while taking into account comfort, safety, and efficiency criteria. Acceleration and steering commands are computed by the trajectory controller and sent to the actuators [2].

Fig. 1.
figure 1

Autonomous driving system general architecture (adopted from [2]).

This paper provides a survey of the advanced methods for perception and localization of autonomous driving:

• Sensors used for autonomous driving.

• Current trends in the research of perception and related methodologies.

• State-of-the-art methods used for localization.

In Section 2 we list the sensors and other sources used for autonomous driving research. In Section 3, perception of autonomous driving is discussed in detail with focus on object detection and semantic segmentation. In Section 4 we discuss localization and current research on simultaneous localization and mapping. Finally, we conclude.

2 SENSORS AND OTHER RESOURCES USED FOR AUTONOMOUS DRIVING

A summary of the various types of sensor that are used in modern autonomous driving research is shown in Table 1.

Table 1. A summary of sensor usage in autonomous driving research

Each sort of sensor is utilized in a different circumstance than the others or in combination with others to execute a task as efficiently as possible. In the context of autonomous driving, cameras are critical. On the ego vehicle, one or more cameras are mounted to take images of the surrounding environment, which are subsequently forwarded to the perception system, which is in charge of situation interpretation. As a result, researchers employ a variety of cameras to ensure that their self-driving cars can visually comprehend their surroundings. A stereo camera is a camera that has two or more lenses, each with its own image sensor. It is used to take 3D pictures of the surrounding area. Because 3D recognition is critical for autonomous driving, a growing number of academics are working to make the ego car perceive its environment in 3D using this technology.

However, sensors such as cameras have a lot of difficulty capturing excellent images at night and in adverse weather situations such as rain, fog, and snow. The employment of a radar can ensure that the surrounding environment is monitored at all times. Radars send out pulses of radio waves and measure the locations and speeds of nearby objects based on the reflections. LiDARs transmit laser pulses rather than radio pulses. LiDAR sensors generate a detailed 3D map from reflected signals by firing invisible lasers. These signals produce “point clouds”, which depict the vehicle’s surroundings. In the dark, LiDARs can do significantly better than cameras. Ultrasonic sensors, on the other hand, emit ultrasonic pulses that are reflected by the surrounding objects. After that, the echo signals are received and analyzed. They are mostly used in tasks that need very short-range perception such as parking.

An inertial measurement unit (IMU) is an electronic device that uses a combination of accelerometers, gyroscopes, and magnetometers to detect a vehicle’s specific force, angular rate, and orientation. Recent advancements have made it possible to manufacture IMU-enabled GPS devices that can be used in autonomous cars in combination with other types of sensor (such as wheel odometry, LiDAR, and camera) to measure the ego vehicle’s position and estimate its dynamic velocity.

The main usage conditions of various types of sensor are summarized in Table 2.

Table 2. A summary of sensor usage in autonomous driving research

3 PERCEPTION IN AUTONOMOUS DRIVING

The accuracy and efficiency of nearly all perception tasks have increased as a result of recent research breakthroughs, particularly those linked to machine learning.

3.1 Detection, Classification, and Recognition

Object detection is critical for autonomous driving in general. Deep learning technologies [17] such as convolutional neural networks (CNNs) are increasingly used in modern research to produce cutting-edge findings [18, 19]. CNNs are increasingly being deployed in object detection applications when combined with other deep learning techniques such as autoencoders, recurrent neural networks (RNNs), and Long Short-Term Memory network (LSTM). For autonomous driving perception, 3D object detection is very important. 3D scene information can be acquired via stereo imaging. A CNN is supplied with context and depth information and produces the 3D bounding boxes coordinates and poses of the detected objects using the acquired high-quality 3D object proposals. The input of the CNN consists of a variety of proposals, including object size priors, object location on the ground plane, and numerous depth characteristics that reflect free space, point cloud densities, and ground distance [20]. Object detection for autonomous driving is increasingly using hybrid deep learning architectures. A hybrid of CNN and RNN architecture can be used to create a model that understands the 3D traffic scene in general, rather than just certain elements of it, and makes driving strategy recommendations based on multimodal data [21]. Here, a CNN is employed with an LSTM (Fig. 2). In terms of general object detection, feature fusion, and low-level semantic understanding, the proposed model is said to outperform the current state-of-the-art.

Fig. 2.
figure 2

The overall system architecture in [21].

It might be difficult to interpret and comprehend collected data in particular traffic scenarios. Some academics have developed a framework to address these traffic-related issues [22]. The framework can identify and distinguish items in a driving scenario, as well as anticipate the intentions of pedestrians in that scene. They employed YOLOv4 [23] to recognize ten different item types. They also used Part Affinity Fields [24] to determine the pedestrians' poses. They leverage Explainable Artificial Intelligence (XAI) technology [25] to explain and help with the outcomes of the risk assessment task. Finally, they built an end-to-end system that allowed them to combine multiple models with the greatest precision. [26] introduced YOLOv4-5D, a deep learning-based one-stage object detector (no region proposal stage required) geared for autonomous driving. Their model is a modified version of YOLOv4. It has an accuracy comparable to those highly accurate, two-stage models [27, 28], while also has the ability to run in real-time. The backbone and feature fusion phases of YOLOv4 are modified in YOLOv4-5D to meet the precision and real-time demands. Even for greater image resolutions, the suggested technique now runs properly and in real-time thanks to the changes. For instance, the feature fusion module of the new model is able to handle five scales of detection, and the proposed network pruning method based on sparse scaling factor reduces the redundant convolutional channels, leading the complexity of the model to drop significantly. In order to determine the location and orientation of objects, costly components such as 3D LiDARs and stereo vision are frequently employed. Recent works, such as [29], proposed a less expensive method for object recognition that relies only on images from a single RGB camera. The depth information acquired from contact points of the respective objects and the ground plane is used to estimate the 3D position of obstacles on the road. Deep neural network approaches for monocular 3D object detection and monocular depth prediction leverage the calculated positions.

When there is a large object scale variation, CNNs usually struggle. To overcome this problem, some study suggests that deconvolutional and fusion layers be added to retrieve context and depth information. Furthermore, some difficulties, such as object occlusion and poor lighting, can be overcome by employing non-maximal suppression (NMS), which can be applied across object proposals at several feature sizes [30]. In order to conduct both single-class and multi-class classification on imbalanced categories, researchers recommended employing a multi-level cost function (Fig. 3). A deep data integration technique can also be utilized for hard or rare samples [31]. The atrous spatial pyramid pooling (ASPP) [32, 33] was used by some researchers to solve the problem of poor performance of object detection systems when there are small objects, low illumination, or blurred outline in the received camera images. This approach extracts features from multiple scales so that object detection accuracy is higher with the same number of parameters [34]. CNNs can further be used for a variety of different applications, including pedestrian movement detection with pedestrian movement direction recognition [35].

Fig. 3.
figure 3

Proposed network architecture in [31].

Lane detection and recognition, as well as the detection and recognition of road surface markings, are essential tasks for autonomous driving. They should be performed in real-time while taking into account the various angles and sizes of the discovered objects. Unsupervised learning techniques based on a live feed from a camera mounted on the dashboard of a moving car are proposed to satisfy these constraints. A spatio-temporal incremental clustering algorithm is used with curve-fitting to detect lanes and road surface markings at the same time. This method of learning is appropriate for a variety of object kinds and scales [36]. Others [19] created a hybrid approach that integrates CNN and RNN principles and is said to outperform previous approaches in lane detection, particularly in difficult scenarios.

As illustrated in Fig. 4, the proposed technique extracts information from many image frames (instead of using a single frame as most previous research used to do) using a fully-connected CNN that produces low-dimension feature maps, and then feeds the derived features from the successive frames to an RNN (An (LSTM) model is used), and finally, the resulted features are fed into another CNN that is used for lane prediction. This way, if a frame is unable to give sufficient information for lanes to be accurately detected, for example owing to heavy shadows or severe road mark degradation, the information gathered from prior frames may be applied to estimate lane information. Deep learning was also applied for lane detection and fitting in more recent works such as [37]. However, as long as learning relies on input data, the data should be diverse enough to allow for the appearance of target objects in a variety of sizes and shapes.

Fig. 4.
figure 4

Architecture of the proposed network in [19].

Other detection issues, such as curb detection, require further investigation. Curb detection is a challenging task, especially at intersections where the curbs become fragmented, and the computational cost of the detection is significant, making it difficult to perform in real-time. A technique for detecting curb areas has been proposed in [38]. The point cloud data that describes the driving scene is initially collected using a 3D LiDAR sensor. The point cloud data is then analyzed to distinguish between on-road and off-road regions. Following that, a sliding beam [39] method is applied to choose the road area using off-road data. Finally, a curb detection algorithm is used to pick curb locations based on the chosen road area. Instead of performing online curb detection by taking advantage of video cameras or 3D LiDARs that may yield erroneous information in poor lighting or bad weather, [40] developed an offline curb detection approach that employs aerial images to fulfil these tasks. Their approach is built on cutting-edge learning techniques including deep learning and imitation learning.

The ability to interpret a situation can be improved by combining vision and LiDAR data. To identify the drivable area more accurately, several researchers proposed combining data fusion with an optimal selection approach [4].

Then, depending on the automated classification of the optimum drivable area, lane detection is achieved selectively (Fig. 5). It is stated that the suggested system is effective and dependable in both structured and unstructured traffic situations, and that no human switching is required. A higher level of data fusion is suggested this time among vision, LiDAR, and radar data to conduct perception in all-weather situations, including hard illumination conditions like dim light or total darkness. The fused data is uniformly aligned and projected onto the image plane, and the output is fed into a probabilistic driving model for motion planning model (Fig. 6). The proposed approach is said to operate well in high-traffic areas and in adverse weather conditions [41]. In [42] the Multi-Sensor Fusion Perception technique (MSFP) is proposed. MSFP is based on AVOD [43], a 3D object proposal system that fuses multimodal information derived from the aggregation of RGB and point cloud data. To produce respected features, the MSFP architecture first analyses camera and LiDAR data. The characteristics are then fed into an introduced Region Proposal Network (RPN), which creates 3D region proposals.

Fig. 5.
figure 5

System architecture in [4].

Fig. 6.
figure 6

The architecture of our probabilistic motion planning network (PMP-net) in [41].

Despite the fact that images from video streams include a wealth of spatial and temporal information on object poses and scales, 3D multi-object tracking (MOT) in complicated driving settings with significant occlusions among numerous tracked objects is difficult. To address this issue, a variety of approaches have been developed. Some of these solutions [44, 45] assume that all of the frames (past, current, and future ones) are accessible and process all of them to generate object tracking information (global tracking). Those methods are accurate, but they don’t meet real-time requirements. Other techniques [46, 47] solely incorporate information from current and previous frames, with no knowledge of future frames (online tracking). Real-time tracking is possible using online tracking methods. However, further effort is required to provide very precise tracking in real-time. For 3D MOT, [48] presented a real-time online technique.

They employed deep learning to jointly perform both object tracking and detection by taking into account the spatial-temporal characteristics of the driving scene, resulting in a more accurate MOT than traditional detection approaches that rely on bounding boxes and association algorithms [49, 50]. The driving behavior awareness is derived from learned object characteristics adopting Deep Layer Aggregation (DLA) [51] as the network backbone and considering 2D center point, depth, rotation, and translation in parallel.

Road sign recognition is a crucial element of the perception of autonomous driving systems. It enables the ego car to gain a comprehensive understanding of the driving environment and situations. Many recent studies have presented effective ways to deal with difficulties connected to road sign recognition as a result of the rapid development of deep learning techniques related to object detection. A CNN design was developed by several academics to detect speed limit road signs, which are among the most challenging things to recognize in the US traffic sign set. They claim that their method improved the area under the precision–recall curve (AUC) for detecting speed limit signs by more than 5% [52]. Others applied advanced CNN techniques like depthwise separable convolution (DSC) [53] to reduce the complexity of the recognition process and speed it up. They increased accuracy even more by breaking down the problem of traffic sign identification into two parts: traffic sign classification (TSC) and traffic sign detection (TSD), and proposing an efficient network for each (ENet for TSC and EmdNet for TSD) [54]. Others developed a framework for both symbol and text-based road sign classification, refinement, and detection. They used the Mask R-CNN architecture [55] for detection and refinement and the CNN architecture for classification to obtain high accuracy [56]. For classification of degraded traffic signs, a multi-scale CNN with dimensionality reduction is also implemented [57]. Furthermore, some studies claim to get higher accuracy outcomes by focusing on certain regions of interest. Every traffic image is preprocessed to get these regions, and then a CNN is applied to recognize the signs [58]. Many advanced road traffic sign recognition research ideas are listed in Table 3.

Table 3. Advanced road traffic sign recognition research ideas

3.2 Semantic and Motion Segmentation

In order to accomplish a scene understanding task, a self-driving car is required to know the segment label under which each point of the received sensor signal is classified (e.g. road, car, building, pedestrian, etc.). This problem is known as semantic segmentation. Many tasks that were formerly difficult or impossible to accomplish have become achievable or even easy in the era of deep learning. As a result of these advancements, study in a variety of fields has flourished. Semantic segmentation was one of these areas that had limited success until recently, when a lot of research has been done and significant results have been obtained. The development of efficient point-wise classification is predicted to assist several industries that need precise semantic segmentation, such as autonomous driving [67, 68].

CNNs and autoencoders [17] are the most common deep learning architectures used for semantic segmentation. Convolutional autoencoders are hybrid models that combine CNNs and autoencoders to perform research with far more accuracy and efficiency than before. A convolutional autoencoder is an autoencoder with convolutional and deconvolutional layers as encoder and decoder, respectively. To put it another way, CNN models serve as the “backbone” structures for the autoencoders that are employed. To construct their model, the majority of contemporary studies have employed some form of convolutional autoencoder architecture. One of the first attempts in this direction was FCN [69]. It employed a fully convolutional autoencoder architecture and accomplished semantic segmentation with a huge number of parameters, making it unsuitable for real-time usage. It was also one of the first tries to eliminate fully connected layers. SegNet [70] and SegNet-Basic [71] used VGG architecture [72] as a backbone for the encoder and the decoder. They used the pooling indices of the encoder part to upsample data in the decoder part. Other designs, such as UNet [73], improved segmentation accuracy by employing skip connections between the encoder and decoder, as well as other approaches like data augmentation. Although the above-mentioned models and other architectures such as PSPNet [74], Dilated [75], and DeepLab [76] increased the accuracy of semantic segmentation models, there was still a need for a less complex architecture to perform real-time semantic segmentation because some fields of application (e.g. autonomous driving and robotics) require very accurate semantic segmentation with a minimum amount of processing time. This was not a simple task because capturing the complexity of the received images and point clouds necessitated a large amount of trainable parameters, and developing light segmentation models without reducing segmentation accuracy was a challenging mission. Some models were designed with a smaller number of parameters. FPN [77] and LinkNet [78] and other super lite models such as ApesNet [79], Enet [80], ESPNet [81], ESCNet [82] and EDANet [83] tried to minimize the number of parameters so that the semantic segmentation can be done in real-time or implemented for embedded systems. Despite the fact that these models provided practical solutions to satisfy the real-time condition, crucial applications such as road scene understanding in autonomous vehicles need much more segmentation accuracy.

Road detection and segmentation is a very important task in autonomous driving. The live feed from the camera can be processed in order to extract valuable information. From the driving scene, a saliency image map [84] can be obtained to represent visual perception by extracting spatial, spectral, and temporal information from the live stream and then applying entropy driven image-context-feature data fusion. Both for still objects and non-station objects, the fusion output of the last step includes high-level descriptors. After segmentation and object detection, road surface regions are selected by an adaptive maximum likelihood classifier [85]. To allow the ego vehicle to recognize the road area, some researchers employ supervised learning approaches. However, several recent studies recommend using semisupervised learning methods to solve the difficulty of a huge quantity of data necessary for training purposes. Advanced deep learning approaches like generative adversarial networks (GANs) [86] and conditional GANs [87] are said to produce excellent outcomes [88]. In dynamic situations, neural networks may also be utilized to gather real-time road static information. For this purpose, several studies suggest light neural networks. These kinds of efficient multi-task networks can perform occlusion-free road segmentation, dense road height estimation, and road topology recognition all at the same time. In order for these networks to perform successfully, the loss function must be chosen carefully [89].

Semi-supervised learning approach using CNN architecture can also be used for semantic segmentation of 3D LiDAR data. A CNN model can be trained using both supervised samples that are manually labeled and pairwise constraints [8]. However, recent research tends to benefit from the significant improvement of deep learning technology. For real-time RGB-D semantic segmentation, RFNet [90] is a designed fusion network with a CNN and an autoencoder. Two distinct branches are employed in the encoder component of the proposed model to extract features for RGB and depth images independently. The semantic segmentation function is completed using a simple decoder with upsampling modules and skip connections. [91] developed a deep learning-based end-to-end model that applies a sensor fusion approach. In order for semantic segmentation to be completed effectively, visual information from the camera is fused with depth information from the LiDAR. This allows for accurate scene interpretation and vehicle control. FSFnet [92] is a deep neural network that was proposed as an accurate and efficient semantic segmentation model for autonomous driving. The general architecture is made of convolutional and deconvolutional layers with the lightweight ResNet-18 [93] as a backbone (Fig. 7). They created the feature selective fusion module (FSFM) for multiscale and multilevel feature fusion, which collects valuable information from respected feature maps both spatially and channelwise, employing adaptive pooling layers to make a balance between accuracy and efficiency. To build a comprehensive understanding of the driving scene while also considering objects of various scales and sizes, context aggregation module (CAM) was introduced. CAM combines both global and multiscale context information efficiently using the lightweight ANNN [94]. PointMoSeg [95] is another advanced segmentation model designed to distinguish moving obstacles in the driving scene. PointMoSeg is an end-to-end model in which a convolutional autoencoder inspired by ResUNet [96] is adopted to online acquire point cloud data from a 3D LiDAR and process it to produce a segmented current frame with segments representing moving obstacles. The design of the proposed architecture benefits from modern deep learning technologies such as sparse tensors [97] and sparse convolutions [98], and introduces two novel modules - a temporal module and a spatial module – that are used to efficiently and effectively extract the features and context information of the moving obstacles. There is no need for flat road assumption or ego-motion estimation since the suggested end-to-end architecture assures that high performance is generalized to scenarios when the driving road has some slope or the driving environment is urban. Some other recent works such as OmniDet [99] proposed to train the deep learning model jointly on a variety of perception tasks including semantic and motion segmentation, object detection, and depth estimation. OmniDet is claimed to perform multi-task surround-view fisheye camera perception in real-time.

Fig. 7.
figure 7

The architecture of FSFnet [92].

Advanced semantic segmentation research ideas are listed in Table 4.

Table 4. Advanced semantic segmentation research ideas

3.3 Other Perception Related Research

Many aspects of perception have improved as a result of the recently created technology. Deep learning algorithms, for example, can be used in data acquisition to overcome the problem of high transmission bandwidth required for delivering high quality image frames in autonomous driving applications. Using deep learning algorithms, only the most essential information bits can be picked, and the bandwidth required to send this information is considerably lower than when the entire data is broadcasted. These deep learning algorithms are incorporated in the camera, allowing it to record only the task-critical data required for the autonomous car’s correct operation. It is difficult to design a smart camera system that also works in real-time. To do this task properly, several researchers [110] recommended using a 3D-stacked sensor architecture with a Digital Pixel Sensor (DPS) [111, 112]. So, sensor data from an autonomous car can be extremely huge. Compressing, storing, and transmitting LiDAR point cloud data in real-time is challenging. Noise is also a major problem. Deep learning provides a solution to the previously mentioned data volume issues. For point cloud data compression, a real-time deep learning technique based on the UNet architecture [73] has been suggested. UNet reduces the point cloud data stream’s temporal redundancy. The data from LiDAR point clouds is converted into a sequence of video frames. After that, certain frames are utilised as reference frames, and the other frames are interpolated using a UNet architecture. Temporal redundancy may be significantly reduced in this approach. A padding technique is also employed to mitigate the harmful effects of noise [113]. ReViewNet [114] is another example of deep learning in action. ReViewNet is described as an efficient deep learning-based solution for dehazing driving environment images, which is a critical prerequisite for perception. Their algorithm uses 12 channels representing a quadruple-color space (RGB, HSV, YCrCb, and Lab). It also has a multi-look architecture that provides the ability to further control the dehazing process and add more learning structs such as multi-output (two MSE losses are used) and spatial pooling block (to extract features in shallow layers without going deeper). This end-to-end architecture outperforms other state-of-the-art works, even the models that use advanced deep learning architectures such as GANs [115117], allowing the ego car to remove haze from images so they are more realistic and detailed.

An autonomous system’s sensors can be subjected to a variety of assaults. Because reliability is so important in autonomous driving, any autonomous vehicle design must be able to withstand many types of assaults. Some academics recommend utilizing a 3D data hiding approach based on 3D-QIM (quantization index modulation) [118121] on point cloud data collected by 3D LiDARs, for example. Data integrity is guaranteed at the decision-making level by introducing a binary watermark into the point cloud data at the sensor level. Advanced methods can also aid in the analysis of some LiDAR parameters that have previously been unavailable for analysis. Spot pattern, waist, and divergence, in particular, may be investigated using data obtained from spot reflection on flat surfaces. This is useful for analysing the LiDAR’s performance [122].

4 LOCALIZATION IN AUTONOMOUS DRIVING

Localization is the process of determining the precise location of an autonomous vehicle by employing systems such as GPS, dead reckoning, and road maps. The autonomous vehicle’s ability to perform well in the localization process is critical since it keeps the vehicle moving along the correct planned path. In the vast majority of situations, centimeter-level precision is necessary. Localization is challenging, particularly in urban areas where the signal is constantly disrupted by nearby buildings and moving objects. Mapping is the process of producing accurate maps of driving regions that may be used for navigation. In most cases, conventional map resources are insufficient to function securely. As a result, high-resolution HD maps of the environment are necessary. The simultaneous localization and mapping (SLAM) principle has recently become more popular and it is now used to derive most practical localization methods as it combines signals from various sensors to build a complete understanding of the object’s location inside the surrounding environment, and therefore, it is more accurate than traditional localization methods that depends only on GPS and fixed anchors in the external environment [12, 123]. SLAM refers to the method of constructing and updating a map of the driving environment while also identifying the precise location of the autonomous vehicle inside that map. Conventional methods of SLAM exploit on-board sensors to accomplish localization. Visual SLAM and LiDAR SLAM are the most common ways [124].

Traditional localization methods use a GNSS receiver in collaboration with other on-board sensors such as IMU. The received information from those sources is then processed by methods such as Kalman Filter [125, 126], Hidden Markov Model [127, 128], and Bayesian Network [129]. In [9], they use an interacting multiple model (IMM)-filter-based information system [130, 131] for localization. They combined GPS data with data from other sources (wheel speed sensor, steering angle sensor, and yaw rate sensor), as per current study trends. This integration is necessary since GPS data is influenced by a variety of factors, causing localization to drift. The localization algorithm can adjust to changing driving circumstances thanks to the usage of the IMM filter. The GPS-bias technique is also included in order to increase the accuracy and reliability of the localization process. In [11], they propose a low-cost localization method. The localization approach is said to be more accurate than earlier efforts since it uses lane detection, dead reckoning, map-matching, and data fusion techniques. A lane marking detector with a short range is used to detect the lanes. The precise vehicle location is established by creating a buffer called Back Lane Markings Registry (BLMR) online, and integrating this registry with information received from gyroscopes, odometry, and a prior road map.

Buildings and other artifacts, as previously noted, can create severe perturbations in the received signals required for localization, affecting conventional localization systems such as GPS and dead reckoning. In general, these systems are very accurate. However, owing to the noise of the dead reckoning sensors and the inaccuracy in the dead reckoning integration, reliable localization cannot be accomplished in metropolitan areas during long periods of GPS outage. To address these issues, some researchers proposed a precise Monte-Carlo localization technique based on a precise digital map and perceptual data fusion of motion sensors, low GPS, and cameras [10]. Particle filtering [132] is used for combining several types of sensor probabilistic distributions (as they are non-Gaussian distributions). A method for modelling sensor noise is also provided in order to improve localization performance.

Despite the fact that numerous visual SLAM algorithms have achieved accurate and robust localization [133, 134], real-time restrictions must yet be investigated. [135] demonstrated an effective visual SLAM capability that eliminates the problem of the frame losses associated with real-time performance. A four-threads architecture was implemented. Frontend thread is implemented to recognize the pose of the camera in real-time. The mapping thread then analyses additional keyframes in order to construct 3D map points while keeping drift to a minimum. The keyframe poses and 3D map point placements are refined by the state optimization thread. An online bag-of-words based loop closer thread [136, 138] is used specifically to allow long-term localization in large-scale maps. [139] provided a map-saving mechanism to extend ORB-SLAM 2 that was presented in [134], leading localization error to drop. [140] proposed an innovative end-to-end framework that takes input from a GPS receiver, OpenStreetMap (OSM) datasets [141], IMU and a front camera, and then, takes full advantage of probabilistic modeling (specifically Hidden Markov Model (HMM) and Bayesian Network (BN)) in order to conduct all aspects of localization (road level, lane level, and ego-lane level localization) (Fig. 8).

Fig. 8.
figure 8

The localization system architecture proposed in [140].

Many approaches were proposed for LiDAR SLAM. For example, LOL [142], and HDL-Graph-SLAM [143] used Iterative Closest Point (ICP) [144] that tries to minimize the Euclidean distance between point pairs in order for the transformation matrix to converge. The computing cost of these approaches is significant, and they are sensitive to environmental factors. Some other models such as LOAM [145], and LeGO-LOAM [146] used the extracted geometric features to accomplish convergence.

Other conditions such as snowy weather and wet road surfaces reduce the quality of the received signals, lowering localization accuracy. For example, in case of using LiDAR sensors, LiDAR reflectivity drops when the road surface is wet. Also, the probable snow lines may deform the expected road context in LiDAR images. These issues can be overcome by employing advanced techniques such as principal component analysis (PCA) [17] to gain a better understanding of the surrounding environment, and then using the leading PCA components to support an accumulation approach aimed at increasing the density of LiDAR data. In addition, for snow situations, an edge matching technique that matches previously recorded map pictures with online LiDAR images is developed to ensure correct localization [5].

Occupancy grid is an important way to represent the driving environment, where each cell of the grid contains information characteristics of the objects inside it [147]. The ground reflectivity inference grid is a variant of occupancy grid used in autonomous driving, in which the ground’s reflection is linked with a position in the grid proportional to the projection measurement of its range into the grid’s reference frame [148]. The authors of [149] recommended adopting a novel grid representation that incorporates reflectivity edges generated by combining distinct reflectivity gradient grids from fixed laser views. This representation is claimed to enable laser-perspective and vehicle motion invariance, negating the requirement for further post-factory laser reflectivity calibrations and facilitating many processes including localization.

Particle filtering was also used in [150] where they proved that accurate localization in large-scale environments can be achieved by simple input components including light-weight satellite and road maps from the internet, car odometry and a 3D LiDAR. [151] created a method to deal with the challenging 3D localization of LiDAR point cloud data. Data accumulation is used in their solution. Due to the high complexity of the problem, the processing step of 3D point cloud data of successive frame steps may need a large number of computations. They kept various data structures among those frames by employing data accumulation, which allowed the time steps to save a lot of computations. The information about recently occupied voxels is collected by filtering a stream of 3D point cloud data via a proposed Dynamic Voxel Grid (DVG). The first phase’s output is then sent into an incremental region growing segmentation module [152], which combines it with the already clustered data (Fig. 9).

Fig. 9.
figure 9

System architecture in [151].

Even though many research works took advantage of intensity measures to extract intensity features and used them for accurate SLAM [153, 154], most of these proposals require huge data collection effort, are not consistent when the surrounding environment changes, or are applied only for 2D localization. [124] combined intensity features with geometric features in order to perform LiDAR SLAM accurately and reliably. An intensity-based loop closure detection (frontend odometry estimation) in addition to factor graph optimization (backend optimization) are used together to localize the autonomous car (Fig. 10). By using both geometric and intensity features, the performance of the SLAM system is claimed to improve.

Fig. 10.
figure 10

System architecture in [124].

Occupancy grid map (OGM) based 2D LiDAR SLAM [155, 156] was implemented to perform well in indoor environments. However, they cannot be generalized to work sufficiently in outdoor environments. In addition, localization methods that employ 3D LiDAR data in the form of point cloud [5], voxels [151], or extracted features [124] are very computationally expensive and, in many scenarios, they cannot be implemented for real-time LiDAR SLAM. Therefore, many researchers [157, 158] suggest using 2.5D heightmaps for SLAM, claiming improved performance compared to conventional 2D and 3D methods that overcomes the problems of local maximum, drifting, and complexity.

For autonomous driving, precise SLAM is required. However, due to a variety of reasons, SLAM remains a difficult problem to address. For example, SLAM (along with other localization methods) has a tendency to drift over a lengthy driving distance. As a result, the SLAM algorithm gets less accurate as the driving distance increases. Another issue is that while maps are necessary for SLAM, they are not always feasible under all driving situations. In other words, when an autonomous vehicle is given a true trajectory to follow, employing a common SLAM approach causes the followed trajectory to deviate from the true one over time. As a result, very accurate maps under a variety of weather situations are needed to help with the localization process (which are not always available).

5 CONCLUSIONS

Autonomous driving is a rapidly evolving area with a plethora of anticipated advantages. For self-driving cars to make the correct decision at the appropriate moment, they need accurate and quick perception and localization. A well-chosen sensor type and arrangement assists the perception and localization system in generating a complete picture of the surrounding region, allowing for a flawless comprehension of the driving scenario. The autonomous driving area has achieved outstanding achievements in all aspects, including object identification, semantic segmentation, and simultaneous localization and mapping, thanks to the recently developed technologies such as the modern deep learning methods. These accomplishments enabled the ego vehicle to attain extremely high accuracy in perception and localization tasks while also taking into account real-time concerns. However, inclement weather and congested metropolitan areas cause even the state-of-the-art systems to fail. As a result, further study is needed to increase the performance of the perception and localization systems in all driving environments and weather conditions.