1 Introduction

Robots navigating through an unfamiliar environment can accomplish self-localization and mapping thanks to a technology known as simultaneous localization and mapping (SLAM). Sonar and LiDAR sensors, which offer a high degree of accuracy but are heavy, expensive, and fragile, were a major component of SLAM in its early stages. On the other hand, visual sensors became an alternative since they are small, inexpensive, and simple to use. To facilitate location and navigation in challenging real-world contexts, visual simultaneous localization and mapping, or VSLAM, can use visual sensors that function similarly to human eyes to sense the surrounding environment and gather rich environmental information. VSLAM technology is important for several applications, such as military rovers, drones, unmanned vessels, intelligent robotics, autonomous cars, augmented reality (AR), and virtual reality (VR) [1]. Furthermore, the physical world and virtual cyberspace are interwoven thanks to recent advancements in AR and VR technology. In addition to maintaining the overlay virtual items’ geometric coherence with the physical world, the 3D map rebuilt by VSLAM can include geometric details regarding the scene, enhancing the realism of the virtual environment. VSLAM technology is becoming increasingly in demand, driving the emergence of new techniques and technologies and making it a popular topic for research.

Throughout the past few decades, there has been a great interest in both the computer vision and robotics communities in visual odometry (VO), one of the most important approaches for posture estimation and robot localization [2]. It has been widely used as an addition to GPS, Inertial Navigation System (INS), wheel odometry, and other systems on various robots. Wireless networks are often lauded for their communication capabilities only, ignoring their innate benefits related to localization and sensing. With its enormous antenna array, high carrier frequency, and big bandwidth, the 5G NR access interface presents excellent prospects for precise localization and sensing systems in this area. Furthermore, 6G systems will carry on the trend of operating at increasingly higher frequencies, such as those at the millimeter wave (mmWave) and THz1 bands, and with even bigger bandwidths. The THz frequency range presents excellent prospects for frequency spectroscopy, high-definition imaging, and precise localization. The authors of [3] summarise wireless communications and the intended uses for 6G networks that operate above 100 GHz. They then discuss the potential of mmWave and THz-enabled localization and sensing solutions. Similarly, [4] discusses potential paths the cellular industry may take to develop future 6G systems.

With the emergence of 6G frameworks, the line between communication from localization is becoming more and more blurred, necessitating the development of seamless integration solutions. This convergence offers improved usefulness and efficiency, enabling a wide range of applications ranging from augmented reality to driverless vehicles. Although very efficient, current localization technologies like LiDAR are constrained by a number of issues, including expensive operating expenses, large physical bulk, and comparatively limited operational lifetimes. Their scalability and general adoption are limited by these disadvantages, especially in consumer-grade applications. Because of their affordability, portability, and adaptability, visual sensors offer a viable substitute. Their potential has been greatly increased by developments in camera and image processing, which makes them perfect for integration into mobile and ubiquitous computing environments. This research finds solution for,

  • How can the integration of deep learning with visual odometry be optimized to take full advantage of 6G capabilities?

  • What are the specific advantages and challenges of using visual sensors over traditional localization technologies like LiDAR in the context of 6G?

  • Can an end-to-end deep learning model effectively replace traditional multi-stage VO systems without compromising accuracy and reliability?

Deep Learning (DL) has shown encouraging results dominating numerous computer vision tasks. Unfortunately, this hasn’t arrived yet for the VO problem. Not much work has been done on VO—not even about 3D geometry issues. This is likely because most models that have been trained and DL architectures currently in use are primarily built to address recognition and classification tasks, which motivates deep convolutional neural networks (CNNs) for extracting high-level visual data from images. Understanding appearance representation severely impedes the VO’s ability to become widely known and restricts its use in controlled situations. For this reason, the VO algorithms mostly rely on geometric properties rather than visual ones. Instead of processing a single image, an AV algorithm should ideally describe motion dynamics by looking at the changes and linkages in a sequence of images. This implies that sequential learning is required, which the CNNs cannot provide. To satisfy these needs, this paper contributes the following:

  • This paper proposed a novel end-to-end monocular VO approach using convolution LSTM in 6G wireless communication systems.

  • This study uses DL approaches to demonstrate the monocular VO problem in an end-to-end way (directly determining the poses from the RGB images).

  • The input from the captured video sequence or RGB image is preprocessed to remove the noise. Then, the new geometric features from the RGB images are mapped and extracted using Global channel attention and CNN methodologies.

  • Long-short-term memory (LSTM) intuitively captures and automatically understands the sequential dependencies and complicated motion dynamics of an image series, which are important to the VO but cannot be openly or simply modelled by humans.

  • The developed model is experimented with using the KITTI dataset, and the model’s efficiency is discussed.

The remainder of the paper is structured as follows: Sect. 2 reviews related work. Section 3 describes the end-to-end monocular-VO method with preprocessing, feature mapping and extraction, and sequence modelling. Section 4 presents the experimental findings. Section 5 concludes.

2 Related Work

Author [5] provide a comprehensive overview of the various SLAM technologies implemented for AV perception and localisation. The authors also offer a comprehensive review of various V SLAM schemes, their strengths and weaknesses, as well as the challenges of deployment of V SLAM and future research directions. Author [6] highlight important technological enablers for convergent 6G communication, localisation, and sensing systems, review their underlying difficulties and implementation concerns, and suggest possible solutions. We also review the fascinating new prospects for integrated localisation and sensing applications, which will upend conventional design ideas and fundamentally alter how we live, interact with our surroundings, and conduct business. 6G will advance toward even higher frequency ranges, broader bandwidths, and huge antenna arrays in terms of potential enabling technologies. Consequently, this will allow for sensing systems with high Doppler, angle, and range resolutions and accurate localisation down to the centimetre level.

Author [7] examine the potential uses and applications of localization in upcoming 6G wireless systems and explore the effects of the key technological enablers. Next, system models considering line-of-sight (LOS) and non-LOS channels are offered for millimetre wave, terahertz, and visible light placement. Additionally, mathematical definitions and a review of localization key performance indicators are provided. A thorough analysis of the most advanced conventional and learning-based localisation approaches is also carried out. In addition, the design of the wireless system is taken into account, the localisation problem is stated, and their optimisation is looked at Author [1]. This research thoroughly analyses deep learning-based VSLAM techniques. We describe the basic ideas and framework of VSLAM and briefly overview its development process. Next, we concentrate on the three parts of deep learning and VSLAM integration: mapping, loop closure detection, and visual odometry (VO). We provide a detailed summary and analysis of each algorithm’s strengths and weaknesses. Furthermore, we offer an overview of commonly utilised datasets and assessment metrics. Lastly, we review the unsolved issues and potential paths for merging deep learning and VSLAM.

Author [8] initially provide a detailed overview of the research findings on the subject of visual SLAM, divided into three categories: deep learning-enhanced SLAM, dynamic SLAM, and static SLAM. To sort out the fundamental technologies related to the use of 5G ultra-dense system to offload complex computing tasks from visual SLAM systems to edge computing servers, the second section of the technology contrast between mobile edge computing and mobile cloud computing, along with the sections on 5G ultra-dense networking technology and MEC and UDN integration technology, are introduced. Author [9] present OTE-SLAM, an object-tracking augmented visual SLAM system that follows dynamic objects’ movements and the camera’s motion. Moreover, we jointly optimise the 3D position of the item and the camera posture, allowing object tracking and visual SLAM to work together to both benefits. Experiment findings show that the suggested method enhances the SLAM system’s accuracy in difficult dynamic situations.

The Extended Kalman Filter (EKF) is a valuable tool, especially when tackling nonlinear systems, as it linearizes them around the current estimate [10, 11]. Multisensor integrated navigation refers to the fusion of data from multiple sensors to determine the position, orientation, or trajectory of a vehicle or device [12, 13]. This process often involves specific metrics or measures to evaluate the effectiveness of privacy-preserving techniques [14, 15]. Accurate passenger counting holds significance across various applications like public transportation, ride-sharing services, and traffic management [16, 17].

Urban heat prediction is crucial for understanding and mitigating the effects of heat islands, areas with significantly higher temperatures due to human activities and infrastructure [18, 19]. Light field image depth estimation tasks involve estimating depth information captured in a scene [20, 21], particularly essential for applications like 3D reconstruction, autonomous driving, and augmented reality, where precise depth information is pivotal [22, 23].

Transformers represent a specific architecture widely used for sequence modeling tasks such as natural language processing or image recognition [24, 25]. Detecting glass surfaces finds utility in diverse applications such as robotics, augmented reality, or autonomous driving, where accurate scene understanding is indispensable [26, 27]. IoT environments encompass various applications, including smart homes, industrial automation, healthcare, and smart cities [28, 29]. Image feature extraction plays a pivotal role in many computer vision tasks like object detection, classification, and segmentation [30, 31].

Adapting a traffic object detection model from one domain to another typically involves gradually refining the adaptation process from coarse adjustments to fine-tuned adjustments [32, 33]. A maximum reduction of 22% and 33% in absolute and relative trajectory inaccuracy is one of the enhancements.

3 Methodologies

This section provides a detailed description of the deep RCNN framework that realises the monocular VO in an end-to-end manner. It is mainly made up of CNN-based feature extraction, GC-based feature mapping, and LSTM-based sequential modelling. The overview of the proposed architecture flow is shown in Fig. 1. The input monocular image sequence from the video clip is taken as input. Next, the input image is preprocessed to remove noise, resize, and smooth it in the preprocessing stage. Then, the feature map is determined using the Global-CA method from the preprocessed image. The features from the feature map are extracted using CNN, and the model sequence learning is performed using LSTM. Each image pair estimates the pose at each time step t through the network. The process is repeated at each time step t + 1 and new poses are estimated.

Fig. 1
figure 1

Overview of the proposed VO modelling system

Hard challenges are overcome by,

  • Terahertz (THz) frequency ranges are anticipated to be used by 6G, which may enable more accurate localization with possibly centimeter-level precision. Massive MIMO (Multiple Input Multiple Output) technology, which can improve the capacity and dependability of wireless communications and location, is also made easier by these higher frequencies.

  • In contrast to earlier generations, 6G seeks to lower latency and improve the effectiveness of both services by combining communication and location into a single framework.

  • Signals can be directed more precisely via beamforming, which enhances localization accuracy and lowers interference. This is especially useful in densely populated urban areas.

  • With 6G, artificial intelligence (AI) and machine learning are predicted to play major roles in allowing the network to dynamically adapt to the surroundings and user needs.

3.1 Preprocessing

The input RGB image is preprocessed by subtracting the mean value of RGB values and resizing it with the new size as the multiple of 64. The partial volume effect is caused by variations in real-time applications that impact the input images. The Bias field detection and correction approach is utilised to get around this. The difference between the grey pixels of comparable tissues is known as the bias field and is seen as the picture multiplicative module. Recent studies on RGB images have shown that smoothing improves results compared to non-smoothing methods. Therefore, this paper pretreated RGB images to improve feature extraction and sequencing results using the bias field reduction and smoothing procedure.

The noise N and bias B of the true images of \({x}_{0}\) and \({x}_{t}\) is written as in Eq. (1)

$${x}_{t}={Bx}_{0}+N$$
(1)

Once the bias field is identified, it is corrected using N4ITK method [34]. To smooth the image, a Gaussian filter has been used with the kernel size of 5 × 5 as in Eq. (2)

$${I}_{smooth}\left(G\left(x,y\right)\right)=\frac{1}{2\pi {\sigma }^{2}}{e}^{-\frac{{x}^{2}+{y}^{2}}{2{\sigma }^{2}}}$$
(2)

where \(\sigma \) denotes the standard deviation.

3.2 Feature Mapping Using G-CA

Based on 1-dimensional Convolution via ECANet [35], the nonlocal neural network [36] underpins the GCA process. As illustrated in Fig. 2, given \(b\times h\times w\) from the backbone network as feature tensor F. To obtain the 1 × b query as \({Q}_{b}\) and the key \({k}_{b},\) applied the global-average pooling (GAP) with spatial dimensions followed by the 1D convolution along the kernel size of k and sigmoid activation function. The outlier product of \({Q}_{b}\) and \({k}_{b}\) is formed through softmax functions over the channels to comprise b \(\times \)b GCA map,

Fig. 2
figure 2

G-CA

$${A}_{b}^{g}=softmax({{k}_{b}}^{T}{Q}_{b})$$
(3)

At the end, the attention map is \({V}_{b}\) as \({(V}_{b}\times {A}_{b}^{g})\) which is reshaped back as \(b\times h\times w\) to produce the G-CA map \({G}_{b}\). The channel attention is denoted as in Eq. (4)

$${G}_{b}=\sigma (Fully\_Connected({Max}_{pool}\left(X\right))+Fully\_Connected (Averag{e}_{pool}(X))$$
(4)

3.3 End to End VO Using C-LSTM

Several well-known and potent DNNar architectures, such VGGNet [37] and GoogleNet [38], were created for computer vision applications and have demonstrated exceptional performance. Most of them are built to solve problems related to recognition, classification, and detection; thus, they are taught to derive knowledge from appearance and visual context. However, as was previously mentioned, VO—which has its roots in geometry—should not be strongly associated with look. As such, applying the widely used DNN architectures currently available for the VO problem is not feasible. Addressing the VO and other geometric problems requires a framework to learn geometric feature representations. Nevertheless, as VO systems function on picture sequences obtained during movement, inferring relationships between successive image frames, such as motion models, is crucial. These relationships grow over time. As a result, the suggested C-LSTM takes these needs into account. The proposed end–end VO system is shown in Fig. 3.

Fig. 3
figure 3

Proposed end–end VO using DL

As seen in the above diagram, the C-LSTM (Convolutional Long Short-Term Memory) architecture is a novel strategy created to address the particular difficulties associated with voice over the internet. Convolutional neural networks (CNNs) and long short-term memory (LSTM) units are used in this architecture to provide a system that can interpret spatial data and take the temporal sequence of images into account to infer motion.

  • This model, specifically designed to learn geometric feature representations—which are essential for effectively simulating motion between consecutive image frames—is the C-LSTM framework. In contrast to appearance-focused architectures, C-LSTM places more emphasis on the scene’s geometry and the relative motion of the object or camera.

  • The C-LSTM’s LSTM component is especially made to record the dependencies and temporal linkages between a series of frames. This is crucial for voice over internet (VO), as trajectory estimate accuracy is strongly correlated with the comprehension of continuity and changes in position over time.

  • Without the need for human feature extraction or pre-processing, the system can learn directly from raw RGB images thanks to the end-to-end training methodology. By doing this, the system may potentially become more versatile and perform better by automatically figuring out what features are most relevant for voice over internet jobs.

3.3.1 CNN (AlexNet) Based Feature Extraction

For the study of virtual images, CNN is a popular DL model [39, 40]. Generally speaking, CNN uses the image as input and divides it into many categories. Input neurons, a sequence of convolutional layers, pooling, fully connected layers, and normalising layers make up its structure [41]. The convolution layer’s neurons have a tiny region connecting them to the layer before it. The fully connected layers’ activation neurons are connected to the layers below. Equation (5) represents a fully connected function’s forward and backwards reverse propagation.

$${x}_{i}^{l+1}=\sum_{i}{w}_{j,i}^{l+1}{x}_{i}^{l}$$
(5)
$${g}_{i}^{l}=\sum_{i}{w}_{j,i}^{l+1}{g}_{i}^{l}$$
(6)

where \({x}_{i}^{l}\) and \({g}_{i}^{l}\) are the activation and gradient of ith neuron at lth layer, \({w}_{j,i}^{l+1}\) denotes the weight of neuron i at l-th layer and neuron j at l + 1-th layer. Different CNN architectures have arisen in recent research growths. AlexNet has been used in this work. For the 2012 ImageNet competition, it was implemented to lower the picture error in classification from 26 to 15.3%. It’s an incredibly competent and well-organized architecture. Its eight learning layers comprise three completely connected layers and five convolution layers. To construct the class labels, the output of the last layer is input into the softmax activation function. GPU sharing connects the second, fourth, and fifth levels’ kernels to their preceding layers. The second layer and the third layer kernel are entirely connected. The max-pooling layers are connected to the normalization layer after the first and second layers. Each learning layer is associated with ReLU activation function. The network architecture details are shown in Table 1. The neurons in the last layers are set to 22 to balance the features. The output layer, layer 12, has a sigmoid activation function that indicates the efficient properties of waste goods. This layer is provided as input to the DBN, which classifies waste products into recyclable and non-recyclable categories.

Table 1 CNN architecture details

3.3.2 VO Sequencing Using LSTM

The backbone network in this work was a dense layer LSTM. Figure 4 displays the three layers of the thick LSTM. It consists of two fully connected (FC) layers; the first has 160 neurons, and the second has 90 neurons. The following layers are batch normalisation and dropout. The final layer is the FC, which has three neurons used to segment the picture. The dense layer comes after the LSTM to divide the area around the brain tumour. Features are fed into the LSTM layer from the ROI and G-CA, and CNN. The dataset’s maximum number of slices is 30, which equals the number of sequences defined. There are 225 hidden units in the first and third layers of the LSTM and 200 hidden units in the second and fourth layers. As seen in Fig. 5, each layer was made up of LSTM units with four gates, such as input (i), forget (f), cell (c), and output (o).

Fig. 4
figure 4

Dense LSTM structure

Fig. 5
figure 5

LSTM block

In Fig. 5, the variables X, C and H declares the input, cell and hidden states sequentially. In each LSTM block, three weights such as input weight Iw, recurrent weight Rw and bias b has been used as in Eq. (7)

$${I}_{\omega }=\left[\begin{array}{c}{I}_{\omega i}\\ {I}_{\omega f}\\ {I}_{\omega c}\\ {I}_{\omega o}\end{array}\right], {R}_{\omega }=\left[\begin{array}{c}{R}_{\omega i}\\ {R}_{\omega f}\\ {R}_{\omega c}\\ {R}_{\omega o}\end{array}\right], and b=\left[\begin{array}{c}{b}_{i}\\ {b}_{f}\\ {b}_{c}\\ {b}_{o}\end{array}\right]$$
(7)

The cell state at certain time step t is declared as follows,

$${C}_{t}={F}_{t}\odot {C}_{t-1}+{i}_{t}\odot {c}_{t}$$
(8)

where \(\odot \) is the Hadamard product. The concealed state Ht of t is denoted as,

$${H}_{t}={o}_{t}\odot \text{tanh}({C}_{t})$$
(9)

3.4 Cost Function Optimization

Consider using the suggested C-LSTM based VO system to calculate the conditional probability of the poses \({y}_{t}=({y}_{1},{y}_{2},\dots {y}_{t})\) with the RGB sequential of monocular images \({x}_{t}=({x}_{1},{x}_{2},\dots {x}_{t})\) up to the time t in the probabilistic form:

$$p\left({y}_{t}|{x}_{t}\right)=p({y}_{1},{y}_{2},\dots {y}_{t}|({x}_{1},{x}_{2},\dots {x}_{t})$$
(10)

The C-LSTM is used for both probabilistic inference and modeling. The DNN maximizes in order to determine the ideal parameters θ × for the VO:

$${\theta }^{*}=\underset{\theta }{\text{argmax}}p\left({y}_{t}|{x}_{t};\theta \right)$$
(11)

4 Results and Discussion

This section uses the popular KITTI VO/SLAM benchmark to assess the suggested end-to-end monocular VO approach [42].Most currently available monocular video encoding techniques do not compute an absolute scale, so their localization outcomes must be manually matched to the actual data. As a result, the open-source VO library LIBVISO2 [43] is used for comparison. It recovers the scale for the monocular VO using a fixed camera height. It also uses its stereo version, which may acquire the absolute positions directly.

4.1 Dataset

There are 22 image sequences in the KITTIVO/SLAM benchmark [42], 11 of which (Sequence 00–10) are linked to ground truth. The remaining ten sequences (Sequences 11–21) merely contain raw sensor data. The fact that this dataset was captured at a relatively slow frame rate (10 frames per second) while driving through crowded, dynamic cities at speeds of up to 90 km/h makes it extremely difficult for monocular VO algorithms to process.

Two different experiments were carried out to assess the suggested approach. As ground truth is only available for these sequences, the first one is based on Sequence 00–10 to analyse its performance statistically. The relatively long sequences 00, 02, 08, and 09 are the only ones utilised for training to create a separate dataset for testing. The paths are divided into segments of varying lengths to produce large training data—7410 samples altogether. The tested, trained models are evaluated on the following sequence: 03, 04, 05, 06, 07, and 10. As the capacity to extrapolate effectively to actual data is crucial for deep learning methods, the subsequent trial examines the suggested technique’s behaviour and the trained VO models in entirely novel settings. This is also necessary for the VO problem, as previously explained. As a result, models trained on all of Sequence 00–10 are tested on Sequence 11–21, which lacks training ground truth.

The network is trained using an NVIDIA Tesla K40 GPU based on the well-known DL framework Theano. It is trained using the Adagrad optimiser for a maximum of 200 epochs at a learning rate of 0.001. Techniques such as dropout and early halting are implemented to prevent the models from overfitting. The CNN is based on a pre-trained FlowNet model to minimise both the training time and data required to converge [44].

4.2 Experimental Results of VO

The KITTIVO/SLAM evaluation metrics, which calculate the average root mean square errors (RMSEs) of translational and rotating errors for all sequences of lengths between 100 and 800 m and various speeds (the range of speeds varies in different sequences), are used to analyse the performance of the trained VO models.

Sequences 00, 02, 08, and 09 are used to train the initial DL-based model. Sequences 03 to 07 and 10 are used for testing. In Fig. 6, the translation and rotation against various path lengths and speeds are displayed along with the average RMSEs of the calculated VO on the test sequences. Due to the implementation of 6G network, the high drifts are avoided and the proposed model secured improved results than stereo VISO2-S, Monocular VISO2-M and DeepVO [45]. The rotational errors are smaller than the translation errors since the KITTI dataset is recorded while the car is moving, which tends to be high speed on driving and slow in rotation with varied velocity. As seen in Fig. 6a, b while the trajectory length is increased, the translation and rotation errors are reduced compared to the cases considered, such as stereo, monocular, and DeepVO. Also, in Fig. 7a, b the translation error and rotation error decrease as speed increases.

Fig. 6
figure 6

Error calculation during fixed-length path travel a Translation error against path length b Rotation error against path length

Fig. 7
figure 7

Error calculation during various speeds of path travel a Translation errors on various speed b Rotation errors on speed

Table 2 summarises the detailed performance of the algorithms on the testing sequences. It suggests that compared to the examined VO systems, the C-LSTM produces more reliable results. While the previous experiment assessed the generalisation of the proposed model, the network is tested on the KITTI VO benchmark testing dataset to explore its performance in entirely new settings with distinct motion patterns and images. The KITTI VO benchmark’s 11 training sequences, or Sequence 00–10, are used to train the C-LSTM model. This provides additional data to minimise overfitting and optimise the network’s performance. The VO findings cannot be subjected to any quantitative analysis because no ground truth is available for these testing sessions.

  • Trel-Average translation RMSE (%) of 100–800 m length.

  • Rrel is the average rotational RMSE (\(^\circ \)/100 m) for 100–800 m length.

Table 2 Testing sequence results

The C-LSTM VO produces results that are substantially superior to those of the monocular VISO2 and somewhat comparable to those of the stereo VISO2. It appears that this larger training dataset improves the performance of the proposed model. The DeepVO, a monocular VO method, provides an attractive performance, demonstrating that the trained model may generalise effectively in new settings, considering the stereo features of the stereo VISO2. One possible exception is the Sequence 10 test, which has quite large localisation errors while having a trajectory shape that is similar to the stereo VISO2s. There are multiple causes. Firstly, there is insufficient data at high speeds in the training dataset. Only Sequence 01, out of the 11 training datasets, exhibits velocities greater than 60 km/h. On the other hand, Sequence 10’s top speeds range from 50 to around 90 km/h. Furthermore, only 10 Hz is used to collect the pictures, which increases the difficulty of VO estimate during rapid movement.

5 Conclusion

This work presents an innovative deep learning-based end-to-end monocular video algorithm. This new paradigm combines CNNs with LSTM to achieve simultaneous representation training and sequential monocular voice-over network (VO) modelling, leveraging the power of GCA-CNN and LSTM. There is no need to properly adjust the VO system’s parameters because it is trained end-to-end and does not rely on any module in the traditional VO algorithms—not even camera calibration—for posture estimation. It is confirmed that it can generate exact VO findings with exact scales and function well in new contexts based on the KITTI VO benchmark. The Analyzed results with the comparison among the considered VO approaches show the efficiency of the proposed model with reduced error rate on both testing and training video sequences. Quite the contrary: by combining geometry with the representation, knowledge, and models that the DNNs have learned, it can be a useful supplement, helping to further enhance the VO’s accuracy and, more importantly, robustness.