Keywords

1 Introduction

In recent 30 years, with the development of computer vision, machine learning and other technologies and the improvement of hardware manufacturing level, visual SLAM, which selects camera as the main sensor, with its advantages of rich information collection and relatively low cost, has gradually been valued by the industry, and has played a great application potential in many fields such as robot, UAV and AR/VR [1, 2]. According to the era division of the research process of SLAM by Cadena, after the classic era and algorithm analysis era, the current SLAM research is in the era of robustness and perception [3]. At this stage, the main characteristics and research keystone of SLAM focus on system robustness performance, high-level understanding, resource awareness and task driven perception.

Obviously, SLAM using a single sensor is difficult to meet the development needs of the times, and does not meet the needs of current theoretical research. It can be said that SLAM problem promotes the idea of multi-sensor information fusion to a certain extent. Among them, through the joint optimization of sparse visual feature point observation and IMU measurement, the visual-inertial SLAM (or Visual Inertial System, VINS) can effectively deal with the blur of fast-moving camera and correct the accumulated error of IMU, so as to further improve the robustness of the system. It provides an effective solution for the miniaturization and low cost of SLAM, and is the research hotspot in multi-sensor information fusion SLAM [4].

However, a large number of studies show that although the camera can correct the cumulative error of IMU, due to the limitation of odometer measurement principle and the lack of explicit loop closure, VIO still has cumulative error in long-term operation. In the outdoor large-scale environment, GNSS is still a more easy-to-use and reliable global positioning method [5]. GNSS can provide accurate global pose estimation for VIO, which is conducive to correcting the drift of VIO and improving the accuracy and continuity of navigation and positioning. When GNSS refuses, VIO can still play a role in maintaining the need for navigation.

Based on this, this paper proposes a visual-inertial SLAM scheme assisted by GNSS. Taking the tightly coupled VIO based on optimization as the main body, the IMU angular velocity and acceleration information are modeled and optimized together with the visual features, and the pose information obtained by the combination of GNSS original observations is fused with the VIO solution results, so as to ensure the local pose accuracy and enhance the global positioning accuracy at the same time. It can achieve long-term and high-precision positioning results in indoor and outdoor environments.

2 Visual-Inertial Odometry

2.1 Visual Front End Based on Optical Flow Tracking

There are two main implementation methods for the front end of traditional visual SLAM. First, based on the method of feature points, the key points and descriptors are calculated to realize feature extraction and matching, and the camera motion is optimized by minimizing the reprojection error. This method is stable and not easy to be affected by illumination changes, but it takes a long time to calculate the descriptor and performs poorly in the case of missing features. The second is the direct method, which only requires light and shade changes in the scene, and estimates the camera motion by minimizing the photometric error. The direct method eliminates the calculation of key points and descriptors and avoids the lack of features. However, it completely depends on gradient search, has the problem of non-convexity, and the gray invariant assumption is difficult to meet in the real environment.

Considering the shortcomings of the above two methods, we choose the com-promise scheme of key point extraction and optical flow tracking. This method retains the feature points, only extracts the key points, and uses the optical flow tracking method to replace the descriptor matching, which can take into account the accuracy and robustness to a certain extent.

The key point extraction algorithm here is Shi-Tomasi corner extraction algorithm [6], which is an improved method based on Harris corner extraction [7]. The basic principle of Harris corner extraction algorithm is to take the target pixel as the center, calculate the change of gray curvature in its window, and select the point with the largest curvature change as the feature point.

On this basis, KLT optical flow tracking is used to track the feature points. KLT algorithm is an improved method based on LK optical flow tracking, which tracks the same feature points of two consecutive images [8]. KLT algorithm has three assumptions: First, the image should keep the brightness constant; Second, movement is continuous or small; Third, the space is consistent, and the motion changes of adjacent points are similar. These three points can be met under normal circumstances.

2.2 IMU Preintegration

Because the sampling frequencies of IMU and camera are inconsistent, in order to use them for data fusion, the problem of data synchronization must be considered. The processing method of IMU preintegration [9] solves this problem well.

The so-called IMU preintegration is to take the IMU relative measurement in-formation between two key frames at different times as a constraint to estimate the IMU state at the next time (Fig. 1).

Fig. 1.
figure 1

IMU preintegration

Generally, six-axis IMU can output three-dimensional acceleration and three-dimensional angular velocity information, and its measurement model is:

$$ \begin{gathered} {\hat{\boldsymbol{a}}}_{{\text{B}}}^{{}} = {\boldsymbol{R}}_{{\text{B}}}^{{\text{W}}} \left( {{\boldsymbol{a}}_{{\text{W}}}^{{}} - {\boldsymbol{g}}_{{\text{W}}}^{{}} } \right) + {\boldsymbol{b}}_{{\text{a}}}^{{}} + {{\varvec{\eta}}}_{{\text{a}}}^{{}} \hfill \\ {\hat{\boldsymbol{\omega }}}_{{\text{B}}}^{{}} = {{\varvec{\upomega}}}_{{\text{B}}}^{{}} + {\boldsymbol{b}}_{{\text{g}}}^{{}} + {{\varvec{\eta}}}_{{\text{g}}}^{{}} \hfill \\ \end{gathered} $$
(1)

Among them, the subscripts B and W represent the IMU coordinate system and the world coordinate system respectively. The subscripts a and g represent accelerometer and gyroscope, respectively. \({\boldsymbol{a}}_{{\text{W}}}^{{}}\) and \({\boldsymbol{g}}_{{\text{W}}}^{{}}\) represents the real acceleration of motion and gravitational acceleration in the world coordinate system, and \({\boldsymbol{R}}_{{\text{B}}}^{{\text{W}}}\) represents the rotation matrix between the IMU coordinate system and the world coordinate system; \({{\varvec{\upomega}}}_{{\text{B}}}^{{}}\) represents the angular velocity in the IMU coordinate system; \({\boldsymbol{b}}_{{\text{a}}}^{{}}\) and \({\boldsymbol{b}}_{{\text{g}}}^{{}}\) represent the zero offset of the accelerometer and the gyroscope respectively; \({{\varvec{\eta}}}_{{\text{a}}}^{{}}\) and \({{\varvec{\eta}}}_{{\text{g}}}^{{}}\) represent respectively Measurement noise of two sensors.

In order to calculate the motion information from the measured values of IMU, the following kinematic model is introduced:

$$ \begin{gathered} {\dot{\boldsymbol{p}}}_{{\text{W}}}^{{\text{B}}} = {\boldsymbol{v}}_{{\text{W}}}^{{\text{B}}} \hfill \\ {\dot{\boldsymbol{v}}}_{{\text{W}}}^{{\text{B}}} = {\boldsymbol{a}}_{{\text{W}}}^{{}} \hfill \\ {\dot{\boldsymbol{R}}}_{{\text{W}}}^{{\text{B}}} = {\boldsymbol{R}}_{{\text{W}}}^{{\text{B}}} {{\varvec{\upomega}}}_{{\text{B}}}^{ \wedge } \hfill \\ \end{gathered} $$
(2)

Among them, \({\boldsymbol{p}}_{{\text{W}}}^{{}}\) and \({\boldsymbol{v}}_{{\text{W}}}^{{}}\) respectively represent the three-dimensional position and velocity in the world coordinate system, and \(\left( \cdot \right)_{{}}^{ \wedge }\) represents the antisymmetric matrix operator.

Based on IMU measurement model and kinematics model, the preintegration formula of IMU can be further obtained:

$$ \begin{array}{*{20}l} {\boldsymbol{p}}_{{\text{W}}}^{{\text{B}}} \left( {t + \Delta t} \right) = {\boldsymbol{p}}_{{\text{W}}}^{{\text{B}}} \left( t \right) + {\boldsymbol{v}}_{{\text{W}}}^{{\text{B}}} \left( t \right)\Delta t + \frac{1}{2}{\boldsymbol{g}}_{{\text{W}}}^{{}} \Delta t_{{}}^{{2}} \\ { + }\frac{1}{2}{\boldsymbol{R}}_{{\text{W}}}^{{\text{B}}} \left( t \right) \cdot \left( {{\hat{\boldsymbol{a}}}_{{\text{B}}}^{{}} \left( t \right) - {\boldsymbol{b}}_{{\text{a}}}^{{}} \left( t \right) - {{\varvec{\eta}}}_{{{\text{ad}}}}^{{}} \left( t \right)} \right)\Delta t_{{}}^{{2}} \\ {\boldsymbol{v}}_{{\text{W}}}^{{\text{B}}} \left( {t + \Delta t} \right) = {\boldsymbol{v}}_{{\text{W}}}^{{\text{B}}} \left( t \right) + {\boldsymbol{g}}_{{\text{W}}}^{{}} \Delta t + {\boldsymbol{R}}_{{\text{W}}}^{{\text{B}}} \left( t \right) \cdot \left( {{\hat{\boldsymbol{a}}}_{{\text{B}}}^{{}} \left( t \right) - {\boldsymbol{b}}_{{\text{a}}}^{{}} \left( t \right) - {{\varvec{\eta}}}_{{{\text{ad}}}}^{{}} \left( t \right)} \right)\Delta t \\ {\boldsymbol{R}}_{{\text{W}}}^{{\text{B}}} \left( {t + \Delta t} \right) = {\boldsymbol{R}}_{{\text{W}}}^{{\text{B}}} \left( t \right) \cdot {\text{Exp}}\left( {\left( {{\hat{\boldsymbol{\omega }}}_{{\text{B}}}^{{}} \left( t \right) - {\boldsymbol{b}}_{{\text{g}}}^{{}} \left( t \right) - {{\varvec{\eta}}}_{{{\text{gd}}}}^{{}} \left( t \right)} \right)\Delta t} \right) \\ \end{array} $$
(3)

The above three formulas are the preintegration calculation formula of IMU. The so-called preintegration is to integrate the original output data of IMU to obtain the integration of measured values. Therefore, this integration can be carried out immediately when the IMU measurement data are obtained, and it can be “pre” integrated before the optimization calculation. At the same time, in the process of optimization calculation, even if the position p, velocity v or rotation matrix R change, it will not affect the preintegrated value, so there is no need to re integrate the IMU data.

2.3 Nonlinear Optimization Based on Sliding Window

The essence of SLAM problem is state estimation. In order to describe the generality of, motion equation and observation equation are commonly used to describe:

$$ \left\{ \begin{gathered} {\boldsymbol{x}}_{k}^{{}} = f\left( {{\boldsymbol{x}}_{k - 1}^{{}} ,{\boldsymbol{u}}_{k}^{{}} ,{\boldsymbol{w}}_{k}^{{}} } \right) \hfill \\ {\boldsymbol{z}}_{k,j}^{{}} = h\left( {{\boldsymbol{y}}_{j}^{{}} ,{\boldsymbol{x}}_{k}^{{}} ,{\boldsymbol{v}}_{k,j}^{{}} } \right) \hfill \\ \end{gathered} \right. $$
(4)

Among them, the upper formula represents the motion equation, \({\boldsymbol{x}}_{k}^{{}}\) represents the camera pose at the moment \(k\), \({\boldsymbol{u}}_{k}^{{}}\) represents the input or reading of the motion sensor (here is IMU), \({\boldsymbol{w}}_{k}^{{}}\) represents the noise of the process; the lower formula represents the observation equation, \({\boldsymbol{z}}_{k,j}^{{}}\) represents the observation that the camera observes the landmark \({\boldsymbol{y}}_{k}^{{}}\) at the location \({\boldsymbol{x}}_{k}^{{}}\), \({{\varvec{\upnu}}}_{k,j}^{{}}\) represents the observed noise of the process.

The so-called SLAM problem is how to estimate the position of the camera and the position of the road sign after obtaining the motion sensor input and the observation of the road sign, that is, the problem of “positioning” and “map construction”. Mathematically, it is a process of solving the maximum a posterior (MAP). Assuming that all measurements are independent of each other and that the noise of each measurement conforms to the zero mean Gaussian distribution, the MAP problem can be further transformed into the minimum estimation of the sum of cost functions:

$$ \begin{array}{*{20}l} {\mathcal{X}}_{{}}^{*} = \mathop {\arg \max }\limits_{{\mathcal{X}}} p\left( {{\mathcal{X}}|{\boldsymbol{z}}} \right) \\ = \mathop {\arg \max }\limits_{{\mathcal{X}}} p\left( {\mathcal{X}} \right)p\left( {{\boldsymbol{z}}|{\mathcal{X}}} \right) \\ = \mathop {\arg \max }\limits_{{\mathcal{X}}} p\left( {\mathcal{X}} \right)\prod\limits_{i = 1}^{n} {p\left( {{\boldsymbol{z}}_{i}^{{}} |{\mathcal{X}}} \right)} \\ = \mathop {\arg \min }\limits_{{\mathcal{X}}} \left\{ {\left\| {{\boldsymbol{r}}_{p}^{{}} - {\boldsymbol{H}}_{p}^{{}} {\mathcal{X}}} \right\|_{{}}^{{2}} { + }\sum\limits_{i = 1}^{n} {\left\| {{\boldsymbol{r}}\left( {{\boldsymbol{z}}_{i}^{{}} ,{\mathcal{X}}} \right)} \right\|_{{{\text{P}}_{{\text{i}}}^{{}} }}^{{2}} } } \right\} \\ \end{array} $$
(5)

where, \({\boldsymbol{z}}\) represents the set of independent sensors and \(\left\{ {{\boldsymbol{r}}_{p}^{{}} ,{\boldsymbol{H}}_{p}^{{}} } \right\}\) represents the a priori information of the system.

In the VIO back end mentioned in this paper, in order to ensure the optimization efficiency and save computing resources, a sliding window is set up. The whole back end uses the tight coupling method to integrate the measurement information of vision and IMU, and uses the Bundle Adjustment (BA) method in computer vision to minimize the feature re projection position error of all points in the sliding window and the observation error of IMU. On this basis, the nonlinear optimization method is used to iteratively optimize the pose of key frames, IMU state and the three-dimensional coordinates of road markings.

After the nonlinear optimization method is used to optimize all variables in the sliding window, in order to maintain a constant number of frames in the sliding window, the key frames need to be removed, and the corresponding key frame pose variables, IMU state variables and some road marking variables related to the key frames will be removed. Directly discarding the variables will cause the system to lose information and reduce the accuracy of the system, even the system is not observable. Therefore, in order to solve this problem, the method of marginalization is often used when moving out the variables to convert the variable information into the constraints between the remaining variables, so as to retain the variable information as much as possible.

3 VIO/GNSS Fusion Algorithm

As for the fusion of VIO and GNSS, our processing strategy here is to take the tightly coupled VIO as the main body, and loosely couple the GNSS global pose estimation with the VIO pose results, so as to achieve the effect of GNSS assisting VIO.

3.1 Spatiotemporal Datum Unification

In order to fuse the sensor data, we need to transform the data of different sensors in time and space, so that the simultaneous interpreting of time and coordinates is achieved.

Time unification is realized by PPS signal of GNSS receiver. For the unification of space, the main work is the transformation of coordinate system. After collecting data, GNSS receiver obtains the current longitude, latitude and elevation information through software settlement. Then, the position in the geodetic coordinate system (LLA) needs to be converted to the earth centered earth fixed (ECEF) coordinate system.

3.2 Fusion Method Based on Kalman Filter

Kalman Filter (KF) is an algorithm for optimal estimation of system state by using linear system state equation and system input and output observation data. In the loose combination system, GNSS system and VIO system work independently. VIO outputs position and attitude information, GNSS outputs position information, uses KF algorithm to optimally estimate the system error, and finally the system outputs the corrected VIO parameters (Fig. 2).

Fig. 2.
figure 2

Flow chart of VIO and GNSS fusion algorithm

The error state vector is selected as:

$$ \tilde{x} = \left[ {\tilde{P}_{x}^{{}} ,\tilde{P}_{y}^{{}} ,\tilde{P}_{z}^{{}} ,\tilde{V}_{x}^{{}} ,\tilde{V}_{y}^{{}} ,\tilde{V}_{z}^{{}} } \right]_{{}}^{{\text{T}}} $$
(6)

The error state transfer equation of the combined system is:

$$ {\boldsymbol{\dot{\tilde{x}}}}\left( t \right) = {\boldsymbol{F}}\left( t \right) \cdot {\tilde{\boldsymbol{x}}}\left( t \right) + {\boldsymbol{G}}\left( t \right) \cdot {\boldsymbol{w}}\left( t \right) $$
(7)

where, the state transition matrix \({\boldsymbol{F}}\left( t \right){ = }\left[ {\begin{array}{*{20}c} {{0}_{{{3} \times {3}}}^{{}} } & {{\boldsymbol{I}}_{{{3} \times {3}}}^{{}} } \\ {{0}_{{{3} \times {3}}}^{{}} } & {{0}_{{{3} \times {3}}}^{{}} } \\ \end{array} } \right]\) and the noise driving matrix \({\boldsymbol{G}}\left( t \right){ = }\left[ {\begin{array}{*{20}c} {{\boldsymbol{I}}_{{{3} \times {3}}}^{{}} } & {{0}_{{{3} \times {3}}}^{{}} } \\ {{0}_{{{3} \times {3}}}^{{}} } & {{\boldsymbol{I}}_{{{3} \times {3}}}^{{}} } \\ \end{array} } \right]\), the system noise is \({\boldsymbol{w}}\left( t \right)\).

The system measurement model is:

$$ {\boldsymbol{Z}}_{k}^{{}} = {\boldsymbol{H}}_{k}^{{}} {\boldsymbol{x}}_{k}^{{}} + {\boldsymbol{V}}_{k}^{{}} $$
(8)

Among them, the quantity measurement \({\boldsymbol{Z}}_{k}^{{}} = \left[ {\begin{array}{*{20}c} {P_{x}^{{}} - \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{P}_{x}^{{}} } \\ {P_{y}^{{}} - \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{P}_{y}^{{}} } \\ {P_{z}^{{}} - \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{P}_{z}^{{}} } \\ {V_{x}^{{}} - \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{V}_{x}^{{}} } \\ {V_{y}^{{}} - \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{V}_{y}^{{}} } \\ {V_{z}^{{}} - \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{V}_{z}^{{}} } \\ \end{array} } \right]\), and the measurement matrix \({\boldsymbol{H}}_{k}^{{}} = {\boldsymbol{I}}_{6 \times 6}^{{}}\), \({\boldsymbol{V}}_{k}^{{}}\) is the measurement noise.

According to the above model, the optimal unbiased estimation of the combined system can be obtained by using the two steps of “prediction” and “update” of Kalman filter.

4 Simulation and Discussion

4.1 Simulation Dataset

KITTI Dataset [10] is the world's largest computer vision algorithm evaluation dataset released by Karlsruhe Institute of technology in Germany. This dataset can be used to evaluate stereo images, visual ranging, 3D object detection, 3D tracking and so on. KITTI Dataset includes real image data of urban, rural and highway scenes. The sampling platform of KITTI Dataset includes two gray cameras, two color cameras and a GPS navigation system.

Select 2011_10_03_drive_0027_Synced data as test data. The data set has a total length of 3700 m and lasts 430 s. The running track of the proposed algorithm is shown in the figure below (Fig. 3).

Fig. 3.
figure 3

The camera's left and right eye real-time images and dataset running trajectory

Compared with the official real trajectory map, the early trajectory is more accu-rate, while the later trajectory drifts, which shows that the long-distance accuracy of VIO is indeed poor without explicit loop closure.

4.2 Analysis of Simulation Results

In order to quantitatively analyze the data, EVO is selected for analysis. EVO is a python tool used to evaluate the measurement data and output estimation of SLAM system. Its core function is to draw the camera motion trajectory. In this paper, EVO is used to draw and compare the motion trajectory, calculate the Absolute Pose Error (APE) and align and compare the trajectory. In order to obviously compare the performance of the algorithms before and after GNSS assistance, we directly draw the error mapping of the two algorithms, as shown in the figure below (Figs. 4, 5).

Fig. 4.
figure 4

VIO trajectory error map and APE without GNSS assistance

Fig. 5.
figure 5

VIO trajectory error map and APE with GNSS assistance

Relevant data are shown in the table below (Table 1).

Table 1. Results of Metric Absolute Pose Error

According to the comparison of APE data, in an outdoor environment with GNSS assistance, the VIO trajectory error is further reduced, the root mean square error is increased from 13.385 m to 2.056 m, the average error is increased from 11.896 m to 1.687 m, and the standard deviation is increased from 6.136 m. The increase is 1.176 m. After the GNSS correction is applied to the long-running VIO system, the navigation accuracy and robustness of the system have been improved to a certain extent. In trajectories with poor GNSS performance, pure VIO can still provide accurate pose estimation, which shows that the GNSS-assisted VIO algorithm in this article is effective.

5 Conclusion

This paper proposes a GNSS-assisted visual-inertial SLAM technology solution, which takes the optimized tightly coupled VIO as the main body, integrates the global positioning results of GNSS, and carries out a simulation experiment based on the KITTI data set. The results show that in terms of long-term positioning, the VIO system assisted by GNSS can achieve an accuracy of 1.687 m, a standard deviation of 1.176 m, and a root mean square error of 2.056 m, which is nearly 80% higher than that without assistance. Even when GNSS cannot be used indoors, the system described in this article degenerates to VIO, which can also provide stable 3D navigation and map construction. The proposed algorithm effectively solves the long-term drift problem of VIO, and it is rejected in GNSS.

Due to the limitations of actual conditions, only simulation experiments have been done, and the effect of VIO operation in the event of GNSS failure has not been evaluated. Since only Single-Point Positioning (SPP) based on pseudorange and KF loose coupling are used, this algorithm does not further explore the use of GNSS. Subsequently, we will conduct research on the fusion of Real-Time Kinematic (RTK) and Precise Point Positioning (PPP) with VIO. Furthermore, GVIO, which tightly coupled with the raw observations of GNSS, will be our key research work in future.