1 Introduction

Robot localization has been gaining importance in the field of Robotics, spanning from robot navigation, three-dimensional reconstruction to simultaneous localization and mapping (SLAM). Visual inertial odometry (VIO) is a common way for robot localization. By fusing the measurements captured by camera and IMU, VIO can make the metric scale together with the pitch and roll angles observable, which (especially for the scale) underlies the tasks like SLAM and navigation. In addition, the VIO sensor with small size is easy-to-deploy on mobile robots, unmanned aerial vehicles, and handheld devices. In spite of these advantages, VIO requires generic three-dimensional motion along different directions, which is hard to satisfy in practice as the robot usually moves horizontally. Even worse, Wu et al. (2017) have proved that the pitch and roll angles and scale will be unobservable when robot moves along straight lines or circular arcs on the ground. This motivates us to seek for another vehicle to address this problem.

In this paper, we propose an accurate and robust odometry by jointly using VIO and the wheel encoder (VIWO), and expect to obtain the benefits from both: VIO has accurate relative translation and rotation information and meanwhile the wheel encoder provides absolute scale based on long-time, high-frequency, and stable translation and rotation information. However, fusing VIO and wheel encoder is not easy. First, the wheel slips sometimes, which leads to some measurement errors in scale. Second, the system requires good initial values as the input, which is vital to accurate localization. Third, the VIO system suffers from a high degree of nonlinearity itself. When the wheel encoder is taken into account, this degree will be higher, which makes the pose estimation harder and the computation overhead heavier.

VIWO overcomes these challenges with the following three steps. First, we propose a kinematic motion scheme that deals with the accumulated slippage errors by using a preintegration model and a noise propagation model. Second, we obtain good initial values by loosely aligning IMU and odometer preintegration with the vision-only structure and simplify the pose estimation by considering the constraints from odometer and planar motion in the back-end optimization. Third, we improve a sliding window update strategy and reduce the computation overhead by removing unnecessary measurements. In summary, our contributions are threefold. (1) We propose an accurate and robust odometry system called VIWO that combines the sensor information of visual, IMU, and wheel encoder. (2) We address the key issues of slippage errors and a high degree of nonlinearity by jointly using a kinematic motion scheme together with an improved sliding window update strategy. (3) We implement a prototype of VIWO and several experiments showing that VIWO can achieve higher accuracy of pose estimation, compared with the baseline.

2 Related work

In recent years, there are many excellent works on SLAM, including monocular visual SLAM and visual–inertial odometry (VIO). Although monocular visual SLAM, such as ORB-SLAM (Mur-Artal et al. 2015) and DSO (Engel et al. 2018), can generate compact and trackable map, it is unable to acquire accurate pitch, roll and absolute scale. Instead, VIO is capable of making the metric scale together with the pitch, roll angels observable. VIO can usually be divided into filtering-based method (Mourikis and Roumeliotis 2007) and nonlinear optimization-based method (Leutenegger et al. 2014; Mur-Artal and Tardós 2016; Qin et al. 2018). As a model of the former, MSCKF (Mourikis and Roumeliotis 2007) is an Extended Kalman Filter (EKF)-based method which constraints the IMU and camera pose at the same time, as well as the multiple camera poses that have the same feature observation. The latter introduces nonlinear optimization methods based on the sliding window. OKVIS (Leutenegger et al. 2014) applies monocular and stereo camera, and integrates the inertial measurements in advance, then achieves feature detection by using BRISK (Leutenegger et al. 2011) algorithm. However, OKVIS (Leutenegger et al. 2014) preintegrates the inertial measurements repeatedly when the linearization point changes and it has no implementation of loop closing. ORB-VISLAM (Mur-Artal and Tardós 2016) introduces inertial measurements based on ORB-SLAM (Mur-Artal et al. 2015), optimizing the inertial error term between two frames and achieving zero-drift localization in mapped areas. VINS-Mono (Qin et al. 2018) is another VIO framework that fulfills a low computation relocalization module and its improvement (Qin et al. 2019) can be conveniently extended with other sensors such as GPS.

Some researches work on visual–inertial–wheel encoder odometry. As proven in Wu et al. (2017), VIO has unobservabilities when robot moves without generic 3D motion, such as along straight lines or circular arcs. To solve this problem, Wu et al. (2017) makes the scale, pitch and roll observable by incorporating odometer measurements and planar motion constraints. Li et al. (2017) presents a gyro-aided camera-odometer online calibration and localization method, which is based on the stereo vision without the scale estimation and the initial calibration. Furthermore, Liu et al. (2019) considers both gyroscope and accelerometer measurements in the preintegration and optimization, but fails to pay an attention to the significance of angular velocity of wheel encoder and the planar motion constraint. DRE-SLAM (Yang et al. 2019) fuses the information of RGB-D camera and wheel encoder, then constructs OctoMap in both dynamic and static environments.

The initialization methods are widely adopted in SLAM and the initial values have great influences on the accuracy of the system. In earlier studies (Yang and Shen 2017; Shen et al. 2016; Martinelli 2014), the initialization methods only utilize relative rotation from the IMU, without considering gyroscope bias and image noises. Kaiser et al. (2017) introduces the gyroscope bias calibration, but requires double integration of IMU measurements. ORB-VISLAM (Mur-Artal and Tardós 2016) proposes an IMU initialization method, which is able to compute the scale, gravity direction, velocity, and gyroscope and accelerometer biases in a few seconds with high accuracy. In order to improve efficiency, VINS-Mono (Qin et al. 2018) loosely aligning IMU preintegration with the vision-only structure without considering accelerometer bias.

However, visual–inertial odometry suffers from the additional unobservabilities when the robot moves along straight lines or circular arcs, the metric scale and other values can not be able to be initialized accurately. We propose in this paper an accurate and robust odometry by jointly using VIO and the wheel encoder, called VIWO, expecting to obtain the accurate relative translation and the rotation information, as well as the absolute scale.

3 System model

3.1 Notations

We begin with defining the notations used throughout this paper. \({\mathbf{R}}_{M}^{N}\) denotes the rotation matrix from frame M to frame N, and \({\mathbf{q}}_{M}^{N}\) is its quaternion form. \({\mathbf{p}}_{M}^{N}\) denotes the translation vector from frame M to frame N. w is world frame. \(b_k\), \(c_k\) and \(o_k\) are IMU frame, camera frame and odometer frame respectively when we obtain the kth image.

Besides, the extrinsic parameters between the IMU and odometer are presented as \({\mathbf{R}}_{B}^{O}\), \({\mathbf{q}}_{B}^{O}\), \({\mathbf{p}}_{B}^{O}\) and \({\mathbf{T}}_{B}^{O}\), which indicates the rotation matrix, quaternion form, translation vector and transformation matrix from IMU frame to odometer frame respectively. Similarly, the extrinsic parameters between the IMU and camera are presented as \({\mathbf{R}}_{B}^{C}\), \({\mathbf{q}}_{B}^{C}\), \({\mathbf{p}}_{B}^{C}\) and \({\mathbf{T}}_{B}^{C}\), which indicates the rotation matrix, quaternion form, translation vector and transformation matrix from IMU frame to camera frame respectively. The extrinsic parameters between the IMU and odometer are calibrated manually and the extrinsic parameters between the IMU and camera are estimated by tightly coupled nonlinear optimization in Sect. 6. In addition, \((n)_x\) is vector \([n,0,0]^T\), \((n)_z\) is vector \([0,0,n]^T\), and \(({\hat{.}})\) refers to its noisy measurement.

3.2 System definition

The input for our system is a stream of monocular camera frames, IMU measurements and odometer measurements. The raw measurements from IMU are composed of angular velocity vector \({\hat{\varvec{\omega }}}_t\) and linear accelerated velocity vector \({\hat{\mathbf{{a}}}}_t\). The measurements from odometer are composed of \({\hat{\omega}}_t\) and \(\hat{v}_t\), which means angular velocity and linear velocity respectively. And monocular camera captures a series of grayscale images.

The output for our system is a state vector in the sliding window including robot poses and velocities, 3D feature locations, acceleration bias and gyroscope bias and extrinsic parameters.

The structure of the proposed VIWO system is shown in Fig. 1 and the details are presented in the following sections.

Fig. 1
figure 1

The structure of the proposed VIWO system

4 Measurement preprocessing

This section presents preprocessing methods for monocular visual, IMU and odometer measurements. Monocular visual measurement preprocessing is responsible for extracting features and tracking relative transformation between two consecutive frames. IMU and odometer measurement preprocessing are responsible for calculating preintegration between two selected keyframes. For monocular visual and IMU measurements, we adopt the existing methods to perform measurement preprocessing. For the odometer measurements, we propose a novel method to perform measurement preprocessing, reducing the accumulated slippage errors.

4.1 Visual measurements

The preprocessing of visual measurements refers to visual processing front-end in VINS-Mono (Qin et al. 2018). We detect the features for each image using GFTT corner detection algorithm (Jianbo and Tomasi 1994), which is an improved corner detection algorithm based on Harris (Mikolajczyk and Schmid 2004), and adopt KLT sparse optical flow algorithm (Lucas and Kanade 1997) for pose tracking. After obtaining multiple sets of matched features, we also use RANSAC algorithm (Fischler and Bolles 1981) to adjust feature locations and eliminate outliers. Finally, we get relative visual rotation and translation of camera, as well as the detected feature locations.

4.2 Preintegration of IMU measurements

We employ VINS-Mono algorithm (Qin et al. 2018) to calculate IMU preintegration. The preintegration of translation \({\varvec{\alpha} }_{b_k}^{b_{k+1}}\), velocity \({\varvec{\beta} }_{b_k}^{b_{k+1}}\) and rotation \({\mathbf{R}}_{b_k}^{b_{k+1}}\) between two consecutive frames \(b_k\) and \(b_{k+1}\) can be presented as

$$\begin{aligned} {\varvec{\alpha}}_{b_k}^{b_{k+1}}&= \int \int _{t\in [k,k+1]} [{\mathbf{R}}_{b_t}^{b_k} ((\hat{\mathbf{a}}_t-{\mathbf{b}}_t^a-{\mathbf{n}}^a)-{\mathbf{R}}_w^{b_t} {\mathbf{g}} ^w)]\delta t^2,\\ \varvec{\beta }_{b_k}^{b_{k+1}}&= \int _{t\in [k,k+1]}[{\mathbf{R}}_{b_t}^{b_k}((\hat{\mathbf{a}}_t-{\mathbf{b}}_t^a-{\mathbf{n}} ^a)-{\mathbf{R}}_w^{b_t} {\mathbf{g}} ^w)]\delta t,\\ {\mathbf{R}}_{b_k}^{b_{k+1}}&= \int _{t\in [k,k+1]} {\mathbf{R}}_{b_t}^{b_k} \text{exp}((\hat{\varvec{\omega }}_t - {\mathbf{b}}_t^g - {\mathbf{n}} _g)^{\wedge }) \delta t, \end{aligned}$$
(1)

where \(\hat{\mathbf{a}}_t\), \({\mathbf{b}}_t^a\) and \({\mathbf{n}} ^a\) are measurement, bias and Gaussian white noise of accelerator respectively and \(\hat{\varvec{\omega }}_t\), \({\mathbf{b}}_t^g\) and \({\mathbf{n}} ^g\) are corresponding terms of gyroscope.

4.3 Preintegration of odometer measurements

The wheel slippage is one of the main reasons of measurement errors in scale. To solve this problem, we propose a novel kinematic motion scheme to deal with the accumulated slippage errors by using a preintegration model and a noise propagation model between two keyframes. The kinematic motion scheme consists of three steps: constructing motion model, constructing preintegration model and constructing noise propagation model.

4.3.1 Constructing motion model

According to the motion characteristics of the kinematic mobile robot, we construct the motion model. The raw linear velocity \(\hat{v}\) in the forward direction of robot and yaw angular velocity \(\hat{w}\) measured at the time t can be given:

$$\begin{aligned} \hat{v}_{t}&= v_{t}+\frac{n_r+n_l}{2}, \\ \hat{w}_{t}&= w_{t}+\frac{n_r-n_l}{l}. \end{aligned}$$
(2)

We assume that additive Gaussian white noises exist in both left and right wheels, where \(n_{l} \sim {\mathcal {N}}(0,\sigma _{l}^2)\) and \(n_{r} \sim {\mathcal {N}}(0,\sigma _{r}^2)\). Here, l is the distance between left and right wheel center. To simplify the description, we use \(n_w=\frac{n_r-n_l}{l}\) and \(n_v=\frac{n_r+n_l}{2}\) to present rotation and translation noises in the following sections.

4.3.2 Constructing preintegration model

Since the sampling frequency of odometer is much higher than camera, we integrate the odometer measurements between two consecutive frame \(b_k\) and \(b_{k+1}\), which can be given as follows:

$$\begin{aligned} {\mathbf{R}}_{b_k}^{b_{k+1}}&= \int _{t\in [k,k+1]} {\mathbf{R}}_{b_t}^{b_k}\text{exp}[({\mathbf{R}}_O^B (\hat{w}_{t} - n_w)_z )^{\wedge }] \delta t ,\\ {\mathbf{p}}_{b_k}^{b_{k+1}}&= \int _{t\in [k,k+1]} {\mathbf{R}}_{b_t}^{b_k} [{\mathbf{R}}_O^B (\hat{v}_{t} - n_v)_x \delta t + {\mathbf{p}}_O^B - {\mathbf{R}}_{b_{t-1}}^{b_t} {\mathbf{p}}_O^B], \end{aligned}$$
(3)

where \((.)^\wedge\) means transformation to skew symmetric matrix form, and \(\text{exp}(\xi ^\wedge )\) is exponential mapping from Lie algebra \({\mathfrak{so}}(3)\) to Lie group \({\mathbf{SO}}(3)\) that means transformation from rotation vector to rotation matrix physically. Furthermore, we can discover that the preintegration model contains the Gaussian white noises at each moment between two consecutive frames.

4.3.3 Constructing noise propagation model

We observe that the noises from odometer are accumulated and propagated with the preintegration processing. For decreasing the effects of accumulated odometer noises, we construct a noise propagation model to separate the noises from preintegration model.

Odometer preintegration between two consecutive frame \(b_k\) and \(b_{k+1}\) contains the rotation term \({\mathbf{R}}_{b_k}^{b_{k+1}}\) and the translation term \({\mathbf{p}}_{b_k}^{b_{k+1}}\). According to first-order approximation of Taylor expansion, we can split the preintegration into measurements and noises as follows:

$$\begin{aligned} {\mathbf{R}}_{b_k}^{b_{k+1}}&= \int _{t\in [k,k+1]} {\mathbf{R}}_{b_t}^{b_k}\text{exp}[({\mathbf{R}}_O^B (\hat{w}_{t} - n_w)_z )^{\wedge }] \delta t\\ &=\hat{{\mathbf{R}}}_{b_k}^{b_{k+1}} \int _{t\in [k,k+1]} \text{exp} (-{\mathbf{J}}_{rt} {\mathbf{R}}_O^B (n_w)_z )^{\wedge }\delta t\\&\triangleq \hat{{\mathbf{R}}}_{b_k}^{b_{k+1}} \text{exp}((-\delta \varphi _{o_k}^{o_{k+1}})^{\wedge }), \end{aligned}$$
(4)

and

$$\begin{aligned} {\mathbf{p}}_{b_k}^{b_{k+1}}&= \int _{t\in [k,k+1]} \hat{{\mathbf{R}}}_{b_k}^{b_{k+1}} \text{exp}((-\delta \varphi _{o_k}^{o_{k+1}})^{\wedge })\\&\quad \times [{\mathbf{V}}_t - {\mathbf{R}}_O^B (n_v)_x \delta t + {\mathbf{p}}_O^B - {\mathbf{R}}_{b_{t-1}}^{b_t} {\mathbf{p}}_O^B]\\&= \hat{{\mathbf{p}}}_{b_k}^{b_{k+1}} + \int _{t\in [k,k+1]} \hat{{\mathbf{R}}}_{b_k}^{b_{k+1}} \delta \varphi _{o_k}^{o_{k+1}}\\&\quad \times [{\mathbf{V}}_t - {\mathbf{R}}_O^B (n_v)_x \delta t + {\mathbf{p}}_O^B - {\mathbf{R}}_{b_{t-1}}^{b_t} {\mathbf{p}}_O^B]^\wedge \\&\triangleq \hat{{\mathbf{p}}}_{b_k}^{b_{k+1}} + \delta p_{o_k}^{o_{k+1}}, \end{aligned}$$
(5)

where \({\mathbf{V}}_t = {\mathbf{R}}_O^B (\hat{v}_t)_x \varDelta t\) and \({\mathbf{J}}_{rt} = {\mathbf{J}}_r({\mathbf{R}}_O^B(\hat{w}_t - n_w)_z \varDelta t)\) is the right jacobian on Lie group \({\mathbf{SO}} (3)\).

As a result, we can get the odometer rotation error term \(\delta \varphi _{o_k}^{o_{k+1}}\) and the translation error term \(\delta p_{o_k}^{o_{k+1}}\) from Eqs. (4) and (5), which can be given:

$$\begin{aligned} \delta \varphi _{o_k}^{o_{k+1}}&= \int _{t\in [k,k+1]} \text{exp} (- {\mathbf{J}}_{rt} {\mathbf{R}}_O^B (n_w)_z )^{\wedge }\delta t ,\\ \delta p_{o_k}^{o_{k+1}}&= \int _{t\in [k,k+1]} \hat{{\mathbf{R}}}_{b_k}^{b_{k+1}} \delta \varphi _{o_k}^{o_{k+1}}\\&\quad \times [{\mathbf{V}}_t - {\mathbf{R}}_O^B (n_v)_x \delta t + {\mathbf{p}}_O^B - {\mathbf{R}}_{b_{t-1}}^{b_t} {\mathbf{p}}_O^B]^\wedge . \end{aligned}$$
(6)

Furthermore, we can get the accumulated error term in \(0\sim k\)th frame, \(\delta \varphi _{o_0}^{o_k}\), \(\delta p_{o_0}^{o_k}\), and the accumulated error term in \(0 \sim (k+1)\)th frame, \(\delta \varphi _{o_0}^{o_{k+1}}\), \(\delta p_{o_0}^{o_{k+1}}\). And the relationship between them can be written as matrix form.

$$\begin{aligned} \begin{bmatrix} \delta \varphi _{o_0}^{o_{k+1}}\\ \delta p_{o_0}^{o_{k+1}} \end{bmatrix}&= \begin{bmatrix} {\hat{{{\mathbf{R}}}}}_{b_{k+1}}^{b_k} &{} {\mathbf{0}}_{3\times 3}\\ {\hat{{{\mathbf{R}}}}}_{b_0}^{b_k} ({{\mathbf{V}}}_k - {{\mathbf{R}}}_O^B (\frac{n_r+n_l}{2})_x \varDelta t + {{\mathbf{p}}}_{O}^B - {{\mathbf{R}}}_{b_k}^{b_{k+1}} {{\mathbf{p}}}_{O}^B)^\wedge &{} {\mathbf{I}}_{3\times 3} \end{bmatrix}\begin{bmatrix} \delta \varphi _{o_0}^{o_k}\\ \delta p_{o_0}^{o_k} \end{bmatrix} \\&\quad + \begin{bmatrix} \frac{{{\mathbf{J}}}_{rk} {{\mathbf{R}}}_O^B \varDelta t}{l} &{} -\frac{{{\mathbf{J}}}_{rk} {{\mathbf{R}}}_O^B \varDelta t}{l}\\ \frac{{{\mathbf{R}}}_{b_0}^{b_{k-1}} {{\mathbf{R}}}_O^B \varDelta t}{2} &{} \frac{{{\mathbf{R}}}_{b_0}^{b_{k-1}} {{\mathbf{R}}}_O^B \varDelta t}{2} \end{bmatrix} \begin{bmatrix} ({\mathbf{n}}_r)_{3\times 1}\\ ({\mathbf{n}}_l)_{3\times 1} \end{bmatrix}\\&\triangleq {\mathbf{A}} \begin{bmatrix} \delta \varphi _{o_0}^{o_k}\\ \delta p_{o_0}^{o_k} \end{bmatrix} + \begin{bmatrix} {\mathbf{B}}&{\mathbf{C}} \end{bmatrix} \begin{bmatrix} ({\mathbf{n}}_r)_{3\times 1}\\ ({\mathbf{n}}_l)_{3\times 1} \end{bmatrix}. \end{aligned}$$
(7)

According to forward propagation of covariance, the error term \(\begin{bmatrix} \delta \varphi _{o_i}^{o_j}\\ \delta p_{o_i}^{o_j} \end{bmatrix}\) satisfies Gaussian distribution \({\mathcal {N}}({\mathbf{0}} _{6\times 1},{\varvec{\Sigma }}_{{\mathbf{o}}_{\mathbf{i}}}^{{\mathbf{o}}_{\mathbf{j}}})\). Therefore, the covariance matrix can be written as:

$$\begin{aligned} {\varvec{\Sigma }}_{{\mathbf{o}}_{\mathbf{0}}}^{{\mathbf{o}}_{{\mathbf{k}}+{\mathbf{1}}}} = {\mathbf{A}} {\varvec{\Sigma }}_{{\mathbf{o}}_{\mathbf{0}}}^{{\mathbf{o}}_{\mathbf{k}}} {\mathbf{A}}^{\mathbf{T}} + {\mathbf{B}} {\varvec{\Sigma }}_{{{\mathbf{n}}}_{\mathbf{r}}} {\mathbf{B}} ^{\mathbf{T}} + {\mathbf{C}} {\varvec{\Sigma }}_{{\mathbf{n}}_{\mathbf{l}}} {\mathbf{C}}^{\mathbf{T}}. \end{aligned}$$
(8)

5 Robust initialization

Robust initialization plays a significant role in the processing of nonlinear optimization, which requires well-performed initial guess at the beginning. However, limited to the visual and IMU information, the initialization in VINS-Mono (Qin et al. 2018) suffers from poor scale and orientation results when moving along straight lines or circular arcs on the ground. Therefore, we propose a new method that takes the odometer preintegration into consideration, providing excellent initial values. We first adopt the sliding window vision-only SfM strategy to achieve the feature observations and relative rotations among different frames. Then, we take advantage of the IMU, the odometer preintegration and the rotation results to provide well-performed initial values including velocity, gyroscope bias, scale, gravity vector, robot poses and feature locations.

5.1 Sliding window vision-only SfM

First, we calculate the relative rotation and translation of frames using SfM algorithm (Wu 2013). Specifically, the five-point algorithm (Nister 2004) devotes to the essential matrix calculation, and the perspective-n-point (PnP) method (Lepetit et al. 2009) devotes to the poses of all frames estimation. If there are stable feature tracking and sufficient parallax compared with other frames in the sliding window, we can obtain the feature observations by minimizing the reprojection errors using the global bundle adjustment (Triggs et al. 2000).

5.2 Gyroscope bias calibration

Upon obtaining the monocular camera rotation \({\mathbf{q}}_{c_k}^{c_{k+1}}\) between the kth and \((k+1)\)th frames in the sliding window, gyroscope bias can be calculated as:

$$\begin{aligned} \mathop {\min }_{\delta b_g} \sum _{k\in {\mathcal {F}}} \Vert {\mathbf{q}}_C^B \otimes {\mathbf{q}}_{c_k}^{c_{k+1}} \otimes {\mathbf{q}}_B^C \otimes {\mathbf{q}}_{b_{k+1}}^{b_k} \Vert ^2, \end{aligned}$$
(9)

where \({\mathbf{q}}_{b_{k+1}}^{b_k} \approx \hat{{\mathbf{q}}}_{b_{k+1}}^{b_k} \otimes \begin{bmatrix} 1\\ \frac{1}{2} {\mathbf{J}}_{b_g}^{q} \delta {\mathbf{b}}_{g} \end{bmatrix}\). Similarly, the odometer rotation \({\mathbf{q}}_{o_k}^{o_{k+1}}\) can also be used to calculate the gyroscope bias:

$$\begin{aligned} \mathop {\min }_{\delta b_g} \sum _{k\in {\mathcal {F}}} \Vert {\mathbf{q}}_O^B \otimes {\mathbf{q}}_{o_k}^{o_{k+1}} \otimes {\mathbf{q}}_B^O \otimes {\mathbf{q}}_{b_{k+1}}^{b_k} \Vert ^2. \end{aligned}$$
(10)

The gyroscope bias from Eqs. (9) and (10) are marked as \(b_{g_1}\) and \(b_{g_2}\) respectively. Then the final result is \(\frac{1}{2}(b_{g_1} + b_{g_2})\). Finally, we do repropagation of all IMU preintegrations under the new \(b_g\).

5.3 Velocity and metric scale initialization

The scale is an important feature in initialization procedure, so we optimize the metric scale, the velocity and gravity simultaneously. Firstly, the state vector to be estimated is \({\mathcal {X}}_I = [{\mathbf{v}}_0^b, {\mathbf{v}}_2^b,\ldots ,{\mathbf{v}}_n^b, s]\), where \({\mathbf{v}}_k^b\) is the linear velocity in IMU frame while taking the kth image and s is the metric scale. Here, we assume that the direction and magnitude of gravity are known, \({\mathbf{g}} ^{c_0} = [0, g, 0]^T\), as the robot moves on the planar ground. So, we can get \(\hat{\mathbf{z }}_{b_{k+1}}^{b_k}\) and \({\mathbf{H}}_{b_{k+1}}^{b_k}\) between two consecutive frames as:

$$\begin{aligned} \hat{\mathbf{z }}_{b_{k+1}}^{b_k} = \begin{bmatrix} \hat{{\mathbf{p}}}_{b_{k+1}}^{b_k} - {\mathbf{p}}_C^B + \hat{{\mathbf{R}}}_{b_{k+1}}^{b_k}{\mathbf{p}}_C^B - \frac{1}{2}{\mathbf{R}}_{b_0}^{b_k}{\mathbf{R}}_C^B{\mathbf{g}} ^{c_0}\varDelta t_k^2\\ \hat{{\mathbf{v}}}_{b_{k+1}}^{b_k} - {\mathbf{R}}_{b_0}^{b_k}{\mathbf{R}}_C^B {\mathbf{g}} ^{c_0} \varDelta t_k\\ {\mathbf{R}}_{b_0}^{b_k}{\mathbf{R}}_O^B \hat{{\mathbf{p}}}_{o_{k+1}}^{o_k} + ({\mathbf{p}}_O^B-{\mathbf{p}}_C^B) - \hat{{\mathbf{R}}}_{b_{k+1}}^{b_k}({\mathbf{p}}_O^B-{\mathbf{p}}_C^B) \end{bmatrix}, \end{aligned}$$
(11)
$$\begin{aligned} {\mathbf{H}}_{b_{k+1}}^{b_k} {\mathcal {X}}_I= \begin{bmatrix} -\mathbf{I} _{3 \times 3}\varDelta t_k &{} {\mathbf{0}} _{3 \times 3} &{} {\mathbf{R}}_{b_0}^{b_k}{\mathbf{R}}_C^B \bar{{\mathbf{p}}}_{c_{k+1}}^{c_k}\\ -\mathbf{I} _{3 \times 3} &{} {\mathbf{R}}_C^B \bar{{\mathbf{R}}}_{c_{k+1}}^{c_k}{\mathbf{R}}_B^C &{} 0\\ {\mathbf{0}} _{3 \times 3} &{} {\mathbf{0}} _{3 \times 3} &{} {\mathbf{R}}_{b_0}^{b_k}{\mathbf{R}}_C^B \bar{{\mathbf{p}}}_{c_{k+1}}^{c_k} \end{bmatrix} \begin{bmatrix} {\mathbf{v}}_k^b\\ {\mathbf{v}}_{k+1}^b\\ s \end{bmatrix}, \end{aligned}$$
(12)

where \({\mathbf{R}}_{b_0}^{b_k}\) is the preintegration term from the 0th to kth image in IMU frame, and \(\bar{{\mathbf{R}}}_{c_{k+1}}^{c_k}\), \(\bar{{\mathbf{p}}}_{c_{k+1}}^{c_k}\) are visual rotation and translation after SfM algorithm.

According to the constraint that the translation from camera, IMU and odometer under the \(b_k\) frame should be the same, we can obtain the following least square equation:

$$\begin{aligned}&\min _{{\mathcal {X}}_I} \sum _{k\in {\mathcal {F}}} \Vert \hat{\mathbf{z }}_{b_{k+1}}^{b_k} - {\mathbf{H}}_{b_{k+1}}^{b_k} {\mathcal {X}}_I \Vert ^2, \end{aligned}$$
(13)

which can use SVD (Golub and Reinsch 1970) to get vector \({\mathcal {X}}_I\). If the metric scale in the vector \({\mathcal {X}}_I\) is positive, the velocities in the sliding window and metric scale are initialized successfully. Furthermore, the gravity refinement is implemented to correct the direction and magnitude again by the method in VINS-Mono (Qin et al. 2018). Finally, all the variables are transformed into the world frame, and the robot poses and feature locations are in the absolute scale.

6 Tightly coupled nonlinear optimization

This section tightly couples all known measurements and state vectors to be estimated in the sliding window based on bundle adjustment (Triggs et al. 2000), which is presented in the factor map form. We also consider the constraints from odometer and planar motion to simplify the pose estimation.

The full state vector to be estimated in the back-end optimization sliding window is defined as:

$$\begin{aligned} {\mathcal {X}}&= [\mathbf{x }_0, \mathbf{x }_1, \ldots \mathbf{x }_n, \lambda _0, \lambda _1, \ldots \lambda _m, {\mathbf{p}}_C^B, {\mathbf{q}}_C^B], \\ \mathbf{x }_k&= [{\mathbf{p}}_{b_k}^w, {\mathbf{v}}_{b_k}^w, {\mathbf{q}}_{b_k}^w, {\mathbf{b}}_a, {\mathbf{b}}_g], k \in [0,n], \end{aligned}$$
(14)

where n is the number of keyframes, and m is the number of features in the sliding window. \(\mathbf{x }_k\) means the IMU state when the system get the kth keyframe, which contains position \({\mathbf{p}}_{b_k}^w\), velocity \({\mathbf{v}}_{b_k}^w\), orientation \({\mathbf{q}}_{b_k}^w\), accelerometer bias \({\mathbf{b}}_a\) and gyroscope bias \({\mathbf{b}}_g\). \(\lambda _l\) is the inverse distance of the lth feature from its first observation. \({\mathbf{p}}_C^B\) and \({\mathbf{q}}_C^B\) are extrinsic parameters to be estimated.

6.1 Bundle adjustment

The state estimation problem refers to estimate the inner state from the noisy data. We obtain the visual, IMU and odometer measurements from Sect. 4, and the estimated state vector shown in Eq. (14). However, the state vector calculated only by preprocessing and initialization is not optimal, which still exists the accumulative errors. To address this problem, we optimize the robot poses and other variables mentioned in the state vector by adopting bundle adjustment, which is a tightly coupled nonlinear optimization approach.

Furthermore, in order to implement online optimization, we use a sliding window to save a certain amount of states as input for bundle adjustment model. With the participation of new preprocessed measurements, the states are constantly updated according to the marginalization strategy.

In visual–inertial–wheel encoder state estimation system, the bundle adjustment is achieved by minimizing the sum of prior and the Mahalanobis norm of all measurement residuals, which can be given as follows:

$$\begin{aligned}&\mathop {\min }_{{\mathcal {X}}} \left\{ \Vert {\mathbf{r}}_p-{\mathbf{H}}_p {\mathcal {X}}\Vert ^2 + \sum _{k\in {\mathcal {F}}}\Vert {\mathbf{r}}_{\mathcal {B}}(\hat{\mathbf{z }}_{b_{k+1}}^{b_k},{\mathcal {X}}) \Vert ^2_{\sum _{b_{k+1}}^{b_k}}\right. \\&\quad + \sum _{k\in {\mathcal {F}}}\Vert {\mathbf{r}}_{{\mathcal {O}}}(\hat{\mathbf{z }}_{o_{k+1}}^{o_k},{\mathcal {X}}) \Vert ^2_{\sum _{o_{k+1}}^{o_k}} \\&\quad \left. + \sum _{l\in {\mathcal {L}}} \sum _{j \in {\mathcal {F}}_l} \rho (\Vert {\mathbf{r}}_{\mathcal {C}}(\hat{\mathbf{z }}_{c_j,l},{\mathcal {X}}) \Vert ^2_{\sum _{c_j,l}} )+ \sum _{k\in {\mathcal {F}}}\Vert {\mathbf{r}}_{pl} \Vert ^2_{\sum _{pl}} \right\} , \end{aligned}$$
(15)

where \(({\mathbf{r}}_p-{\mathbf{H}}_p {\mathcal {X}})\) is the prior information after marginalization, \({\mathcal {F}}\) means all the frames in the sliding window, \({\mathbf{r}}_{\mathcal {B}}(\hat{\mathbf{z }}_{b_{k+1}}^{b_k},{\mathcal {X}})\) is the IMU measurement residual, \({\mathbf{r}}_{{\mathcal {O}}}(\hat{\mathbf{z }}_{o_{k+1}}^{o_k},{\mathcal {X}})\) is the odometer measurement residual, \({\mathcal {L}}\) and \({\mathcal {F}}_l\) means all the features and the frames when the feature l appears, \({\mathbf{r}}_{\mathcal {C}}(\hat{\mathbf{z }}_{c_j,l}, {\mathcal {X}})\) is the visual measurement residual, \(\rho\) is the Huber cost function (Huber 1964), \({\mathbf{r}}_{pl}\) is the planar constraint residual, and \(\sum _{b_{k+1}}^{b_k}\), \(\sum _{o_{k+1}}^{o_k}\) and \(\sum _{c_j,l}\), \(\sum _{pl}\) are the covariances corresponding to the residuals.

6.2 Measurement residuals

In the sliding window, we have the state vector composed of rotation \({\mathbf{R}}_{b_k}^w\), translation \({\mathbf{p}}_{b_k}^w\) and velocity \({\mathbf{v}}_{b_k}^w\) in each frame. At the same time, we have rotation and translation measurements of each sensor, \(\hat{{\mathbf{p}}}_{o_{k+1}}^{o_k}\), \(\hat{{\mathbf{R}}}_{o_{k+1}}^{o_k}\), \(\hat{{\mathbf{p}}}_{b_{k+1}}^{b_k}\), \(\hat{{\mathbf{R}}}_{b_{k+1}}^{b_k}\), \(\hat{{\mathbf{p}}}_{c_{k+1}}^{c_k}\), \(\hat{{\mathbf{R}}}_{c_{k+1}}^{c_k}\). In this way, there are residuals between the state vectors and the measurements, which can be used to eliminate the accumulative errors caused by measurement preprocessing.

Each sensor has the residual term of its own, which is associated with the variables to be estimated and the measurements obtained in the preprocessing model. The visual and IMU residuals are the same as VINS-Mono (Qin et al. 2018), and we analyze the odometer residuals and planar residuals in detail. According to the constraints, we can construct the factor map as Fig. 2. We can find that the visual factor constraints multiple robot poses, feature locations, velocity, IMU bias and extrinsic parameters between IMU and camera. The IMU factor constraints two consecutive robot poses, velocity and IMU bias. The odometer factor also constraints two consecutive robot poses. The planar factor constraints its own pose individually.

Fig. 2
figure 2

The factor map that describes the optimization. The circles represent states to be estimated, and the squares represent the constrained edges derived by measurements. Each square constraints any number of states

6.2.1 Visual residuals

For each detected feature in the sliding window, we calculate the reprojection errors by projecting the feature pixel coordinates to a unit sphere. First, we can find the first frame i and other frames in which the lth feature appears. Then we reproject the lth feature pixel coordinates of both the first frame i and the frame j onto the unit sphere of frame j to calculate the residual term of the lth feature in the frame j.

$$\begin{aligned} {\mathbf{r}}_{\mathcal {C}}(\hat{\mathbf{z }}_{c_j,l},{\mathcal {X}}) = \begin{bmatrix} \hat{x}_{c_j} - x_{c_j}\\ \hat{y}_{c_j} - y_{c_j}\\ \hat{z}_{c_j} - z_{c_j} \end{bmatrix}, \end{aligned}$$
(16)
$$\begin{aligned} \begin{bmatrix} \hat{x}_{c_j}\\ \hat{y}_{c_j}\\ \hat{z}_{c_j} \end{bmatrix}&= \frac{1}{\lambda _l} \pi _c^{-1} \begin{bmatrix} \hat{u}_l^{c_j}\\ \hat{v}_l^{c_j} \end{bmatrix}, \quad \begin{bmatrix} x_{c_j}\\ y_{c_j}\\ z_{c_j}\\ 1 \end{bmatrix}\\&= {\mathbf{T}}_B^C{\mathbf{T}}_w^{b_j}{\mathbf{T}}_{b_i}^{w}{\mathbf{T}}_C^B \left( \frac{1}{\lambda _l} \pi _c^{-1} \begin{bmatrix} \hat{u}_l^{c_i}\\ \hat{v}_l^{c_i} \end{bmatrix} \right) _H, \end{aligned}$$
(17)

where \([\hat{x}_{c_j}, \hat{y}_{c_j}, \hat{z}_{c_j}]^T\) is the reprojection result of the lth feature from the frame j, \([x_{c_j}, y_{c_j}, z_{c_j}, 1]^T\) is the reprojection result of the lth feature from the frame i correspondingly. \(\pi _c^{-1}\) means that project pixel coordinates into unit sphere using intrinsic parameters, \(\lambda _l\) is the inverse depth and \(\frac{1}{\lambda _l}\) is the real depth of the lth feature, which helps transform the three-dimensional coordinates into the real world. In order to simplify the computation, we also use transformation matrix \({\mathbf{T}}\) to present the rotation and translation, and use \((.)_H\) to expand three-dimensional coordinates to homogeneous coordinates formulation.

6.2.2 IMU residuals

The residual of IMU measurements in the sliding window can be defined as:

$$\begin{aligned} {\mathbf{r}}_{\mathcal {B}}(\hat{\mathbf{z }}_{b{k+1}}^{b_k}, {\mathcal {X}})= & {} \begin{bmatrix} \delta {\varvec{\alpha}}_{b_{k+1}}^{b_k}\\ \delta \varvec{\beta }_{b_{k+1}}^{b_k}\\ \delta {\mathbf{R}}_{b_{k+1}}^{b_k}\\ \delta {\mathbf{b}}_{a}\\ \delta {\mathbf{b}}_{g} \end{bmatrix}\nonumber \\= & {} \begin{bmatrix} {\mathbf{R}}_{w}^{b_k}({\mathbf{p}}_{b_{k+1}}^{w}\!-\!{\mathbf{p}}_{b_k}^w + \frac{1}{2}{} {\mathbf{g}} ^w \varDelta t_k^2 - {\mathbf{v}}_{b_k}^w\varDelta t_k)-\hat{{\varvec{\alpha}}}_{b_{k+1}}^{b_k}\\ {\mathbf{R}}_{w}^{b_k}({\mathbf{v}}_{b_{k+1}}^w + {\mathbf{g}} ^w \varDelta t_k - {\mathbf{v}}_{b_k}^w )-\hat{\varvec{\beta }}_{b_{k+1}}^{b_k}\\ \text{log}({\mathbf{R}}_w^{b_k}{\mathbf{R}}_{b_{k+1}}^w\hat{{\mathbf{R}}}_{b_{k+1}}^{b_k})^\vee \\ {\mathbf{b}}_{a b_{k+1}} - {\mathbf{b}}_{a b_k}\\ {\mathbf{b}}_{g b_{k+1}} - {\mathbf{b}}_{g b_k} \end{bmatrix}, \end{aligned}$$
(18)

where \(\delta {\varvec{\alpha}}_{b_{k+1}}^{b_k}\), \(\delta \varvec{\beta }_{b_{k+1}}^{b_k}\), \(\delta {\mathbf{R}}_{b_{k+1}}^{b_k}\) are the translation, velocity and rotation error term between measurement and state vector in the two consecutive frames \(b_k\) and \(b_{k+1}\). The accelerometer and gyroscope bias in the adjacent two frames should be the same. \(\text{log}(.)^{\vee }\) is the logarithmic mapping from Lie group \({\mathbf{SO}} (3)\) to Lie algebra \({\mathfrak{so}}(3)\), which means the transformation from the rotation matrix to the rotation vector.

6.2.3 Odometer residuals

The residual of odometer is associated with the preintegration after preprocessing and the rotation and translation term in the state vector.

$$\begin{aligned} {\mathbf{r}}_{{\mathcal {O}}}(\hat{\mathbf{z }}_{o{k+1}}^{o_k}, {\mathcal {X}})= & {} \begin{bmatrix} \delta {\mathbf{p}}_{o_{k+1}}^{o_k}\\ \delta {\mathbf{R}}_{o_{k+1}}^{o_k} \end{bmatrix}\nonumber \\= & {} \begin{bmatrix} {\mathbf{R}}_{w}^{b_k}({\mathbf{p}}_{b_{k+1}}^{w}\!-\!{\mathbf{p}}_{b_k}^w)-{\mathbf{p}}_O^B+{\mathbf{R}}_w^{b_k}{\mathbf{R}}_{b_{k+1}}^w{\mathbf{p}}_O^B-\hat{{\mathbf{p}}}_{o_{k+1}}^{o_k}\\ \text{log}({\mathbf{R}}_B^O {\mathbf{R}}_w^{b_k} {\mathbf{R}}_{b_{k+1}}^w {\mathbf{R}}_O^B \hat{{\mathbf{R}}}_{o_k}^{o_{k+1}})^{\vee } \end{bmatrix}, \end{aligned}$$
(19)

where \(\hat{{\mathbf{p}}}_{o_{k+1}}^{o_k}\) and \(\hat{{\mathbf{R}}}_{o_k}^{o_{k+1}}\) are the translation and rotation term of the odometer preintegration between two adjacent frames \(b_k\) and \(b_{k+1}\) in the sliding window.

6.2.4 Planar residuals

As the robot is moving on a planar ground, there is almost no pitch and roll angle. The translation on the vertical dimension is almost zero. Therefore, the planar residuals can be presented as:

$$\begin{aligned} {\mathbf{r}}_{pl} = \begin{bmatrix} ({\mathbf{R}}_w^{b_r} {\mathbf{R}}_{b_k}^w)_{roll}\\ ({\mathbf{R}}_w^{b_r} {\mathbf{R}}_{b_k}^w)_{pitch}\\ (-{\mathbf{R}}_w^{b_r} {\mathbf{R}}_{b_k}^w {\mathbf{p}}_w^{b_k} + {\mathbf{p}}_w^{b_r})_{z} \end{bmatrix}, \end{aligned}$$
(20)

where \(b_r\) means choosing a frame in the sliding window randomly, \((.)_{roll}\) and \((.)_{pitch}\) means transforming rotation matrix into Euler angles firstly, then only select the roll or pitch angle respectively, and \((.)_{z}\) means only choosing translation in vertical direction.

6.3 Marginalization

For the sake of meeting limited computational resources and the real-time requirement, the frames in the sliding window need to be updated continuously. Firstly, we always keep the latest frame in the sliding window, whether it is a keyframe or not. Keyframe selection depends on the algorithm mentioned in VINS-Mono (Qin et al. 2018). Then we update other frames information by determining if the second latest frame is a keyframe. If it is, we keep this frame in the sliding window, and remove the oldest keyframe, including the visual, IMU and odometer measurements, as well as planar constraints. Otherwise, we remove the visual measurements and planar constraints of the second latest frame, and keep the IMU and odometer measurements. The process is shown in Fig. 3. Our marginalization strategy keeps the IMU and odometer measurements in the sliding window to provide consecutive motion information. This module is realized by Schur complement (Sibley et al. 2010). Firstly, the prior is constructed based on the marginalized measurements, and then we combine the prior and new measurement information to construct a new information matrix.

Fig. 3
figure 3

The Marginalization strategy. The sliding window is composed of several old keyframes and latest two frames. The different scheme is adopted based on whether the second latest frame is a keyframe or not. The gray region is the measurements to be removed. If the oldest frame is removed, the corresponding measurements should be marginalized. Otherwise, if the second latest frame is removed, we will keep the IMU and odometer measurements and marginalize other information

7 Experiments

In this section, we evaluate our system VIWO from two aspects, the trajectory accuracy and the initialization robustness. Our experiments are based on the mobile robot platform TurtleBot2 as shown in Fig. 4. The Turtlebot2 is equipped with a 2D scanning LiDAR (SICK LMS100), wheel encoder and a forward-looking global shutter camera integrated with an IMU (MYNTEYE D1000-IR-120/Color). The onboard computation resource is provided by an Intel i5-7300H CPU with 2.40 GHz. The specific camera provides not only \(640\times 480\) resolution gray images at 10 Hz, but also the IMU measurements at 200 Hz, with wheel encoder providing the odometer measurements at 20 Hz.

Fig. 4
figure 4

The mobile robot platform

We conduct experiments under several indoor low-texture scenes such as corridor and lobby. And we collect the data of six sequences: sequence01 to sequence04 that the robot moves along straight lines and pure rotation most of the time, sequence05 that the robot moves along circular arcs and sequence06 that the robot moves freely. The linear velocity is 0.15 m/s in all sequences and the yaw angular velocity is 0.15 rad/s in sequence01 to sequence05. Besides, the trajectory ground truth is obtained by GMapping algorithm (Grisetti et al. 2005, 2007) based on the 2D scanning LiDAR and wheel encoder measurements, and the result grid map is shown in Fig. 5.

We design two groups of experiments to evaluate the trajectory accuracy and the effect of the initialization performance respectively. The first experiment focuses on the trajectory accuracy. We compare our system with the state-of-the-art algorithm and odometer-only measurements. The second experiment focuses on the effect of initialization method. We perform our initialization method in multiple time periods and calculate the average time and average trajectory accuracy of our method and VINS-Mono (Qin et al. 2018).

Fig. 5
figure 5

The grid map from GMapping algorithm

7.1 Evaluation of trajectory accuracy

We compare our system with the open source method VINS-Mono (Qin et al. 2018) and Wheel Encoder (odometer-only measurements), and calculate their trajectory accuracy. Table 1 shows the root mean square error (RMSE) of ATE (Sturm et al. 2012) from sequence01 to sequence06. In addition, we compare the trajectory accuracy in the x–y, z and x–y–z directions. We find that the x–y plane trajectory accuracy from our approach is improved dramatically than VIO and odometer-only method. Furthermore, the result shows that planar motion constraints also reduce the errors in the z direction.

Table 1 RMSE (in meters) for trajectory accuracy of sequences

As shown in Fig. 6, the whole trajectory results from different approaches are displayed intuitively. We can discover that although the odometer-only method has accumulated errors, absolute metric scale still achieved. The reason is that VINS-Mono (Qin et al. 2018) suffers from scale inaccuracy seriously, but has well-performed trajectory shape in general. The results demonstrate that our system has higher trajectory accuracy as a result of combining the advantages of the existing approaches.

Fig. 6
figure 6

The trajectory of the proposed method, odometer-only measurements, VINS-Mono and ground truth from GMapping algorithm

7.2 Initialization

In order to illustrate the experiment results of initialization, we propose two criteria: time required to complete the initialization and the trajectory accuracy during a period of time after initialization. Besides, the back-end optimization and marginalization are kept consistent by tightly coupling the visual, IMU and odometer measurements in this paper. For each sequence, we set the beginning time as multiples of 30 s, and count the time required to finish the initialization. After the initialization, we record the trajectory data in the next 30 s.

We compare the initialization approach proposed in this paper with the method in VINS-Mono (Qin et al. 2018). The average results for each sequence are shown in Fig. 7. We observe that not only the average time required for initialization, but also the trajectory accuracy in the 30 s after initialization is improved.

Fig. 7
figure 7

The average time (s) required for initialization and the average RMSE (m) in x–y–z direction in the 30 s after initialization

For each sequence, as shown in Table 2, the average initialization time for our method is less than VINS-Mono (Qin et al. 2018). And at some moment, such as 60 and 120 s for sequence05, the initialization time required in Qin et al. (2018) is longer than 10 s, but all the time durations of our method are less than 10 s, which means that our method is more stable and faster.

Table 2 Time (in seconds) required to finish initialization

Table 3 demonstrates the trajectory accuracy from the x–y–z direction during the 30 s after initialization. There are three empty results because their trajectories fail to converge, because the initialization method in VINS-Mono (Qin et al. 2018) requires generic 3D motion. On the contrary, the trajectories after the initialization method in this paper all succeed to converge and has lower errors, which means more robustness.

Table 3 RMSE (in meters) for trajectory accuracy in the 30 s after initialization

8 Conclusion

In this paper, we propose a VIWO system that tightly couples the measurements of camera, IMU and wheel encoder to provide robust and accurate robot poses. We first deal with the accumulated slippage errors using the kinematic motion scheme with the preintegration model and noise propagation model. Then, we propose a robust and the novel initialization method, improve the sliding window update strategy and reduce the computational overhead. We implement a prototype and several experiments demonstrating that our system can achieve robust and accurate pose estimation, in terms of the scale, orientation and location, compared with the state-of-the-art.