Keywords

1 Introduction

VI-SLAM (Visual-Inertial Simultaneous Localization and Mapping) plays an important role in giving autonomous robots the ability to build the map of surroundings as well as estimate their states. Visual camera and inertial measurement unit (IMU) are ideal choice for SLAM since they could complement each other. On the one hand, the rich representation of environments projected by a camera helps to build a map and to estimate the trajectory of the robot up-to-scale. On the other hand, gyroscope and accelerometer of an IMU can obtain the angular velocity and linear acceleration of the sensor suite, which helps to recover the absolute scale information as well as make the gravity and the pitch and roll angle of the robot observable; however, the collected data will be affected by the measurement noise and drift with time [1]. Their superior size, weight and energy consumption make them widely used in the fields of robot navigation [2], UAVs [3] etc.

Recent years, several visual-inertial techniques have been presented in the field, such as the EKF based VI-SLAM [4, 5] algorithm and the nonlinear optimization methods [6, 7]. However, all the VI-SLAM algorithms depend heavily on accurate system initialization and prior precise extrinsic calibration of the 6DoF (Degree-of-Freedom) transformation between the camera and the IMU. Extrinsic parameters play a bridge role in the state transformation between camera reference frame and IMU reference frame.

At present, there are two main calibration methods for monocular camera-IMU extrinsic parameters: offline method and automatic calibration in system initialization. Offline calibration method requires technicians to carefully move the calibration checkboard in front of the sensor suite [8], which is complex and time-consuming. Automatic calibration in system initialization jointly estimate initial values and extrinsic parameters. Li [9], incorporating the camera-IMU transformation into the state vector, uses the extended Kalman filter (EKF) to estimate them. The convergence of the algorithm depends on the accuracy of the state estimation in initialization, and there is no systematic analysis of the results in the literature. Dong-Si proposed a geometric method to calibrate the camera-IMU extrinsic parameters in [10]. However, this method does not consider the noise of the sensor and tracking accuracy of the system will be affected by the accumulation of IMU bias. Yang and Shen [11], based on Lupton [12] and Martineli [13], calibrate the parameters (except for IMU bias) with an optimization-based linear estimator. The IMU bias is estimated as a state variable in the sliding window nonlinear estimator in their subsequent work of VI-SLAM system [14]. In the work of Huang [15], based on the work of Mur-Artal [16], a linear equation system is established to estimate the camera-IMU extrinsic parameters and other initialization parameters. This method has high initialization accuracy, but the camera-IMU extrinsic parameters become fixed after initialization without online estimation.

In this paper, we realize a VI-SLAM algorithm with monocular camera-IMU extrinsic automatic calibration and online estimation. Without knowing the mechanical configuration of the sensor suite, the scale factor, gravity, IMU biases and extrinsic parameters are jointly estimated in the initialization, as well as online estimation of camera-IMU extrinsic parameters during motion.

The rest of this paper is organized as follows. Section 2 describes the preliminaries of this algorithm. Then the initialization process with camera-IMU extrinsic automatic calibration is proposed in Sect. 3. Section 4 describes online estimation algorithm of camera-IMU extrinsic parameters. Experimental results are shown in Sect. 5. Finally, conclusions are given in Sect. 6.

2 Preliminaries

This section provides the necessary explanations for the notation and geometric concepts involved in this article. In addition, the relationship between reference frames and the IMU preintegration model on manifold are also described.

2.1 Notation

The matrices and vectors used here are indicated in bold uppercase and lowercase respectively. The letter in the upper right corner of the vector indicates the reference frames of the vector, e.g. \( \varvec{v}^{\varvec{W}} \) for the vector \( \varvec{v} \) expressed in frame \( W \). Incorporating geometric meaning, \( \varvec{p}_{\varvec{B}}^{\varvec{C}} \) and \( \varvec{v}_{\varvec{B}}^{\varvec{W}} \) represent the point \( \varvec{p}_{\varvec{B}} \) and velocity vector \( \varvec{v}_{\varvec{B}} \) in the reference frames \( C \) and \( W \) respectively. In the frame \( C \), the rotation matrix and translation matrix of the frame \( B \) are represented by \( \varvec{R}_{{\varvec{CB}}} \) and \( \varvec{T}_{{\varvec{CB}}} \) respectively.

2.2 Reference Frames

The VI-SLAM system mainly involves four frames: camera frame \( C \), IMU body frame \( B \), world frame \( W \) and inertial frame \( E \). As shown in Fig. 1,

Fig. 1.
figure 1

The Reference frame transformation.

The monocular camera and the IMU are fixed by external devices, and the transformation matrix \( \varvec{T}_{CB} = \left\{ {\varvec{R}_{CB} \left| {\varvec{p}_{B}^{C} } \right.} \right\} \) between them needs to be calibrated. Since the VI-SLAM system measures the relative motion, the absolute attitude in the earth’s inertial frame \( E \) cannot be determined. Therefore, the coordinate system of the first keyframe determined by the VI-SLAM system generally coincides with the world frame \( W \). The goal is to calibrate the rotation matrix \( \varvec{R}_{{\varvec{CB}}} \; \in \;SO(3) \) and translation vector \( \varvec{p}_{\varvec{B}}^{\varvec{C}} \in {\mathbb{R}}^{3} \) between camera frame \( C \) and IMU body frame \( B \) and estimate the gravitational acceleration in the world frame \( W \). By aligning the gravity in the system’s world frame \( \varvec{g}^{\varvec{W}} \) with the gravity in the earth’s inertial frame \( \varvec{g}^{\varvec{E}} \), the absolute pose of the system in the inertial frame is determined. Considering the scale factor \( s \), the transformation between camera frame \( C \) and IMU body \( B \) frame is

$$ \varvec{R}_{{\varvec{WB}}} = \varvec{R}_{{\varvec{WC}}} \cdot \varvec{R}_{{\varvec{CB}}} $$
(1)
$$ \varvec{p}_{\varvec{B}}^{\varvec{W}} = s \cdot \varvec{p}_{\varvec{C}}^{\varvec{W}} + \varvec{R}_{{\varvec{WC}}} \cdot \varvec{p}_{\varvec{B}}^{\varvec{C}} $$
(2)

2.3 Preintegration

The IMU sensor acquires angular velocity and acceleration of the sensor w.r.t. the IMU body frame \( B \) at a certain frequency. However, the gyroscope and accelerometer are subject to white sensor noises \( {\varvec{\upeta}}_{{\mathbf{a}}} \) and \( {\varvec{\upeta}}_{{\mathbf{g}}} \), as well as low-frequency drift biases \( \varvec{b}_{\varvec{a}} \) and \( \varvec{b}_{\varvec{g}} \). In the case where the initial state of the system is known, the state estimation of the system can be propagated by integrating the IMU measurements. However, this method is based on the initial state. When the nonlinear optimization adjusts system initial state, the integration process needs to be repeated.

To avoid repeated integration, the concept of preintegraed IMU measurement on the manifold space was proposed by Forster et al [17]. Assuming two consecutive keyframes at time \( i \) and \( j \), and the IMU measurement value is constant during the sampling interval. So the pose and velocity relationship can be computed by numerical integration of all measurements within this period

$$ \varvec{R}_{{\varvec{WB}_{\varvec{j}} }} = \varvec{R}_{{\varvec{WB}_{\varvec{i}} }} \prod\limits_{k = i}^{j - 1} {Exp\left( {\left( {\varvec{\omega}^{{\varvec{B}_{\varvec{k}} }} - \varvec{b}_{\varvec{g}}^{\varvec{k}} -\varvec{\eta}_{\varvec{g}}^{\varvec{k}} } \right)\Delta t} \right)} $$
$$ \varvec{v}_{{\varvec{B}_{\varvec{j}} }}^{\varvec{W}} = \varvec{v}_{{\varvec{B}_{i} }}^{\varvec{W}} + \varvec{g}^{\varvec{W}} \Delta t_{ij} + \sum\limits_{k = i}^{j - 1} {\varvec{R}_{{\varvec{WB}_{\varvec{k}} }} \left( {\varvec{a}^{{\varvec{B}_{\varvec{k}} }} - \varvec{b}_{\varvec{a}}^{\varvec{k}} -\varvec{\eta}_{\varvec{a}}^{\varvec{k}} } \right)\Delta t} $$
(3)
$$ \varvec{p}_{{\varvec{B}_{\varvec{j}} }}^{\varvec{W}} = \varvec{p}_{{\varvec{B}_{\varvec{i}} }}^{\varvec{W}} + \sum\limits_{k = i}^{j - 1} {\left( {\varvec{v}_{{\varvec{B}_{\varvec{k}} }}^{\varvec{W}} \Delta t + \frac{1}{2}\varvec{g}^{\varvec{W}} \Delta t^{2} + } \right.} \left. {\frac{1}{2}\varvec{R}_{{\varvec{WB}_{\varvec{k}} }} \left( {\varvec{a}^{{\varvec{B}_{\varvec{k}} }} - \varvec{b}_{\varvec{a}}^{\varvec{k}} -\varvec{\eta}_{\varvec{a}}^{\varvec{k}} } \right)\Delta t^{2} } \right) $$

Where \( \Delta t \) denotes the IMU sampling interval, \( \Delta t_{ij} = \sum\limits_{i}^{j - 1} {\Delta t} \) represents time interval between two consecutive frames. According to the definition in [17], \( Exp\left( \cdot \right) \) maps Lie algebra \( {\sf{so}}\left( 3 \right) \) to Lie group \( SO\left( 3 \right) \). When the bias is assumed to remain constant between two image acquisition moments, a small bias correction w.r.t. previously estimated \( \overline{\varvec{b}}_{\left( \cdot \right)}^{\varvec{i}} \) could be \( \delta \varvec{b}_{\left( \cdot \right)}^{\varvec{i}} \). The Eq. (8) can be rewritten as

$$ \varvec{R}_{{\varvec{WB}_{\varvec{j}} }} = \varvec{R}_{{\varvec{WB}_{\varvec{i}} }}\Delta \overline{\varvec{R}}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{\varvec{j}} }} Exp\left( {\varvec{J}_{{\varvec{\varDelta}\overline{\varvec{R}}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{\varvec{j}} }} }}^{\varvec{g}} \cdot \delta \varvec{b}_{\varvec{g}}^{\varvec{i}} } \right) $$
$$ \varvec{v}_{{\varvec{B}_{\varvec{j}} }}^{\varvec{W}} = \varvec{v}_{{\varvec{B}_{i} }}^{\varvec{W}} + \varvec{g}^{\varvec{W}} \Delta t_{ij} + \varvec{R}_{{\varvec{WB}_{i} }} \left( {\Delta \overline{\varvec{v}}_{{\varvec{B}_{\varvec{j}} }}^{{\varvec{B}_{\varvec{i}} }} + \varvec{J}_{{\varvec{\varDelta}\overline{\varvec{v}}_{{\varvec{B}_{\varvec{j}} }}^{{\varvec{B}_{\varvec{i}} }} }}^{\varvec{g}} \cdot \delta \varvec{b}_{\varvec{g}}^{\varvec{i}} + \varvec{J}_{{\varvec{\varDelta}\overline{\varvec{v}}_{{\varvec{B}_{\varvec{j}} }}^{{\varvec{B}_{\varvec{i}} }} }}^{a} \cdot \delta \varvec{b}_{a}^{\varvec{i}} } \right) $$
(4)
$$ \varvec{p}_{{\varvec{B}_{\varvec{j}} }}^{\varvec{W}} = \varvec{p}_{{\varvec{B}_{\varvec{i}} }}^{\varvec{W}} + \varvec{v}_{{\varvec{B}_{\varvec{i}} }}^{\varvec{W}} \Delta t_{ij} + \frac{1}{2}\varvec{g}^{\varvec{W}} \Delta t_{ij}^{2} + \varvec{R}_{{\varvec{WB}_{\varvec{i}} }} \left( {\Delta \overline{\varvec{p}}_{{\varvec{B}_{\varvec{j}} }}^{{\varvec{B}_{\varvec{i}} }} + \varvec{J}_{{\Delta \overline{\varvec{p}}_{{\varvec{B}_{\varvec{j}} }}^{{\varvec{B}_{\varvec{i}} }} }}^{\varvec{g}} \cdot \delta \varvec{b}_{\varvec{g}}^{\varvec{i}} + \varvec{J}_{{\Delta \overline{\varvec{p}}_{{\varvec{B}_{\varvec{j}} }}^{{\varvec{B}_{\varvec{i}} }} }}^{a} \cdot \delta \varvec{b}_{a}^{\varvec{i}} } \right) $$

\( \varvec{J}_{\left( \cdot \right)}^{\varvec{g}} \) and \( \varvec{J}_{\left( \cdot \right)}^{a} \) are Jacobian matrix of preintegration measurements relative to bias estimation, which is deduced in the appendix of paper [17]. \( \Delta \overline{\varvec{R}}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{\varvec{j}} }} ,\,\Delta \overline{\varvec{v}}_{{\varvec{B}_{\varvec{j}} }}^{{\varvec{B}_{\varvec{i}} }} \) and \( \Delta \overline{\varvec{p}}_{{\varvec{B}_{\varvec{j}} }}^{{\varvec{B}_{\varvec{i}} }} \) are the terms of preintegration which are independent of the states at time \( i \) and the gravity, used to describe the relative motion of two frames

$$ \Delta \overline{\varvec{R}}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{\varvec{j}} }} = \prod\limits_{k = i}^{j - 1} {Exp\left( {\left( {\varvec{\omega}^{{\varvec{B}_{\varvec{k}} }} - \overline{\varvec{b}}_{\varvec{g}}^{\varvec{i}} } \right)\Delta t} \right)} $$
$$ \Delta \overline{\varvec{v}}_{{\varvec{B}_{\varvec{j}} }}^{{\varvec{B}_{\varvec{i}} }} = \sum\limits_{k = i}^{j - 1} {\Delta \overline{\varvec{R}}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{\varvec{k}} }} \left( {\varvec{a}^{{\varvec{B}_{\varvec{k}} }} - \overline{\varvec{b}}_{\varvec{a}}^{\varvec{i}} } \right)\Delta t} $$
(5)
$$ \Delta \overline{\varvec{p}}_{{\varvec{B}_{\varvec{j}} }}^{{\varvec{B}_{\varvec{i}} }} = \sum\limits_{k = i}^{j - 1} {\left( {\Delta \overline{\varvec{v}}_{{\varvec{B}_{k} }}^{{\varvec{B}_{\varvec{i}} }} \Delta t + \frac{1}{2}\Delta \overline{\varvec{R}}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{k} }} \left( {\varvec{a}^{{\varvec{B}_{\varvec{k}} }} - \overline{\varvec{b}}_{\varvec{a}}^{\varvec{i}} } \right)\Delta t^{2} } \right)} $$

3 Initialization with Camera-IMU Extrinsic Automatic Calibration

This section elaborates on the initialization method with camera-IMU extrinsic automatic calibration. This method jointly calibrates the camera-IMU extrinsic parameters \( \varvec{T}_{{\varvec{CB}}} \), as well as estimates factor scale \( s \), gravity acceleration in the world frame \( \varvec{g}^{\varvec{W}} \), biases of gyroscope and accelerometer \( \varvec{b}_{\varvec{a}} \) and \( \varvec{b}_{\varvec{g}} \).

3.1 Camera-IMU Extrinsic Orientation Calibration

The extrinsic rotation between the monocular camera and the IMU is very important for the robustness of the VI-SLAM system. Excessive deviation can cause the system initialization to collapse. The hand-eye calibration method is used to align the rotations of the camera with the integrated IMU rotations. Because the monocular camera can track the pose of the system, to detect the relative rotation \( \varvec{R}_{{\varvec{C}_{\varvec{i}} \varvec{C}_{{\varvec{i + 1}}} }} \) between consecutive frames. In addition, the angular velocity measured by the gyroscope can be integrated to obtain relative rotation \( \varvec{R}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{{\varvec{i + 1}}} }} \) in the IMU body frame. So it leads to

$$ \varvec{R}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{{\varvec{i + 1}}} }} \cdot \varvec{R}_{{\varvec{BC}}} = \varvec{R}_{{\varvec{BC}}} \cdot \varvec{R}_{{\varvec{C}_{\varvec{i}} \varvec{C}_{{\varvec{i + 1}}} }} $$
(6)

With the quaternion representation, (6) can be described as

$$ \begin{aligned} & \varvec{q}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{{\varvec{i + 1}}} }} \otimes \varvec{q}_{{\varvec{BC}}} = \varvec{q}_{{\varvec{BC}}} \otimes \varvec{q}_{{\varvec{C}_{\varvec{i}} \varvec{C}_{{\varvec{i + 1}}} }} \\ & \Rightarrow \left[ {\left[ {\varvec{q}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{{\varvec{i + 1}}} }} } \right]_{L} - \left[ {\varvec{q}_{{\varvec{C}_{\varvec{i}} \varvec{C}_{{\varvec{i + 1}}} }} } \right]_{R} } \right]\varvec{q}_{{\varvec{BC}}} = \varvec{Q}_{{\varvec{i,i + 1}}} \cdot \varvec{q}_{{\varvec{BC}}} = {\mathbf{0}}_{{{\mathbf{4 \times 1}}}} \\ \end{aligned} $$
(7)

Linear over-determined equation can be established for temporally continuous frames

$$ \left[ {\begin{array}{*{20}c} {\alpha_{0,1} \cdot \varvec{Q}_{{\varvec{0,1}}} } \\ {\alpha_{1,2} \cdot \varvec{Q}_{{\varvec{1,2}}} } \\ \vdots \\ {\alpha_{N - 1,N} \cdot \varvec{Q}_{{\varvec{N - 1,N}}} } \\ \end{array} } \right]\varvec{q}_{{\varvec{BC}}} \;{ = }\;\varvec{Q}_{\varvec{N}} \cdot \varvec{q}_{{\varvec{BC}}} \; = \;{\mathbf{0}} $$
(8)

\( N \) indicates the number of frames used when extrinsic rotation converges; \( \alpha_{N - 1,N} \) is a weight for outlier handling. As the extrinsic rotation calibration runs with incoming measurements, the previously estimated result \( \widehat{\varvec{R}}_{{\varvec{BC}}} \) can be used as the initial value to weight the residual

$$ r_{i,i + 1} = \arccos \left( {\left( {tr\left( {\widehat{\varvec{R}}_{{\varvec{BC}}}^{ - 1} \varvec{R}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{{\varvec{i + 1}}} }}^{ - 1} \widehat{\varvec{R}}_{{\varvec{BC}}} \varvec{R}_{{\varvec{C}_{\varvec{i}} \varvec{C}_{{\varvec{i + 1}}} }} } \right) - 1} \right)/2} \right) $$
(9)

The weight is a function of the residual

$$ \alpha_{i,i + 1} = \left\{ {\begin{array}{*{20}c} {1,} & {r_{i,i + 1} \; < \;t_{0} } \\ {\frac{{t_{0} }}{{r_{i,i + 1} }},} & {otherwise} \\ \end{array} } \right. $$
(10)

\( t_{0} \) is the threshold. The solution to (8) can be found as the right unit singular vector corresponding to the smallest singular value of \( \varvec{Q}_{\varvec{N}} \).

3.2 Gyroscope Bias Estimation

The gyroscope bias can be estimated by the rotation relationship of consecutive keyframes. This paper assumes that the gyroscope bias remains constant in the initialization stage, and the initial gyroscope bias \( \overline{{\varvec{b}_{\varvec{g}} }} \) is 0.

Substituting the extrinsic rotation matrix \( \widehat{\varvec{R}}_{{\varvec{BC}}} \) estimated in Sect. 3.1 into (4)

$$ \left( {\varvec{R}_{{\varvec{WC}_{\varvec{i}} }} \cdot \widehat{\varvec{R}}_{{\varvec{CB}}} } \right)^{T} \left( {\varvec{R}_{{\varvec{WC}_{{\varvec{i + 1}}} }} \cdot \widehat{\varvec{R}}_{{\varvec{CB}}} } \right) =\Delta \overline{\varvec{R}}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{{\varvec{i + 1}}} }} Exp\left( {\varvec{J}_{{\Delta \overline{\varvec{R}}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{{\varvec{i + 1}}} }} }}^{\varvec{g}} \cdot \delta \varvec{b}_{\varvec{g}}^{{}} } \right) $$
(11)

For all keyframes during initialization, we use the minimum function for bias estimation,

$$ \delta \varvec{b}_{\varvec{g}}^{\varvec{*}} \;{ = }\;\mathop {\arg \hbox{min} }\limits_{{\delta \varvec{b}_{\varvec{g}}^{{}} }} \sum\limits_{i = 1}^{N - 1} {\left\| {Log\left( {\left( {\Delta \overline{\varvec{R}}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{{\varvec{i + 1}}} }} Exp\left( {\varvec{J}_{{\Delta \overline{\varvec{R}}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{{\varvec{i + 1}}} }} }}^{\varvec{g}} \cdot \delta \varvec{b}_{\varvec{g}}^{{}} } \right)} \right)^{T} \widehat{\varvec{R}}_{{\varvec{BC}}} \varvec{R}_{{\varvec{C}_{\varvec{i}} \varvec{W}}} \varvec{R}_{{\varvec{WC}_{{\varvec{i + 1}}} }} \widehat{\varvec{R}}_{{\varvec{CB}}} } \right)} \right\|}^{2} $$
(12)

where \( \left\| \cdot \right\| \) is the L2-norm, \( Log\left( \cdot \right) \) is the inverse of \( Exp\left( \cdot \right) \). \( \varvec{R}_{{\varvec{WC}_{\varvec{i}} }} \) and \( \Delta \overline{\varvec{R}}_{{\varvec{B}_{\varvec{i}} \varvec{B}_{{\varvec{i + 1}}} }} \) are known to be obtained by monocular camera pose tracking and preintegration of gyroscope measurements respectively. This equation can be solved with Gauss-Newton algorithm. The final estimated gyroscope bias is \( \hat{\varvec{b}}_{\varvec{g}} = \overline{{\varvec{b}_{\varvec{g}} }} + \delta \varvec{b}_{\varvec{g}}^{\varvec{*}} = \delta \varvec{b}_{\varvec{g}}^{\varvec{*}} \).

3.3 Scale, Gravity and Translation Approximation Without Accelerometer Bias

Once the gyroscope bias has been estimated, the preintegrations can be rectified by Eq. (5). And continue to estimate the scale factor \( s \), gravity \( \varvec{g}^{\varvec{W}} \) and extrinsic translation \( \varvec{p}_{\varvec{B}}^{\varvec{C}} \) approximately. Since the gravity and accelerometer bias are hard to be distinguished, accelerometer bias is not considered in this stage. So the \( \varvec{J}_{{\Delta \overline{\varvec{v}}_{{\varvec{B}_{\varvec{j}} }}^{{\varvec{B}_{\varvec{i}} }} }}^{a} \) and \( \varvec{J}_{{\Delta \overline{\varvec{p}}_{{\varvec{B}_{\varvec{j}} }}^{{\varvec{B}_{\varvec{i}} }} }}^{a} \) can be set to zero. Substituting Eq. (2) into the third equation of Eq. (4), the relationship of two consecutive keyframes can be obtained,

$$ \begin{aligned} s \cdot \varvec{p}_{{\varvec{C}_{{\varvec{i + 1}}} }}^{\varvec{W}} \; = \; & s \cdot \varvec{p}_{{\varvec{C}_{\varvec{i}} }}^{\varvec{W}} + \varvec{v}_{{\varvec{B}_{\varvec{i}} }}^{\varvec{W}} \cdot \Delta t_{i,i + 1} + \frac{1}{2}\varvec{g}^{\varvec{W}} \cdot \Delta t_{i,i + 1}^{2} \\ \, & + \varvec{R}_{{\varvec{WC}_{\varvec{i}} }} \cdot \widehat{\varvec{R}}_{{\varvec{CB}}} \cdot \Delta \overline{\varvec{p}}_{{\varvec{B}_{i + 1} }}^{{\varvec{B}_{\varvec{i}} }} \varvec{ + }\left( {\varvec{R}_{{\varvec{WC}_{\varvec{i}} }} - \varvec{R}_{{\varvec{WC}_{{\varvec{i + 1}}} }} } \right)\varvec{p}_{\varvec{B}}^{\varvec{C}} \\ \end{aligned} $$
(13)

The goal is to estimate \( s \), \( \varvec{g}^{\varvec{W}} \) and \( \varvec{p}_{\varvec{B}}^{\varvec{C}} \). Using two relations between three consecutive keyframes to eliminate velocities \( \varvec{v}_{{\varvec{B}_{\varvec{i}} }}^{\varvec{W}} \), which leads to the following expression:

$$ \left[ {\begin{array}{*{20}c} {\varvec{\lambda}\left( i \right)} & {\varvec{\beta}\left( i \right)} & {\varvec{\varphi }\left( i \right)} \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} s \\ {\varvec{g}^{\varvec{W}} } \\ {\varvec{p}_{\varvec{B}}^{\varvec{C}} } \\ \end{array} } \right] =\varvec{\gamma}\left( i \right) $$
(14)

Writing the subscript \( i,\,i + 1,\;i + 2 \) as 1, 2, 3, we have:

$$ \varvec{\lambda}\left( i \right) = \left( {\varvec{p}_{{\varvec{C}_{3} }}^{\varvec{W}} - \varvec{p}_{{\varvec{C}_{2} }}^{\varvec{W}} } \right)\Delta t_{12} + \left( {\varvec{p}_{{\varvec{C}_{1} }}^{\varvec{W}} - \varvec{p}_{{\varvec{C}_{2} }}^{\varvec{W}} } \right)\Delta t_{23} $$
$$ \varvec{\beta}\left( i \right) = - \frac{1}{2}\left( {\Delta t_{12}^{2} \Delta t_{23} + \Delta t_{23}^{2} \Delta t_{12}^{{}} } \right){\mathbf{I}}_{3 \times 3} $$
$$ \varvec{\varphi }\left( i \right)\;{ = }\;\left( {\varvec{R}_{{\varvec{WC}_{3} }} { - }\varvec{R}_{{\varvec{WC}_{2} }} } \right)\Delta t_{12} + \left( {\varvec{R}_{{\varvec{WC}_{1} }} { - }\varvec{R}_{{\varvec{WC}_{2} }} } \right)\Delta t_{23} $$
$$ \begin{aligned}\varvec{\gamma}\left( i \right)\; = \; & \varvec{R}_{{\varvec{WC}_{1} }} \cdot \widehat{\varvec{R}}_{{\varvec{CB}}} \cdot \Delta \overline{\varvec{v}}_{{\varvec{B}_{2} }}^{{\varvec{B}_{1} }} \Delta t_{12} \Delta t_{23} + \varvec{R}_{{\varvec{WC}_{2} }} \cdot \widehat{\varvec{R}}_{{\varvec{CB}}} \cdot \Delta \overline{\varvec{p}}_{{\varvec{B}_{3} }}^{{\varvec{B}_{2} }} \Delta t_{12} \\ \, & - \varvec{R}_{{\varvec{WC}_{1} }} \cdot \widehat{\varvec{R}}_{{\varvec{CB}}} \cdot \Delta \overline{\varvec{p}}_{{\varvec{B}_{2} }}^{{\varvec{B}_{1} }} \Delta t_{23} \\ \end{aligned} $$
(15)

For \( N \) consecutive keyframes, a linear over-determined equation \( \varvec{A}_{3(N - 2) \times 7} \cdot \varvec{x}_{7 \times 1} = \varvec{B}_{3(N - 2) \times 1} \) can be stacked. \( \hat{s},\,\hat{\varvec{g}}^{\varvec{W}} ,\,\hat{\varvec{p}}_{\varvec{B}}^{\varvec{C}} \) can be solved by SVD decomposition. Note that there are \( 3(N - 2) \) linear constraints and 7 unknowns, so at least 5 keyframes is required to solve the equation.

3.4 Accelerometer Bias Estimation, and Scale, Gravity and Translation Refinement

Assuming the gravity \( \varvec{g}^{\varvec{E}} \) in the inertial frame \( E \) is known, and \( G = 9.8 \) represents the magnitude of the gravitational acceleration as well as \( \bar{\varvec{g}}^{\varvec{E}} = \left( {0,0, - 1} \right) \) represents its direction. It is stipulated that the earth’s inertial frame \( E \) coincides with the origin of the world frame \( W \), the rotation \( \varvec{R}_{{\varvec{WE}}} \) between them can be computed as follows, shown in Fig. 2:

Fig. 2.
figure 2

Diagram of gravity acceleration direction angle.

$$ \varvec{R}_{{\varvec{WE}}} = Exp\left( {\theta\varvec{\beta}} \right) $$
$$ \varvec{\beta = }\frac{{\bar{\varvec{g}}^{\varvec{E}} \times \bar{\varvec{g}}^{\varvec{W}} }}{{\left\| {\bar{\varvec{g}}^{\varvec{E}} \times \bar{\varvec{g}}^{\varvec{W}} } \right\|}},\theta = \text{atan} 2\left( {\left\| {\bar{\varvec{g}}^{\varvec{E}} \times \bar{\varvec{g}}^{\varvec{W}} } \right\|,\bar{\varvec{g}}^{\varvec{E}} \cdot \bar{\varvec{g}}^{\varvec{W}} } \right) $$
(16)

As \( \bar{\varvec{g}}^{\varvec{W}} = \hat{\varvec{g}}^{\varvec{W}} /\left\| {\hat{\varvec{g}}^{\varvec{W}} } \right\| \) represents the gravity direction in the world frame \( W \) estimated in Sect. 3.3. This rotation can be optimized by appending perturbation \( \delta\varvec{\theta}\in {\mathbb{R}}^{3 \times 1} \):

$$ \varvec{g}^{\varvec{W}} = \varvec{R}_{{\varvec{WE}}} \cdot Exp(\delta\varvec{\theta})\varvec{g}^{\varvec{E}} \approx \varvec{R}_{{\varvec{WE}}} \cdot \varvec{g}^{\varvec{E}} - \varvec{R}_{{\varvec{WE}}} (\varvec{g}^{\varvec{E}} )^{ \wedge } \delta\varvec{\theta} $$
$$ \delta \varvec{\theta = }\left[ {\delta\varvec{\theta}_{{\varvec{xy}}}^{T} ,0} \right]^{T} ,\delta\varvec{\theta}_{{\varvec{xy}}}^{{}} = \left[ {\delta\varvec{\theta}_{\varvec{x}}^{{}} ,\delta\varvec{\theta}_{\varvec{y}}^{{}} } \right]^{T} $$
(17)

Substituting Eq. (17) into Eq. (13) and including the effect of accelerometer bias, we have:

$$ \begin{aligned} s \cdot \varvec{p}_{{\varvec{C}_{{\varvec{i + 1}}} }}^{\varvec{W}} = & s \cdot \varvec{p}_{{\varvec{C}_{\varvec{i}} }}^{\varvec{W}} + \varvec{v}_{{\varvec{B}_{\varvec{i}} }}^{\varvec{W}} \Delta t_{i,i + 1} - \frac{1}{2}\varvec{R}_{{\varvec{WE}}} (\varvec{g}^{\varvec{E}} )^{ \wedge } \delta\varvec{\theta}\cdot \Delta t_{i,i + 1}^{2} \\ \, & + \varvec{R}_{{\varvec{WC}_{\varvec{i}} }} \cdot \widehat{\varvec{R}}_{{\varvec{CB}}} \left( {\Delta \overline{\varvec{p}}_{{\varvec{B}_{i + 1} }}^{{\varvec{B}_{\varvec{i}} }} { + }\varvec{J}_{{\Delta \overline{\varvec{p}}_{{B_{i + 1} }}^{{B_{i} }} }}^{a} \cdot \delta \varvec{b}_{a} } \right)\varvec{ + }\left( {\varvec{R}_{{\varvec{WC}_{\varvec{i}} }} - \varvec{R}_{{\varvec{WC}_{{\varvec{i + 1}}} }} } \right)\varvec{p}_{\varvec{B}}^{\varvec{C}} \\ \, & + \frac{1}{2}\varvec{R}_{{\varvec{WE}}} \cdot \varvec{g}^{\varvec{E}} \cdot \Delta t_{i,i + 1}^{2} \\ \end{aligned} $$
(18)

Considering the constraints between three consecutive keyframes as well, we can construct the following linear equations:

$$ \left[ {\begin{array}{*{20}c} {\varvec{\lambda}\left( i \right)} & {\varvec{\alpha}\left( i \right)} & {\varvec{\phi}\left( i \right)} & {\varvec{\varphi }\left( i \right)} \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} s \\ {\delta\varvec{\theta}_{{\varvec{xy}}}^{{}} } \\ {\delta \varvec{b}_{a}^{{}} } \\ {\varvec{p}_{\varvec{B}}^{\varvec{C}} } \\ \end{array} } \right] =\varvec{\chi}\left( i \right) $$
(19)

Where \( \varvec{\lambda}\left( i \right),\;\varvec{\varphi }\left( i \right) \) remains the same as Eq. (15), \( \varvec{\alpha}\left( i \right),\varvec{\phi}\left( i \right),\varvec{\chi}\left( i \right) \) are computed as follow:

$$ \varvec{\alpha}\left( i \right)\;{ = }\;\left[ {\frac{1}{2}\varvec{R}_{{\varvec{WE}}} (\varvec{g}^{\varvec{E}} )^{ \wedge } \left( {\Delta t_{12}^{2} \Delta t_{23} + \Delta t_{23}^{2} \Delta t_{12}^{{}} } \right)} \right]_{{\left( {:,1:2} \right)}} $$
$$ \begin{aligned}\varvec{\phi}\left( i \right) & = \varvec{R}_{{\varvec{WC}_{1} }} \cdot \widehat{\varvec{R}}_{{\varvec{CB}}} \cdot \varvec{J}_{{\Delta \overline{\varvec{p}}_{{B_{2} }}^{{B_{1} }} }}^{a} \Delta t_{23} - \varvec{R}_{{\varvec{WC}_{2} }} \cdot \widehat{\varvec{R}}_{{\varvec{CB}}} \cdot \varvec{J}_{{\Delta \overline{\varvec{p}}_{{B_{3} }}^{{B_{2} }} }}^{a} \Delta t_{12} \\ & - \varvec{R}_{{\varvec{WC}_{1} }} \cdot \widehat{\varvec{R}}_{{\varvec{CB}}} \cdot \varvec{J}_{{\Delta \overline{v}_{{B_{2} }}^{{B_{1} }} }}^{a} \Delta t_{12} \Delta t_{23} \\ \end{aligned} $$
$$ \begin{aligned}\varvec{\chi}\left( i \right) & \;{ = }\;\varvec{R}_{{\varvec{WC}_{2} }} \cdot \widehat{\varvec{R}}_{{\varvec{CB}}} \cdot \Delta \overline{\varvec{p}}_{{\varvec{B}_{3} }}^{{\varvec{B}_{2} }} \Delta t_{12} { + }\varvec{R}_{{\varvec{WC}_{1} }} \cdot \widehat{\varvec{R}}_{{\varvec{CB}}} \cdot \Delta \overline{\varvec{v}}_{{\varvec{B}_{2} }}^{{\varvec{B}_{1} }} \Delta t_{12} \Delta t_{23} \\ & \,{ + }\frac{1}{2}\varvec{R}_{{\varvec{WE}}} \cdot \varvec{g}^{\varvec{E}} \left( {\Delta t_{12}^{2} \Delta t_{23}^{{}} { + }\Delta t_{12}^{{}} \Delta t_{23}^{2} } \right) - \varvec{R}_{{\varvec{WC}_{1} }} \cdot \widehat{\varvec{R}}_{{\varvec{CB}}} \cdot \Delta \overline{\varvec{p}}_{{\varvec{B}_{2} }}^{{\varvec{B}_{1} }} \Delta t_{23} \\ \end{aligned} $$
(20)

\( \left[ \cdot \right]_{{\left( {:,1:2} \right)}} \) in the \( \varvec{\alpha}\left( i \right) \) means the first two columns of the matrix.

Similar to above, a linear over-determined equation \( \varvec{A}_{3(N - 2) \times 9} \cdot \varvec{x}_{9 \times 1} = \varvec{B}_{3(N - 2) \times 1} \) can be constructed to calculate the \( \delta \varvec{b}_{a}^{ *} ,\;s^{ *} ,\;\delta\varvec{\theta}_{{\varvec{xy}}}^{ *} \) and \( \varvec{p}_{\varvec{B}}^{{\varvec{C} *}} \). Since the initial accelerometer bias is also set to zero, the final estimated accelerometer bias is \( \hat{\varvec{b}}_{a} = \overline{{\varvec{b}_{a} }} + \delta \varvec{b}_{a}^{\varvec{*}} = \delta \varvec{b}_{a}^{\varvec{*}} \). What’s more, the gravity in the world frame is adjusted by incorporating perturbation, i.e. \( \varvec{g}^{{\varvec{W} *}} = \varvec{R}_{{\varvec{WE}}} \cdot Exp(\delta\varvec{\theta}^{ *} )\varvec{g}^{\varvec{E}} \).

4 Camera-IMU Extrinsic Online Estimation

Through the initialization process with camera-IMU extrinsic calibration, an accurate extrinsic parameter could be estimated. However, during the movement of system, the mechanical configuration of the sensor suite changes slightly. Fixed camera-IMU extrinsic parameters can hardly track this change, which leads to system error affecting the tracking accuracy and robustness of the system. With regards to this, we put the camera-IMU extrinsic parameters into the state vectors for online estimation.

4.1 States and Factor Graph Representation

During the motion of the VI-SLAM system, the states to be estimated in each frame include pose, velocity and IMU biases of the sensor suite. On this basis, the camera-IMU extrinsic parameters are also put into the state vector for online estimation. Defining the IMU body frame as the frame to be estimated, the states to be estimated in each image frame is \( \left\{ {\varvec{p}_{{\varvec{B}_{\varvec{i}} }}^{\varvec{W}} ,\varvec{R}_{{\varvec{WB}_{i} }} ,\varvec{v}_{{\varvec{B}_{i} }}^{\varvec{W}} ,\varvec{b}_{\varvec{g}}^{i} ,\varvec{b}_{\varvec{a}}^{i} ,\varvec{R}_{{\varvec{CB}}} ,\varvec{p}_{\varvec{B}}^{\varvec{C}} } \right\} \). In addition to these, the location of the \( k \) th landmark point \( \varvec{l}_{k}^{W} \in {\mathbb{R}}^{3} \) is also included in the states. Using factor graph to describe the constraint relationship between these states, the representation of the VI-SLAM system with the camera-IMU extrinsic parameters online estimation is:

As shown in the Fig. 3, the circles represent the variables to be estimated, and squares are the factors. So there are three kinds of constraint in the VI-SLAM system:

Fig. 3.
figure 3

Factor graph of VI-SLAM system.

  1. (1)

    Each image pose, camera-IMU extrinsic parameters and landmark point have graph feature position observation constraint;

  2. (2)

    Poses, velocities and IMU biases of consecutive frames have preintegration constraint of IMU measurements;

  3. (3)

    IMU biases of consecutive frames have IMU bias random walk constraint.

Therefore, the camera-IMU extrinsic parameters are limited by the graph feature position observation constraint. To estimate it online, it’s necessary to construct a nonlinear estimation function of the graph feature position observation constrain.

4.2 Graph Feature Constraint and Its Jacobian

Assuming the states of the frame \( i \) is \( \left\{ {\varvec{p}_{{\varvec{B}_{\varvec{i}} }}^{\varvec{W}} ,\varvec{R}_{{\varvec{WB}_{i} }} ,\varvec{v}_{{\varvec{B}_{i} }}^{\varvec{W}} ,\varvec{b}_{\varvec{g}}^{i} ,\varvec{b}_{\varvec{a}}^{i} ,\varvec{R}_{{\varvec{CB}}} ,\varvec{p}_{\varvec{B}}^{\varvec{C}} } \right\} \), and the landmark point \( k \)’s position in the world frame is \( \varvec{l}_{k}^{\varvec{W}} \). The feature position observed on the frame \( i \) of the landmark point \( k \) is \( \hat{\varvec{p}}_{i,k} \) with the uncertainty of one pixel. Pinhole projection model projects the landmark point \( \varvec{l}_{k}^{\varvec{W}} \) as the pixel \( \varvec{p}_{i,k} \).

So the reprojection error \( \varvec{e}_{i,k}^{{}} \) of the graph feature position constraint is:

$$ \begin{aligned} \varvec{e}_{i,k} & = \hat{\varvec{p}}_{i,k} - \frac{1}{{z_{C} }}\varvec{K}\left[ {\varvec{I}_{3} \; 0_{3 \times 1} } \right]\left[ {\begin{array}{*{20}c} {\varvec{l}_{k}^{C} } \\ 1 \\ \end{array} } \right] \\ & = \hat{\varvec{p}}_{i,k} - \frac{1}{{z_{C} }}\varvec{K}\left[ {\varvec{I}_{3} \; 0_{3 \times 1} } \right]\varvec{T}_{CB} \left[ {\begin{array}{*{20}c} {\varvec{R}_{{\varvec{WB}_{i} }} } & {\varvec{p}_{{\varvec{B}_{\varvec{i}} }}^{\varvec{W}} } \\ 0 & 1 \\ \end{array} } \right]^{ - 1} \left[ {\begin{array}{*{20}c} {\varvec{l}_{k}^{W} } \\ 1 \\ \end{array} } \right] \\ \end{aligned} $$
(21)

\( \varvec{K}^{2 \times 3} \) is the camera intrinsic matrix; the three components of \( \varvec{l}_{k}^{C} \) (landmark \( k \)’s position in the camera frame \( C \)) is \( x_{C} ,\,y_{C} ,\;z_{C} ;\;\varvec{l}_{k}^{C} \) can be calculated:

$$ \varvec{l}_{k}^{C} = \varvec{R}_{CB} \varvec{R}_{{WB_{i} }}^{ - 1} \left( {\varvec{l}_{k}^{W} - \varvec{p}_{{\varvec{B}_{\varvec{i}} }}^{\varvec{W}} } \right) + \varvec{p}_{\varvec{B}}^{\varvec{C}} $$
(22)

According to Eq. (21), the reprojection error \( \varvec{e}_{i,k}^{{}} \)’s Jacobian w.r.t. \( \varvec{l}_{k}^{C} \) is:

$$ \frac{{\partial \varvec{e}_{i,k}^{{}} }}{{\partial \varvec{l}_{k}^{C} }}{ = } - \left[ {\begin{array}{*{20}c} {\frac{{f_{x} }}{{z_{C} }}} & 0 & { - \frac{{f_{x} x_{C} }}{{z_{C}^{2} }}} \\ 0 & {\frac{{f_{y} }}{{z_{C} }}} & { - \frac{{f_{y} y_{C} }}{{z_{C}^{2} }}} \\ \end{array} } \right] $$
(23)

\( f_{x} ,\,f_{y} ,\,c_{x} ,\,c_{y} \) are parameters of intrinsic matrix \( \varvec{K} \).

In this way, the rejection error \( \varvec{e}_{i,k}^{{}} \) w.r.t. camera-IMU extrinsic translation is:

$$ \begin{aligned} \frac{{\partial \varvec{e}_{i,k}^{{}} }}{{\partial \delta \varvec{p}}} & = \frac{{\partial \varvec{e}_{i,k}^{{}} }}{{\partial \varvec{l}_{k}^{C} }} \cdot \frac{{\partial \varvec{l}_{k}^{C} }}{{\partial \delta \varvec{p}}} \\ & = - \left[ {\begin{array}{*{20}c} {\frac{{f_{x} }}{{z_{C} }}} & 0 & { - \frac{{f_{x} x_{C} }}{{z_{C}^{2} }}} \\ 0 & {\frac{{f_{y} }}{{z_{C} }}} & { - \frac{{f_{y} y_{C} }}{{z_{C}^{2} }}} \\ \end{array} } \right]\varvec{R}_{CB} \\ \end{aligned} $$
(24)

Similarly, rejection error \( \varvec{e}_{i,k}^{{}} \) w.r.t. camera-IMU extrinsic rotation is:

$$ \begin{aligned} \frac{{\partial \varvec{e}_{i,k}^{{}} }}{{\partial \delta\varvec{\theta}}} & = \frac{{\partial \varvec{e}_{i,k}^{{}} }}{{\partial \varvec{l}_{k}^{C} }} \cdot \frac{{\partial \varvec{l}_{k}^{C} }}{{\partial \delta\varvec{\theta}}} \\ & = \left[ {\begin{array}{*{20}c} {\frac{{f_{x} }}{{z_{C} }}} & 0 & { - \frac{{f_{x} x_{C} }}{{z_{C}^{2} }}} \\ 0 & {\frac{{f_{y} }}{{z_{C} }}} & { - \frac{{f_{y} y_{C} }}{{z_{C}^{2} }}} \\ \end{array} } \right]\left[ {\varvec{R}_{CB} \varvec{R}_{{WB_{i} }}^{ - 1} \left( {\varvec{l}_{k}^{W} - \varvec{p}_{{\varvec{B}_{\varvec{i}} }}^{\varvec{W}} } \right)} \right]^{ \wedge } \varvec{R}_{CB} \\ \end{aligned} $$
(25)

At this point, the camera-IMU extrinsic parameters can be estimated online based on the residual constraint and the Jacobian.

5 Experimental Evaluation

In this section, performances of our VI-SLAM algorithm with camera-IMU extrinsic calibration and online estimation are estimated on the EuRoC dataset which provides accurate position ground-truth and camera-IMU extrinsic parameters. Eleven sequence, recorded with a Micro Aerial Vehicle (MAV), are divided into three levels: simple, medium and difficult according to different speeds of the aircrafts, illumination, image blur and environment texture. All the experiments are carried out with an Intel CPU i7-5500U (3.0 GHz) laptop computer with 4 GB RAM.

5.1 Implementation Details

The extrinsic rotation calibration is placed in the Tracking thread of the ORB-SLAM, because it’s easily excited. The rest the initialization method is implemented in the Local Mapping thread, between the Local BA module and the Local Keyframes Culling module. Considering there are not enough observations to limit the camera-IMU extrinsic parameters in the tracking thread, the camera-IMU extrinsic online estimation is executed in the Local BA module. Since the convergence judgment condition is set only for the extrinsic rotation calibration, the time of the remaining initialization method is set to 23 s.

5.2 Initialization Results

Under the experimental condition of this paper, the initialization and camera-IMU extrinsic estimation results are evaluated using the V2_01_easy sequence of the EuRoC dataset. Figures 4 and 5 show the process of camera-IMU extrinsic rotation calibration. The singular values to Eq. (8) become larger as time goes by (Fig. 4). When the second smallest singular value \( \sigma_{2} \) reaches the set threshold \( \sigma_{thr}^{{}} \;{ = }\;0.25 \), the process is achieved. A convergence plot of the yaw, pitch and roll can be found in Fig. 5 to their benchmark \( [89.147953^{ \circ } ,\;1.476930^{ \circ } ,\;0.215286^{ \circ } ] \).

Fig. 4.
figure 4

Time varied singular value.

Fig. 5.
figure 5

Process of extrinsic rotation calibration.

The process of camera-IMU extrinsic translation, gyroscope bias, accelerometer bias, scale factor and gravity in world frame is shown in Figs. 6, 7 and 8. Each state quantity begins to converge between 5 s and 10 s after running. The extrinsic translation calibrated will have a few centimeters of error in each axis to its standard \( [ - 0.021, - 0.064,0.009] \)m.

Fig. 6.
figure 6

Process of extrinsic translation calibration.

Fig. 7.
figure 7

Process of gyroscope bias and accelerometer bias calibration.

Fig. 8.
figure 8

Process of gravity and scale factor calibration.

5.3 Extrinsic Estimation Results

Figures 9 and 10 illustrate the online estimation process of the camera-IMU extrinsic parameters. During the online estimation process, the deviation of the extrinsic rotation in three axial directions fluctuates within 0.5° and the deviation of the extrinsic translation in three axial directions fluctuates within 0.02 m.

Fig. 9.
figure 9

Process of extrinsic rotation online estimation.

Fig. 10.
figure 10

Process of extrinsic rotation calibration.

Using the V2_01_easy dataset, ten trials for our method, linear and nonlinear processes of VINS-Mono are conducted in Fig. 11. It is observed that the extrinsic rotation calibrated from our method and VINS-Mono’ linear process deviates from the benchmark greatly, because we both do not take the gyroscope bias into account. However, our calibrated extrinsic translation is more consistent and accurate than VINS-Mono’s linear estimation as we considering the influence of IMU bias. After our online estimation of the extrinsic parameters and VINS-Mono’s nonlinear optimization, both camera-IMU extrinsic rotation and translation achieve high precision and consistency. The estimated error of the camera-IMU extrinsic rotation and translation obtained by the method proposed by use, is within 0.5° and 0.02 m separately.

Fig. 11.
figure 11

Process of extrinsic rotation calibration.

6 Conclusion

In this paper, we propose a VI-SLAM algorithm with camera-IMU extrinsic automatic calibration and online estimation without knowing the mechanical configuration of the sensor suite. Compare to VIORB which need prior precise extrinsic calibration, our method is a plug-and-play solution for mobile robots. Through the experiment of EuRoC dataset, the performance of the proposed algorithm in camera-IMU extrinsic calibration and online estimation is verified. The error of the estimated extrinsic rotation and translation is within 0.5° and 0.02 m separately. Compare to VINS-Mono, our method achieves comparable or higher precision and consistency. A limitation of our method is the long time (about 35 s) for the convergence of the extrinsic calibration. To overcome this, we plan to set judgment of convergence in our future work.