Keywords

1 Introduction

The visual simultaneous localization and mapping (V-SLAM) system adopts cameras (usually with inertial measurement units (IMUs)) to simultaneously inference its own state (e.g. pose) and build a consistent surrounding map [1]. V-SLAM systems are widely used by autonomous mobile robots, especially the resource-limited agile micro drones [2], augmented/mixed/virtual reality devices [3], as well as spatial AI applications [4, 5]. Recently, with the huge tide of deep learning (DL) techniques [6, 7], some end-to-end DL-based V-SLAM or visual odometry (VO) systems are proposed [8,9,10,11]. In this paper, we focus on the traditional de-facto framework of V-SLAM systems with standard cameras, as shown in Fig. 1, which contains two modules: a front-end and a back-end. The front-end is responsible for processing visual and IMU data, including feature extraction, short/long-term data association with outlier rejection, as well as initial estimation of the current pose and newly detected landmark positions. The back-end takes the initialization information from the front-end, and makes a maximum-a-posteriori (MAP) estimation based on a factor graph to refine both poses and landmark positions [12]. This paper focuses on the new trend in front-end techniques of V-SLAM systems.

Fig. 1.
figure 1

The de-facto framework of a V-SLAM system.

2 Hand-Engineered Features

Traditionally, visual features are mainly restricted to salient and repeatable points (called keypoints), due to the unreliable extraction of high-level geometric features (e.g. lines or edges) using unlearning methods. Besides keypoints, dense methods using all pixel information (e.g. optical flow [13] or correspondence-free method [14]) have also been adopted to inference ego-motion with small motion assumption. In the early research of stereo VO for Mars rovers [15,16,17,18], keypoints were tracked among successive images from nearby viewpoints. Nister proposed that keypoints could be independently extracted in all images and then matched in his landmark paper [19]. This method then became the dominant approach because it can successfully work with a large motion/viewpoint change [20].

Keypoint-based feature extraction includes two stages: keypoint detection and keypoint description. The keypoint detectors can be divided into two categories: corner detectors and blob detectors [21]. Traditional keypoint extractors are hand-engineered: corner detectors include Moravec [22], Forstner [23], Harris [24], Shi-Tomasi [25], and FAST/FASTER [26, 27]; blob detectors include SIFT [28], SURF [29], SENSURE [30], RootSIFT [31], and KAZE [32, 33]; keypoint descriptors include SSD/NCC [34], census transform [35], SIFT [28], GLOH [36], SURF [29], DAISY[37], BRIEF [38], ORB [39], BRISK [40], and KAZE [32, 33].

Data association (also called feature matching) is commonly conducted by comparing similarity measurements between keypoint descriptors along with a mutual consistency check procedure [21]. With a prior knowledge of motion constraints (e.g. from IMU sensors or constant velocity assumptions) or stereo’s epipolar line constraints, the time used for data association can be shorten by restricting the searching space [17, 41, 42]. Due to the visual aliasing, wrong data associations (called outliers) are unavoidable to both short-term feature matching and long-term loop closure. Therefore, consensus set search (e.g. RANSAC) [43,44,45,46,47,48] and geometric information (i.e. pose and map estimations) from the back-end are usually adopted to remove outliers.

Since traditional back-end of V-SLAM systems relies on local iterative optimization algorithms (e.g. Gauss-Newton) [12], a fairly good initialization of variables (e.g. 6-DoF poses and 3D coordinate of keypoints) is required. The 6-DoF poses can be estimated via multiview geometry knowledge [18, 34, 49] using keypoints. 3D coordinate of keypoints are obtained by triangulation with careful keyframe selection [50, 51].

3 Deep-Learned Features

Traditional hand-engineered keypoint extractors and matchers are not robust to viewpoint/illumination/seasonal change, which is crucial for long-term autonomy of mobile robots. While deep learning, especially the convolutional neural network (CNN) [52], is good at feature extraction and processing. Therefore, new trend about DL-based keypoint extractors and matchers are proposed to enhance the robustness of V-SLAM systems even under challenging conditions.

The DL-based keypoint extractors can be divided into three categories: detect-then-describe, jointly detect-and-describe, describe-to-detect. For the detect-then-describe method, keypoint detectors include TILDE [53], covariant feature detector [54], TCDet [55], MagicPoint [56], Quad-Networks [57], texture feature detector [58], and Key.Net [59]; keypoint descriptors include convex optimized descriptor [60], Deepdesc [61], TFeat [62], UCN [63], L2-Net [64], HardNet [65], averaging precision ranking [66], GeoDesc [67], RDRL [68], LogPolarDesc [69], ContextDesc [70], SOSNet [71], GIFT [72], and CAPS descriptor [73]. Jointly detect-and-describe methods include LIFT [74], DELE [75], LF-Net [76], SuperPoint [77], UnsuperPoint [78], GCNv2 [79], D2-Net [80], RF-Net [81], R2D2 [82], ASLFeat [83], and UR2KiD [84]. Different from the above two categories, Tian proposed a describe-to-detect (D2D) method [85], which selects keypoints based on the dense descriptors information.

After feature extracting, robust keypoint matching methods include content networks (CNe) [86], deep fundamental matrix estimation [87], NG-RANSAC [88], NM-Net [89], Order-Aware Network [90], ACNe [91], and SuperGlue [92].

Besides above extract-then-match methods, some end-to-end keypoint extract-and-match methods have also been proposed, they include NCNet [93], KP2D [94], Sparse-NCNet [95], LoFTR [96], and Patch2Pix [97].

Some datasets and benchmarks are released to train and evaluate different DL networks, such as Brown and Lowe’s dataset [98], HPatches [99], matching in the dark (MID) dataset [100], image matching challenge [101], and long-term visual localization benchmark [102, 103].

4 Conclusions

In this paper, we give a very concise survey of the front-end techniques of V-SLAM systems, emphasizing the DL-based keypoint extracting and matching techniques. Future research directions include extraction and association of high-level geometric features (e.g. lines [104, 105], symmetry [106], and holistic 3D structures [107]) as well as semantic object-level features [108,109,110].