Keywords

1 Introduction

Augmented Reality (AR) is the technology of mixing real scenes with virtual scenes, an emerging field of huge application potentials. The technology makes the use of computer-generated virtual information within the real world to enhance the human perception of the world. As defined by Azuma, it is an integration of virtual world and the real world with real-time interactions via three-dimensional registrations [2]. The recent rapid development of software as well as hardware technologies in virtual reality and computer vision, AR has a wider range of applications in medicine, military, entertainment and others [4, 15]. Virtual registration, however, remains a challenge in AR research. Initially Simultaneous Localization and Mapping (SLAM), as a probability algorithm, has been mainly used for positioning robots in unknown environments [3, 14]. More recently, researchers have started to utilize the accuracy and real-time performance of SLAM for virtual registration in AR. Davison et al. [5, 6] have used a monocular camera to achieve fast 3D modeling and positioning of cameras in unknown environments, which has presented many practical uses of the algorithm. Klein [9] applied a SLAM algorithm in the creation of three-dimensional point clouds, as well as Reitmayr [12] demonstrated the use of SLAM and sensor fusion techniques in an accurate virtual reality registration with markerless tracking.

The method of computing homography matrix in AR systems for the three-dimensional registration [7, 11] is simple and efficient. This algorithm requires the detection of four point coordinates of a plane in order to determine the translation and rotation of the camera relative to the world coordinate system. In spite of its simplicity and efficiency, since it is based on the 2D plane registration, the four points of detection algorithm is prone to the error of misplacement of the virtual object registration, resulting in virtual objects being unstable with distracting visual effects (e.g. flashing visual artifacts). Previous approaches [9, 12] have attempted to make the use of the three-dimensional map information generated by SLAM for this process. In this paper, we present a method of improvement to the registration and tracking process of virtual objects by using map information generated by VSLAM [12] technology. The three-dimensional information of a scene generated by VSLAM cannot be used directly, due to the interference points and the large error of point clouds. Therefore, a robust Maximum Consistency with Minimum Distance and Robust Zscore (MCMD_Z) [1] algorithm have been used to detect the 2D plane more accurately. Our improved MCMD_Z method computes plane point matrix by using the plane normal vector fund by Singular Value Decomposition (SVD). A method of lie group is then used to convert the normal vector into the rotation matrix to register the virtual object using the plane information. We use the precise positioning function of the VSLAM to change the camera poses to the rendering coordinate system under the camera perspective for the three-dimensional registration of the virtual object.

The main contribution of this paper is to develop a method that can effectively produce stable and high registration accuracy for virtual reality fusion.

2 AR System Overview

The AR system consists of two software modules: VSLAM module and registration module as shown in Fig. 1 for an overview of the system. Tracking in the VSLAM module is to locate the camera position through processing each image frame, and decide when to insert a new keyframe. Firstly, the feature matching is initialized with the previous frame and Bundle Adjustment BA [16] is used to optimize the camera poses. While the 3D map is initialized and the map is successfully created by the VSLAM module, the registration module is called. Once the point cloud of the scene is generated, the computation for the plane detection is started, and the center and the normal vector of the plane are calculated. The center of the plane is used to determine the exact position of the virtual object and the normal vector is used to determine the orientation of the virtual object. Camera poses obtained by VSLAM are then transformed to the modelview matrix of OpenGL, which will be generated by the transformation of three-dimensional virtual objects to the center of the plane to achieve the virtual augmentation.

2.1 Tracking and 3D Map Building

Our system is based on a visual simultaneous mapping and tracking approach by extracting and matching the Oriented Features From Accelerated Segment Test (FAST) and Rotated Binary Robust Independent Elementary Features (BRIEF) (ORB) [13] feature points and compute two models: a homography matrix that is used to compute a planar scene and a fundamental matrix that is used to compute a non-planar scene. Each time two matrices are calculated, and a score (\(M=H\) for the homography matrix, \(M=F\) for the fundamental matrix) is also calculated as shown in Eq. 1. The score is used to determine which model is more suitable for the camera posture.

Fig. 1.
figure 1

System overview

$$\begin{aligned} S_{M}=\sum _{i}\left( \rho _{M}\left( d _{cr,M}^{2}\left( x_{c}^{i},x_{r}^{i} \right) + \rho _{M}\left( d _{rc,M}^{2}\left( x_{c}^{i},x_{r}^{i} \right) \right) \right) \right) \end{aligned}$$
(1)
$$ \rho _{M}\left( d^{2} \right) = \left\{ \begin{matrix} \varGamma -d^{2}\ if \quad d^{2}<T_{M} \\ 0 \qquad \ if \quad d^{2}\ge T_{M} \end{matrix}\right. $$

where \(d_rc\) and \(d_cr\) is the measure of symmetric transfer errors [8], \(T_m\) is the outlier rejection threshold based on the \(\chi ^{2}\), \(\varGamma \) is equal to, \(T_m\), \(x_c\) is the features of the current frame, \(x_r\) is the features of the reference frame. The BA is used to optimize camera poses, which gets a more accurate camera position as in following equation.

$$\begin{aligned} \left\{ R,r \right\} = \arg \min _{R,t}\sum _{i\in \chi } \rho \left( \left\| x^{i}-\pi \left( RX^{i}+t \right) \right\| _{\varSigma }^{2} \right) \end{aligned}$$
(2)

where \(R\in \mathcal {SO}^{3}\) is the rotation matrix, \(t\in \mathbb {R}^{3}\) is the translation vector, \(\mathcal {x}^{i}\in \mathbb {R}^{3} \) is a three-dimensional point in space, \(x^{i}\in \mathbb {R}^{2}\) is the key point, \(\rho \) is the Huber cost function, Sigma item is the covariance matrix associated to the key point, \(\pi \) is the projection function.

After obtaining the accurate position estimation of the camera, the three-dimensional map point cloud is obtained by triangulating the key frame through the camera poses, and finally the local BA is used to optimize the map. A detailed description of the approach is given in [10].

3 Plane Detection and Calculation of the Normal Vector

The map created in \(\xi 2.1\) is composed of a sparse point cloud. Because of the error of the point cloud data with large number of abnormal values, MCMD_Z is used for plane detection. A MCMD_Z algorithm is used to fit the data according to a search model. The idea of this algorithm is to use Principal Component Analysis (PCA) for a reliable selection of the registration plane, using Robust Z-score to remove invalid points at once. This method not only effectively avoids the threshold setting, but also runs fast. The MCMD_Z algorithm detects the plane as follows:

figure a
$$\begin{aligned} Rz_{i}=\frac{\left| od_{i}-\underset{j}{\mathrm {median}} \left( od_{i} \right) \right| }{a\cdot \underset{i}{\mathrm {median}}\left| od_{i}-\underset{j}{\mathrm {median}} \left( od_{i} \right) \right| } \end{aligned}$$
(3)

The detection of the plane determines the plane location, while providing a super-position of the location for a virtual object. Although the location of the virtual object is determined, virtual objects will not appear parallel to the plane but at a certain angle to the plane. In order to solve this problem, we need to calculate the normal vector of the plane and the rotation matrix.

The SVD of the matrix in the plane interior point is obtained, and the right singular vector corresponding to the minimum eigenvalue is the normal vector of the plane. Because there are two normal vectors in the plane, it is important that the normal vector direction is pointing outward. Specifically, the vector of the camera to the plane is found by the camera’s posture. Through the vector and the relationship be-tween the plane vectors, we can then determine the direction of the normal vector. The rotation matrix is obtained from the known normal vector by Lie group using the following equation:

$$\begin{aligned} R_{3\times 3}=exp\left( \hat{w} \right) = I + sin\left( \left\| w \right\| \right) \cdot \frac{\hat{w}}{\left\| w \right\| } + \left( 1-cos\left( \left\| w \right\| \right) \right) \cdot \frac{\hat{w}\hat{w}}{\left\| w \right\| ^{2}} \end{aligned}$$
(4)
$$ w = \frac{n_{y} \times n_{p}}{\left\| n_{y} \times n_{p} \right\| } \cdot arctan \frac{\left\| n_{y} \times n_{p} \right\| }{n_{y} \times n_{p}} $$

Where \(n_y\) is \(y-axis\) unit vector, \(n_p\) is normal vector of the plane, w is a column vector, \(\hat{w}\) is the anti-symmetric matrix of the vector w. Finally, the transformation matrix of OpenGL is composed of a translation vector and a rotation matrix. The rotation matrix is obtained and the translation vector is found to be the center of the plane.

3.1 Virtual Registration

The virtual object is finally registered in the real world, which must go through the transformation of the coordinate systems (from the world coordinate system to the camera coordinate system to the crop coordinate system, and to the screen coordinate system). The transformation sequences can be described by Eq. 4 from left to right: the world coordinate system is transformed into the camera coordinate system by a rotation matrix \(R_{(3\times 3)}\) and a translation matrix \(T_{(3\times 1)}\). Those matrices are made up by the camera’s position and the detected plane information. Then the camera coordinate system is then transformed into the screen coordinate system (uv) by the focal length \((f_x,f_y )\) and the principal point \((d_x,d_y )\). These parameters are obtained by the camera calibration. Finally, the virtual object is registered in the screen to the real world.

$$\begin{aligned} \begin{bmatrix} u\\ v\\ 1 \end{bmatrix} = \begin{bmatrix} f_{x}&0&d_{x}&0 \\ 0&f_{y}&d_{y}&0\\ 0&0&1&0 \end{bmatrix} \begin{bmatrix} R_{3\times 3}&T_{3\times 1}\\ 0_{1\times 3}&1 \end{bmatrix} \begin{bmatrix} X\\ Y\\ Z\\ 1 \end{bmatrix} \end{aligned}$$
(5)

4 Experiment and Evaluation

Our experiment is run under Ubuntu 14.04 system, CPU clocked at 2.3 GHz, 8 GB memory and graphics card for the NVIDIA GeForce GTX 960 MB. The camera resolution is 640 by 480 pixels at 30 Hz. The experimental scene is indoors and the length of the image collection is 1857 frames. Figure 2(a)–(b) show the indoor scene under the AR tracking and registration. We can see that the tracking and registration effect. Figure 2(c) shows the correct virtual object orientation.

Fig. 2.
figure 2

AR tracking and registration (left to right (a)–(c))

4.1 Plane Detection Analysis

Our method based on the CMCD_Z, which achieves better results than Random Sample Consensus (RANSAC). In contrast to these two algorithms, we use the Gaussian distribution to produce 1000 point with outlier percentages (10 and 20) using the same input parameters used previously. The inliers have means (15.0, 15.0, 10.0) and variances (10.0, 2.0, 0.5). The outliners have means (15.0, 15.0, 10.0) and variances (10.0, 2.0, 0.5). The program ran 1000 times. We compared Correct Identification Rate (CIR) and Swamping Rate (SR). The RANSAC sets iterations 50 times (Table 1).

Table 1. Correct Identification Rate (CIR), Swamping Rate (SR) and Time
Fig. 3.
figure 3

Registration error (Color figure online)

4.2 Registration Error Analysis

A comparison method is used with fixed camera positions to evaluate the robustness of the method. The three-dimensional registration of the virtual object is carried out by using the described method and the standard homography matrix method. Six components of the three-dimensional registration results are analyzed. The difference between the transformation matrix of the current frame and the corresponding component of the transformation matrix of the previous frame is used as the basis for comparison. The results are shown in Fig. 3, where Translate x, Translate y and Trans-late z are the errors of the translation components, respectively, and Rotate x, Rotate y, Rotate z are relative to the x, y, z axis of the rotation component error and where is obtained by subtracting the previous frame from the current frame. The result of the rotation component is obtained by dividing the respective components with the dot product of the corresponding coordinate axis, and the translation component is the result obtained by the normalization process.

In Fig. 3, the red curves in the figures are the results of using only the homography matrix. The blue curves are the results of the new registration method used in this paper. As can be seen from the Fig. 3, the use of the homography matrix method to register the virtual objects has produced large fluctuations of registration errors that are equivalent to virtual object registration instability. However, the new method tested on each rotation component has been kept the error in a small range below 0.5\(^{\circ }\). The errors with Translate x, Translate y and Translate z are also small similar to the result of the rotation components.

Through the experimental results, it can be seen that the new method produces stable virtual registration and solve the flickering phenomenon in the virtual reality registration, hence, improves the stability of the AR system.

5 Conclusions and Future Work

This paper presents a stable and realistic tracking method based on three-dimensional map information generated by VSLAM method to track the registration of virtual objects to ensure the stability and real-time performance of registration. Our proposed method is faster and is able to achieve more accurate registration results. The experimental results show that the proposed method can effectively suppress the virtual object jitter, have a higher tracking accuracy with good tracking performance. The current three-dimensional map used in this paper is a sparse point cloud, which can only access limited space configuration information.

While this work has served to propose and prototype with experiments to show the effectiveness of the proposed approach, future work will consider the use of dense point cloud based on our proposed method.