Augmented reality without SLAM

Gholami, Aminreza; Nasihatkon, Behrooz; Soryani, Mohsen

doi:10.1007/s11042-024-20154-6

Augmented reality without SLAM

Published: 06 September 2024

(2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Augmented reality without SLAM

Download PDF

50 Accesses
Explore all metrics

Abstract

Most augmented reality (AR) pipelines typically involve the computation of the camera’s pose in each frame, followed by the 2D projection of virtual objects. The camera pose estimation is commonly implemented as SLAM (Simultaneous Localisation and Mapping) algorithm. However, SLAM systems are often limited to scenarios where the camera intrinsics remain fixed or are known in all frames. This paper presents an initial effort to circumvent the pose estimation stage altogether and directly computes 2D projections using epipolar constraints. To achieve this, we initially calculate the fundamental matrices between the keyframes and each new frame. The 2D locations of objects can then be triangulated by finding the intersection of epipolar lines in the new frame. We propose a robust algorithm that can handle situations where some of the fundamental matrices are entirely erroneous. Most notably, we introduce a depth-buffering algorithm that relies solely on the fundamental matrices, eliminating the need to compute 3D point locations in the target view. By utilizing fundamental matrices, our method remains effective even when all intrinsic camera parameters vary over time. Notably, our proposed approach achieved sufficient accuracy, even with more degrees of freedom in the solution space.

An Improved Augmented Reality Registration Method Based on Visual SLAM

The constrained SLAM framework for non-instrumented augmented reality

Article 19 October 2015

LSD-SLAM: Large-Scale Direct Monocular SLAM

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Augmented Reality (AR) is an important topic with applications in various fields such as entertainment [1,2,3], medicine [4, 5], industry [6], education [7, 8] and tourism [9, 10]. The key component of most AR systems is pose estimation: keeping track of camera rotation and translation, so we can embed the virtual object(VO) in the scene as the view point changes. Here, we offer an alternative approach that removes the need for pose estimation or creating a 3D map of the environment.

The main idea of our proposed method is to find the virtual object’s location directly and without the need to calculate camera poses. To this end, we use fundamental matrices between each new frame and previous keyframes. Unlike the SLAM-based methods, we do not need to maintain the point cloud. Also, our algorithm works when intrinsic camera parameters change over time and it maintains its effectiveness with more degrees of freedom. Our major contributions can be compiled as

1.
A new AR algorithm that works without pose estimation,
2.
A robust algorithm to compute virtual object’s locations from the fundamental matrices between each keyframe and a new frame,
3.
A novel approach for depth-buffering that only uses the fundamental matrices and does not require the 3D points or their depths.

In Section 2, we provide an overview of the related work. This is followed by Section 3, detailing our proposed method. Section 4 presents our results. Finally, Section 5 concludes the paper.

2 Related work

Recent AR systems utilize SLAM algorithms for camera pose estimation. The SLAM systems typically operate in two threads: mapping and tracking [11,12,13,14,15,16,17]. The mapping thread is responsible for extracting 3D locations of the tracked features. This process is primarily performed for keyframes, where ample time is available for precise scene structure refinement. On the other hand, frames not selected as keyframes are processed in the tracking thread. One fundamental step in SLAM systems is feature tracking, which is carried out in the tracking thread. Subsequently, the camera pose is computed for all frames. Several factors are considered for keyframe selection. These parameters include tracking quality [18], distance from previous keyframe [13, 18], the minimum distance between the camera and a key-point in the 3D point cloud [18], number of tracked features in the new frame [13], and overlapping of the current frame and the previous keyframe [13, 18].

Klein and Murray first introduced the concept of keyframes, and the use of tracking and mapping threads [12, 19, 20]. A SLAM system utilizing ORB feature points (ORB-SLAM) was proposed by Mur-Artal et al. [13, 21, 22]. The ORB (Oriented FAST and Rotated BRIEF) features are suitable for online systems [23], enabling real-time performance even without the need for GPUs. Additionally, this system incorporates another thread called loop closing.

AR algorithms typically require the location of objects in a reference frame. To achieve this, they often employ SLAM to establish an initial 3D model upon which the 3D objects are placed. A common approach is to use two non-consecutive frames and to initialize camera parameters. The detected features in the first frame are tracked throughout the sequence up to the second frame. The relative camera pose between the two frames is then calculated using the corresponding features.

Table 1 A comparison between different stages of typical AR algorithms and our algorithm

Full size table

Most SLAM methods rely on camera intrinsic parameters. Nonetheless, to address uncalibrated environments, a viable approach is to utilize auto-calibration methods. Chawla et al. [24] add an auto-calibration method to ORB-SLAM so their algorithm works in uncalibrated spaces. Ling et al. [25] proposed a method for augmented reality when intrinsic camera parameters are unknown. They utilized Kruppa’s equations to determine the intrinsic camera matrix. In their approach, intrinsic parameters are unknown but remain constant throughout the video. In contrast, our method functions even when intrinsic parameters change during the video sequence.

Kutulakos and Vallino [26] do not use any camera intrinsic parameters. They model camera to image transformation using weak perspective and show that given the projection of four non-coplanar fiducial points is enough to find the location of all other points in a frame. They find the location of a point by linear combination of the known points. In contrast, our method works for fully projective cameras.

Seo et al. [27, 28] propose an approach that does not need intrinsic camera parameters as input. First, they build a projective reconstruction for a pair of frames, and extend it to the rest of the frames using camera resectioning. Using a set of 2D control points annotated by the user in a pair of frames they remove the projective ambiguity obtaining a metric reconstruction. Having the metric reconstruction, they can project the virtual object(VO) points and also perform depth-buffering. Wang et al. [29] merge geometric constraints with a deep learning method that uses a probabilistic model to predict the intrinsic and extrinsic parameters of images. In contrast, our method bypasses the 3D reconstruction step, and performs the rendering completely in the 2D domain.

3 The proposed method

Table 1 contrasts different stages of a generic AR algorithm and our method. Stages 1, 2, and 5 are typically implemented using a SLAM system, providing the camera pose and 3D object locations in each frame. Our main challenge is to render the virtual object(VO) with no access to the camera pose and 3D object locations in every view.

Figure 1 shows the flow of our algorithm. Initialization is performed once at the beginning of our algorithm (Sec. 3.1). Then, for each frame, 2D local features are extracted and/or tracked (Sec. 3.2). Using this, the fundamental matrices between the current frame and 6 previous keyframes are computed (Sec. 3.2). Having the 2D object locations in the keyframes, the 2D locations in the new frame is computed as the intersection of the epipolar lines (Sec. 3.3). Finally, we perform depth-buffering to find the relative depth of the virtual object points and render the object in the 2D frame (Sec. 3.4). In the next subsections we explain different stages of our method.

3.1 Initialization

In the proposed method, initialization involves the assignment of 2D virtual point positions and their relative depths in the initial keyframes. These are the only requirements of our algorithm. To accomplish this, various methods can be employed. While we could use any SfM method for initial 3D reconstruction, in our experiments we simply utilize the ORB-SLAM method up until the first four keyframes. This aligns the initialization step of our algorithm with that of the ORB-SLAM, providing a baseline for comparison. This method has the downside of requiring an initial camera calibration. However, after the initialization, the internal parameters can vary. Notice that the 2D locations of the VO points in the subsequent keyframes are computed similarly to the ordinary frames using the method of Sec. 3.3. Due to the scale ambiguity, the depths computed by ORB-SLAM are relative in the sense that they relate to the actual depth by a common positive scale factor.

3.2 Frame processing

We extract Shi-Tomasi features [30] (improved Harris) in the keyframes and track them in the subsequent frames using the Lucas-Kanade method with backward check. The fundamental matrices($\texttt{F}$) between the keyframes and the current frame are then computed using the tracked points. The fundamental matrix $\texttt{F}$ relates corresponding points($x_1,x_2$) in two frames with the equation $x_2^T \texttt{F}x_1=0$. The accuracy of determining the fundamental matrix degrades when a significant number of points lie on a single plane. This is the case in many AR scenarios, including in the dataset we use. We overcome this as follows. First, we try to establish a homography($\texttt{H}$) relation between the two frames that includes as many as possible 2D point correspondences. This can be simply done using RANSAC (Random sample consensus). Employing the equation $\texttt{F}^T \texttt{H}+ \texttt{H}^T \texttt{F}= 0$, the obtained homography matrix gives 6 linear equations representing 5 independent constraints on the fundamental matrix [31, 32]. Each non-planar correspondence also gives one linear equation, resulting in (say) n other equations. We compute the fundamental matrix by solving all the $6{+}n$ linear equations (again applying RANSAC).

3.3 Finding 2D virtual object’s locations

Here, we do not have access to the 3D locations of the VO points, but rather their 2D locations in the keyframes. To find the 2D locations in the current frame, we first compute the fundamental matrices between the current frame and each of the past 6 keyframes using the tracked features (see Sec. 3.2). Using more keyframes significantly increases computational cost without substantial accuracy improvement, while using fewer keyframes notably decreases accuracy. The number 6 is chosen empirically. Let $\textbf{x}_1, \textbf{x}_2, \ldots , \textbf{x}_6$ be the locations of a VO point in the six most recent keyframes, and $\texttt{F}_i$ be the fundamental matrix between the corresponding keyframe and the current frame. The 2D location $\textbf{x}$ of a VO point in the current frame can be computed as the intersection of the epipolar lines $\textbf{l}_i = \texttt{F}_i^T\textbf{x}_{i}$ for $i=1,2,\ldots ,6$.

Here, we face two problems. First, the estimated epipolar lines do not exactly intersect at one point. Second, it is possible that some of the fundamental matrices be totally incorrect. We propose a robust solution by minimizing the sum of distances of $\textbf{x}$ to the epipolar lines:

$$\begin{aligned} \textbf{x}_{\text {opt}} = \underset{\textbf{x}}{\text {argmin}} \sum _{i=1}^{n} \textrm{dist}(\textbf{x}, \textbf{l}_i). \end{aligned}$$

(1)

Here, $\textrm{dist}(\textbf{x}, \textbf{l})$ is the euclidean distance of point $\textbf{x}$ from line $\textbf{l}$. Notice that minimizing sum of distances rather than sum of squared distances gives a solution that is robust to outlier epipolar lines resulting from the erroneous fundamental matrices. To minimize the above, we use a modification of the Weiszfeld’s algorithm proposed in [33]:

1.
Set $w_i=1$ for $1 \le i \le n$,
2.
Repeat until convergence:
1. 1.
  Find $\textbf{x}_r = \text {argmin}_{\textbf{x}} \sum _{i=1}^{n} w_i ~\textrm{dist}(\textbf{x}, \textbf{l}_i)^2$, in closed form,
2. 2.
  Update $w_i = 1/\textrm{dist}(\textbf{x}, \textbf{l}_i)$,

3.4 Depth-buffering

Depth-buffering or Z-buffering is perhaps the most challenging obstacle in using our method for rendering 3D objects. How can we tell which object point is behind which without direct access to the 3D object points? Additionally, we must identify the VO points situated behind the camera and exclude them from rendering. In this section, we provide a novel solution solely using epipolar relations.

Let $\texttt{P}_1, \texttt{P}_2, \ldots , \texttt{P}_m$ be the projection matrices for frames 1 to m, and $\textbf{X}_1, \textbf{X}_2, \ldots , \textbf{X}_n$ be the 3D VO points. The j-th point $\textbf{X}_j=(X_j,Y_j,Z_j,1)^T$ is projected to the 2D point $\textbf{x}_{ij} = (x_{ij}, y_{ij}, 1)^T$ in frame i according to

$$\begin{aligned} \lambda _{ij} \textbf{x}_{ij} = \texttt{P}_i \textbf{X}_j. \end{aligned}$$

(2)

The scalar $\lambda _{ij}$ is usually called the projective depth. For depth-buffering, however, we are interested in the actual depth of $\textbf{X}_j$ in view i which we denote by $d_{ij}$. These two quantities are related by [34, Sec. 6.2.3]

$$\begin{aligned} d_{ij} = \frac{\textrm{sign}(\det \texttt{M}_i)}{\left\Vert \textbf{m}_{i}\right\Vert } \, \lambda _{ij} \end{aligned}$$

(3)

or

$$\begin{aligned} \lambda _{ij} = \frac{\left\Vert \textbf{m}_{i}\right\Vert }{\textrm{sign}(\det \texttt{M}_i)} \, d_{ij} = \gamma _i \, d_{ij}, \end{aligned}$$

(4)

where the matrix $\texttt{M}_i$ comprises the first three columns of $\texttt{P}_i$, and $\textbf{m}_{i}$ is the third row of $\texttt{M}_i$. Notice, that the above is only valid if $\texttt{P}_i$ and $\textbf{X}_j$ are the true (not reconstructed) entities, and the final coordinate of $\textbf{X}_j$ is chosen equal to 1. The relation (4) shows that for a particular view i the projective depths are related to the actual depths by a common scale factor $\gamma _i = \left\Vert \textbf{m}_{i}\right\Vert /\textrm{sign}(\det \texttt{M}_i)$.

In uncalibrated scenarios, we do not have access to the true projective depths $\lambda _{ij}$, but rather the reconstructed depths $\hat{\lambda }_{ij}$ coming from a projective reconstruction $\hat{\lambda }_{ij} \textbf{x}_{ij} = \hat{\texttt{P}}_i \hat{\textbf{X}}_j$. The reconstructed entities $\{\hat{\texttt{P}}_i\}$ and $\{\hat{\textbf{X}}_j\}$ are related to true $\texttt{P}_i$-s and $\textbf{X}_j$-s by $\hat{\texttt{P}}_i = \alpha _i \texttt{P}_i \texttt{H}$ and $\hat{\textbf{X}}_j = \beta _j \texttt{H}^{-1} \textbf{X}_j$ for some homography $\texttt{H}$ and scalars $\alpha _1, \alpha _2, \ldots , \alpha _m$ and $\beta _1, \beta _2, \ldots , \beta _n$. Consequently, the reconstructed and true projective depths are related by [34, Sec. 18.4]

$$\begin{aligned} \hat{\lambda }_{ij} = \alpha _i \, \beta _j \, \lambda _{ij}, \end{aligned}$$

(5)

for some scalars $\alpha _1, \alpha _2, \ldots , \alpha _m$ and $\beta _1, \beta _2, \ldots , \beta _n$.

Fortunately, we do not have to perform a projective reconstruction to find $\hat{\lambda }_{ij}$-s. There is a method for recovering the reconstructed projective depths of a new frame from that of an old frame directly using the epipolar relations [35, 36]. Let $\textbf{x}_{kj}$ be the projection of the j-th point into the k-th frame (which is a keyframe), and $\textbf{x}_{ij}$ be the corresponding point in a new frame i. The reconstructed projective depths are then related by

$$\begin{aligned} \hat{\lambda }_{ij} = \frac{(\textbf{e}_{ik}\times \textbf{x}_{ij})\,(\texttt{F}_{ik} \textbf{x}_{kj})}{\left\Vert \textbf{e}_{ik}\times \textbf{x}_{ij}\right\Vert ^2} \hat{\lambda }_{kj} \end{aligned}$$

(6)

where $\texttt{F}_{ik}$ is the fundamental matrix between frames k and i, and $\textbf{e}_{ik}$ is the epipole point in frame i. The epipole $\textbf{e}_{ik}$ can be derived from $\texttt{F}_{ik}$.

From the initialization step (Sec. 3.1) we have the depths $d_{ij}$ (up to a positive scale factor) for the first four keyframes. Hence, one can easily perform depth-buffering for these frames. We can also have the projective depths $\lambda _{ij}$ (up to scale) for these four views directly from the initialization step, or by using (4). We set $\hat{\lambda }_{ij} = \lambda _{ij}$ for the first four keyframes. This removes the ambiguity caused by $\beta _1, \beta _2, \ldots , \beta _n$ reducing (5) to $\hat{\lambda }_{ij} = \alpha _i \, \lambda _{ij}$. Now, from (4) we get

$$\begin{aligned} \hat{\lambda }_{ij} = \alpha _i\gamma _i d_{ij}. \end{aligned}$$

(7)

Our goal is to use $\hat{\lambda }_{ij}$-s in lieu of $d_{ij}$-s to perform depth buffering. To do this, we need to resolve the sign ambiguity arose from $\alpha _i\gamma _i$. There are two possibilities: If $\alpha _i\gamma _i > 0$ then from (7) the points for which $\hat{\lambda }_{ij} > 0$ are in front of the camera and we can perform depth buffering using $\hat{\lambda }_{ij}$-s. But if $\alpha _i\gamma _i < 0$, the points for which $\hat{\lambda }_{ij} < 0$ are in front of the camera. In this case, we need to negate all $\hat{\lambda }_{ij}$-s for frame i before performing depth buffering. This introduces an ambiguity since $\alpha _i\gamma _i$ is unknown. To resolve this ambiguity, we simply assume the majority of points keep their state (behind or in front of the camera) in two consecutive frames. In a new frame, we choose the configuration which is more consistent with the previous one.

In (6), we need to choose a keyframe k to obtain projective depths $\lambda _{ij}$ for frame i. Among the past 6 keyframes, for each 2D point (computed using the Weiszfeld’s algorithm) we vote for the keyframe with the closest epipolar line. We then choose the keyframe with the highest number of votes to be used in (6).

Figure 2 shows the outcome of rendering the object within the 2D frame.

3.5 Keyframe processing

We choose the current frame as a keyframe if

1.
The current frame is more than 20 frames away from the last keyframe, or,
2.
There are fewer than 110 feature points tracked from the latest keyframe and more than 4 frames have passed from the latest keyframe.

The first condition ensures that the keyframes are not excessively distant from one another. As for the second condition, when the number of tracked feature points falls below 110, it indicates a decrease in tracking accuracy, which consequently leads to a reduction in the precision of the estimated fundamental matrix. In the new keyframe, we extract new features and add them to the feature list.

4 Results

4.1 Datasets

We tested our method on three sequences: a synthetically generated sequence from ICL-NUIM [37], the Freiburg2-desk from TUM RGB-D dataset [38], and a sequence we captured using a mobile camera (IUST-DESK).

ICL-NUIM dataset has two scenes: a living room and an office room. Each scene has 3D models of different objects, and these models create the scene together. The Povray software can create a video from these scenes. Besides the 3D models, Povray needs the intrinsic and extrinsic camera parameters. We selected the office room scene from this dataset to test the proposed system. A new camera path was defined in this scene, and a new sequence was created using Povray software (Fig. 3). Some 3D points of the scene were selected as features. These points were projected in each frame using camera parameters. In the generated path, the camera rotates around a desk in the scene.

Freiburg2-desk is a sequence of images taken by a single camera. The scene is a typical office with two desks, things like phone, book, cup, etc, are located on desks. The camera rotation and translation for each frame are given (Fig. 4).

The IUST-DESK sequence was captured from a desk by a single camera. Objects such as a phone, a monitor, a laptop, two books, a keyboard, and a mouse are placed on the desk. No information about the location and translation of each frame is available (Fig. 5).

4.2 Error criterion

Here, we measure the virtual object’s augmentation using the RMSE in the image domain:

$$\begin{aligned} e=\sqrt{\displaystyle \sum _{i=1}^{m} \displaystyle \sum _{j\in \mathcal {V}_i} (\textbf{x}_{ij}-\hat{\textbf{x}}_{ij})^2/\displaystyle \sum _{i=1}^{m} |\mathcal {V}_i|}, \end{aligned}$$

(8)

where m is number of frames, $\mathcal {V}_i$ is the set of visible points in frame i, $\hat{\textbf{x}}_{ij}$ is the estimated 2D location of the j-th VO point in frame i, and $\textbf{x}_{ij}$ is its true location obtained by the true camera pose given in the dataset.

4.3 Results of ICL-NUIM sequence

Two tests were done using the synthetically generated sequence. In the first test, the correctness of the proposed method was tested. When there is no error in tracking data, the result should have no error, and virtual object should be added in the correct location. In this test, the proposed method uses synthesized feature points and augments a cube to video. The result has no error, and the augmented cube is in the correct location (Fig. 6).

In the next experiment, the effects of error in tracked features are tested. In this test, zero mean Gaussian noise with different variances is added to the feature points, and the system’s error is calculated. The system’s error for different noise levels is shown in Fig. 7.

4.4 Results of the Freiburg2-desk sequence

In the conducted experiment on Freiburg2-desk dataset (Fig. 8), a comparative evaluation was performed between the proposed algorithm and ORB-SLAM. First, we ran the ORB-SLAM algorithm and get the camera pose in all frames. Having the camera intrinsics, The 2D locations of the VO points can then be computed. To ensure a fair and accurate comparison, the first four keyframes of the proposed algorithm were initialized with the identical data employed by the ORB-SLAM method (see Sec. 3.1). Consequently, the initial values of the proposed algorithm precisely matched that of the ORB-SLAM algorithm, thereby ensuring that the initial errors of both approaches were equivalent. The proposed algorithm was then employed to compute the positions of virtual object points for the subsequent frames, enabling a comprehensive evaluation of its performance and effectiveness. The RMSE of augmented points in each frame is shown in Fig. 9 for our system and ORB-SLAM. The average root mean square error per frame is 11.6 pixels for our system and 9.1 pixels for ORB-SLAM. Notice that the proposed method implicitly utilizes a projective reconstruction approach with 11 degrees of freedom. It does not make any assumptions about the camera intrinsics being fixed in all frames. In contrast, the ORB-SLAM method takes the camera’s internal parameters as input and operates with 6 degrees of freedom for camera pose. Despite this disparity, the accuracy achieved by the proposed method is on par with that of the ORB-SLAM method.

Figures 10 and 11 illustrate the results of our algorithm applied to augment different objects within this dataset.

4.5 Varying camera intrinsics

In order to evaluate the performance of the proposed method under varying camera intrinsic parameters, specifically the focal length, the freiburg2-desk dataset was synthetically modified. Subsequently, the modified dataset was used to compare the outputs of the proposed method and ORB-SLAM.

When the camera’s internal parameters undergo changes, the ORB-SLAM method encounters difficulties in maintaining its functionality. To address this limitation, a novel approach called posest-ORB-SLAM was developed by integrating ORB-SLAM with the posest algorithm [39], which is a PnPf (Perspective-n-Point with unknown focal length) technique. This integration enables the posest-ORB-SLAM method to effectively handle inputs with varying focal lengths. The combination works as follows: in each frame, the 3D points and their corresponding 2D points, which are computed by ORB-SLAM in the previous frame, are passed to the posest algorithm to estimate the focal length. Subsequently, the effects of focal length variations on the input images are eliminated, ensuring that the images provided to ORB-SLAM have a consistent and fixed focal length.

To compare the proposed method with the posest-ORB-SLAM, a similar approach to the previous section was followed. Specifically, the output of posest-ORB-SLAM was utilized for the initialization of the proposed method. This ensured a consistent starting point for both methods, enabling a fair and consistent evaluation of their respective performances. The average root mean square error per frame is 16.4 pixels for our system and 18.6 pixels for posest-ORB-SLAM. The root mean square error (RMSE) of the augmented points in each frame is depicted in Fig. 12.

4.6 Results of the IUST-DESK

We applied our method to augment a cube into the captured sequence. Since there is no ground truth available for this dataset, we visually assessed the results. The cube was correctly added to the video (Fig. 13), and the depth-buffering step functioned effectively (Fig. 14).

5 Conclusion

We have presented a novel augmented reality system that eliminates the need for camera pose estimation. We have addressed the major challenges, including depth-buffering and handling erroneous fundamental matrices. Although our approach handles 11 degrees of freedom compared to 6 of SLAM, it still provides adequate accuracy. Our method performs better in scenarios where camera intrinsics vary over time. One of the key components improving the accuracy of the SLAM-based methods is bundle-adjustment, particularly in conjunction with loop-closing. Our method currently lacks such a component. An equivalent stage in our method could be fine tuning the location of 2D points in the keyframes, by enforcing consistency between the epipolar relations among the keyframes. The current method requires the relative depths of the VO points in the first few keyframes to perform depth buffering. It remains an open question whether depth-buffering is possible under less restrictive assumptions. One can also think of combining our method with the SLAM-based methods. For instance, using our method only for non-keyframes.

Data Availability

The IUST-DESK dataset is available on request.

References

Magnenat S, Ngo DT, Zünd F, Ryffel M, Noris G, Rothlin G et al (2015) Live texturing of augmented reality characters from colored drawings. IEEE Trans Vis Comput Graph 21(11):1201–1210
Article Google Scholar
Gal R, Shapira L, Ofek E, Kohli P (2014) FLARE: Fast layout for augmented reality applications. In: 2014 IEEE international symposium on mixed and augmented reality (ISMAR). IEEE, pp 207–212
Kim K, Ny Park, Woo W (2014) Vision-based all-in-one solution for augmented reality and its storytelling applications. Vis Comput 30:417–429
Article Google Scholar
Stütz T, Dinic R, Domhardt M, Ginzinger S (2014) A mobile augmented reality system for portion estimation. In: 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, pp 375–376
Navab N, Blum T, Wang L, Okur A, Wendler T (2012) First deployments of augmented reality in operating rooms. Computer 45(7):48–55
Article Google Scholar
Nee AY, Ong S, Chryssolouris G, Mourtzis D (2012) Augmented reality applications in design and manufacturing. CIRP Ann 61(2):657–679
Article Google Scholar
Huang F, Zhou Y, Yu Y, Wang Z, Du S (2011) Piano ar: a markerless augmented reality based piano teaching system. In: 2011 Third international conference on intelligent human-machine systems and cybernetics. vol 2. IEEE, pp 47–52
Tomi AB, Rambli DRA (2013) An interactive mobile augmented reality magical playbook: learning number with the thirsty crow. Procedia Comput Sci 25:123–130
Barry A, Thomas G, Debenham P, Trout J (2012) Augmented reality in a public space: the natural history museum, London. Computer 45(7):42–47
Oh J, Lee MH, Park H, Park JI, Kim JS, Son W (2008) Efficient mobile museum guidance system using augmented reality. In: 2008 IEEE international symposium on consumer electronics. IEEE, pp 1–4
Gauglitz S, Sweeney C, Ventura J, Turk M, Höllerer T (2012) Live tracking and mapping from both general and rotation-only camera motion. In: 2012 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, pp 13–22
Klein G, Murray D (2007) Parallel tracking and mapping for small AR workspaces. In: 2007 6th IEEE and ACM international symposium on mixed and augmented reality. IEEE, pp 225–234
Mur-Artal R, Montiel JM, Tardós JD (2015) Orb-slam: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics 31(5), pp 1147-1163
Liu H, Zhang G, Bao H (2016) Robust keyframe-based monocular SLAM for augmented reality. In: 2016 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, pp 1–10
DiVerdi S, Wither J, Höllerer T (2009) All around the map: online spherical panorama construction. Comput Graph 33(1):73–84
Wagner D, Mulloni A, Langlotz T, Schmalstieg D (2010) Real-time panoramic mapping and tracking on mobile phones. In: 2010 IEEE virtual reality conference (VR). IEEE, pp 211–218
Pirchheim C, Reitmayr G (2011) Homography-based planar mapping and tracking for mobile phones. In: 2011 10th IEEE International symposium on mixed and augmented reality. IEEE, pp 27–36
Chen CH, Lee IJ, Lin LY (2016) Augmented reality-based video-modeling storybook of nonverbal facial cues for children with autism spectrum disorder to improve their perceptions and judgments of facial expressions and emotions. Comput Hum Behav 55:477–485
Article Google Scholar
Klein G, Murray D (2008) Improving the agility of keyframe-based SLAM. In: Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part II 10. Springer, pp 802–815
Klein G, Murray D (2009) Parallel tracking and mapping on a camera phone. In: 2009 8th IEEE international symposium on mixed and augmented reality. IEEE, pp 83–86
Mur-Artal R, Tardós JD (2017) Orb-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans Robot 33(5):1255–1262
Campos C, Elvira R, Rodríguez JJG, Montiel JM, Tardós JD (2021) Orb-slam3: An accurate open-source library for visual, visual-inertial, and multimap slam. IEEE Trans Robot 37(6):1874–1890
Article Google Scholar
Rabbi I, Ullah S (2013) A survey on augmented reality challenges and tracking. Acta Graph: Znanstveni Casopis za Tiskarstvo i Graficke Komunikacije 24(1–2):29–46
Google Scholar
Chawla H, Jukola M, Arani E, Zonooz B (2020) Monocular vision based crowdsourced 3d traffic sign positioning with unknown camera intrinsics and distortion coefficients. IEEE, pp 1–7
Ling L, Cheng E, Burnett IS (2011) Eight solutions of the essential matrix for continuous camera motion tracking in video augmented reality. In: 2011 IEEE international conference on multimedia and expo. IEEE, pp 1–6
Kutulakos KN, Vallino JR (1998) Calibration-free augmented reality. IEEE Trans Vis Comput Graph 4(1):1–20
Article Google Scholar
Seo Y, Hong KS (2000) Calibration-free augmented reality in perspective. IEEE Trans Vis Comput Graph 6(4):346–359
Article Google Scholar
Seo Y, Hong KS (2000) Weakly calibrated video-based augmented reality: embedding and rendering through virtual camera. In: Proceedings IEEE and ACM International Symposium on Augmented Reality (ISAR 2000). IEEE, pp 129–136
Wang J, Rupprecht C, Novotny D (2023) Posediffusion: solving pose estimation via diffusion-aided bundle adjustment. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9773–9783
Shi J et al (1994) Good features to track. In: 1994 Proceedings of IEEE conference on computer vision and pattern recognition. IEEE, pp 593–600
Luong QT, Faugeras OD (1993) Determining the fundamental matrix with planes: Instability and new algorithms. In: Proceedings of IEEE conference on computer vision and pattern recognition. IEEE, pp 489–494
Zhou Y, Kneip L, Li H (2015) A revisit of methods for determining the fundamental matrix with planes. In: 2015 International conference on digital image computing: techniques and applications (DICTA). IEEE, pp 1–7
Aftab K, Hartley R, Trumpf J (2015) ${L_q}$-Closest-Point to Affine Subspaces Using the Generalized Weiszfeld Algorithm. Int J Comput Vis 114(1):1–15
Article MathSciNet Google Scholar
Hartley R, Zisserman A (2003) Multiple view geometry in computer vision. Cambridge university press
Sturm P, Triggs B (1996) A factorization based algorithm for multi-image projective structure and motion. In: Computer Vision–ECCV’96: 4th European Conference on Computer Vision Cambridge, UK, April 15–18, 1996 Proceedings Volume II 4. Springer, pp 709–720
Triggs B (1995) Matching constraints and the joint image. In: Proceedings of IEEE international conference on computer vision. IEEE, pp 338–343
Handa A, Whelan T, McDonald J, Davison AJ (2014) A Benchmark for RGB-D Visual Odometry, 3D Reconstruction and SLAM. In: 2014 IEEE international conference on Robotics and automation (ICRA). IEEE, pp 1524–1531
Sturm J, Engelhard N, Endres F, Burgard W, Cremers D (2012) A benchmark for the evaluation of RGB-D SLAM systems. In: Proc. of the International Conference on Intelligent Robot Systems (IROS). IEEE, pp 573–580
Lourakis M, Zabulis X (2013) Model-based pose estimation for rigid objects. In: Computer vision systems: 9th International Conference, ICVS 2013, St. Petersburg, Russia, July 16-18, 2013. Proceedings 9. Springer, pp 83–92

Download references

Author information

Authors and Affiliations

School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
Aminreza Gholami & Mohsen Soryani
School of Computer Engineering, K. N. Toosi University of Technology, Tehran, Iran
Behrooz Nasihatkon

Authors

Aminreza Gholami
View author publications
You can also search for this author in PubMed Google Scholar
Behrooz Nasihatkon
View author publications
You can also search for this author in PubMed Google Scholar
Mohsen Soryani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohsen Soryani.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gholami, A., Nasihatkon, B. & Soryani, M. Augmented reality without SLAM. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-20154-6

Download citation

Received: 20 October 2023
Revised: 03 August 2024
Accepted: 20 August 2024
Published: 06 September 2024
DOI: https://doi.org/10.1007/s11042-024-20154-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Augmented reality without SLAM

Abstract

Similar content being viewed by others

An Improved Augmented Reality Registration Method Based on Visual SLAM

The constrained SLAM framework for non-instrumented augmented reality

LSD-SLAM: Large-Scale Direct Monocular SLAM

1 Introduction

2 Related work