1 Introduction

Augmented Reality (AR) is an important topic with applications in various fields such as entertainment [1,2,3], medicine [4, 5], industry [6], education [7, 8] and tourism [9, 10]. The key component of most AR systems is pose estimation: keeping track of camera rotation and translation, so we can embed the virtual object(VO) in the scene as the view point changes. Here, we offer an alternative approach that removes the need for pose estimation or creating a 3D map of the environment.

The main idea of our proposed method is to find the virtual object’s location directly and without the need to calculate camera poses. To this end, we use fundamental matrices between each new frame and previous keyframes. Unlike the SLAM-based methods, we do not need to maintain the point cloud. Also, our algorithm works when intrinsic camera parameters change over time and it maintains its effectiveness with more degrees of freedom. Our major contributions can be compiled as

  1. 1.

    A new AR algorithm that works without pose estimation,

  2. 2.

    A robust algorithm to compute virtual object’s locations from the fundamental matrices between each keyframe and a new frame,

  3. 3.

    A novel approach for depth-buffering that only uses the fundamental matrices and does not require the 3D points or their depths.

In Section 2, we provide an overview of the related work. This is followed by Section 3, detailing our proposed method. Section 4 presents our results. Finally, Section 5 concludes the paper.

2 Related work

Recent AR systems utilize SLAM algorithms for camera pose estimation. The SLAM systems typically operate in two threads: mapping and tracking [11,12,13,14,15,16,17]. The mapping thread is responsible for extracting 3D locations of the tracked features. This process is primarily performed for keyframes, where ample time is available for precise scene structure refinement. On the other hand, frames not selected as keyframes are processed in the tracking thread. One fundamental step in SLAM systems is feature tracking, which is carried out in the tracking thread. Subsequently, the camera pose is computed for all frames. Several factors are considered for keyframe selection. These parameters include tracking quality [18], distance from previous keyframe [13, 18], the minimum distance between the camera and a key-point in the 3D point cloud [18], number of tracked features in the new frame [13], and overlapping of the current frame and the previous keyframe [13, 18].

Klein and Murray first introduced the concept of keyframes, and the use of tracking and mapping threads [12, 19, 20]. A SLAM system utilizing ORB feature points (ORB-SLAM) was proposed by Mur-Artal et al. [13, 21, 22]. The ORB (Oriented FAST and Rotated BRIEF) features are suitable for online systems [23], enabling real-time performance even without the need for GPUs. Additionally, this system incorporates another thread called loop closing.

AR algorithms typically require the location of objects in a reference frame. To achieve this, they often employ SLAM to establish an initial 3D model upon which the 3D objects are placed. A common approach is to use two non-consecutive frames and to initialize camera parameters. The detected features in the first frame are tracked throughout the sequence up to the second frame. The relative camera pose between the two frames is then calculated using the corresponding features.

Table 1 A comparison between different stages of typical AR algorithms and our algorithm

Most SLAM methods rely on camera intrinsic parameters. Nonetheless, to address uncalibrated environments, a viable approach is to utilize auto-calibration methods. Chawla et al. [24] add an auto-calibration method to ORB-SLAM so their algorithm works in uncalibrated spaces. Ling et al. [25] proposed a method for augmented reality when intrinsic camera parameters are unknown. They utilized Kruppa’s equations to determine the intrinsic camera matrix. In their approach, intrinsic parameters are unknown but remain constant throughout the video. In contrast, our method functions even when intrinsic parameters change during the video sequence.

Kutulakos and Vallino [26] do not use any camera intrinsic parameters. They model camera to image transformation using weak perspective and show that given the projection of four non-coplanar fiducial points is enough to find the location of all other points in a frame. They find the location of a point by linear combination of the known points. In contrast, our method works for fully projective cameras.

Seo et al. [27, 28] propose an approach that does not need intrinsic camera parameters as input. First, they build a projective reconstruction for a pair of frames, and extend it to the rest of the frames using camera resectioning. Using a set of 2D control points annotated by the user in a pair of frames they remove the projective ambiguity obtaining a metric reconstruction. Having the metric reconstruction, they can project the virtual object(VO) points and also perform depth-buffering. Wang et al. [29] merge geometric constraints with a deep learning method that uses a probabilistic model to predict the intrinsic and extrinsic parameters of images. In contrast, our method bypasses the 3D reconstruction step, and performs the rendering completely in the 2D domain.

3 The proposed method

Table 1 contrasts different stages of a generic AR algorithm and our method. Stages 1, 2, and 5 are typically implemented using a SLAM system, providing the camera pose and 3D object locations in each frame. Our main challenge is to render the virtual object(VO) with no access to the camera pose and 3D object locations in every view.

Figure 1 shows the flow of our algorithm. Initialization is performed once at the beginning of our algorithm (Sec. 3.1). Then, for each frame, 2D local features are extracted and/or tracked (Sec. 3.2). Using this, the fundamental matrices between the current frame and 6 previous keyframes are computed (Sec. 3.2). Having the 2D object locations in the keyframes, the 2D locations in the new frame is computed as the intersection of the epipolar lines (Sec. 3.3). Finally, we perform depth-buffering to find the relative depth of the virtual object points and render the object in the 2D frame (Sec. 3.4). In the next subsections we explain different stages of our method.

Fig. 1
figure 1

The outline of the proposed AR system

3.1 Initialization

In the proposed method, initialization involves the assignment of 2D virtual point positions and their relative depths in the initial keyframes. These are the only requirements of our algorithm. To accomplish this, various methods can be employed. While we could use any SfM method for initial 3D reconstruction, in our experiments we simply utilize the ORB-SLAM method up until the first four keyframes. This aligns the initialization step of our algorithm with that of the ORB-SLAM, providing a baseline for comparison. This method has the downside of requiring an initial camera calibration. However, after the initialization, the internal parameters can vary. Notice that the 2D locations of the VO points in the subsequent keyframes are computed similarly to the ordinary frames using the method of Sec. 3.3. Due to the scale ambiguity, the depths computed by ORB-SLAM are relative in the sense that they relate to the actual depth by a common positive scale factor.

3.2 Frame processing

We extract Shi-Tomasi features [30] (improved Harris) in the keyframes and track them in the subsequent frames using the Lucas-Kanade method with backward check. The fundamental matrices(\(\texttt{F}\)) between the keyframes and the current frame are then computed using the tracked points. The fundamental matrix \(\texttt{F}\) relates corresponding points(\(x_1,x_2\)) in two frames with the equation \(x_2^T \texttt{F}x_1=0\). The accuracy of determining the fundamental matrix degrades when a significant number of points lie on a single plane. This is the case in many AR scenarios, including in the dataset we use. We overcome this as follows. First, we try to establish a homography(\(\texttt{H}\)) relation between the two frames that includes as many as possible 2D point correspondences. This can be simply done using RANSAC (Random sample consensus). Employing the equation \(\texttt{F}^T \texttt{H}+ \texttt{H}^T \texttt{F}= 0\), the obtained homography matrix gives 6 linear equations representing 5 independent constraints on the fundamental matrix [31, 32]. Each non-planar correspondence also gives one linear equation, resulting in (say) n other equations. We compute the fundamental matrix by solving all the \(6{+}n\) linear equations (again applying RANSAC).

3.3 Finding 2D virtual object’s locations

Here, we do not have access to the 3D locations of the VO points, but rather their 2D locations in the keyframes. To find the 2D locations in the current frame, we first compute the fundamental matrices between the current frame and each of the past 6 keyframes using the tracked features (see Sec. 3.2). Using more keyframes significantly increases computational cost without substantial accuracy improvement, while using fewer keyframes notably decreases accuracy. The number 6 is chosen empirically. Let \(\textbf{x}_1, \textbf{x}_2, \ldots , \textbf{x}_6\) be the locations of a VO point in the six most recent keyframes, and \(\texttt{F}_i\) be the fundamental matrix between the corresponding keyframe and the current frame. The 2D location \(\textbf{x}\) of a VO point in the current frame can be computed as the intersection of the epipolar lines \(\textbf{l}_i = \texttt{F}_i^T\textbf{x}_{i}\) for \(i=1,2,\ldots ,6\).

Here, we face two problems. First, the estimated epipolar lines do not exactly intersect at one point. Second, it is possible that some of the fundamental matrices be totally incorrect. We propose a robust solution by minimizing the sum of distances of \(\textbf{x}\) to the epipolar lines:

$$\begin{aligned} \textbf{x}_{\text {opt}} = \underset{\textbf{x}}{\text {argmin}} \sum _{i=1}^{n} \textrm{dist}(\textbf{x}, \textbf{l}_i). \end{aligned}$$
(1)

Here, \(\textrm{dist}(\textbf{x}, \textbf{l})\) is the euclidean distance of point \(\textbf{x}\) from line \(\textbf{l}\). Notice that minimizing sum of distances rather than sum of squared distances gives a solution that is robust to outlier epipolar lines resulting from the erroneous fundamental matrices. To minimize the above, we use a modification of the Weiszfeld’s algorithm proposed in [33]:

  1. 1.

    Set \(w_i=1\) for \(1 \le i \le n\),

  2. 2.

    Repeat until convergence:

    1. 1.

      Find \(\textbf{x}_r = \text {argmin}_{\textbf{x}} \sum _{i=1}^{n} w_i ~\textrm{dist}(\textbf{x}, \textbf{l}_i)^2\), in closed form,

    2. 2.

      Update \(w_i = 1/\textrm{dist}(\textbf{x}, \textbf{l}_i)\),

3.4 Depth-buffering

Depth-buffering or Z-buffering is perhaps the most challenging obstacle in using our method for rendering 3D objects. How can we tell which object point is behind which without direct access to the 3D object points? Additionally, we must identify the VO points situated behind the camera and exclude them from rendering. In this section, we provide a novel solution solely using epipolar relations.

Let \(\texttt{P}_1, \texttt{P}_2, \ldots , \texttt{P}_m\) be the projection matrices for frames 1 to m, and \(\textbf{X}_1, \textbf{X}_2, \ldots , \textbf{X}_n\) be the 3D VO points. The j-th point \(\textbf{X}_j=(X_j,Y_j,Z_j,1)^T\) is projected to the 2D point \(\textbf{x}_{ij} = (x_{ij}, y_{ij}, 1)^T\) in frame i according to

$$\begin{aligned} \lambda _{ij} \textbf{x}_{ij} = \texttt{P}_i \textbf{X}_j. \end{aligned}$$
(2)

The scalar \(\lambda _{ij}\) is usually called the projective depth. For depth-buffering, however, we are interested in the actual depth of \(\textbf{X}_j\) in view i which we denote by \(d_{ij}\). These two quantities are related by [34, Sec. 6.2.3]

$$\begin{aligned} d_{ij} = \frac{\textrm{sign}(\det \texttt{M}_i)}{\left\Vert \textbf{m}_{i}\right\Vert } \, \lambda _{ij} \end{aligned}$$
(3)

or

$$\begin{aligned} \lambda _{ij} = \frac{\left\Vert \textbf{m}_{i}\right\Vert }{\textrm{sign}(\det \texttt{M}_i)} \, d_{ij} = \gamma _i \, d_{ij}, \end{aligned}$$
(4)

where the matrix \(\texttt{M}_i\) comprises the first three columns of \(\texttt{P}_i\), and \(\textbf{m}_{i}\) is the third row of \(\texttt{M}_i\). Notice, that the above is only valid if \(\texttt{P}_i\) and \(\textbf{X}_j\) are the true (not reconstructed) entities, and the final coordinate of \(\textbf{X}_j\) is chosen equal to 1. The relation (4) shows that for a particular view i the projective depths are related to the actual depths by a common scale factor \(\gamma _i = \left\Vert \textbf{m}_{i}\right\Vert /\textrm{sign}(\det \texttt{M}_i)\).

In uncalibrated scenarios, we do not have access to the true projective depths \(\lambda _{ij}\), but rather the reconstructed depths \(\hat{\lambda }_{ij}\) coming from a projective reconstruction \(\hat{\lambda }_{ij} \textbf{x}_{ij} = \hat{\texttt{P}}_i \hat{\textbf{X}}_j\). The reconstructed entities \(\{\hat{\texttt{P}}_i\}\) and \(\{\hat{\textbf{X}}_j\}\) are related to true \(\texttt{P}_i\)-s and \(\textbf{X}_j\)-s by \(\hat{\texttt{P}}_i = \alpha _i \texttt{P}_i \texttt{H}\) and \(\hat{\textbf{X}}_j = \beta _j \texttt{H}^{-1} \textbf{X}_j\) for some homography \(\texttt{H}\) and scalars \(\alpha _1, \alpha _2, \ldots , \alpha _m\) and \(\beta _1, \beta _2, \ldots , \beta _n\). Consequently, the reconstructed and true projective depths are related by [34, Sec. 18.4]

$$\begin{aligned} \hat{\lambda }_{ij} = \alpha _i \, \beta _j \, \lambda _{ij}, \end{aligned}$$
(5)

for some scalars \(\alpha _1, \alpha _2, \ldots , \alpha _m\) and \(\beta _1, \beta _2, \ldots , \beta _n\).

Fortunately, we do not have to perform a projective reconstruction to find \(\hat{\lambda }_{ij}\)-s. There is a method for recovering the reconstructed projective depths of a new frame from that of an old frame directly using the epipolar relations [35, 36]. Let \(\textbf{x}_{kj}\) be the projection of the j-th point into the k-th frame (which is a keyframe), and \(\textbf{x}_{ij}\) be the corresponding point in a new frame i. The reconstructed projective depths are then related by

$$\begin{aligned} \hat{\lambda }_{ij} = \frac{(\textbf{e}_{ik}\times \textbf{x}_{ij})\,(\texttt{F}_{ik} \textbf{x}_{kj})}{\left\Vert \textbf{e}_{ik}\times \textbf{x}_{ij}\right\Vert ^2} \hat{\lambda }_{kj} \end{aligned}$$
(6)

where \(\texttt{F}_{ik}\) is the fundamental matrix between frames k and i, and \(\textbf{e}_{ik}\) is the epipole point in frame i. The epipole \(\textbf{e}_{ik}\) can be derived from \(\texttt{F}_{ik}\).

From the initialization step (Sec. 3.1) we have the depths \(d_{ij}\) (up to a positive scale factor) for the first four keyframes. Hence, one can easily perform depth-buffering for these frames. We can also have the projective depths \(\lambda _{ij}\) (up to scale) for these four views directly from the initialization step, or by using (4). We set \(\hat{\lambda }_{ij} = \lambda _{ij}\) for the first four keyframes. This removes the ambiguity caused by \(\beta _1, \beta _2, \ldots , \beta _n\) reducing (5) to \(\hat{\lambda }_{ij} = \alpha _i \, \lambda _{ij}\). Now, from (4) we get

$$\begin{aligned} \hat{\lambda }_{ij} = \alpha _i\gamma _i d_{ij}. \end{aligned}$$
(7)

Our goal is to use \(\hat{\lambda }_{ij}\)-s in lieu of \(d_{ij}\)-s to perform depth buffering. To do this, we need to resolve the sign ambiguity arose from \(\alpha _i\gamma _i\). There are two possibilities: If \(\alpha _i\gamma _i > 0\) then from (7) the points for which \(\hat{\lambda }_{ij} > 0\) are in front of the camera and we can perform depth buffering using \(\hat{\lambda }_{ij}\)-s. But if \(\alpha _i\gamma _i < 0\), the points for which \(\hat{\lambda }_{ij} < 0\) are in front of the camera. In this case, we need to negate all \(\hat{\lambda }_{ij}\)-s for frame i before performing depth buffering. This introduces an ambiguity since \(\alpha _i\gamma _i\) is unknown. To resolve this ambiguity, we simply assume the majority of points keep their state (behind or in front of the camera) in two consecutive frames. In a new frame, we choose the configuration which is more consistent with the previous one.

In (6), we need to choose a keyframe k to obtain projective depths \(\lambda _{ij}\) for frame i. Among the past 6 keyframes, for each 2D point (computed using the Weiszfeld’s algorithm) we vote for the keyframe with the closest epipolar line. We then choose the keyframe with the highest number of votes to be used in (6).

Figure 2 shows the outcome of rendering the object within the 2D frame.

Fig. 2
figure 2

Using the computed \(\lambda \), it is possible to determine which point is visible and which point is occluded by other points

3.5 Keyframe processing

We choose the current frame as a keyframe if

  1. 1.

    The current frame is more than 20 frames away from the last keyframe, or,

  2. 2.

    There are fewer than 110 feature points tracked from the latest keyframe and more than 4 frames have passed from the latest keyframe.

The first condition ensures that the keyframes are not excessively distant from one another. As for the second condition, when the number of tracked feature points falls below 110, it indicates a decrease in tracking accuracy, which consequently leads to a reduction in the precision of the estimated fundamental matrix. In the new keyframe, we extract new features and add them to the feature list.

4 Results

4.1 Datasets

We tested our method on three sequences: a synthetically generated sequence from ICL-NUIM [37], the Freiburg2-desk from TUM RGB-D dataset [38], and a sequence we captured using a mobile camera (IUST-DESK).

ICL-NUIM dataset has two scenes: a living room and an office room. Each scene has 3D models of different objects, and these models create the scene together. The Povray software can create a video from these scenes. Besides the 3D models, Povray needs the intrinsic and extrinsic camera parameters. We selected the office room scene from this dataset to test the proposed system. A new camera path was defined in this scene, and a new sequence was created using Povray software (Fig. 3). Some 3D points of the scene were selected as features. These points were projected in each frame using camera parameters. In the generated path, the camera rotates around a desk in the scene.

Fig. 3
figure 3

A sample frame of the new ICL-NUIM video

Freiburg2-desk is a sequence of images taken by a single camera. The scene is a typical office with two desks, things like phone, book, cup, etc, are located on desks. The camera rotation and translation for each frame are given (Fig. 4).

Fig. 4
figure 4

A sample frame of Freiburg2-desk dataset

The IUST-DESK sequence was captured from a desk by a single camera. Objects such as a phone, a monitor, a laptop, two books, a keyboard, and a mouse are placed on the desk. No information about the location and translation of each frame is available (Fig. 5).

Fig. 5
figure 5

A sample frame of IUST-DESK dataset

Fig. 6
figure 6

Results of augmenting cube using no erroneous feature points

4.2 Error criterion

Here, we measure the virtual object’s augmentation using the RMSE in the image domain:

$$\begin{aligned} e=\sqrt{\displaystyle \sum _{i=1}^{m} \displaystyle \sum _{j\in \mathcal {V}_i} (\textbf{x}_{ij}-\hat{\textbf{x}}_{ij})^2/\displaystyle \sum _{i=1}^{m} |\mathcal {V}_i|}, \end{aligned}$$
(8)

where m is number of frames, \(\mathcal {V}_i\) is the set of visible points in frame i, \(\hat{\textbf{x}}_{ij}\) is the estimated 2D location of the j-th VO point in frame i, and \(\textbf{x}_{ij}\) is its true location obtained by the true camera pose given in the dataset.

4.3 Results of ICL-NUIM sequence

Two tests were done using the synthetically generated sequence. In the first test, the correctness of the proposed method was tested. When there is no error in tracking data, the result should have no error, and virtual object should be added in the correct location. In this test, the proposed method uses synthesized feature points and augments a cube to video. The result has no error, and the augmented cube is in the correct location (Fig. 6).

In the next experiment, the effects of error in tracked features are tested. In this test, zero mean Gaussian noise with different variances is added to the feature points, and the system’s error is calculated. The system’s error for different noise levels is shown in Fig. 7.

Fig. 7
figure 7

Effect of tracked features’ error to system’s result

Fig. 8
figure 8

Our AR system’s result. A cube was augmented to the freiburg2-desk video sequence

4.4 Results of the Freiburg2-desk sequence

In the conducted experiment on Freiburg2-desk dataset (Fig. 8), a comparative evaluation was performed between the proposed algorithm and ORB-SLAM. First, we ran the ORB-SLAM algorithm and get the camera pose in all frames. Having the camera intrinsics, The 2D locations of the VO points can then be computed. To ensure a fair and accurate comparison, the first four keyframes of the proposed algorithm were initialized with the identical data employed by the ORB-SLAM method (see Sec. 3.1). Consequently, the initial values of the proposed algorithm precisely matched that of the ORB-SLAM algorithm, thereby ensuring that the initial errors of both approaches were equivalent. The proposed algorithm was then employed to compute the positions of virtual object points for the subsequent frames, enabling a comprehensive evaluation of its performance and effectiveness. The RMSE of augmented points in each frame is shown in Fig. 9 for our system and ORB-SLAM. The average root mean square error per frame is 11.6 pixels for our system and 9.1 pixels for ORB-SLAM. Notice that the proposed method implicitly utilizes a projective reconstruction approach with 11 degrees of freedom. It does not make any assumptions about the camera intrinsics being fixed in all frames. In contrast, the ORB-SLAM method takes the camera’s internal parameters as input and operates with 6 degrees of freedom for camera pose. Despite this disparity, the accuracy achieved by the proposed method is on par with that of the ORB-SLAM method.

Fig. 9
figure 9

The root mean square error of the augmented VO points is shown for every frame within the Freiburg2-desk sequence using both the proposed method (blue line) and the ORB-SLAM method (red line)

Figures 10 and 11 illustrate the results of our algorithm applied to augment different objects within this dataset.

Fig. 10
figure 10

Our AR system’s result. An icosahedron was augmented to the freiburg2-desk video sequence

Fig. 11
figure 11

The frames illustrate a teapot with the calculated VO points, augmented into the Freiburg2-desk dataset

4.5 Varying camera intrinsics

In order to evaluate the performance of the proposed method under varying camera intrinsic parameters, specifically the focal length, the freiburg2-desk dataset was synthetically modified. Subsequently, the modified dataset was used to compare the outputs of the proposed method and ORB-SLAM.

When the camera’s internal parameters undergo changes, the ORB-SLAM method encounters difficulties in maintaining its functionality. To address this limitation, a novel approach called posest-ORB-SLAM was developed by integrating ORB-SLAM with the posest algorithm [39], which is a PnPf (Perspective-n-Point with unknown focal length) technique. This integration enables the posest-ORB-SLAM method to effectively handle inputs with varying focal lengths. The combination works as follows: in each frame, the 3D points and their corresponding 2D points, which are computed by ORB-SLAM in the previous frame, are passed to the posest algorithm to estimate the focal length. Subsequently, the effects of focal length variations on the input images are eliminated, ensuring that the images provided to ORB-SLAM have a consistent and fixed focal length.

To compare the proposed method with the posest-ORB-SLAM, a similar approach to the previous section was followed. Specifically, the output of posest-ORB-SLAM was utilized for the initialization of the proposed method. This ensured a consistent starting point for both methods, enabling a fair and consistent evaluation of their respective performances. The average root mean square error per frame is 16.4 pixels for our system and 18.6 pixels for posest-ORB-SLAM. The root mean square error (RMSE) of the augmented points in each frame is depicted in Fig. 12.

Fig. 12
figure 12

The upper chart illustrates the variations of the focal length in each frame, and the lower chart represents the root mean square error of the augmented VO points for every frame within the uncalibrated sequence using both the proposed method (blue line) and the posest-ORB-SLAM method (red line)

4.6 Results of the IUST-DESK

We applied our method to augment a cube into the captured sequence. Since there is no ground truth available for this dataset, we visually assessed the results. The cube was correctly added to the video (Fig. 13), and the depth-buffering step functioned effectively (Fig. 14).

Fig. 13
figure 13

The proposed algorithm augmented a cube into the IUST-DESK video sequence

Fig. 14
figure 14

The depth-buffering step of the proposed method functioned well on the IUST-DESK dataset

5 Conclusion

We have presented a novel augmented reality system that eliminates the need for camera pose estimation. We have addressed the major challenges, including depth-buffering and handling erroneous fundamental matrices. Although our approach handles 11 degrees of freedom compared to 6 of SLAM, it still provides adequate accuracy. Our method performs better in scenarios where camera intrinsics vary over time. One of the key components improving the accuracy of the SLAM-based methods is bundle-adjustment, particularly in conjunction with loop-closing. Our method currently lacks such a component. An equivalent stage in our method could be fine tuning the location of 2D points in the keyframes, by enforcing consistency between the epipolar relations among the keyframes. The current method requires the relative depths of the VO points in the first few keyframes to perform depth buffering. It remains an open question whether depth-buffering is possible under less restrictive assumptions. One can also think of combining our method with the SLAM-based methods. For instance, using our method only for non-keyframes.