1 Introduction

Extended Reality (XR) applications, including Augmented Reality (AR), Mixed Reality (MR) and Virtual Reality (VR), were launched from science fiction and have already landed in “reality”. These applications can reshape our society enabling novel services, new ways of interactions and immersive experiences in many fields. Online gaming [1], Industry 4.0 [2], healthcare [3] and architecture [4] are just highlighted examples with special importance and business potential. The features and requirements of the use cases vary widely, however, the systems under the hood and the applied concepts share common components and properties.

One of the key components mainly affecting the user experience is the AR device, including the AR glasses (e.g., Microsoft HoloLens 2, Magic Leap 1) or head-mounted VR displays (e.g., Oculus Quest 2, Sony PlayStation VR, HTC Vive Cosmos) or AR-capable mobile phones (e.g., iPhone 12 Pro, Samsung S10). The other essential constituents are the special purpose, compute-intensive functions, running in the background, which enable the real-time operation. For example, 3D rendering, simulation of the virtual 3D environment, continuous localization of the devices, detecting and tracking of objects, users, gestures are crucial tasks, and the performance characteristics of the respective implementations significantly affect the quality of the immersive experience. The heterogeneity of these services indicates different hardware requirements [5], and we have multiple options where to run the corresponding tasks. A straightforward strategy is to implement all functions in the device itself. For example, HoloLens 2 follows this approach and encompasses a “gamer PC” in the device with the needed GPU/CPU capacity and the versatile software stack. As a result, these glasses are very expensive ones. Another option is to split the rendering function from the headset and deploy to a local PC. VR headsets usually follow this approach and require dedicated connection to a local rendering engine.

But why should we stop there? Today, several providers operate cloud based remote rendering platforms and the customers can rent the resources on demand instead of buying expensive hardware devices. Of course, this setup requires stable, high speed and low latency network connections. These services are often combined with online gaming offerings, such as in NVIDIA GeForce Now or PlayStation Now. Furthermore, we can make more steps towards more utilization of cloud platforms. Other functions can also be offloaded to the cloud if the end-to-end latency is controlled carefully and kept within a predefined bound. Edge computing is an emerging concept bringing compute resources closer to the users and end devices. Thus, it provides a suitable platform and execution environment for the backend components of latency sensitive and compute intensive applications, and it can help a lot with the energy consumption of mobile AR devices. If we manage to offload XR functions, the battery life can be multiplied. Fortunately, the networking requirements are (or more precisely, will be) fulfilled by 5G systems or a new generation of WiFi networks [6].

In an earlier version of this work [7] we presented a novel edge cloud based AR platform for running distributed, multi-user AR applications. We investigated the benefits and also the limits of the approach and evaluated the performance of a proof-of-concept prototype which offloads the Simultaneous Localization And Mapping (SLAM) function.

In this paper, we extend that work with study of collaboration of multiple AR devices, energy consumption measurements and feasibility check. Furthermore, we improved the DNN-based SLAM selector module (SIA), which selects the best from the results of different SLAMs according to the current conditions.

The rest of the paper is organized as follows. In Sect. 2, the related works are presented. Section 3 is devoted to the architecture including the main design goals and concepts. Section 4 describes the relevant implementation details and the revealed issues. Section 5 presents our measurement methodology, while in Sect. 6, we evaluate the performance of the overall system and summarize our main findings. Finally, Sect. 7 concludes the paper.

2 Related Work

In this section, the relevant SLAM algorithms and the challenges of collaboration among the participant devices are summarized. In addition, the research works on offloading the SLAM functions to cloud or edge infrastructures are also presented.

2.1 SLAM Algorithms

Image-based camera localization is a key task in many of today’s hot research fields, such as robotics [8], autonomous vehicles [8] and also virtual and augmented reality [9]. During the last decades, several methods and approaches have been proposed by researches to determine the pose, i.e., position and orientation, of the camera in real time (called visual odometry) and to build the map of the discovered environment. Such algorithms, called Simultaneous Localization and Mapping (SLAM), aim at creating the map of the surrounding environment and locating the device within them. The performance of visual SLAM algorithms heavily relies on the sensors exploited for tracking the devices. The most important one is the basic camera providing RGB color information for image processing purposes, and the advanced RGB-D camera extending the data with per-pixel depth information [10, 11]. Camera state acquisition can be refined in other steps of the algorithms based on additional sensors. Inertial Measurement Unit (IMU) is the source of augmented data for visual-inertial SLAMs [12], which consists of the gyroscope measuring angular velocity and orientation, and the accelerometer that is in charge of tracking the change of linear acceleration.

The majority of existing techniques rely on visual geometric models, called model-based SLAM methods [13]. On the one hand, monocular SLAM algorithms, using a single camera, implicitly estimate camera ego-motion, while incrementally building the map of the environment. In [9] a novel benchmarking method is defined for this type of algorithms and several proposals are evaluated quantitatively. On the other hand, multi-ocular SLAM methods, use two or more cameras in order to acquire the color and depth information, as well [14]. Model-based solutions can be divided into i) feature-based algorithms that aim to find certain features or key points of the environment, making use of e.g. filtering techniques, and compute the camera pose information and the map based on these observations and ii) direct SLAM methods that use image intensities to estimate the location and surroundings [15]. For example, ROVIOLI [16] is a feature-based algorithm, while LSD-SLAM [17] is a direct method. Furthermore, keyframe-based SLAM approaches separate localization and mapping steps [18, 19]: camera localization takes place on regular frames over the subset of the map, while optimization takes place on certain keyframes. Different techniques can also be combined in hybrid algorithms to achieve better performance (see e.g., ORB-SLAM3 [20,21,22,23,24]). Besides the model-based SLAM techniques, data-driven, deep learning based methods have also been proposed in recent years [25,26,27,28,29,30].

2.2 Collaboration

Other challenges arise when multiple AR devices are collaborating [31] and the research community has just recently targeted this field [32,33,34,35]. For example, when multiple AR devices collaborate in the same physical environment, a joint coordinate system for the virtual 3D world has to be shared among the users. There are known methods for coordinate system synchronization, such as mechanisms based on well-known points (e.g., QR code), anchor-based synchronization (e.g., ARAnchor) and distributed SLAM algorithms [35,36,37]. Well-known points and anchor-based methods are typically capable of only offline synchronization and during the online operation the device local SLAMs and the local maps are used. Distributed SLAM algorithms follow a distributed approach to build a Global Map where the map data is shared among the devices and the server side. However, this operation can raise severe security issues because each AR device can access data recorded by the camera of any other AR devices.

In [38] a novel collaborative SLAM system is proposed where autonomous agents run visual-inertial odometry algorithms locally on the devices while sharing map information with a central server which is responsible for merging and optimizing the global map. The solution works only with keyframe-based VIO systems. Another client–server based collaborative SLAM framework is proposed for service robots in [39]. Tracking and local mapping take place on the devices (clients) and after processing, filtered data (keyframes and landmarks) are exchanged with the server which is in charge of the loop detection and map merging. Based on the client’s movement, selected part of the global map is sent to the client on-demand. Both solutions [38, 39] require processing on the devices and lack a full remote SLAM at the server side. In contrast, our proposal offloads the full SLAM stack to the edge infrastructure, it supports arbitrary SLAM algorithms and additional AR services as the camera and IMU data are available at the edge cloud.

2.3 Offloading to Edge/Cloud

Computation-intensive algorithms in SLAM systems require proper equipment and fast, reliable processing solutions. Embedded systems, such as smartphones and smart glasses are poorly equipped in the long run as they are still dependent on battery constraints. Edge and Cloud Computing offer the possibility to offload excessive computational tasks, e.g., the SLAM algorithm, from the devices to a server in the cloud or at the edge of the network close to the user in terms of latency [40,41,42,43]. Thus, future augmented and mixed reality (AR/MR) applications can leverage the additional resources and dedicated hardware support (GPU) provided by cloud/edge servers for improved SLAM services, such as enhanced localization scalability and performance [44], improved response time based on parallel processing [45], collaborative localization by a globally synchronized map [46] or multi-user support [47], along with reduced power consumption of the AR device. Besides the SLAM service, other AR functions can also be offloaded to the cloud/edge infrastructure, including e.g., the rendering [48] or object tracking [49] functions. Of course, the capabilities and the characteristics of the underlying networks significantly impact the feasibility of these techniques.

We have two options how to offload the SLAM algorithms from the devices: partially or fully. For example, CloudSLAM, proposed in [50], partitions different workflows of ORB-SLAM and as a result, tracking and local mapping are executed on the device (vechicle) and loop closure is executed on the edge. Similarly, authors of [51] proposes a functional split of the ORB-SLAM2 architecture between the edge and the mobile device. Their solution keeps the tracking computation on the device and moves the rest, i.e., local mapping and loop closure, to the edge. Both proposals are tightly coupled to ORB-SLAM. In addition, the modifications of the original algorithms are needed. On the other hand, our framework supports arbitrary off-the-shelf SLAM algorithms which can be incorporated into the system and combined during the operation. In [52], the impact of processing power of different edge cloud systems on odometry and map generation is analyzed in detail. In this scenarios, the device executes only tasks related to camera data processing. Authors of [53] address fully remote SLAMs and present a novel buffering method to mitigate the impact of data losses in unreliable networks. ORBBuf optimizes the performance of the SLAM module by discarding frames with the least impact on SLAM’s quality.

2.4 Edge/Cloud Platforms and 5G/6G

Different cloud platforms with several relevant services are available today delivering the computing infrastructure for future XR applications. On the one hand, the three giants operating the leading public cloud platforms, i.e., Amazon Web Services [54], Google Cloud Platform [55], Microsoft Azure [56], provide services in a wide range. At the end of the day, developers and application providers can compose and run virtual machines or software containers implementing the business logic, while the burden of operational tasks, including resource management, on demand resource scaling or fault management, is delegated to the cloud providers. On the other hand, open-source technologies are also available to establish private cloud or edge platforms provisioning “arbitrary” amount of virtual resources on demand For example, OpenStack [57] and Kubernetes [58] are de facto industry standards for virtual machine and container management, respectively. These platforms and services together with emerging network technologies, such as beyond 5G and 6G systems, enable a novel architecture supporting both cloud and edge based deployments of AR applications where the computation intensive functions can be offloaded from the devices.

3 Proposed Architecture

This section is devoted to the main goals driving our architecture design and to the details of the relevant components of the proposed system. In addition, the feasibility of the concept is also discussed.

3.1 Design Goals

We target a novel edge cloud based AR platform supporting the coordination of multiple users in a common geographical space. The user experience requires precise pose (position and orientation) information to be calculated for each AR device in real-time. Moreover, due to the joint space, the coordinate systems of different users/devices need to be synchronized continuously in order to display the virtual objects in the right place on each device during the whole game. To meet these requirements, we propose an edge cloud based solution, where the camera images and sensor data are streamed towards a dedicated coordination service. The SLAM function is offloaded from the device and arbitrary algorithms can be invoked in the remote environment (even in parallel). By these means, we can prolong the battery life of the AR devices. In order to enable the coordination among the users, the per-device coordinate systems have to be synchronized which requires to build and use a joint global map. This is a core task of the coordination service and the details are hidden from the users. For example, they cannot access images sent from other devices. Built-in local SLAM algorithms of current AR devices (e.g., ARKit [59], ARCore [60]) operate with proprietary map formats which cannot be shared among different devices. Making use of open-source SLAM libraries, we can provide cross-platform solutions. Another important goal of our design is the extensibility. As we stream raw data to the edge cloud system, other AR services can easily be added as distinct software components. Real-time detection and tracking of objects, users and gestures are important examples which can be implemented atop the platform.

3.2 Main Components

The architecture of the proposed system is depicted in Fig. 1. Currently, we assume mobile phones as AR devices but the concept is valid for AR glasses as well. On the devices, camera images and IMU data are collected and sent to the coordination service. Since the user experience is greatly affected by the response time of the SLAM function, it is crucial to have a proper network connection between the devices and the edge cloud infrastructure. It is worth noting that local SLAM algorithms can also be run on the devices optionally. Following this scenario, for example, the faster but not coordinated local operation can be improved periodically based on the remote service.

At the cloud side, each device has its own service instance including one or multiple SLAM modules and a selector component. SLAM modules with different approaches can achieve different results regarding accuracy and robustness. Moreover, the performance is also affected by the surrounding environment and the user behavior in a diverse way. Therefore, we run several SLAM algorithms side by side, so that we can choose the best regarding mapping and tracking results. This is done by the SLAM selector module, which decides which algorithm to use based on the camera image data and the availability of pose estimations. Analyzing different SLAM algorithms in different environments, we can construct policies to choose the appropriate algorithm dynamically during the operation.

Fig. 1
figure 1

Proposed architecture

The core component of the coordination service is the global map which is constructed on-the-fly based on the received information from the connected devices. Making use of the global map, novel AR applications can be realized that are not possible with the current technologies. Furthermore, the platform can be extended with additional AR services (e.g., object detection and tracking) enriching the feature set provided to developers and customers.

3.3 Feasibility

The feasibility of the design concept is mainly determined by the capabilities of the underlying network infrastructure and the specific bandwidth and latency requirements of the proposed platform. On the one hand, the delay bound stems from the characteristics of the applications and the used AR devices. Assuming mobile (handheld) AR, the video content on the display is typically delayed with a few (e.g. three) frames which yields a well-defined, manageable upper bound. Head-mounted displays realizing video pass-through technology (similar concept to mobile AR) also allow a slight offset in projection but the head-pose cannot be delayed. In contrast, optical see-through devices pose more stringent latency limits and call for additional mechanisms, such as precise prediction on the movements. On the other hand, the bandwidth requirement of a client primarily depends on the parameters of the camera stream, such as the resolution, FPS and the encoding bitrate. In our framework, the targeted bitrate of the video encoder is a configuration parameter that controls the ultimate bandwidth usage. According to our preliminary analysis, HD video with 30 FPS frame rate encoded into 3.5 Mbps or 7 Mbps video streams result in similar and acceptable SLAM’s accuracy. The architecture supports automatic rate adaptation as well which is an essential feature in bandwidth constrained radio network environments adjusting the targeted bitrate according to the varying network characteristics.

The available bandwidth of a cell in the Radio Access Network of a 5G (later 6G) network depends on multiple parameters, configuration settings and geographical properties. Assuming \(3\text {--}7\) Mbps upstream load per user, current systems can support several (but of course not hundreds of) AR clients which is a good starting point e.g. for multi-player gaming scenarios. But, we van expect much more uplink capacity from future radio networks [6]. We believe that besides cloud-based gaming scenarios, cloud-based multi-user AR applications will also be important ones in the “5G timescale” and will become a killer application during the era of 6G. Today’s online gaming platforms typically require minimum \(2\text {--}3\) Mbps downlink and \(0.5\text {--}2\) Mbps uplink speeds, but with remote rendering the downstream easily jumps up to \(10\text {--}15\) Mbps. We argue that it is crucial to understand the specific requirements (uplink/downlink bandwidth, delay, jitter) of different AR related services and based on the assessments we can provide input for beyond 5G and 6G design and standardization activities.

Fig. 2
figure 2

Proof-of-concept prototype. Clients implemented on Android and iOS (left); Coordination platform (right)

4 Implementation

To validate our approach and the proposed architecture, we have created a proof-of-concept prototype including the relevant components. This section presents the details of the implemented system in two steps: first, the client side is discussed and second, the main parts of the platform are reviewed. The overall system is depicted in Fig. 2.

4.1 Client Side: AR Application

We have implemented a multi-player AR application (game) making use of our edge cloud based coordination service. As it is shown in Fig. 2, the game is built on the Unity 3D engine [61], which is a widely used game engine for developing AR applications. In addition, we have implemented a native library in C++ i) to stream the camera images, ii) to stream the IMU data, and iii) to render the camera image to a Unity texture. On the one hand, this solution makes it easier to reuse our program code in other projects, and on the other hand, it is an efficient solution providing fast operation due to the C++ implementation. We have addressed two different mobile platforms, namely, Android and iOS platforms, with slightly different implementation details.

On the one hand, the streaming of the camera images is implemented in the same way on the two platforms. Invoking a listener, we get the images continuously in a callback method. The images are then passed to a H.264 encoder running in a separate thread. The targeted bitrate of the H.264 encoder is a configuration parameter, and in our experiments we used 3.5 Mbps and 7 Mbps, respectively. When encoding is complete, the original camera image is forwarded for rendering, and the encoded image is sent in a separate thread to the remote service. On the other hand, IMU data is sent differently on the two platforms. Both cases are shown in Fig. 2, where the components corresponding to the iOS implementation are indicated by purple boxes, while the Android related elements are shown by green boxes. In case of the Android implementation, a dedicated module, called Rosbridge [62], is used for data transmission over a websocket. In contrast, on iOS, the IMU data is sent in a raw UDP stream. (This difference stems from a dependency issue of Rosbridge.) We use the same client time for timestamping both the IMU data and video frames and they are synchronized at the server side. The IMU data is sent at a higher frequency (100 or 200 Hz in our experiments, depending on the device capability) and the image frame is combined with the closest one which has been received.

Our main focus was on an application which uses only the remote SLAM service for positioning, but we have also created a hybrid solution on both platforms. In this scenario, a local SLAM function also runs on the device, which is ARKit and ARCore, respectively, in our case, and the final pose information is calculated based on the local and the remote data. This solution requires the transformation between the local coordinate system and the one used by the global map in the platform. Following this approach, the positioning of the application is not fully lost during network errors because the local service can provide data continuously, and the coordination service can be used to improve the accuracy.

4.2 Platform Side: Coordination Service

Our coordination service has been implemented on top of the Robot Operating System (ROS) [63] and the components are packaged into distinct software containers. Building on ROS is beneficial for several reasons. First, most open source SLAM implementations run with ROS so these modules can be changed easily. Second, the topic based communication in ROS following the publish/subscribe model provides an efficient messaging bus where extra components, implementing additional AR features (e.g., object detection), can be connected to. The required input data, such as the image stream, or IMU can easily be broadcast to the involved entities. A dedicated stream receiver is in charge of receiving the UDP stream from the phone, decoding the camera images, and publishing it to the corresponding ROS topic. For each AR device, a new module is created which is responsible for receiving and sending the raw data and the calculated pose estimations and map. With ROS and Docker containers this can be managed in a flexible way. As we have made use of available open-source components, the global map is currently built on ROVIOLI and maplab [16], and the map is built based on explicitly merged dedicated missions. Later, additional algorithms and map handlers can be added.

As described in Sect. 3, multiple SLAM instances are running in parallel. Each module is subscribed to the incoming camera image messages and the IMU messages (if the latter is also required by the algorithm). In the current configuration, we have incorporated two different open-source SLAM libraries: ROVIOLI [16] and LSD-SLAM [17]. Additional implementations can easily be added later. The output of the SLAMs is monitored by the SLAM selector component and the “better” (or faster) pose response is selected for transmission towards the device. The coordinate systems are transformed accordingly. In our proof-of-concept prototype, a Deep Neural Network (DNN) model was constructed and trained with the publicly available EuRoC benchmark dataset [64] which includes the ground truth values as baselines. The model learned the accuracy of the selected SLAM modules for different environments and movement sequences and it is able to predict which algorithm will provide the more accurate pose value for the subsequent series of frames. The module is called SLAM Image Analyzer (SIA) and it is able to switch between SLAM outputs making use of different heuristics (e.g., it checks periodically the accuracy and changes if the predicted deviation exceeds a given threshold). For example, environments with varying light conditions or motion blur typically impact the accuracy of tracking, and particular techniques (e.g., feature-based and direct SLAM algorithms) react differently which we can exploit during the online operation.

For the SIA module, different types of neural networks, prediction horizon parameters, and target error metrics were systematically examined for each SLAM implementation to find the best-performing model configuration and hyperparameters. The intrinsic problem of SLAM’s accuracy prediction has been formulated as a regression that relies on a specific blur metric of received image frames combined with synchronized (closest received) IMU values as the main input. The Variance of Laplacian (VoL) [65] value is leveraged as the blur metric, which can be calculated quickly and it is efficient in characterizing the quality of an image frame regarding either the number of feasible feature points or the pixel intensity’s distribution [66]. For the model’s target, the relative pose error (RPE) has been chosen that is calculated for the range between poses of the current and one following frame according to the prediction horizon, i.e., for consecutive frames the model needs to look ahead and infer a prediction. This local error metric is suitable to our problem since it can be calculated for each frame in a flexible manner and does not depend on previously calculated error values as it is the case of cumulative metrics, such as the absolute trajectory error (ATE). Moreover, these predicted RPE values can be easily utilized to specify the magnitude of expected drifts suffered in the trajectories of the different SLAMs, despite the absence of ground truth data. Beside the standard deep neural network, recurrent neural models were also trained and evaluated to tackle the occurring hidden correlations between the pose errors of consecutive frames by taking the time factor into consideration. Two different recurrent structures were examined: a stacked Long Short-Term Memory (LSTM) [67] based model with two stacked layers and an encoder-decoder model [68] that applies an LSTM layer for both the encoding and decoding parts along with a single hidden layer for the output calculation.

For the model training and evaluation steps, prerecorded measurements of the EuRoC MH 01-04 missions were used with randomized shuffling and cross-validation, while the evo [69] tool was used for calculating the pose errors. Our automated training and tuning processes ultimately resulted that the simpler DNN model has superior performance over the tested recurrent models, using a 10-frame prediction horizon. The best result is provided by a two-layer DNN that we have incorporated into our prototype implementation. It consists of two hidden layers, each with 512 neurons and the rectified linear unit (ReLU) activation function. To avoid overfitting, L2 bias regularizers (\(l2=0.2\)) are defined for each layer along with a separate dropout layers with a 30 % drop rate. For the model training, the Adam optimizer [70] is used with reduced learning rate (\(1e-4\)) and the specific Huber loss [71] that can robustly handle target values with frequent changes by behaving quadratically for small residuals and linearly for large residuals.

5 Measurement and Evaluation Methodology

In regular AR application, the pose information is provided by the local SLAM (e.g., ARCore or ARKit) periodically at certain time instants, estimating how far the device moved from the position since the application started. The quality of the user experience is significantly affected by the accuracy of this estimation. This holds for remote SLAM solutions, as well. In order to measure the accuracy of a SLAM algorithm, we need to know the real physical position for each time instant, which is called the ground truth in the literature [72,73,74]. This is a baseline trajectory for measuring the accuracy of a SLAM. There are some open-source databases [64, 75, 76] that contain the camera images for each timestamp and also include the ground truth for the datasets. However, this tool set cannot be used to characterize our own application in our physical environment including also the network infrastructure. Therefore, a novel measurement methodology is required to be able to analyze the impact of all components and mechanisms.

5.1 Baseline Trajectory

We have elaborated an appropriate measurement methodology which can leverage a wide range of baseline trajectories. We use a robotic arm to move the AR device on preliminary defined 3D trajectories. The real physical position is captured from the robot software and it can be used as a ground truth value.

Fig. 3
figure 3

Measurement setup with an UR5e robotic arm

This methodology is capable of providing high precision ground truth data series, however, the size of the trajectories is limited by the moving range of the robotic arm. In our measurements, we used an Universal Robot 5E robotic arm (shown in Fig. 3) with a mounted mobile device. Within the range, arbitrary trajectories can be configured with programmable velocity and acceleration. Recording the ground truth and the estimated positions provided by the SLAM algorithms (of course, both with timestamps), we can easily compare how close the estimated position is to reality.

5.2 Evaluation Method

A crucial step of the evaluation is the synchronization of the coordinate systems of the robotic arm and the AR device. SLAMs measure the distance and the rotation from the AR application’s starting point, which is the origin of that coordinate system. In contrast, the ground truth is defined in the robotic arm’s coordinate system where the origin is at the base of the robot stand. Furthermore, some SLAM algorithms use right-handed coordinate systems (e.g. ARCore), but others assume left-handed ones. Therefore, we need to apply shifting and matrix multiplication to synchronize the two worlds. The magnitude of the shifting is calculated in the starting position and applied throughout the measurement. The positions obtained in this way and the ground truth values are shown together in the plots in the next section allowing a visual comparison. Besides, we use objective metrics to characterize the targeted SLAM algorithms, such as Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) [75, 77]. The RPE shows the drift of the trajectory over a fixed time interval and the ATE measures the global difference compared to the baseline trajectory.

Another important aspect which should be considered, especially in case of a remote SLAM, is the response time of the service. If the latency is too large, the provided pose is outdated and unusable. During the measurements, we put a timestamp on each image as soon as we get it from the camera of the mobile phone and we also put a timestamp when we get back the estimated pose for that image from the SLAM. Hence, we specifically measure end-to-end latency which includes image encoding, network latency, image decoding, SLAM runtime, and all platform overheads.

Besides the robotic tests, we addressed human motion experiments as well, where the trajectory is not limited. Inside our office building, we have accomplished several missions with different complexities. The recorded datasets can be replayed for the platform running with chosen configurations. Here, the exact ground truth is not available, however, the results of the local SLAM algorithms can be captured and used as an approximation for the baseline. This approach is validated by our measurements and ARCore and ARKit follow the real trajectories with acceptable precision.

Fig. 4
figure 4

Performance of the coordination service with Rovioli running in AWS

Fig. 5
figure 5

Performance of the coordination service with LSD running at the edge site

6 Evaluation

This section is devoted to the evaluation of our proof-of-concept prototype following the methodology described in the previous section. In our experiments, the coordination service platform was established both in an AWS cloud region (Frankfurt, eu-central-1) and in the local premises (edge setup), while the robotic arm and the AR devices were operated from the local premises connected via a WiFi network. The average round-trip time between the device and the coordination platform was 26 ms and 2 ms, respectively. We targeted experiments where the network was not a bottleneck of the system and the number of clients was limited. Our testbed environment encompassed WiFi and wired networks connecting the AR devices to the edge/cloud services, while the interference in the radio channels was minimized. By these means, we analyzed the limits of the approach and revealed the performance characteristics under unconstrained network capacity.

6.1 Latency Characteristics

We conduct experiments to measure the runtime of each component in order to analyze the usability of the system and to reveal potential bottlenecks. Both Android (Huawei P30 Pro) and iOS (iPhone 12 Pro) phones are tested with the coordination service. Detailed analysis has been carried out for a wide range of scenarios and as an illustration, a selected cloud setup and an edge scenario is presented here, respectively. In the chosen cloud experiment, the coordination service uses ROVIOLI as the SLAM module (shown in Fig. 4), while the edge scenario is equipped with LSD-SLAM (presented in Fig. 5). We use HD videos (Huawei P30 Pro: 1280x960, iPhone 12 Pro: 1280x720 resolution) with 30 FPS encoded into streams targeting 3.5 Mbps bitrate. Four separate operation phases are measured: the camera image encoding time and the streaming time on the phone, the processing time of the given SLAM module (including queueing delay if it appears), and the network delay together with all platform overheads (e.g., ROS based communication, virtualization overhead).

Fig. 6
figure 6

End-to-end latency statistics

Experiments have shown that the streaming time of a frame on a phone is most significantly affected by the phone’s encoder. In case of Android, the transcoding time of a frame (1280x960 pixels) takes around 72 ms yielding an average total transmission time of 76 ms (including the encoder and streamer threads). This large encoding time cannot be reduced due to the limitation of the available libraries (and the underlying drivers) which do not allow to enforce individual frame processing (instead three subsequent frames are handled together). For iOS devices, the encoding function shows much better performance due to per-frame processing and it needs around 7.27 ms to encode an image (1280x720 pixels). Approximately, additional 7.5 ms is required by the streamer thread which yields around 15 ms as the total streaming time. The response time of the SLAM modules show the same characteristics for both devices. The average runtime of ROVIOLI, including the queuing delays, is above 80 ms with large variance, while LSD-SLAM exhibits much better performance calculating the pose within 7.5 ms in all scenarios. The network delays and also the overheads are different for the cloud and edge scenarios depending on the underlying hardware architecture, the used network technologies and the physical distance between the platform and the AR devices. The average delay is tolerable for our test scenarios, however the results show significant jitter which can have impact on the user experience. We expect that moving to 5G (and 6G) environment will result in reduced jitter and more deterministic delays in the network part. However, in order to mitigate the jitter introduced by the edge/cloud platforms, additional mechanisms are needed.

We also present the total end-to-end (E2E) latency in Figs. 4) and 5) which measures the time between grabbing the camera image and the arrival of the pose. Besides, Fig. 6 summarizes the E2E statistics for all scenarios in a violin plot. On the one hand, in case of Android, the E2E delay is above 200 ms with ROVIOLI even in edge scenarios, which unfortunately has a noticeable effect on the user experience. With LSD-SLAM, a slightly better operation is achieved (above 100 ms but with large deviation) which also has an impact on the experience. On the other hand, the results with iOS (iPhone 12 Pro) are very promising, especially together with LSD-SLAM. In case of the edge setup, the average E2E latency is kept around 50 ms which means that the pose information can be updated within 2-frame delay for most of the frames (assuming 30 FPS frame rate).

6.2 Accuracy

Fig. 7
figure 7

The performance of different SLAMs using simple trajectory (top view)

Fig. 8
figure 8

The performance of different SLAMs using complex trajectory (top view)

Besides the strict latency characteristics which is essential for the real-time operation, accuracy is the other dimension which directly impacts the user experience. The accuracy of our coordination service is affected by the incorporated SLAM libraries and also the distributed operation of the overall system. We have conducted several experiments to assess the performance characteristics, here we highlight some relevant results. The results of the robotic accuracy measurements are shown in Fig. 7 and Fig. 8 for two different trajectories. The first one is a simple one lifting the AR device and following an arc there and back. In Fig. 7, the top view of the trajectory is plotted presenting the results for different SLAMs. The black curve indicates the ground truth, and as baselines, the results of ARCore and ARKit are also depicted. Besides single SLAMs, we show the results of our ensemble module (SIA) combining ROVIOLI and LSD-SLAM. ARCore and ARKit outperform the open-source methods, however, the errors are not extreme and in certain applications, this performance can be accepted (position coordinates are in millimeters).

The second experiment applies a complex trajectory, where the robotic arm makes several changes of direction and draws the shape of a house after a synchronization phase (bottom center part of Fig. 8 which shows the top view of the movements). Here, ARCore and ARKit strictly follow the ground truth values but the accuracy of the remote SLAMs shows non-negligible divergence. They follow straight movements very well, but the turning angles are often incorrect. Although sometimes the remote SLAMs make corrections and these cause some minor jumps in the estimated positions. Moreover, at the corners of the “house”, the altitude values also show deviation.

The third type of experiments addresses human movements. The result of a selected mission using ROVIOLI is shown in Fig. 9. Here, we use the trajectory of ARCore or ARKit as the ground truth. For this complex human motion, the accuracy of the remote service is very close to the performance of the local solutions and it can provide acceptable user experience. As quantitative metrics, ATE and RPE values are also calculated and shown in Table 1 for selected representative experiments from each test group. LSD-SLAM shows better performance than ROVIOLI, while the ensemble solution is able to outperform both SLAMs in terms of the absolute error (ATE). However, the relative drift (RPE) indicates performance issues for SIA which can be the result of sudden changes in pose values at switching events. This is an important future research to identify the trade-off between sudden corrections vs. slowly accumulated drifts when we optimize for the experience.

Fig. 9
figure 9

The performance using ROVIOLI in a human motion experiment inside the office building

Table 1 ATE and RPE for different SLAMs [m]

To sum up, the revealed and experienced accuracy is acceptable for simple games, however it is currently not enough for critical applications, such as a remote surgery support system. If we target high-quality AR experiences, the coordination service has to be combined with local enhancements (e.g., sophisticated predictions or filtering mechanisms). But we think that the platform can be extended towards this direction.

6.3 Energy Consumption

It is also important to consider the power consumption for client devices used in the system because as already mentioned, the battery life of a mobile AR device can be multiplied by offloading certain XR functions [6]. We have conducted several experiments to compare the consumption of a baseline application invoking local AR libraries with a novel application using our coordination service. The measured energy consumption of the device while running the application for 15 minutes, respectively, is shown in Table 2. In case of iPhone 12 Pro, the energy consumption is reduced with 30% when the coordination service is invoked, which is an important advantage of this approach. However, in case of Android platform, the results do not show that gain. This stems from an implementation issue; currently our rendering codes are not optimized and some tasks are implemented in CPU instead of GPU.

Table 2 Energy consumption for different SLAMs

7 Conclusion

In this paper, we have proposed a novel edge cloud based coordination platform for multi-user AR applications. An extensible architecture has been described and a proof-of-concept prototype were developed. We have focused on the latency sensitive and computation intensive Simultaneous Localization And Mapping function which was offloaded from the device to the edge cloud infrastructure. Our solution has been built on open-source SLAM libraries and the Robot Operating System (ROS) and it inherently supports further extensions, including additional AR services, such as object/face detection and tracking, semantic understanding of the environment or remote rendering. In order to be able to evaluate the concept, we have defined a dedicated measurement methodology providing the necessary ground truth information. Following the methodology, we analyzed the latency characteristics and the accuracy of the platform via real experiments. The current version of the proof-of-concept prototype provides sufficient services for simple AR games or applications, however further improvements and additional components are required to enable high-quality AR experiences.

Challenging future works have also been identified. We are working on an extension to the encoder/streamer part which supports automatic rate adaptation. It is crucial in bandwidth-constrained radio network environments to be able to adjust the targeted bitrate of the encoder dynamically to follow the varying characteristics of the radio network. Besides, the trade-off between the encoder’s bitrate and SLAM’s accuracy for different SLAM implementations is also an interesting research question. Our future work addresses the migration to an internal experimental 5G testbed where the full system is under our control and dedicated experiments with our platform can be carried out. And finally, we target performance enhancements in two parallel tracks. On the one hand, more recent SLAM libraries (e.g., ORB-SLAM3), will be added to the framework. On the other hand, novel mechanisms, such as precise movement prediction and filtering algorithms will be integrated into the platform in order to mitigate delay and jitter issues.