Keywords

1 Introduction

Globalization has boosted demand for immersive telepresence solutions, which are a type of technology that allows individuals who are physically separated to collaborate as if the user were in the same room [1]. By giving an alternative to travelling to presential meetings, this technology can save time and money while also decreasing environmental impact [2]. It enables users to present at a remote location through a virtual projection. The idea of telepresence arises from Marvin Minsky, one of the first to coin the term “telepresence” as the feeling of remote existence when using a teleoperator [3]. Since Marvin Minsky stated the principle, several systems have been developed to accomplish telepresence [4].

In early telepresence, due to limitations in the technology available at that time, telepresence systems were unable to acquire and transmit high-quality real-time three-dimensional (3D) reconstruction objects to distant users [5]. However, following the introduction of inexpensive commodity depth sensors such as the Microsoft Kinect became more widely available, many 3D reconstructions at the room size were developed, such as room2room life-size telepresence [6]. The capability of 3D reconstruction that is able to capture and reconstruct real-world elements makes significant advances in telepresence [7].

However, implementing 3D reconstruction for telepresence requires real-time rendering for moving object since telepresence capture user movement in real-time. Real-time 3D reconstruction for moving objects is challenging in computer vision and computer graphics [7, 8]. The majority of current techniques target real-time 3D reconstruction for a single-view-based system, from a single depth sensor rather than multiple-view from multiple depth sensors.

With the use of a Head-mounted-display (HMD), 3D reconstruction can be viewed with a wider Field-of-View (FoV) compared to a standard phone or monitor [9]. This allows users to interact with virtual environments in a manner that is similar to being physically present in those environments [10]. Mixed Reality (MR) refers to a new category of experiences in which a layer of computer-graphics-based interaction tied to a certain activity is superimposed on top of the real environment [11]. It allows users to experience the actual world as a more integrated and better environment. The user of an MR interface sees the actual world through a portable HMD that coats graphics on the surroundings [12]. [13] were the first to demonstrate HMD, which gave rise to the concept of MR. According to [14], an MR system brings together physical and digital objects in the same space, allows for two-way communication between them, and keeps track of their location and state in real-time.

In this chapter, a depth sensor is proposed to capture real-world objects and reconstruct the data in 3D. However, the FoV of a depth sensor is insufficient to capture a full 3D object and physically impossible to gather all the input data for the object or scene at once [15]. Therefore, to produce a full 3D reconstruction, most researchers use the fusion Kinect method, which moves the depth sensor around the real-world object to capture the data by frame. This method, however, is very computationally intensive and can only be run in an offline mode [16]. Furthermore, more than half of the picture is obscured at any given time when employing a single depth sensor to reconstruct non-rigid motion, and continuous motion creates large frame-to-frame fluctuations, which may lead to inconsistency in the scene's topological structures [17].

As a result, multiple depth sensors are required to be positioned at different angles to obtain unobstructed data from the sides of the object in order to complete the full 3D reconstruction in real-time [18]. According to [19], multiple depth sensors, enable users to get visual information that may not be available from a certain perspective when using a single depth sensor. From the previous studies, multiple depth sensors often consist of more than two depth sensor devices. However, based on recent studies in [20,21,22], multiple depth sensors and multiple cameras can be achieved by utilizing two depth sensors or two cameras. Therefore, this chapter proposes a real-time 3D reconstruction method with multiple depth sensors for MR telepresence. The related work for this chapter is further discussed in the next section. The methodology of the proposed method is described in Sect. 3, and in Sect. 4 we presented the result.

2 Related Work

Real-world 3D reconstruction has long been an aim of computer vision. Numerous instruments, including as stereo cameras, laser range finders, monocular cameras, and depth sensors, have been used to correctly perceive the 3D environment [23]. With the advancement of depth sensor devices such as Microsoft Kinect and Intel RealSense, many new and exciting applications is developed. However, due to issues like noisy depth data, poor registration, camera tracking drift, and a lack of precise surface details, 3D models reconstructed by depth consumer cameras are not yet widely employed in applications. Table 1 represents the summary of related work on 3D reconstruction for MR telepresence which are utilizing depth sensor devices to enable 3D reconstruction.

Table 1. Related work of 3D reconstruction for MR telepresence using a depth sensor

Sun et al. [25] have presented a real-time 3D reconstruction in MR by discretizing the world space into 10 cm voxels which are grouped into 8 × 8 × 8 chunks. Voxels that have points that fall within its boundaries are marked occupied and rendered as cubes, like in Minecraft. While engaging with a virtual environment, the user is free to move through wide areas in the actual world without running into any barriers.

Meanwhile, Joachimczak et al. [10] highlighted the real-time 3D reconstruction in MR Telepresence in their study. They utilized HoloLens with Kinect devices as a depth sensor for the 3D reconstruction purpose as shown in Fig. 1. However, their study provides limited information as they did not perform the evaluation of the proposed method for further discussion. In the research carried out by Fadzli et al. [26], the authors provided a discovery on the MR framework that might be used to transmit a full-body human from a nearby region to a more distant one. Therefore, they introduced a user telepresence capability for 3D reconstruction, allowing people to interact with an accurate 3D model of a remote user. The depth data from the depth sensor is captured and processed in real-time for the 3D reconstruction with the point cloud rendering process.

Fig. 1.
figure 1

Real-time 3D reconstruction using HoloLens and Kinect [10].

3 Real-Time 3D Reconstruction Method

This section elaborates on the real-time 3D reconstruction method, which consists of multiple depth sensor setup, surface reconstruction and merging of 3D reconstruction data for MR telepresence. This is explained in the next subsection.

3.1 Multiple Depth Sensor Setup

In order to produce the 3D reconstruction, depth information is required to generate the data. The conventional camera is only capable of producing two-dimensional (2D) red, green and blue (RGB) data, including photos and videos. The conventional camera is unable to collect depth data into the programmed device, the output is insufficient to enable and capture an item in 3D data. Therefore, a depth sensor is required in this chapter for capturing and converting real-world objects into 3D data. The depth sensor proposed in this chpater is the Microsoft Kinect sensor.

Microsoft Kinect sensor is a low-cost depth sensor known as a skeleton tracker that is used with the Microsoft software development kit [30]. It allows users to communicate with computers and game consoles intuitively via gestures and spoken commands without any additional peripherals [31]. It can also extract data about the subject's 3D joint locations [32].

Microsoft offers not one but two Kinect models. The first generation of Kinect sensors (Kinect v1) debuted in 2010. They were based on structured light coding technology. Following that, Microsoft made it Windows-compatible using a standard development kit (SDK) and a power conversion adaptor. Based on the time-of-flight (ToF) technology, the second generation of Kinect sensors (Kinect v2) was released in 2014 with improved specs in comparison to the first Kinect in terms of speed, accuracy, and field of vision.

However, the FoV of the Kinect sensor is constrained and physically impossible to gather all the point cloud data for the 3D object at once. Therefore, this chapter proposes two Kinect sensors, to acquire the data front and back of the user simultaneously. The setup for multiple depth sensors is shown in Fig. 2. Based on Fig. 2, the distance between the Kinect sensor and the user is fixed at 2.1 m. While the height of the Kinect sensor from the floor is fixed at 1.1 m. This setup for the multiple Kinect sensors is based on [33] research. This setup will ensure the Kinect sensor able to track the user’s full body for the 3D reconstruction process.

Fig. 2.
figure 2

Setup for multiple depths

3.2 Surface Reconstruction

Surface reconstruction is one of the steps in this chapter's real-time 3D reconstruction technique. Surface reconstruction is the process of acquiring 3D real-world object input data from a depth sensor in order to produce a mesh from the input data. For surface reconstruction, there are two methods available, explicit and implicit methods. The implicit method is proposed in this chapter to reconstruct the surface. When one of the isocontours is available near the input data, it is defined as an implicit surface. In the depiction of the reconstructed surfaces, the implicit technique employs the triangulation method.

The most intuitive form of triangulated surfaces for surface reconstruction uses k-nearest neighbours of a point to generate the connection between linked triangles made from the input points [34]. To generate the triangulated surface from the implicit method, the Marching Squares algorithm is by far the most well-known technique.

Marching Squares is a computer graphics approach for recovering 2D isocontours from a grid of sample points. The goal is to analyze each grid cell individually and to calculate the contours that pass through it depending on the values at its corners. The initial stage in this procedure is to identify the edges that are linked by contours depending on whether the corners are above or below the contour value. To keep things simple, the evaluation is only considering contours along with the value 0, focusing on the positive or negative corners. The Marching Squares lookup table is shown in Fig. 3 Based on Fig. 3, there is a total of 2^4 = 16 distinct examples which consists of Case 0 until Case 15.

Fig. 3.
figure 3

Marching squares lookup table

The pseudocode for the marching square algorithm is shown in Table 2 as referred to [35]. Based on the Marching Squares algorithm the triangulation process is implemented by including the process of searching for vertex candidates from the value of the contour based on the Marching Squares lookup table. Hence, the triangulation is formed according to the vertex candidates lists.

Table 2. Marching squares algorithm

This process of triangulation using the Marching Squares algorithm is performed after the point cloud data is acquired from the depth sensor as shown in Fig. 4. Based on Fig. 4. After the triangulation has been completed and successfully produced the mesh. The mesh will be mapped with texture mapping. Texture mapping is the final stage to complete the 3D reconstruction look. The texture for the mesh is generated by composing the segmented images from the depth sensor as shown in Fig. 4. The images contain vertices that consist of a single colour per vertex. The colours are acquired to produce the texture and mapped to the mesh. After the texture has been mapped to the mesh, the 3D reconstruction final output is ready to display as a 3D representation for MR telepresence.

Fig. 4.
figure 4

Surface reconstruction process

3.3 Merging of 3D Reconstruction Data for MR Telepresence

After the mapped mesh texture has been completed, the next phase is to merge the 3D reconstruction data from multiple depth sensors for MR telepresence. The drawback of the Kinect sensor is one personal computer (PC) can only connect to one Kinect sensor. Therefore, a network is required to merge the data between the multiple Kinect sensors. The network used for this chapter is a local network. The local network will transmit the 3D reconstruction data from each Kinect at the remote user site to the receiver site, which is the local user. Figure 5 shows the process of merging 3D reconstruction data for the MR telepresence system.

Fig. 5.
figure 5

Merging process at the receiver

Based on Fig. 5, before the 3D reconstruction data is transmitted through the network, the 3D reconstruction data that consists of vertices, triangles, colours and UVs will undergo compression to reduce the data size. Reducing the data size can reduce the bandwidth required for data transmission. After the receiver PC has received the 3D reconstruction data from both sender PC, the 3D reconstruction data will decompress and reconstruct the 3D data. The merging of 3D reconstruction data is then executed at the receiver site to be displayed at the MR telepresence. We manually adjust the position and rotation of the 3D reconstruction data from sender 1 and sender 2 using unity tools software, to be aligned in a full 3D reconstruction.

4 Results

This section continues to discuss the results of the proposed method using multiple depth sensors for 3D reconstruction in real-time for MR Telepresence. Figure 6 shows the final setup for the remote user workspace. Based on Fig. 6, the remote user is placed in between the Kinect sensor and the setup for the Kinect sensor is as discussed in the previous section.

Fig. 6.
figure 6

Final setup for the remote user using multiple depth sensors

Next, Fig. 7 shows the results after the 3D reconstruction data has been merged from multiple depth sensors. In Fig. 7, we present a full real-time 3D reconstruction view of the remote user at six different angles. As shown in Fig. 7, the front and behind views of the 3D reconstruction remote user have been aligning together into one complete full 3D reconstruction. However, there is a small gap between the 3D reconstruction data from the front and behind. This issue rises in this chapter and requires further improvement. The recommendation for improvement by [36] is to align and stitch two 3D meshes by the boundaries to complete 3D object models without prior registration using the Joint Alignment and Stitching of Non-Overlapping Meshes (JASNOM) algorithm. Recent studies by [37], also suggest a mesh stitching method with their algorithm that includes pre-processing process, to ensure the alignment is reasonable.

Fig. 7.
figure 7

Results of merging 3D reconstruction data at the receiver site

Furthermore, Fig. 8 shows the final output for real-time 3D reconstruction for the MR telepresence system using multiple depth sensors. Figure 8 is a captured image from the local user’s point-of-view (POV). In Fig. 8, the remote user appears in 3D reconstruction in the MR environment. The interaction between the local user and the remote user is in real-time. The local user is able to interact and move around the MR telepresence simultaneously with the remote user.

Fig. 8.
figure 8

Final output real-time 3D reconstruction for MR telepresence

During the real-time interaction, the data that has been compressed which consists of vertices, triangles, colors and UVs from the remote user are transmitted to the local user using a stable local network. Since the data transmitted is required a local network, the bandwidth of the local network may influence the speed of the data transmission. The higher, the bandwidth of the local network, the faster the compressed data can be transmitted to the local user. However, if the local network bandwidth is low, network latency will occur and reduce the speed of the data transmission which can affect the real-time performance. Figure 9(a) shows the local user view of the remote user from the side. While Fig. 9(b) shows the side of the remote user in closer range.

Fig. 9.
figure 9

Local user view of the remote user (a) side view (b) closer range of the side view

5 Conclusion

Based on the literature study, real-time 3D reconstruction technology and MR HMDs may now be used to enable a new potential for telepresence and teleoperation systems. Users can interact with a distant individual or operate remote devices by perceiving their 3D virtual representations as a part of their nearby surroundings.

In this chapter, we suggest employing multiple depth sensors to provide a real-time 3D reconstruction for MR telepresence. We utilized two depth sensors to produce depth data for the 3D reconstruction process. First, we captured depth data to form triangulation using a marching square algorithm and produce a surface reconstruction. The process continues with generating texture from the RGB image captured by the depth sensor to be mapped on the mesh. As the process is completed, the 3D reconstruction of the remote user is presented in real-time within the MR environment that was enabled on the HMD worn by the local user.

As for future work, we suggest addressing the gap issues that we found during the merging of the 3D reconstruction from multiple depth sensors to produce a more realistic and volumetric presentation. Other than that, enhancement for the interaction between the local user and remote user can be studied and improved by adding whiteboard interaction, pointing ray and object manipulation in MR telepresence.