Keywords

1 Introduction

Less than a year into the COVID-19 pandemic and most of us already experience Zoom fatigue on a regular basis. This has been attributed to the added cognitive load of having to focus more intently during video-meetings, while struggling with the dissonance of performing work-related tasks in a contrary environment [21]. Other factors that make video-conferencing strenuous include increased self-awareness, frequency of meetings and the lack of body language cues. As a result of these factors, people find it difficult to stay focused and work productivity suffers. Therefore, there is a need to develop more engaging remote work tools that can capture and relay a shared sense of presence among colleagues.

Extended Reality (XR) can help address this challenge by creating immersive meeting spaces for users to remotely interact within. A shared virtual classroom environment for students and instructors, for example, can serve as a more engaging medium for learning to take place [4]. Virtual environments can transport users to faraway places (both familiar and unknown) where they can interact with 3D objects, spaces and each other. These environments can be modelled after real-world locations (like offices and classrooms) to give users a familiar space to collaborate in. Work artifacts can be visualized as 3D models that can be scaled, rotated and viewed from multiple angles. XR users can greatly benefit from all of the added affordances provided by immersive environments [15]. In addition to visual affordances, embodied interactions can provide users with a natural and intuitive way to interface with the shared virtual environment [16]. Embodied interactions can be used to promote engagement by enhancing a user’s sense of presence in a virtual space [8] and can serve to compensate for the missing body language cues that make video-conferencing tedious (Fig. 1).

Fig. 1.
figure 1

A body-tracked avatar of a remote collaborator in a shared virtual classroom environment (local user’s POV).

In this paper we present a novel approach to facilitating real-time remote collaboration in shared XR environments. Our approach leverages body-tracking, XR, and a lightweight communication platform to support telepresence among remote meeting participants. We describe the implementation of this approach and provide a case study. Our prototype implementation comprises of three components: 1) A body-tracking component that uses a peripheral depth sensor (Azure Kinect) to capture body frames and recognize user body joints. 2) A communication component that receives and relays captured joint data among connected client applications. 3) An XR component that comprises of a set of XR display devices that run an instance of a client Unity application for each remotely connected user.

Azure Kinect devices installed at each remote location continuously track each user’s body-joint data which is then streamed to all other listening client applications. This data is used to rig virtual avatar representations of each user. The avatars and their body movements are then rendered through the XR display device (client application). All communication between remote client applications is handled via the Message Queue Telemetry Transport (MQTT) messaging protocol (Fig. 2).

Each of the individual devices within all three system components can be easily switched out in favor of more accessible ones. For example, if XR headsets are unavailable to users, Web-based Unity applications can be substituted in for the XR component of the system. This approach makes the proposed system more accessible to remote users, while also opening up exciting new possibilities for platform-agnostic collaboration in shared virtual environments (i.e. MR-VR, MR-Web, Web-VR, etc.).

2 Related Work

The objective of this paper is to explore ways of enhancing the experience of remote collaboration. This aligns well with a large body of existing Computer-supported cooperative work (CSCW) research that envisions trends in remote collaboration and workplace team dynamics. Workplace practices and technology have been evolving rapidly over the last few decades, especially after the advent of computers. As a result, we now have various remote collaborative platforms that can help us better connect and work with people across the world. The importance of these technologies has increased exponentially with the advent of COVID-19. These work-from-home technologies typically exist as standalone applications and are not built to integrate with their user’s surrounding environment.

However, workplace architecture can play an important role in determining local work practices. Streitz et al. [18] present an early conceptual framework into how data visualization can be better facilitated by workplace architecture, making the workplace environment a multi-modal interface to access relevant information. They leverage the ideas of ubiquitous computing [20] and invisible technology, stating that they will play an important role in making these adaptable architectural spaces a reality. Streitz et al. [17] argue that despite information and communication technologies having reshaped work processes, the design of work spaces have largely remained the same. To remedy this they propose the idea of cooperative buildings: “flexible and dynamic environments” capable of “supporting and augmenting human communication and collaboration”.

Ladwig and Geiger [9] envision that the technology of the near future will allow us to create the “ultimate device” that will be capable of making the real world indistinguishable from the virtual world. Such a system would be capable of providing “realistic and complete embodied experiences” by utilizing multiple human senses including haptic, auditory, olfactory and even gustatory sensations.

XR is an effective tool that is bringing us closer to achieving Ladwig’s vision. It also has great potential as a work-from-home tool. Zhao et al. discuss an approach that involves whole-body interaction with the Kinect, and responsive large-screen visualizations to support new forms of embodied interaction and collaborative learning. Their findings suggest that the pairing of physical interaction and visualization in a multi-user environment helps promote engagement and allows children to easily explore cause-and-effect relationships [23].

RGB-D sensors, like the Microsoft Kinect, can enhance a user’s XR experience by providing them with embodied interactions [5]. Anderson et al. [1] developed an AR application “YouMove” that lets users train through various stages of a physical movement sequences. Task guidance and feedback was provided through an AR mirror. Handosa et al. [6] combined mixed reality with embodied interaction to create interactive training simulations for nursing students.

However, to connect two or more users in an immersive XR environment, an efficient communication platform is essential. Dasgupta et al. [3] developed an architecture for supporting context awareness in MR, enabling MR devices to scan and recognize workspace tools in real time using an MQTT broker, while also providing instructions on their use. We followed a similar approach and used MQTT as the communication back-end in our prototype implementation.

3 Approach

The COVID-19 pandemic has forced people to go about doing most of their daily activities from home. In-person meetings across the globe have been significantly reduced and almost all collaborative work has been moved online. However, the subsequent overuse of virtual videoconferencing platforms has resulted in what we now know as “Zoom Fatigue”. Users reported feeling tired, anxious, and worried as a result of overusing these remote collaboration platforms [22]. A major reason for “Zoom Fatigue” is because videoconferencing tends to disrupt the regular communication practices that we are accustomed to [13]. Videoconferencing also deprives us of a significant amount of contextual information (through the loss of body language cues and micro expressions) due to the physical separation between meeting participants.

Current remote work practices suffer from several of the issues presented above. Given the inherent limitations of videoconferencing, our goal is to try and make remote meetings more engaging, life-like and immersive. Embodied interactions can help bridge the gap between virtual and in-person meetings. With the recent advances in XR infrastructure, motion-tracking sensors and network connectivity, it is possible to incorporate human body movement and gestures into meetings in virtual environments.

We describe an approach to designing collaborative virtual environments that can be used for a variety of purposes, including education and remote meetings. Our proposed system architecture consists of three components: Body Tracking, Communication, and XR Rendering. Each of these components are flexible and are not bound to specific hardware. Figure 2 depicts the use of these the components to create a shared XR space for two users. This system can be further expanded to support additional users. The three components are described in more detail below.

Fig. 2.
figure 2

System Architecture Diagram. 1) Component 1: Body Tracking—captures a local user’s body frames and identifies joints. 2) Component 2: Communication—transfers the messages from the Kinect client to other remote user Unity client applications. 3) Component 3: XR Rendering—Renders each user’s avatar at their respective positions in the virtual environment. The XR device allows the user to see and interact with other users and the shared environment.

3.1 Component 1: Body Tracking

Advancements in computer vision and object-recognition technology have led to the commercial availability of several body-tracking sensors. These sensors are capable of tracking human skeleton joints in real-time. Real-time tracking can be implemented via marker-based (Qualysis, OptiTrack) or markerless (Microsoft Kinect, Intel RealSense) techniques. While marker-based body trackers provide more robust tracking [19] and have larger coverage areas, they are not practical for personal use at home or in a workplace. Marker-less body trackers are smaller, portable and more cost-effective overall.

Our system design makes use of marker-less body trackers to capture user skeleton data. The skeleton joint data for user movements is repeatedly captured and used to rig virtual avatars that replicate their body movement in the XR application. Skeleton joint data for important body parts (head, arms, pelvis, legs, etc.) are continuously streamed to the XR applications of all the users in the meeting. This streaming is handled by our communication component.

3.2 Component 2: Communication

The skeleton joint data collected by the body tracking device includes the Cartesian coordinates (xyz) and the rotational quaternion coordinates (wxyz) for each tracked body joint. Most marker-less body tracking devices collect skeleton joint data for all the joints present in the human body and use them to animate avatar models. However, this is a tremendous amount of data to send over a network. It can overload the communication network and result in significantly higher latency cost. To avoid this we recommend streaming only the most relevant user joint data through the communication channels.

For our design, we decided to use the MQTT network protocol [7] to support our communication component. MQTT is an OASIS standard messaging protocol for Internet of Things. It is a lightweight publish/subscribe network protocol that can be used to establish communication between remote devices.

Each user’s skeleton joint data is streamed on a unique MQTT channel (topic) by their respective body-tracker client applications. The body-tracker client streams their skeleton joint data to a centralized MQTT broker. The broker then broadcasts this data to all the other users’ XR client applications that are listening for updates on these channels. XR client applications run on each user’s XR display component. Upon receiving new messages from the broker, these XR applications use the passed skeletal joint data to update each user’s location within the shared remote meeting environment.

3.3 Component 3: XR Rendering

The XR component consists of an always-on application (developed on a game engine like Unity) that runs on the XR display hardware. This application displays the current state of shared virtual environment, along with the avatars of all remote meeting participants. Virtual environments can be designed to replicate traditional collaborative environments (classrooms, conference rooms, office spaces, etc.). The XR application is responsible for controlling everything that a user sees and does in the shared virtual environment. The application updates the user’s position based on their movement.

It also registers any embodied interaction between the user and an environment object and sends this to a designated channel on the communication component. For instance, if the user moves a chair in shared virtual classroom environment, this movement is also propagated to the XR applications of each other remotely connected user. Finally, the XR application also receives streamed data from the communication component (other user body movements and environment object interactions) and updates the state of each of these entities in the environment.

4 Implementation

In this section, we describe our proof concept implementation based on the approach described in the previous section. Given the impact of COVID-19 and the switch to online modes of education, we decided to go with a virtual classroom scenario for our implementation prototype. We envisioned this to be a space where students can join in remotely to attend lectures, work on group projects and interact with faculty members.

4.1 Virtual Classroom Environment Overview

The first virtual classroom that we designed was inspired by a real classroom located in the Northern Virginia Center, a satellite campus of Virginia Tech. This virtual classroom was designed using the Unity game engine. Images of the real and virtual classrooms can be seen in Fig. 3.

Fig. 3.
figure 3

Left: The physical classroom at the Northern Virginia Center. Right: The virtual classroom environment Unity application.

After designing this classroom environment to closely resemble it’s real-world counterpart, we re-configured its layout to allow users to interact in multiple environment settings. We created a total of four different classroom layouts:

  • Small Classroom: The original classroom design from the Northern Virginia Center. This classroom environment is built to match the scale of its physical counterpart. It is best suited for small class sizes with more interpersonal engagement.

  • Small Classroom post-COVID: This classroom environment was developed to reflect the“six-feet apart” social-distancing policy mandated at locations of in-person instruction, as a preventative measure against COVID-19.

  • Conference Room: A modified classroom designed to resemble meeting rooms. The desks in this layout have been brought together to create a large table in the middle of the room.

  • Large Classroom: This virtual environment was developed to seat a large number of students. This scenario is appropriate for lecture halls with minimal group-based discussions.

To speed up future development, we also created a script that would allow developers to automatically generate their desired classroom layouts by specifying their required layout parameters in the form of a JSON input file. A demo video highlighting the navigation and interaction aspects of our prototype implementation can be found here [11].

4.2 Application Components

As described in the Proposed Approach section, our implementation comprises of three major components:

Body-Tracking. We selected the Microsoft Azure Kinect as the body-tracking device for our implementation. Each remote user is equipped with a Kinect device and a dedicated client PC. Each user’s Kinect captures their skeleton joints and updates the user’s own avatar inside a Unity application running on the XR component. The joints tracked by the Kinect sensor are shown in Fig. 4. The Kinect client also streams the user’s joint information to all other user’s within the same shared virtual classroom environment (via the communication component). In our prototype implementation, we tested this out by having two remote users in the same virtual classroom environment at the same time.

Fig. 4.
figure 4

Left: Two user skeleton joints tracked by the Kinect. Right: All joints capable of being tracked by the Azure Kinect sensor [12].

Communication. The communication component serves as the backbone of this implementation. Due to the high volume of messages that need to be sent over the network, picking an efficient, light-weight messaging protocol is imperative. We used the MQTT protocol as the communication back-end for our implementation. The MQTT protocol follows a publish-subscribe messaging pattern. The MQTT broker is responsible for directing all of the shared virtual environment messages to their appropriate communication channels. A dedicated Raspberry Pi device served as our MQTT broker.

Kinect client PCs are always running an application that connects them to the MQTT broker and allows them to ‘publish’ (send) messages to specific communication channels. In our implementation, for example, coordinates for remote user A, were streamed to a channel called ‘kinectCoords’, while coordinates for user B were streamed to a channel called ‘kinectCoords2’. Any device connected to the same MQTT broker can ‘subscribe’ (listen) to messages sent on these channels.

Whenever a user is tracked by the Kinect, their client PC applications publish their tracked skeletal joints along with the user’s id to the MQTT broker on one of these channels. The broker in turn transmits these messages to all other devices that are listening to these channels. These include the XR applications running on each remote user’s XR component that take the skeletal joint messages and update the corresponding user’s avatar inside each user’s local version of the shared virtual classroom environment.

XR Rendering. For each remotely connected user, the XR Rendering component is responsible for displaying the user’s local view of the shared virtual environment. This component comprises of display hardware such as AR/VR/MR headsets. However, regular PC screens/displays can also be used to present the shared virtual environment (for improved accessibility). This component is always running a Unity application containing the current state of the shared virtual environment. The application is responsible for listening to messages sent via the MQTT broker and accurately updating the positions of all the users in the shared virtual environment. The positions of all interactable objects in the environment are also appropriately updated by this application. For our prototype application, the Kinect client PCs also ran the Unity XR application and displayed the local view of the shared environment for each remote user.

4.3 Performance Evaluation

To evaluate the suitability of our implementation for real-time remote collaboration, we conducted a few validation tests. Two remote users (A and B) met and interacted with each in our shared virtual classroom environment. The user’s were made to perform simple actions like moving around and waving their hands (Fig. 5). The network latency for these interactions was measured for three remote user configurations. Equation 1 was used to calculate the travel time of 10000 MQTT messages between the Kinect client and XR applications of user A and user B. The round-trip travel time for the messages was computed and averaged in order to negate the effects of minor system clock time differences between each user’s client PC. The results from running these latency tests are summarized in Table 1.

$$\begin{aligned} t_{Travel} = \frac{( t_{rec}-t_{sent})}{2} \end{aligned}$$
(1)

where:

  • \(t_{Travel}\) is the time taken to send a message from user A’s Kinect client application to user B’s XR application client over the internet using MQTT (one-way average network delay).

  • \(t_{sent}\) is the timestamp of when user A’s Kinect client application sent out a message to user B’s XR application via MQTT.

  • \(t_{rec}\) is the timestamp of when user A’s Kinect client application receives confirmation of having received \(t_{sent}\) from user B’s XR application.

Fig. 5.
figure 5

Left: The avatars of the two remote user’s interacting. Right: The avatar of a remote user as seen from another user’s point-of-view.

Table 1. Communication latency for different distances between locations.

In the first test configuration, both the users were located in separate rooms within the same building at our research lab facility (Virginia Tech, Blacksburg, VA, USA). As both of these users were on the same network, the latency of the communication component was a minimal 0.056s. As a result, avatar movements were updated very fast for both users.

In the second test configuration, user A was located at our lab facility, while user B joined the virtual classroom environment from the university library, three miles away from the lab. In this test, we noticed a delay of 0.06s, which also allowed for fast avatar position updates.

In our final test configuration, user A was located in Blacksburg, VA, USA, while user B joined the virtual classroom from Zagreb, Croatia. In this test, we noticed a larger delay of 0.120s per message. There was also more jitter in the avatar movements observed, but the movements of the avatars still matched the body movements of the users. Also, given the substantial physical distance between the two users (4710 mi), we believe this to be a tolerable amount of latency for real-time collaborative applications.

5 Discussion

Embodied cognition in shared virtual environments can be enhanced by ensuring that the latency between the movement of a user’s body and their virtual avatar is minimal. In our evaluation, we measured the communication latency in updating avatars for two remote users in three scenarios. We observed that the communication time between updating user avatar positions corresponds to the physical distance between the users. We also noticed some jitter in the avatar movements, which was more pronounced in the case of large physical distance among users. This jitter could be attributed to packets of data containing user skeleton joints, not reaching user XR applications fast enough. This can be further optimized in future iterations of this implementation to improve the efficiency of the embodied experience.

Embodied cognition and interaction in remote collaboration can be of special relevance to online classrooms. Reportedly, many students are not able to pay as much attention in online classes as they could in physical classrooms. For students suffering from Attention deficit hyperactivity disorder (ADHD) focusing during online lectures is especially challenging. Research shows that ADHD is not simply a set of mental functions, but rather a range of bodily dynamics through which humans engage with their environment [10]. The embodied cognition of remote collaboration has the potential to significantly help students with attention disorders pay more in online classrooms.

Virtual-environment-based classroom sessions can also slowly be incorporated into existing Zoom-based online classrooms. During these sessions, students and the instructor can log into a shared virtual environment. The classroom can serve as a homeroom for the instructor to provide lesson overviews, introductions and set expectations for what students will encounter in that day’s virtual environment. Virtual classroom sessions can range from simple sessions, where the instructor presents virtual models to the class, to transportation into completely virtual landscapes [14].

For example, the doors of the classroom could be made to be portals that teleport the students to Africa, where they can interact with local flora and fauna and gaze upon Mount Kilimanjaro. The students can use the doors/portals to move between classrooms for different subjects. Students can be allowed to dynamically interact with the objects in a room to change their state. Virtual objects can also be implemented to have textual and audio properties that play when a student interacts with them [2]. The interactive nature of virtual environments can captivate and hold student attention for much longer than in a typical lecture session. For this reason virtual environments can serve well as a supplement to existing online education.

However, a primary challenge to incorporating embodied interactions in shared virtual environments is making these environments accessible for all users. While VR, AR and MR headsets are slowly becoming commercially viable they are still a few years away from being used as personal devices for work (like laptops). Body-tracking sensors are also typically used for niche areas like for gaming, animation or research. The lack of commercial access to specific hardware can be a barrier to the widespread use of embodied interaction and XR.

A potential solution to this problem is democratize access by also making the shared virtual environments available as web-enabled experiences. Users would be able to open up a browser window and log into the virtual environment, just like they would to attend a Zoom meeting. Users would interact with the same virtual content through their browser as they would through an XR headset. The primary difference between the two being that XR users would experience the environment from a first-person perspective, while the online users would experience it in third-person. Both types of users can create custom avatars to represent themselves in the shared virtual environment. XR users could navigate and interact with the environment using their controller, while online users would use their keyboard and mouse to move their avatars around in a virtual environment. This kind of implementation would be accessible to a much wider range of users and can open up exciting new avenues for platform-agnostic collaboration in shared virtual environments.

6 Conclusion and Future Work

The objective of this study was to explore the scope for embodied interactions for remote meetings in shared virtual environments. The outbreak of COVID-19 has forced people to switch to video-conferencing platforms for remote work. The feeling of immersion, engagement, and presence in such video-conferencing platforms is low. This can result in reduced attention and increased fatigue among remote collaborators.

To address some of the challenges, we proposed and implemented an approach to conducting virtual meetings, while leveraging embodied interaction. These embodied interaction techniques can help facilitate better remote collaboration in shared virtual environments. Our approach comprises of a body tracking component, a communication component, and an XR rendering component. We developed a virtual classroom scenario as our prototype implementation and connected two locations, over 4000 mi apart. The maximum latency that we observed during our usability evaluation was 120 ms, suggesting the potential for this kind of approach in facilitating real-time remote collaboration.

Next steps will involve porting and testing the usability of the system with standalone XR headsets. We also intend to enhance our system design by exploring more efficient data management and communication protocols that can reduce jitter in avatar movements and allow several remote users to join the virtual environment. These additional components will be implemented and validated by a user study in our future work.