Keywords

1 Introduction

This chapter gives an overview of interaction techniques for mixed reality (MR) with its variations of augmented and virtual reality (AR/VR). Early research in the field of MR interaction techniques focused on the use of surface-based, tangible, and gesture-based interaction, which will be presented at the beginning of this chapter. Further modalities, such as pen-based, gaze-based, or haptic interaction, have recently gained attention and are presented next. Further, with the move toward productivity-oriented use cases, interaction with established input devices such as keyboard and mouse has gained interest from the research community. Finally, inspired by the popularity of conversational agents, interaction with intelligent virtual agents is discussed.

The development of interaction techniques is closely related to the advancements in input devices. Hence, the reader is invited to study the according book chapter as well.

While this chapter follows the abovementioned structure, further possibilities of structure interaction techniques include organization according to interaction tasks [1] such as object selection [2,3,4] and object manipulation [5], navigation [6], symbolic input [7], or system control [8]. Further, interaction techniques for specific application domains have been proposed, such as music [9] games [10] or immersive analytics [11].

Interested readers are also referred to further surveys and books in areas such as 3D interaction techniques, [12] or interaction with smart glasses [13].

2 Tangible and Surface-Based Interaction

This section presents the concepts of tangible user interfaces (TUIs) and their applicability in AR. It covers the effects of output media, spatial registration approaches for TUIs, tangible magic lenses, augmenting large surfaces like walls and whole rooms, the combination of AR with shape-changing displays, and the role of TUIs for VR-based interaction. Figure 5.1 depicts an overview about output and input devices typically found in TUI-based interaction for MR.

Fig. 5.1
figure 1

A classification of input and output devices used in tangible user interfaces for AR and VR. OST: Optical See-Through. VST: Video See-Through. HMDs: Head-Mounted Displays

TUIs are concerned with the use of physical objects as medium for interaction with computers [14] and has gained substantial interest in human–computer interaction [15]. Early prototypes utilized tabletop setting on which physical objects were placed to change properties of digital media. For example, Underkoffler and Ishii introduced a simulation of an optical workbench using tangible objects on a tabletop [16] as well as an application for architectural planning [17].

In AR, this concept was introduced by Kato et al. [18] as tangible augmented reality (TAR). They used a paddle as prop, equipped with a fiducial, to place furniture inside a house model. Fjeld et al. [19] introduced further tangibles such as a booklet and a cube for interaction within an educational application for chemistry.

TAR AR is used for visualizing digital information on physical objects while using those physical objects as interaction devices. Billinghurst et al. [20] stated that the TAR characteristics have a spatial registration between virtual and physical objects and the ability of users to interact with those virtual objects by manipulating the physical ones. Regenbrecht et al. [21] utilized a rotary plate to allow multiple co-located users to manipulate the orientation of a shared virtual object.

This way, the gap between digital output (e.g., on a flat screen) and physical input (e.g., using a rotary knob) can be reduced as the digital information is directly overlaid over the physical content.

Lee et al. [22] described the common interaction themes in the TAR application such as static and dynamic mappings between physical and digital objects. They describe a space-multiplexed approach, where each physical tool is mapped to a single virtual tool or function as well as a time-multiplexed approach in which the physical object is mapped to different digital tools dependent on the context of use.

However, the effect of this overlay is also dependent on the output medium used. For example, when using projection-based systems [23] or video see-through (VST) head-mounted displays (HMDs) (c.f. chapter 10 in [24]), the distance between the observer and the physical and virtual objects is the same. In contrast, when using commodity optical see-through (OST) HMDs with a fixed focal plane, there can be an substantial cost of perceiving virtual and physical objects at the same time. Specifically, Eiberger et al. [25] demonstrated that when processing visual information jointly from objects within arms’ reach (in this case, a handheld display) and information presented on a OST HMD at a different distance, the task completion times increases by approximately 50%, and the error rate increased by approximately 100% compared with processing this visual information solely on the OST HMD.

For spatially registering physical and virtual objects, early works on TAR often relied on fiducial markers, such as that provided by ARToolKit [26] or ARUCO [27]. While easy to prototype (i.e., simply, fiducials have to be printed out and attached to objects), these markers can inhibit interaction due to their susceptibility to occlusions (typically through hand and finger interaction). Hence, it is advised to use modern approaches for hand-based interaction [28, 29] with spatially tracked rigid and non-rigid objects [30,31,32].

A specific kind of TAR can be seen in tangible magic lenses, which evolved through a combination from the magic lens [33] and tangible interaction concepts [14]. Tangible magic lenses allow for access to and manipulation of otherwise hidden data in interactive spatial environments.

Evolving from the magic lens [33] and tangible interaction concepts [14], tangible magic lenses allow for access to and manipulation of otherwise hidden data in interactive spatial environments. A wide variety of interaction concepts for interactive magic lenses have been proposed within the scope of information visualization (see surveys [34, 35]).

Within AR, various rigid shapes have been explored. Examples include rectangular lenses for tabletop interaction [36] or circular lenses [37]. Flexible shapes (e.g., [38]) have been utilized as well as multiple sheets of paper [39]. In their pioneering work, Szalavári and Gervautz [40] introduced the personal-interaction panel in AR. The two-handed and pen-operated tablet allowed for the selection and manipulation of virtual object as well as for system control. Additionally, transparent props have been explored (e.g., a piece of plexiglass) both for tabletop AR [41,42,43] and VR [44]. Purely virtual tangible lenses have been proposed as well [45]. Brown et al. [46] introduced a cubic shape which could either perspectively correct, render, and manipulate 3D objects or text. This idea was later revisited by Issartel et al. [47] in a mobile setting.

Often, projection-based AR has been used to realize tangible magic lenses, in which a top-mounted projector illuminates a prop such as a piece of cardboard or other reflective materials [36, 48] and (typically RGB or depth) cameras process user input.

Mobile devices such as smartphones and tablets are also commonly used as a tangible magic lens [49, 50], and can be used in conjunction with posters [49], books [51], digital screens [50], or maps [52, 53].

When using the tangible magic lens metaphor in public space, one should be aware about the social acceptability, specifically due to the visibility of spatial gestures and postures [54, 55]. For example, in a series of studies across gaming and touristic use cases, Grubert et al. [56, 57] explored benefits and drawbacks of smartphone-based tangible lens interfaces in public settings and compared them with traditional static peephole interaction, commonly used in mobile map applications. They found that user acceptance is largely dependent on the social and physical setting. In a public bus stop in a large open space used at a transit area in a public transportation stop, participants favored the magic lens over a static peephole interface despite tracking errors, fatigue, and potentially conspicuous gestures. Also, most passersby did not pay attention to the participants and vice versa. However, when deploying the same experience in a different public transportation stop with other spatial and social contexts (waiting area, less space to avoid physical proximity to others), participants used and preferred the magic lens interface significantly less compared with a static peephole interface.

Further, when using smartphones or tablets as magic lenses, the default user’s view is based on the position of the physical camera attached to the handheld device. However, this can potentially negatively affect the user’s experience [58, 59]. Hence, it can be advisable to incorporate user-perspective rendering to render the scene from the point of view of the user’s head. In this domain, Hill et al. [60] introduced user-perspective rendering as virtual transparency for VST AR. Baričević et al. [61] compared user- vs. device-perspective rendering in a VR simulation. Tomioka et al. [62] presented approximated user-perspective rendering using homographies. Grubert et al. [63] proposed a framework for enabling user-perspective rendering to augment public displays. Čopič et al. [58, 59], quantified the performance differences between device- and user-perspective rendering in map-related tasks. Mohr et al. [64] developed techniques for an efficient computation of head-tracking techniques needed for user-perspective rendering.

Beyond handheld solutions, whole surfaces such as tables, walls, or body parts can be augmented and interacted with. Often projector-camera systems are used for processing input and creating output on surfaces. Early works included augmenting desks using projectors to support office work of single users [65,66,67] or in collaborative settings [68]. Later the Microsoft Kinect and further commodity depth sensors gave rise to a series of explorations with projector-camera systems.

For example, Xiao et al. [69] introduced WorldKit to allow users to sketch and operate user interface elements on everyday surfaces. Corsten et al. [70] proposed a pipeline for repurposing everyday objects as input devices. Henderson and Feiner also proposed the use of passive haptic feedback from everyday objects to interact with virtual control elements such as virtual buttons [71].

Mistry and Maes [72] utilized a necklace-mounted projector-camera system to sense finger interactions and project content on hands or the environment. Following suite, Harrison et al. [73] introduced OmniTouch, a wearable projector-depth-camera system that allowed for project user interface elements on body parts, such as the hand (e.g., a virtual dial pad), or for augmenting paper using touch.

Further, the idea of interacting with augmented surfaces was later expanded to cover bend surfaces [74], walls [75], and complete living rooms [76] or even urban facades [77, 78]. For example, in IllumiRoom [75], the area around a television was augmented using a projector, after initially scanning it with a depth camera. Possible augmentations included extending the field of view of on-screen content, selectively rendering scene elements of a game, or changing the appearance of the whole environment using non-photorealistic renderings (e.g., cartoon style or a wobble effect). In RoomAlive, multiple projector-depth camera systems were used to create a 3D scan of a living room as well as to spatially track the user’s movement within that room. Users are able to interact with digital elements projected in the room using touch and in-air gestures. Apart from entertainment purposes, this idea was also investigated in productivity scenarios such as collaborative content sharing in meetings [79]. Finally, the augmentation of shape-changing interfaces was also explored [80,81,82]. For example, in Sublimate [82] an actuated pin display was combined with stereoscopic see-through screen to achieve a close coupling between physical and virtual object properties, e.g., for height fields or NURBS surface modeling. InForm [81] expanded this idea to allow both for user input on its pins (e.g., utilizing them as buttons or handles) as well as manipulation of external objects (such as moving a ball across its surface).

In VR, tangible interaction has been explored using various props. The benefit of using tangibles in VR is that a single physical object can be used to represent multiple virtual objects [83], even if they show a certain extend of discrepancy. Simeone et al. [84] presented a model of potential substitutions based on physical objects, such as mugs, bottles, umbrellas, and a torch. Hettiarachchi et al. [85] transferred this idea to AR. Harley et al. [86] proposed a system for authoring narrative experiences in VR using tangible objects.

3 Gesture-Based Interaction

Touch and in-air Gestures and postures make up a large part of interpersonal communication and have also been explored in depth in mixed reality. A driver for gesture-based interaction was the desire for “natural” user interaction, i.e., interaction without the need to explicitly handle artificial control devices but to rely on easy-to-learn interaction with (to the user) invisible input devices. While many gesture sets have been explored by researchers or users [87], it can be debated how “natural” those gesture-based interfaces really are [88], e.g., due to the poor affordances of lectures.

Still, the prevalence of small sensors such as RGB and depth cameras, inertial measurement units, radars or magnetic sensors in mobile devices and AR as well as VR HMDs, and continuing advances in hand [28, 29], head [89], and body pose estimation [90,91,92,93,94,95,96] gave rise to a wide variety of gesture-based interaction techniques being explored for mixed reality.

For mobile devices research began investigating options for interaction next to [97], above [98, 99], behind [100, 101], across [102,103,104,105], or around [106, 107] the device.

The additional modalities are either substituting or complementing the devices’ capabilities. These approaches typically relied on modifying existing devices using a variety of sensing techniques, which can limit their deployment to mass audiences. Hence, researchers started to investigate the use of unmodified devices. Nandakumar et al. [108] proposed the use of the internal microphones of mobiles to determine the location of finger movements on surfaces but cannot support mid-air interaction. Song et al. [109] enabled in-air gestures using the front and back facing cameras of unmodified mobile devices. With Surround See, Yang et al. [110] modified the front-facing camera of a mobile phone with an omnidirectional lens, extending its field of view to 360 horizontally. They showcased different application areas, including peripheral environment, object, and activity detection, including hand gestures and pointing, but did not comment on the recognition accuracy. In GlassHands, it was demonstrated how the input space around a device can be extended by using a built-in front-facing camera of an unmodified handheld device and some reflective glasses, like sunglasses, ski goggles, or visors [111]. This work was later extended to work with eye reflections [112, 113].

While being explored since the mid-1990s in tabletop-based AR [114,115,116], for handheld AR, vision-based finger and hand tracking became popular since the mid-2000s [117,118,119,120]. Yusof et al. [121] provide a survey on the various flavors of gesture-based interaction in handheld AR, including marker-based and marker-less tracking of fingers or whole hands.

An early example of in-air interaction using AR HMDs is presented by Kolsch et al. [122], who demonstrated finger tracking with a head-mounted camera. Xiao et al. [123] showed how to incorporate touch gestures on everyday surfaces in to the Microsoft HoloLens. Beyond hand and finger tracking, full-body tracking using head-mounted cameras was also explored [124]. Also, reconstruction of facial gestures, e.g., for reenactment purposes, when wearing HMDs has gained increased interest [125,126,127,128].

Further solutions for freehand interaction were also proposed, including a wrist-worn gloveless sensor [129], swept frequency capacitive sensing [130], an optical mouse sensor attached to a finger [131], or radar-based sensing [132].

Most AR and VR in-air interactions typically aim at using unsupported hands. Hence, to enable reliable selection, targets are designed to be sufficiently large and spaced apart [133]. Also, while the addition of hand tracking to modern AR and VR HMDs allows for easy access to in-air gestures, the accuracy of those spatial tracking solutions is still significantly lower than dedicated lab-based external tracking systems [134].

Besides interaction with handheld or head-worn devices, also whole environments such as rooms can be equipped with sensors to facilitate gesture-based interaction [135,136,137].

In VR, off-the-shelf controllers were also appropriated to reconstruct human poses in real time [138, 139].

4 Pen-Based Interaction

In-air interactions in AR and VR typically make use of unsupported hands or controllers designed for gaming. In addition, pens (often in combination with tablets as supporting surface) have also been explored as input devices. Szalavári and Gervautz [40] as well as Billinghurst et al. [140] utilized pens for input on physical tablets in AR respectively VR. Watsen et al. [141] used a handheld Personal Digital Assistant (PDA) for operating menus in VR. In the Studierstube frameworks, pens were used to control 2D user interface elements on a PDA in AR. Poupyrev et al. [142] used a pen for notetaking in VR. Gesslein et al. [143] used a pen for supporting spreadsheet interaction in Mobile VR.

Many researches also investigated the use of pens for drawing and modeling. Sachs et al. [144] used an early system of 3D CAD modeling using a pen. Deering [145] used a pen for in-air sketching in a fishtank VR environment. Keeve et al. [146] utilized a brush for expressive painting in a Cave Automatic Virtual Environment (CAVE). Encarnacao [147] used a pen and pad for sketching in VR on top of an interactive table. Fiorentino et al. [148] explored the use of pens in mid-air for CAD applications in VR. Xin et al. [149] enabled the creation of 3D sketches using the pen and tablet interaction in handheld AR. Yee et al. [150] used a pen-line device along a VST HMD for in situ sketching in AR. Gasquez et al. [151, 152], Arora et al. [153], and Drey et al. [154] noted the benefits of supporting both free-form in-air sketching and on a supporting 2D surface in AR and VR. Suzuki et al. [155] expanded previous sketching applications for AR with dynamic and responsive graphics, e.g., to support physical simulations.

The performance of pen-based input was also investigated in VR. Bowman and Wingrave [156] compared pen and tablet input for menu selection against floating menus and a pinch-based menu system and found that the pen and tablet interaction was significantly faster. Teather and Stuerzlinger [157] compared pen-based input with mouse input for target selection in a fishtank VR environment and found that 3D pointing was inferior to 2D pointing when targets where rendered stereoscopically. Arora et al. [158] compared pen-based mid-air painting with surface-supported painting and found supporting evidence that accuracy improved using a physical drawing surface. Pham et al. [159] indicated that pens significantly outperform controllers for input in AR and VR and is comparable to mouse-based input for target selection. Batmaz et al. explored different pen grip styles for target selection in VR [160].

5 Gaze-Based Interaction

Besides input using touch input gestures or handhold input devices, gaze has also been explored as input modality in mixed reality.

Duchowski [161] presents a review of 30 years of gaze-based interaction, in which gaze-based interaction is categorized within a taxonomy that splits interaction into four forms, namely, diagnostic (off-line measurement), active (selection, look to shoot), passive (foveated rendering, a.k.a. gaze-contingent displays), and expressive (gaze synthesis).

For VR, Mine [162] proposed the use of gaze-directed steering and look-at menus in 1995. Tanriverdi and Jacob [163] highlighted that VR can benefit from gaze tracking. They stated that physical effort can be minimized through gaze, and user’s natural eye movement can be employed to perform interactions in VR (e.g., with distant objects). They also show that a proposed heuristic gaze selection technique outperforms virtual hand-based interaction in terms of task-completion time. Cournia et al. [164] found that dwell-time-based selection was slower than manual ray-pointing. Duchowski et al. [165] presented software techniques for binocular eye tracking within VR as well as their application to aircraft inspection training. Specifically, they presented means for integrating eye trackers into a VR framework, novel 3D calibration techniques, and techniques for eye-movement analysis in 3D space. In 2020, Burova et al. [166] also utilized eye-gaze analysis in industrial tasks. They used VR to develop AR solutions for maintenance tasks and collected gaze data to elicit comments from industry experts on the usefulness of the AR simulation. Zeleznik et al. [167] investigated gaze interaction for 3D pointing, movement, menu selection, and navigation (orbiting and flying) in VR. They introduced “Lazy” interactions that minimize hand movements, “Helping Hand” techniques in which gaze augments hand-based techniques, as well as “Hands Down” techniques in which the hand can operate a separate input device. Piumsomboon et al. [168] presented three novel eye-gaze-based interaction techniques for VR: Duo-Reticles, an eye-gaze selection technique based on eye-gaze and inertial reticles; Radial Pursuit, a smooth pursuit-based technique for cluttered object; and Nod and Roll, a head-gesture-based interaction based on the vestibulo-ocular reflex.

6 Haptic Interaction

Auditory and visual channels are widely addressed sensory channels in AR and VR systems. Still, human experiences can be enriched greatly through touch and physical motions. Haptic devices enable the interaction between humans and computers by rendering mechanical signals to stimulate human touch and kinesthetic channels. The field of haptics has a long standing tradition and incorporates expertise from various fields such as robotics, psychology, biology, and computer science. They also play a role in diverse application domains such as gaming [169], industry [170], education [171], and medicine [172,173,174]. Haptic interactions are based on cutaneous/tactile (i.e., skin-related) and kinesthetic/proprioceptive (i.e., related to the body pose) sensations. Various devices have been proposed for both sensory channels, varying in form factor, weight, mobility, comfort as well as the fidelity, duration, and intensity of haptic feedback. For recent surveys, we refer to [175, 176].

In VR, the use of haptic feedback has a long tradition [177]. A commonly used active haptic device for stationary VR environment with a limited movement range of the users’ hands is the PHANToM, which is a grounded system (or manipulandum) offering a high fidelity but low portability. Hence, over time substantial research efforts have been made in creating mobile haptic devices for VR [176].

In AR, the challenge in using haptics is that the display typically occludes real objects the user might want to interact with. Also, in OST displays, the haptic device is still visible behind virtual objects rendered on the display. When using VST displays, the haptic device might be removed by inpainting [178].

Besides active haptic systems, researchers have also investigated the use of low-fidelity physical objects to augment virtual environments in passive haptics. An early example of this type of haptic feedback is presented by Insko [179], who showed that passive haptics can improve both sense of presence and spatial knowledge training transfer in a virtual environment.

A challenge when using passive haptic feedback, besides a mismatch in surface fidelity, is that the objects used for feedback are typically static. To mitigate this problem, two strategies can be employed. First, the objects themselves can be moved during interaction by mounting them on robotic platforms such as robots [180, 181] or by human operators [182, 183]. Second, the movements of the user themselves can be redirected to a certain extent by decoupling the physical motion of a user from the perceived visual motion. This can be done with individual body parts such as hands [184, 185] or the whole body using redirected walking techniques [186, 187].

7 Multimodal Interaction

While, often, AR and VR systems offer single input channels along with audio-visual output, rich interaction opportunities arise when considering the combination of further input and output modalities. Complementing the strengths of multiple channels can lead to increased user experiences. While multimodal (or multisensory) output is typically concerned with increasing the immersion and sense of presence in a scene, multimodal input typically tries to increase the efficiency of user interaction with a AR or VR system. For overviews about multimodal interaction beyond AR and VR, we refer to [188, 189]. Nizam et al. also provided a recent overview about multimodal interaction for specifically for AR [190].

The use of multisensory output such as the combination of audiovisual output with smell and touch has been shown to increase presence and perceived realism in VR [191, 192] and has been employed as early as in the 1960s [193]. Gallace et al. discussed both benefits and challenges when utilizing multiple output modes in VR [194]. Extrasensory experiences [195, 196] (such as making temperature visible through infrared cameras) have also been explored [197].

In AR, Narumi et al. [198] showed that increasing the perceived size of a real cookie using AR also increased the feeling of satiety. Narumi et al. [199] also created a multisensory eating experience in AR by changing the apparent look and smell of cookies. Koizumi et al. [200] could modulate the perceived food texture using a bone-conducting speaker. Ban et al. [201] showed that it is possible to influence fatigue while handling physical objects by affecting their perceived weight by modulating their size in AR.

Regarding multimodal input in VR, the combination of speech and gestures is a commonly used input combination. In 1980, Bolt [202] introduced put-that-there. Users could immerse themselves in a Media Room to place objects within that environment through a combination of gestures and speech. In 1989, Hauptmann [203] showed that users preferred a combination of speech and gestures for the spatial manipulation of 3D object. Cohen et al. [204] used a handheld computer along with speech and gesture for supporting map-based tasks on a virtual workbench. LaViola [205] used hand-based interaction (sensed through a data glove) along with speech for interior design in VR. Ciger et al. [206] combined speech with pointing of a magic wand on an immersive wall to create “magical” experiences. Burdea et al. [207] presented an early survey on VR input and output devices as well as an overview about studies that quantify the potentials of several modalities on simulation realism and immersion. Prange et al. [208] studied the use of speech and pen-based interaction in a medical setting.

In AR, Olwal et al. [209] combined speech and gestures for object selection. Kaiser et al. [210] extended that work by introducing mutual disambiguation to improve selection robustness. Similarly, Heidemann et al. [211] presented an AR system for online acquisition of visual knowledge and retrieval of memorized objects using speech and deictic (pointing) gestures. Kolsch et al. [122] combined speech input with gestures in an outdoor AR environment. Piumsomboon [212] studied the use of gestures and speech vs gestures only for object manipulation in AR. They found that the multimodal was not substantially better than gesture-only-based interaction for most tasks (but object scaling). This indicates that multimodality per se is not always beneficial for interaction but needs to be carefully designed to suit the task at hand. Rosa et al. [213] discussed different notions of AR and Mixed Reality as well as the role of multimodality. Wilson et al. [214] used a projector-camera system mounted on a pan-tilt platform for multimodal interaction in a physical room using a combination of speech and gestures.

The combination of touch and 3D movements has also been explored in VR and AR. Tsang et al. [215] introduced the Boom Chameleon, touch display mounted on a tracked mechanical boom, and used joint gesture, speech, and viewpoint input in a 3D annotation application. Benko et al. [216] combined on-surface and in-air gestures for content transfer between a 2D screen and 3D space. Mossel et al. [217] and Marzo et al. [218] combined touch input and handheld device movement for 3D object manipulations in mobile AR. Polvi et al. [219] utilized touch and the pose of a handheld touchscreen for reminded object positioning in mobile AR. Grandi et al. [220] studied the use of touch and the orientation of a smartphone for collaborative object manipulation in VR. Surale et al. [221] explored the use of touch input on a spatially tracked tablet for object manipulations in VR. In VR, Menzner et al. [222] utilized combined in-air and touch movements on and above smartphones for efficient navigation of multiscale information spaces. Several authors combined pen input both in mid-air as well as on touch surfaces to enhance sketching in VR [154] and AR [151,152,153].

Also, the combination of eye-gaze with other modalities such as mid-air gestures and head movements has seen recent interest for interaction in AR and VR. For example, Pfeuffer et al. [223] investigated the combination of gaze and gestures in VR. They described Gaze + Pinch, which integrates eye gaze to select 3D objects, and indirect freehand gestures to manipulate those objects. They explored this technique for object selection, manipulation, scene navigation, menu interaction, and image zooming. Similarly, Ryu et al. [224] introduced a combined grasp eye-pointing technique for 3D object selection. Kyto et al. [225] combined head and eye gaze for improving target selection in AR. Sidenmark and Gellersen [226, 227] studied different techniques combining eye and head pointing in VR. Gesslein et al. [143] combined pen-based input with gaze tracking for efficient interaction across multiple spreadsheets. Biener et al. [228] utilized gaze and touch interaction to navigate virtual multi-display environments.

8 Multi-Display Interaction

Traditionally, output of interactive systems is often limited to a single display, ranging from smartwatches to gigapixel displays. However, multi-display environments from the desktop to gigapixel displays are also increasingly common for knowledge work and complex tasks such as financial trading or factory management as well as for social applications such as second screen TV experiences [229]. Surveys about multi-display systems and distributed user interfaces have been presented by Elmqvist [230], Grubert et al. [229, 231, 232], and Brudy et al. [233].

Augmented reality has the potential to enhance interaction with both small and large displays by adding an unlimited virtual screen space or other complementing characteristics like mobility. However, this typically comes at the cost of a lower-display fidelity compared with a physical panel display (such as lower resolution, lower contrast, or a smaller physical field of view in OST HMDs).

In 1991, Feiner et al. [234] proposed a hybrid display combining a traditional desktop monitor with an OST HMD and explored a window manager application. Butz et al. [235] combined multiple physical displays ranging from handheld to wall-sized ones with OST HMDs in a multi-user collaborative environment. Baudisch et al. [236] used a lower-resolution projector to facilitate focus and context interaction on a desktop computer. MacWilliams et al. [237] proposed a multi-user game in which players could interact with a tabletop, laptop, and handheld displays. Serrano et al. [238] proposed to use an OST HMD to facilitate content transfer between multiple physical displays on a desktop. Boring et al. [239] used a smartphone to facilitate content transfer between multiple stationary displays. They later extended the work to manipulate screen content on stationary displays [240] and interactive facades [241] using smartphones. Raedle et al. [104] supported interaction across multiple mobile displays through a top-mounted depth camera. Grubert et al. [105, 242] used face tracking to allow user interaction across multiple mobile devices, which could be dynamically re-positioned. They also proposed to utilize face tracking [242, 243] to create a cubic VR display with user-perspective rendering. Butscher et al. [244] explored the combination of VST HMDs with a tabletop displays for information visualization. Reipschläger et al. [245, 246] combined a high-resolution horizontal desktop display with an OST HMD for design activities. Gugenheimer et al. [247] introduced face touch, which allows interacting with display-fixed user interfaces (using direct touch) and world-fixed content (using raycasting). This work was later extended to utilize three touch displays around the user’s head [248]. Gugenheimer et al. also introduced ShareVR [249], which enabled multi-user and multi-display interactions across users inside and outside of VR.

A number of systems also concentrated on the combination of HMDs and handheld as well body-worn displays, such as smartwatches, smartphones, and tablets in mobile contexts. Here, typically the head-mounted display extends the field of view of the handheld display to provide a larger virtual field of view. In MultiFi [250], an OST HMD provides contextual information for higher-resolution touch-enabled displays (smartwatch and smartphone). The authors explored different spatial reference systems such as body-aligned, device-aligned, and side-by-side modes. Similar explorations have followed suit using video-see-through HMDs [251], an extended set of interaction techniques [252], using smartwatches [253,254,255], or with a focus on understanding smartphone-driven window management techniques for HMDs [256].

Purely virtual multi-display environments have also been explored in AR and VR. In 1993, Feiner et al. [257] introduced head-surrounding and world reference frames for positioning 3D windows in VR. In 1998, Billinghurst et al. [258] introduced the spatial display metaphor, in which information windows are arranged on a virtual cylinder around the user. Since then, virtual information displays have been explored in various reference systems, such as world-, object-, head-, body-, or device-referenced [259]. Specifically, interacting with windows in body-centered reference systems [260] has attracted attention, for instance, to allow fast access to virtual items [261, 262], mobile multi-tasking [263, 264], and visual analytics [265]. Lee et al. [266] investigated positioning a window in 3D space using a continuous hand gesture. Petford et al. [267] compared the selection performance of mouse and raycast pointing in full coverage displays (not in VR). Jetter et al. [268] proposed to interactively design a space with various display form factors in VR.

9 Interaction Using Keyboard and Mouse

Being the de facto standard for human-computer interaction in personal computing environments for decades, standard input peripherals such as keyboard and mouse, while initially used in projection-based CAVE environments, were soon replaced by special-purpose input devices and associated interaction techniques for AR and VR (see previous sections). This was partly due to the constraints of those input devices, making them challenging to use for spatial input with six degrees of freedom. Physical keyboards typically support solely symbolic input. Standard computer mice are restricted to two-dimensional pointing (along with button clicks and a scroll-wheel). However, with modern knowledge workers still relying on the efficiency of those physical input devices, researchers revisited how to use them within AR and VR.

With increasing interest in supporting knowledge work using AR and VR HMDs [269,270,271,272], keyboard and mouse interaction drew the attention of several researchers.

The keyboard was designed for the rapid entrance of symbolic information, and although it may not be the best mechanism developed for the task, its familiarity that enabled good performance by users without considerable learning efforts kept it almost unchanged for many years. However, when interacting with spatial data, they are perceived as falling short of providing efficient input capabilities [273], even though they are successfully used in many 3D environments (such as CAD or gaming [274]), can be modified to allow 3D interaction [275, 276], or can outperform 3D input devices in specific tasks such as 3D object placement [277, 278]. Also for 3D object manipulation in AR and VR, they were found to be not significantly slower than a dedicated 3D input device [279].

In VR, a number of works investigated the costs of using physical keyboards for standard text entry tasks. Grubert et al. [280, 281], Knierim et al. [282], and McGill et al. [283] found physical keyboards to be mostly usable for text entry in immersive head-mounted display-based VR but varied in their observations about the performance loss when transferring text entry from the physical to the virtual world. Pham et al. [284] deployed a physical keyboard on a tray to facilitate mobile text entry. Apart from standard QWERTY keyboards, a variety of further text entry input devices and techniques have been proposed for VR; see [7].

Besides using unmodified physical keyboards, there have been several approaches in extending the basic input capabilities of physical keyboard beyond individual button presses. Specifically, input on, above, and around the keyboard surface have been proposed using acoustic [285, 286], pressure [287,288,289], proximity [290], and capacitive sensors [291,292,293,294,295,296], cameras [297,298,299], body-worn orientation sensors [300], or even unmodified physical keyboards [301, 302]. Besides sensing, actuation of keys has also been explored [303]. Embedding capacitive sensing into keyboards has been studied by various researchers. It lends itself to detect finger events on and slightly above keys and can be integrated into mass-manufacturing processes. Rekimoto et al. [294] investigated capacitive sensing on a keypad, but not a full keyboard. Habib et al. [292] and Tung et al. [293] proposed to use capacitive sensing embedded into a full physical keyboard to allow touchpad operation on the keyboard surface. Tung et al. [293] developed a classifier to automatically distinguish between text entry and touchpad mode on the keyboard. Shi et al. developed microgestures on capacitive sensing keys [295, 304]. Similarly, Zheng et al. [305, 306] explored various interaction mappings for finger and hand postures. Sekimoro et al. focused on exploring gestural interactions on the space bar [307]. Extending the idea of LCD-programmable keyboards [308], Block et al. extended the output capabilities of touch-sensitive, capacitive-sensing keyboard by using a top-mounted projector [296]. Several commercial products have also augmented physical keyboards with additional, partly interactive, displays (e.g., Apple Touch Bar, Logitech G19 [309], Razer DeathStalker Ultimate [310]).

Maiti et al. [311] explored the use of randomized keyboard layouts on physical keyboards using an OST display. Wang et al. [312] explored the use of an augmented reality extension to a desktop-based analytics environment. Specifically, they added a stereoscopic data view using a HoloLens to a traditional 2D desktop environment and interacted with keyboard and mouse across both the HoloLens and the desktop.

Schneider et al. [313] explored a rich design space of using physical keyboards in VR beyond text entry. Specifically, they proposed three different input mappings: 1 key to 1 action (standard mode of interaction using keyboards), multiple keys to a single action (e.g., mapping a large virtual button to several physical buttons), as well as mapping a physical key to a coordinate in a two-dimensional input space. Similarly, they proposed three different output mappings: augmenting individual keys (e.g., showing an emoji on a key), augmenting on and around the keyboard (e.g., adding user-interface elements on top of the keyboard such as virtual sliders), as well as transforming the keyboard geometry itself (e.g., only displaying single buttons or replacing the keyboard by other visuals). Those ideas were later also considered in the domain of immersive analytics [314].

Mouse-based pointing has been studied in depth outside of AR and VR for pointing on single monitors [315] as well as multi-display environments [316,317,318]. However, it has been found that stand 2D mouse devices do not adapt well to multi-display interaction [319], an issue which is also relevant for AR and VR. Consequently, standard mice have been modified in various ways to add degrees of freedom. For example, Villar et al. [320] explored multiple form factors for multi-touch-enabled mice. Other researchers have added additional mouse sensors to support yawing [321, 322], pressure sensors for discrete selection [323, 324] to allow for three instead of two degrees of freedom. Three-dimensional interaction was enabled using Rockin’Mouse [325] and the VideoMouse [326]. Both works added a dome below the device to facilitate 3D interaction. Steed and Slater [327] proposed to add a dome on top of the mouse rather than below. Further form factors have also been proposed to facilitate pointing-based interaction in 3D [328, 329]. Recently, researchers also worked on unifying efficient input both in 2D and 3D [276, 330].

Standard mice using a scroll wheel can also be efficiently used for 3D object selection when being combined with gaze-tracking in virtual multi-display environments [228]. For example, in the Windows Mixed Reality Toolkit [331], the x and y movements of the mouse can be mapped to the x and y movements on a proxy shape, such as a cylinder (or any object on that cylinder, like a window). The scroll wheel is used for changing the pointer depth (in discrete steps). The x and y movements can be limited to the current field of view of the user to allow for acceptable control to display ratios. The user gaze can then be used to change the view on different regions of the proxy shape.

10 Virtual Agents

Virtual agents can be considered as “intelligent” software programs performing tasks on behalf of users based on questions or commands. While it can be argued what “intelligent” really means in this context, a widely accepted characteristic of this “intelligence” is context-aware behavior [332, 333]. This allows an agent to interact with the user and environment through sensing and acting in an independent and dynamic way. The behavior is typically well defined and allows to trigger actions based on a set of conditions [334].

The rise of voice assistants (or conversational agents) [335], which interact with users through natural language, has brought media attention and a prevalence in various areas, such as home automation, in-car operation, automation of call centers, education, and training [336].

In AR and VR, virtual agents often use more than a single modality for input and output. Complementary to voice in an output, virtual agents in AR and VR can typically react to body gestures or postures or even facial expressions of the users. Due to their graphical representations, those agents are embodied in the virtual world. The level of embodiment of a virtual agent has been studied for decades [337, 338]. For example it has been shown that the effect of adding a face was larger than the effect of visual realism (both photo-realism and behavioral realism of the avatar). In VR, the level of visual realism of the virtual agent is typically matched to the visual realism of the environment. In contrast, in AR, there is often a noticeable difference between the agent representation and the physical scene, and those effects are still underexplored [339]. Hantono et al. reviewed the use of virtual agents in AR in educational settings. Norouzi et al. provided review of the convergence between AR and virtual agents [340].

Specifically for AR, Maes et al. [341] introduced a magic mirror AR system, in which humans could interact with a dog through both voice and gestures. Similarly, Cavazza et al. [342] allowed participants to interact with virtual agents in an interactive storytelling environment. MacIntyre et al. [343] used pre-recorded videos of physical actors to let users interact with them using OST HMDs. Anabuki et al. [344] highlighted that having virtual agents and users share the same physical environment is the most distinguishing aspect of virtual agents in AR. They introduced Welbo, an animated virtual agent, which is aware of its physical environment and can avoid standing in the user’s way. Barakony et al. [345] presented “AR Puppet” as system that explored the context-aware animated agents within AR in investigated aspects like visualization, appearance, or behaviors. They investigated AR-specific aspects such as the ability of the agent to avoid physical obstacles or its ability to interact with physical objects. Based on this initial research, the authors explored various applications [346, 347]. Chekhlov et al. [348] presented a system based on Simultanteous Localization and Mapping (SLAM) [349], in which the virtual agent had to move in a physical environment. Blum et al. [350] introduced an outdoor AR game which included virtual agents. Kotranza et al. [351, 352] used a tangible physical representation of a human that could be touched, along with a virtual visual representation in a medical education context. They called this dual representation mixed reality humans and argued that affording touch between a human and a virtual agent enables interpersonal scenarios.

11 Summary and Outlook

This chapter served as an overview of a wide variety of interaction techniques MR, covering both device- and prop-based input such as tangible interaction and pen and keyboard input as well as utilizing human effector-based input such as spatial gestures, gaze, or speech.

The historical development of the presented techniques was closely coupled to the available sensing capabilities. For example, in order to recognize props such as paddles [18], they had to be large enough in order to let fiducials be recognized by low-resolution cameras. With the advancement of computer vision-based sensing, fiducials could become smaller, change their appearance to natural-looking images, or be omitted altogether (e.g., for hand and finger tracking). Further, the combination of more than one modality became possible by increasing computational capabilities of MR systems.

In the future, we expect an ongoing trend of both minimizing the size and price of sensors, as well as the ubiquitous availability of those sensors, in dedicated computing devices, in everyday objects [353], on [354] or even in the human body itself [355]. Hence, MR interaction techniques will play a central role on shaping the future of both pervasive computing [333] as well as augmenting humans with (potentially) superhuman capabilities (e.g., motor capabilities [356, 357], cognitive and perceptual capabilities [358]. Besides technological and interaction challenges along the way, the field of MR interaction will greatly benefit from including both social and ethical implications when designing future interfaces.