1 Introduction

Human–computer interaction (HCI) is one of the rapidly growing topics in recent years, which plays a pivotal role in smart life [1], healthcare [2, 3], and virtual/augmented reality (VR/AR) [4, 5]. The HCI techniques include graphical user interface, voice control, biometric recognition, and gesture recognition. A typical information transmission bridge between human and computers is the user interfaces (UIs). UIs based on touchable interfaces are already widely used. However, touchable interfaces inevitably lead to physical contact, which increases the risk of bacteria and viruses cross-infection [6]. Contactless HCI is desirable in remote communication and control systems for its advantages in terms of hygiene, convenience, and safety [7, 8]. Various contactless HCI methods have been proposed including virtual keyboard.

Keyboards in modern computers have inevitably led to an increase in the volume of text-based communication. As computing technology expands beyond the desktop, academic and industry researchers are motivated to seek efficient text input methods to replace the conventional desktop keyboard. Mobile text input has been a recurring topic at nearly every HCI conference since the 2000s [9,10,11,12]. An alternative to soft keyboards for mobile devices was resistive touch screens in conjunction with a stylus. Then finger-operated capacitive touch screens and soft keyboards became the dominant text input method. With the progress of computing systems, numerous studies on keyboard interaction have emerged. These studies include designing new interactive keyboard approaches [13,14,15], developing advanced algorithms for gesture control systems [16], and establishing evaluation criteria and design rules to improve user experience [17]. The realization of virtual keyboards typically includes two components: gesture recognition and interface display. Gesture recognition enables various forms of interaction [8, 18,19,20,21] by detecting and analyzing the movement of users’ hands. In recent years, a variety of movement tracking devices have merged, which can be employed to implement gesture UI. The advancement of diverse display techniques offers an intuitive perceptual experience for interface display. However, most studies concerning virtual keyboard systems are focused on two-dimensional (2D) interfaces, with limited exploration of three-dimensional (3D) virtual keyboard interaction. This constrains the interactive objects and scenarios, thereby limiting the user experience.

A UI featuring 3D display can provide users with an immersive perceptual experience. 3D display techniques include binocular vision display, volumetric display, light field display and holographic display [22, 23]. Among them, binocular vision display can provide psychological and physiological cues. Its convergence-focusing conflict may cause visual fatigue and dizziness [24,25,26,27,28]. Volumetric display enables all physiological cues. It is lack of psychological cues such as occlusion, shadow, and texture [29,30,31,32]. Light field display offers psychological, binocular depth and monocular depth cues. Its focus cues are incomplete [33,34,35,36,37,38]. In contrast, holographic display can in principle provide all kinds of depth cues and thus more realistic sensation [39,40,41,42]. In its original form, holography encodes the wavefront on photosensitive materials by interference. With the development of computer technology, it is now possible to calculate the wavefront numerically and reproduce 3D scenes using programmable spatial light modulators (SLMs), which is known as computer-generated holography (CGH). Following the advancements in CGH algorithms [43,44,45,46], fast and high-quality 3D reconstruction can now be achieved, paving the way for the realization of AR/VR [23, 47, 48] as well as Metaverse [49].

Applying CGH in HCI systems is an active research trend. Holographic 3D gesture UIs are used to establish interactive color electroholography system [50], video system for rotating and scaling specific virtual objects [51], holographic projection system for drawing fingertip trajectories [52], aerial writing and erasing [53]. Shimobaba et al. proposed an interactive color electroholography system, which used the field-programmable gate array and the time division switching method for color reconstruction [50]. Adhikarla et al. reported their design of 3D gesture interaction with full horizontal parallax light field display [54]. Yamaguchi et al. used light field display to establish a 3D user interface [55]. By detecting the scattering light of user’s fingertips from the 3D floating image, they realized 3D touch interface [56]. Sang et al. presented an interactive system with a mouse based on floating light field display [57]. Yamada et al. demonstrated an interactive, full-color holographic video system, which realized rotating and scaling holographic objects [51]. Sando et al. merged a 3D holographic-display system with digital micromirror devices and rotating mirrors. They projected viewable 3D videos with mouse interaction [58]. Suzuki et al. proposed a real-time holographic projection system for drawing trajectories with fingertips [52]. Sánchez Salazar Chavarría et al. proposed a 3D user-interface based on a holographic optical element to detect scattered light of fingertips [59]. Nishitsuji et al. demonstrated a holographic display system for drawing of basic handwritten content, which used a tablet and a touch pen [60]. Takenaka et al. built a holographic aerial writing system to draw and erase finger trajectories [53]. Sánchez Salazar Chavarría et al. put forward a method to register the user’s position and the reconstructed 3D content without a calibration [61]. These studies demonstrate the advantages of holographic contactless system, including quick initiation, intuitive visual experience, and accurate interaction process. However, in these studies, handheld tools, wearable devices, and projection screens are utilized frequently, which may lead to inconvenience. There is a lack of research on 3D gesture UI as to the virtual keyboard interaction.

Drawing inspiration from the advantages of CGH, we propose a 3D virtual keyboard system to combine gesture recognition and holographic display. A hand-tracking sensor is used to collect the gestures and fingertips’ positions, and a SLM is used to generate 3D display patterns. The hand-tracking sensor and the SLM is operated synchronously with feedback controls by a personal computer. We conducted the user-interactive experiments to evaluate system’s accuracy and response time. No wearable devices, handheld tools, or projection screens are required, thereby eliminating potential user inconvenience. The robust 3D virtual keyboard system is expected to serve as a solution for mobile text input, and make contributions to the 3D user-interface in VR and AR.

2 Methodology

2.1 System configuration

The system setup is schematically shown in Fig. 1a. A 532 nm fiber-coupled laser is used as the coherent source for illumination. The emitted light from the single-mode fiber is collimated, properly polarized, and then modulated by the SLM. By uploading a pre-calculated modulation pattern to the SLM, a corresponding 3D image can be reconstructed at the target location. To facilitate convenient hand interaction, we introduce a magnification module with two lenses, L1 and L2, producing an enlarged 3D scene within the detecting area of a hand-tracking sensor. The hand-tracking sensor and the SLM are synchronously controlled by a computer. A camera is used to observe and record the interaction from the perspective of the human eyes.

Fig. 1
figure 1

a Schematic diagram of the proposed system. CL collimating lens; P polarizer; BS beam splitter; L1 and L2 lenses; M1 and M2 mirrors. b Flowchart of the interaction process, including holographic display module and gesture recognition module

Figure 1b shows the two core modules: the holographic display module and the gesture recognition module. In the holographic display module, holograms are pre-calculated and uploaded to the SLM to reconstruct the 3D object. The gesture recognition module utilizes the hand-tracking sensor to detect hand structures and accurately measure fingertip positions, which are then used to determine the corresponding interaction behaviors. Accordingly, an updated hologram is then sent to the SLM. By operating in such a closed-loop manner, the system can achieve real-time, dynamic 3D display with interactive capabilities.

The system combines holographic display and gesture recognition techniques, thereby facilitating real-time contactless interaction with 3D virtual objects. The incorporation of a magnification system effectively addresses the constraints imposed by the limited light field range of the SLM. The computer directs a harmonious coordination between the hand-tracking sensor and the SLM, providing an intuitive user experience. These collective features contribute to the system’s capability and expandability.

2.2 Gesture recognition module

3D hand gesture recognition has attracted increasing research attention in the field of HCI. This recognition can be achieved through vision-based or sensor-based approaches [62]. Techniques to obtain 3D spatial–temporal data generally involve stereo cameras, motion capture systems, and depth sensors [63, 64]. Stereo cameras are based on human binocular vision. Motion capture systems use wearable markers or motion tracking techniques for position estimation. Depth sensors include time-of-flight (ToF) cameras and structured light cameras. Popular depth sensors are ToF camera, Intel RealSense [65], Kinect [66] and Leap Motion [67]. Unlike sensors that capture full-body depth, Leap Motion, which we choose in this work, focuses on hand tracking. It has an accuracy of 0.01 mm in detecting hands and fingers. Its tracking area is an inverted quadrilateral, with a horizontal view of 140°, a vertical view of 120°, and a depth ranging from 10 to 80 cm, as shown in Fig. 2a. This device comprises two grayscale infrared cameras, four infrared LEDs, and a top filter layer that allows infrared light to pass through only.

Fig. 2
figure 2

a The tracking area of Leap Motion is an inverted quadrilateral, with a horizontal view of 140°, a vertical view of 120°, and a measuring depth between 10 cm and 80 cm. b The hand structure detected and modeled by Leap Motion. c Schematic of criteria for interaction. The “Delete” button is pressed if the fingertip enters the cubic area in green color

The gesture recognition process involves data acquisition, pre-processing, segmentation, feature extraction, and classification. When a hand enters the detecting area, it is automatically tracked, and a series of data frames are acquired. The raw data is subsequently pre-processed by a built-in software. Figure 2b illustrates the hand structure measured by Leap Motion, which includes information about the fingers, gestures, position, velocity, direction, and rotation. The joints of the thumb, index, middle, ring, pinky, and wrist are recognized, along with the distal, intermediate, proximal, and metacarpal bones of each finger. Additionally, Leap Motion can capture high-speed movements at 200 frames per second.

In this work, we consider a keyboard interaction scenario, where the positions of the fingertips serve as the input. Figure 2c illustrates the criteria for pressing the button “Delete”. The coordinate origin is defined as the center point on top of the Leap Motion. The positions of the five fingertips are denoted as \((x_{n} ,y_{n} ,z_{n} ),n = 1,2,3,4,5\). The foremost fingertip, determined by the minimum position along the z-axis, is considered as the button press candidate:

$$z_{k} = \min \left\{ {z_{n} |n = 1,2,3,4,5} \right\}.$$
(1)

The activation of the button “Delete” occurs when the fingertip enters a pre-defined cubic area, where the following conditions are satisfied:

$$|z_{k} - z_{(0)} | < d,z_{(0)} = 0,$$
(2)
$$|x_{k} - x_{(0)} | < l/2,$$
(3)
$$|y_{k} - y_{(0)} | < w/2,$$
(4)

where \((x_{(0)} ,y_{(0)} ,z_{(0)} )\) denotes the center position of the “Delete” button, l and w denote the length and width of the “Delete” button, respectively. d is the depth threshold value along the z-axis. The interaction criteria for other buttons are similar to the button “Delete”.

2.3 Holographic display

In a contactless interaction system, it is desirable to create 3D scenes for decent user perceptual experiences. Holographic display can provide various depth cues, leading to a realistic sensation. With the progress of CGH algorithms, high-speed and high-quality holographic display is achieved. In this work, the 3D display module is based on CGH. Dynamic holographic display can be achieved by utilizing programmable wavefront modulation devices, such as SLMs and digital micromirror devices (DMDs). In our prototype system, a phase-only liquid–crystal SLM (LC-SLM) is used for its high efficiency compared with amplitude-based devices.

Given a target 3D scene, we employ the layer-oriented angular-spectrum method to calculate the 2D phase-only hologram (POH) [44]. Specifically, the free-space propagation of the wavefield is calculated based on the angular spectrum model as

$$U(x,y;\ z) = F^{ - 1} \{ F\{ U(x,y;\ z_{0} )\} \times H(f_{x} {,}f_{y} ;\ s,\lambda )\} ,$$
(5)
$$H(f_{x} {,}\ f_{y} ;\ s,\lambda ) = \exp ({\text{j}}ks\sqrt {1 - (\lambda f_{x} )^{2} - (\lambda f_{y} )^{2} } ),$$
(6)

where \(U(x,y;z_{0} )\) and \(U(x,y;z)\) denote the original and propagated 2D wavefield distribution at axial locations \(z_{0}\) and \(z\), respectively. \(H(f_{x} ,f_{y} ;s,\lambda )\) is the transfer function. \(\lambda\) denotes the illumination wavelength and \(k = 2\pi /\lambda\) denotes the wave number. \(f_{x}\) and \(f_{y}\) are the spatial frequencies. \(s = z - z_{0}\) denotes the propagation distance. \(F\) and \(F^{ - 1}\) represent the Fourier and inverse Fourier transform, respectively.

The 3D scene is discretized and divided into a series of 2D layers along the axial direction. For each layer, the amplitude distribution is extracted from the corresponding 2D slice. Due to the ill-posedness of POH calculation, the directly obtained phase pattern may suffer from limited display contrast and speckle noises. Therefore, we adopt the Gerchberg-Saxton (GS) algorithm [68] to improve the display quality, which proceeds as follows:

  1. (1)

    Initialize the complex amplitude at the SLM plane as

    $$U_{1} = A_{0} \exp ({\text{j}}\phi_{1} ),$$
    (7)

    where \(\phi_{1}\) is the random initial phase, and \(A_{0}\) is the amplitude of the SLM plane depending on the source intensity.

  2. (2)

    Obtain the complex amplitude distribution of the target plane through forward propagation:

    $$U_{2} = F^{ - 1} \{ F\{ U_{1} \} \times H(s)\} = A_{2} \exp ({\text{j}}\phi_{2} ),$$
    (8)

    where s denotes the propagation distance.

  3. (3)

    Preserve the phase information \(\phi_{2}\) and replace with the target amplitude \(A_{i}\):

    $$U_{3} = A_{i} \exp ({\text{j}}\phi_{3} ),\phi_{3} = \phi_{2} .$$
    (9)
  4. (4)

    Obtain the complex amplitude distribution of the SLM plane through back-propagation:

    $$U_{4} = F^{ - 1} \{ F\{ U_{3} \} \times H( - s)\} = A_{4} \exp ({\text{j}}\phi_{4} ).$$
    (10)
  5. (5)

    Preserve the phase information \(\phi_{4}\) and replace with the SLM plane amplitude \(A_{0}\):

    $$U_{1} ^{\prime} = A_{0} \exp ({\text{j}}\phi_{1} ^{\prime}),\phi_{1} ^{\prime} = \phi_{4} .$$
    (11)
  6. (6)

    Steps (2)–(5) proceed iteratively until the phase distribution \(\phi_{m} \left( {m = 1,2,3,...} \right)\) converges, and the refined POH is obtained.

A flowchart of the iterative algorithm is illustrated in Fig. 3b. To accurately simulate the diffusion effect of real-world 3D objects, the hologram is initialized randomly. Through back propagation, a complex-amplitude optical field is obtained at the SLM plane. By adding the back-propagated field distributions of all the layers and extracting the phase, a POH is obtained. The entire process is depicted in Fig. 3a. Compared to other CGH algorithms, the angular-spectrum layer-oriented method is favored for its high computational efficiency and precise prediction of the complete diffraction field.

Fig. 3
figure 3

a The layer-oriented angular-spectrum method. \(L_{1} ,L_{2} ,...\) denote the 2D slices of the 3D scene, and \(H_{1} ,H_{2} ,...\) denote the corresponding calculated complex-amplitude optical field at the SLM plane. b Flowchart of the GS algorithm. The light source determines the SLM plane amplitude, and the target pattern determines the target amplitude. Iterate until the phase information converges

3 Experiment

3.1 Experimental setup

We used a fiber-coupled 532 nm laser for illumination. A reflective phase-only LC-SLM (GAEA-2, Holoeye) with a resolution of 3840 × 2160 pixels, a pixel size of 3.74 μm, and a refresh rate of 60 Hz, was used for wavefront modulation. To magnify the displayed 3D target, we introduced two lenses, L1 and L2, with focal lengths of 100 mm and 300 mm respectively. A Canon EOS 70D camera was employed to observe and record the displayed scene. The entire process, including gesture data acquisition, processing, and SLM control, was implemented with Python. The user sequentially pressed all the buttons of the keyboard using the index finger. This procedure allowed for comprehensive data collection for subsequent analysis and evaluation. The experimental setup is shown in Fig. 4 and the experimental settings are shown in Table 1.

Fig. 4
figure 4

Experiment setup of HCI system. CL collimating lens; P polarizer; BS beam splitter; L1 and L2 lenses; M1 and M2 mirrors

Table 1 Experimental settings

3.2 Experimental results

Figure 5 shows the designed pattern, depth map, calculated POH, and display patterns when the button “Delete” is pressed. The initial virtual keyboard, formed by the SLM, has a size of 10 × 10 mm. The sizes of the buttons “1” to “9” and “Dot” are 2.5 × 2.5 mm, the sizes of the buttons “Enter” and “Delete” are 2.5 × 5 mm, and the size of button “0” is 5 × 2.5 mm. The virtual keyboard plane is positioned 17.7 mm from the SLM plane, and the pressed button is located 18 mm from the SLM plane, resulting in a depth of 0.3 mm between the two layers. After the magnification, the size of the virtual keyboard becomes 40 × 40 mm, and the distance between the two layers is 15 mm. The sizes of the buttons “1” to “9” and “Dot” are 10 × 10 mm, the sizes of the buttons “Enter” and “Delete” are 10 × 20 mm, and the size of the button “0” is 20 × 10 mm. The magnified keyboard is suitable for hand interaction, given that the normal diameter of a human finger is approximately 10 mm. Setting the camera’s focal distance at z = 0 mm (Layer 1) and z =  − 15 mm (Layer 2) allows the virtual keyboard and the pressed button to be focused respectively, as Figs. 5d and e show. The size of the virtual 3D keyboard is shown in Table 2.

Fig. 5
figure 5

a Designed keyboard pattern and b the depth map. c POH of the keyboard. d and e are the display patterns observed at Layer 1 and Layer 2, respectively. In d the virtual keyboard is in focus, and in (e) the button “Delete” is in focus

Table 2 Size of the virtual 3D keyboard

To demonstrate the functionality of our system, we conducted a series of experiments, sequentially pressing all the buttons. Figure 6 shows the captured images with the keyboard plane (Layer 1) in focus. It is evident that the buttons “Delete”, “0”, “1”, and “5” were successfully pressed, as indicated by the positions of the interacting fingertip, which is shown blow each image. It can be easily verified that these positions satisfy the criteria of Eqs. (2)–(4). The successful interaction behaviors validate the accurate determination, with the uploaded holograms reconstructing the corresponding virtual patterns as expected.

Fig. 6
figure 6

Press (a) button “Delete”, (b) button “0”, (c) button “1”, and (d) button “5”. The virtual keyboard is in focus, and the pressed button is out of focus

To evaluate the functional stability of the holographic 3D virtual keyboard system, a series of interactive experiments were conducted. The transition of the 3D virtual keyboard from its initial state, where all buttons pop up, to a state where a specific button is pressed, is referred to as a “pressed response”. If the keyboard generated a “pressed response”, the fingertip position collected by Leap Motion would be recorded. The interactive experiments were carried out by randomly pressing the virtual buttons a sufficient number of times. We collected and analyzed the statistical results of “pressed response” for buttons “Delete”, “0”, “1”, and “5”. The number of responses was counted, as shown in Table 3. The response count for buttons “Delete”, “0”, “1”, and “5” were 63, 442, 722, and 28, respectively. The total response count was 1255, and all of them were accurate. The fingertip positions were recorded and marked in a coordinate system, as shown in Fig. 7a–d, and the button areas were also indicated in the same coordinate system. It could be observed that the fingertip positions align with the predefined cubic areas of the respective buttons. The point “A” at coordinate (16.97, 108.78, 1.19) in Fig. 7a, point “B” at coordinate (− 17.15, 83.78, 0.23) in Fig. 7b, point “C” at coordinate (− 17.83, 94.47, 0.91) in Fig. 7c, and point “D” at coordinate (5.31, 103.58, 0.56) in Fig. 7d correspond to the captured images in Fig. 6a–d. Figures 7e and f show all collected fingertip positions and corresponding button areas.

Table 3 Statistical results in interaction experiment of pressing buttons “Delete”, “0”, “1”, and “5”
Fig. 7
figure 7

Statistical results of interaction experiment of pressing (a) button “Delete”, (b) button “0”, (c) button “1”, and (d) button “5”. (e) All collected data points and (f) the button areas where they are located. “A”, “B”, “C”, and “D” are the corresponding collected data points in Figs. 6(a), (b), (c), and (d)

As an interaction system, the response time is an important attribute. The response time of the proposed HCI system is contingent on the time required for gesture data collecting, data processing, hologram uploading, and SLM rendering. The hand-tracking sensor has a frame rate of 200 frames per second. SLM has a refresh rate of 60 Hz. The data processing and hologram uploading are executed by a personal computer, taking approximately ~\({10}^{-8}\) s for computation. Then the response time for a complete interaction action is 0.02 s, achieving real-time responsiveness as observed by humans. In this work, the holograms are pre-calculated in advance to ensure real-time rendering. This puts forward higher requirements on the system memory as the quantity of display patterns increases. This can be addressed through the utilization of parallel computing techniques based on graphical processing units (GPUs) and deep learning-based CGH algorithms [69, 70].

It should be noted that in the optical system, lens L2 serves as a field lens and plays a role in widening the field of view, which helps the camera to capture the entire holographic images. The function of lens L2 and the camera is to illustrate the approach through experimental demonstration. In practical applications, this part may not be necessary, thus system complexity can be further reduced. Expanding the viewing angle of holographic 3D display remains a topic for further research, which can be achieved by using a liquid crystal grating [71], time or spatial multiplexing techniques, metasurface devices, and holographic optical elements.

4 Conclusion

In this paper, we proposed a contactless 3D virtual keyboard system to combine gesture recognition and holographic display. Specifically, we integrated a SLM, which is used to generate holographic 3D display patterns, with a hand-tracking sensor, which is used to detect the hand gestures and fingertip positions. The hand-tracking sensor and the SLM is operated synchronously with feedback controls by a personal computer. The user-interactive experiments demonstrate the system’s 3D display capability, stable performance, and real-time responsiveness. By making minor adjustments to the virtual display patterns and the interaction instructions, the proposed HCI system can accommodate a board spectrum of applications, particularly in contactless interaction scenarios during the COVID-19 pandemic. Holographic near-eye display is a potential application of bidirectional holographic display systems with haptic interfaces. The integration of holographic 3D gesture UI with VR, AR and MR offers a promising avenue for the development of user-interfaces and mobile devices.