Keywords

1 Introduction

Virtual reality devices, particularly VR glasses, are widely adopted in the production industry, including rail transportation, for various applications such as virtual map navigation, maintenance guidance, intelligent inspection, emergency rescue, safety training, and education.It has become a trend with great development prospects [1,2,3,4,5,6,7,8].

The rail industry requires more interactive functions in virtual reality devices [3], leading to the emergence of new human-machine interaction methods like gesture and facial recognition [9, 10]. Gesture recognition is becoming mainstream [11] but needs improved sensitivity for better operational experience.

This paper proposes a VR-assisted system for emergency rescue in rail transportation, consisting of a multi-source information fusion system, emergency dispatch instruction system, and a VR-based human-machine interaction system. The human-machine interaction system utilizes gesture recognition for VR/AR scenarios, offering functions like virtual mouse, keyboard, and media control. It enhances user experience, frees the hands of on-site operators, and provides a versatile solution for emergency rescue in rail transportation.

2 Overview of the Virtual Reality-Assisted System for Emergency Rescue in Rail Transportation

The virtual reality-assisted system for emergency rescue in rail transportation consists of three major parts: a field situation awareness system based on multi-source information fusion, an emergency dispatch instruction receiving system, and a human-machine interaction system for virtual reality terminal devices, the structure of the system is shown in Fig. 1. This chapter will provide a detailed explanation of the composition and functions of these three subsystems.

Fig. 1.
figure 1

The structure of the Rail Transit Emergency Rescue Virtual Reality Assisted System

2.1 Field Situation Awareness System Based on Multi-source Information Fusion

The field situation awareness system collects videos, audio, environment videos, and positioning coordinates using sensors installed on virtual reality devices. This information is then transmitted wirelessly to the emergency command platform. The command platform uses its computing power to understand and analyze the multi-modal information, enabling intelligent perception of the emergency situation on the scene.

2.2 Emergency Dispatch Instruction Receiving System

The system sends dispatch instructions to on-site personnel's virtual reality terminals, which receive and display the instructions. It includes a progress feedback function for operators to confirm completion status, and the terminals also send updates to the command platform for tracking.

2.3 Human-Machine Interaction System for Virtual Reality Terminal Devices

This paper explores the human-machine interaction system for virtual reality devices, using the mediapipe visual recognition module. It surpasses physical controllers, offering an immersive and natural experience with three functional modules: virtual mouse, virtual keyboard, and volume control.

3 Implementation Method of Human-Computer Interaction System for Virtual Reality Terminal Devices

This system has three modules: virtual mouse, virtual keyboard, and volume control. It uses Mediapipe for gesture recognition and explains the architecture and working principle.

3.1 Introduction to the Mediapipe Model

Mediapipe Hands is a lightweight gesture recognition module in the Mediapipe deep learning framework [12, 13]. It can infer 3D coordinates of 21 hand keypoints in real time on mobile devices. The module utilizes a two-stage process of hand detection and gesture recognition, providing accurate depth information and rendering the image with marked keypoints. Mediapipe Hands can predict the depth information and generate the 3D coordinates of the hand keypoints as shown in the Fig. 2. The detailed description of the system structure of the Mediapipe Hands module is provided in the Fig. 3.

Fig. 2.
figure 2

Display of hand keypoint positions

Fig. 3.
figure 3

Flowchart for simulating mouse function with hand gestures

Using the real-time hand keypoint coordinate information output by Mediapipe Hands, the spatial position of the hand can be dynamically tracked. In this way, in a mixed reality scene, more natural gesture interaction experience can be achieved based on the real-time position, direction, and shape of the hand.

3.2 Gesture Recognition for Implementing Virtual Mouse Function

Design scheme. To replace the physical mouse, hand gesture recognition analyzes camera images to obtain finger coordinates, mapping them to control the mouse pointer on the screen.

$$ \left\{ \begin{gathered} x_2 = x_1 + wSlope \ast (x_1 - 0) \hfill \\ y_2 = y_1 + hSlope \ast (y_1 - 0) \hfill \\ wSlope = \frac{wScr - 0}{{wCam - 0}} \hfill \\ hSlope = \frac{hScr - 0}{{hCam - 0}} \hfill \\ \end{gathered} \right. $$
(1)

The mathematical formula for one-dimensional linear interpolation is shown in Eq. (1). In this equation, (x2, y2) denote the screen coordinates of the mouse pointer, while (x1, y1) represent the coordinates of the top keypoint on the index finger captured by the camera. wScr and hScr refer to the width and height of the screen; wCam and hCam refer to the width and height of the camera’s captured image. To perform one-dimensional linear interpolation, the NumPy library provides the function. Taking the first equation mentioned earlier as an example, this function interpolates between two arrays. Specifically, given a set of coordinates x1 to be mapped, the range of values for the points to be mapped is represented by the array [0, 1, …, wCam], while the corresponding range of mapped values is contained in the other array [0, 1, …, wScr].

The autopy module uses finger movement coordinates to control the mouse pointer on the screen, allowing it to simulate mouse movement and clicks based on the distance between finger tips. Figure 3 shows the flowchart of simulating mouse movement with gestures.

Optimization of Pointer Jitter. To reduce finger tremors caused by gesture recognition errors in Mediapipe. We optimized the code to enhance user experience by smoothing mouse movement using a calculation based on the difference between current and previous coordinates. The formula used is as follows:

$$ \begin{gathered} clocX = plocX + (x_2 - plocX)/s \hfill \\ clocY = plocY + (y_2 - plocY)/s \hfill \\ \end{gathered} $$
(2)

This formula uses several variables. ClocX and clocY represent the current horizontal and vertical locations of the mouse pointer on the screen, respectively, after optimization. In contrast, plocX and plocY represent the previous frame's horizontal and vertical locations of the mouse pointer on the screen. x2 and y2 denote the current mouse pointer's horizontal and vertical coordinates on the screen before optimization. The smoothening variable controls the balance between pointer smoothing and delay, with a higher value increasing smoothing but also introducing more delay.

Implementation Results. Enabling the feature displays a camera monitoring window with a green box denoting the active finger movement area. The index finger controls the interface pointer, while both the index and middle fingers trigger a click by bringing them close until the end-point between them changes color (Fig. 4).

Fig. 4.
figure 4

Image demonstrating virtual mouse controlled by hand gestures

3.3 Virtual Keyboard Input Module

Design Scheme. After capturing the video, the OpenCV preprocessing module converts the image to RGB color space and applies horizontal mirroring. The image is then passed to the Mediapipe module for hand recognition, which returns hand keypoint coordinates and an annotated image. The Finger Tip Matching module matches fingertip coordinates with virtual keyboard buttons and highlights the corresponding button. The Thumb Tip Distance Judgment module detects the distance between the thumb tip and the middle finger and triggers a click if it's below a set threshold. The letter corresponding to the button is sent to the character input module for simulated keyboard input (Fig. 5).

Fig. 5.
figure 5

Flowchart for controlling virtual keyboard input with hand gestures

Optimization of Clicking Problem. The code was optimized to detect the time interval between consecutive clicks to address the issue of multiple clicks being triggered by a single finger click. The specific implementation method is as follows:

$$ cTime - pTime > t $$
(3)

The formula uses two variables related to mouse clicks. cTime denotes the time when the current click occurred, and pTime denotes the time when the previous click occurred, t denotes the minimum time interval for triggering a click. The click instruction code runs only when the time interval between the two clicks is greater than t. If the interval is less than t, it is considered as a single click event; when the interval is greater than t, it is considered as two separate clicking operations.

Experimental Results. Figure 6 demonstrates the gesture-controlled media volume program, enabling users to select buttons by touching them with their index finger. A click is triggered by briefly touching the middle finger's second node with the thumb, allowing input on a virtual keyboard for the system's text document.

Fig. 6.
figure 6

Image demonstrating virtual keyboard input controlled by hand gestures

3.4 Gesture Recognition for Media Volume Control

Experimental Plan. Mediapipe recognizes hand keypoints, thumb-index finger tip distance is calculated, and the volume control module adjusts the system volume based on the distance. OpenCV draws the image on the screen, achieving real-time video effects at 25 frames per second. Figure 7 shows the flowchart for gesture-controlled media volume control.

Fig. 7.
figure 7

Flowchart for controlling media volume with hand gestures

Experimental Results. Figure 8 shows the actual effect of the gesture-controlled media volume program. When hand is detected, video window labels thumb tip and index finger tip, connects them with a line. Left side shows volume percentage, bottom right corner shows system media volume. Changing distance between thumb tip and index finger tip adjusts volume synchronously for gesture-controlled media volume control.

Fig. 8.
figure 8

Image demonstrating media volume controlled by hand gestures

3.5 Test Platform Hardware and Software Composition

Hardware Composition. One computer (Win10 system, AMD Ryzen 9 5950X processor, NVIDIA GeForce RTX 2080 Ti dedicated graphics card, 11GB memory), PICO 4 VR glasses.

Software Composition. PyCharm Community Edition development environment, Python 3.8, PICO Unity Integration SDK.

4 Conclusion

This paper proposes a VR interaction system for emergency rescue in rail transit. It comprises a situation awareness system, dispatch instruction receiver, and VR terminal equipment. The system creates an AR collaboration space for on-site personnel and experts. The VR terminal equipment's human-computer interaction system, using hand recognition, enables functions like virtual mouse, keyboard, and gesture-controlled volume. This study demonstrates the feasibility of gesture recognition in VR input control, expanding its application in rail transit.