1 Introduction

Robotic avatar systems facilitate secure and seamless operator control of a robot from a considerable distance while obtaining a lifelike perception of the robot’s surrounding environment and status. Consequently, robots have the advantage of being able to replace humans in dangerous places: in space, underwater, etc. Furthermore, they are expected to provide everyday assistance and support to individuals with disabilities [1,2,3]. Within the field of avatar robot research, the effective creation of tele-existence − the experience of being in another location − is investigated through a combination of telepresence and the corresponding teleoperation [4]. In addition, the DARPA Robotics Challenge demonstrated the importance and feasibility of robots replacing humans in hazardous environments through remote teleoperation [5,6,7].

Teleoperation, which has been extensively studied in the field of robotics, finds application in various domains such as industrial, surgical, space, and underwater contexts [8,9,10,11,12,13,14,15]. The objective of teleoperation research is to develop intuitive interfaces for operators. The joystick, a device that operators can easily familiarize themselves with, has been used in teleoperation research for a long time [16, 17]. However, the joystick interface faces challenges in directly conveying factors such as contact between the robot and the surrounding environment or the reactive forces occurring when the robot manipulates objects during tasks. To address this limitation, haptic controllers with feedback capabilities have been adopted as an alternative interface [18, 19]. Haptic controllers enable force feedback, allowing the operator to perceive a counterforce when the robot manipulates an object, thereby improving teleoperation performance.

Although joystick controllers or haptic controllers are intuitive interfaces, they have fewer degrees of freedom (DoF) than required to control humanoid robots. Therefore, to teleoperate humanoid robots with multiple DoFs, such as TOCABI [20], more intuitively, a different form of interface is desired. Current research explores how operators control robots by moving their entire bodies, a departure from earlier techniques that relied solely on hand-operated joysticks and haptic controllers. Motion capture methods involve attaching devices to the operator’s body to track their movement and then employing the measured movements to teleoperate the robot. One method of measuring human motion involves attaching markers to the body and recording their positions through cameras [21,22,23]. Another approach is to attach multiple Inertial Measurement Unit (IMU) sensors to the operator’s body to estimate the position and orientation of the operator’s limbs [24,25,26,27]. These methods utilize the positions of markers attached to a person’s body to calculate retargeted robot joint angles through inverse kinematics. As a result, drawbacks, such as tracking errors or pose discrepancies, may arise due to the different DoFs and differences in body lengths between the human and the robot. Furthermore, the system complexity increases as numerous sensors need to be placed on each limb of the human body [28].

Methods using an exoskeleton-type interface have been investigated to directly measure the angles of the operator’s joints [29,30,31] and to teleoperate the robot without relying on inverse kinematics (IK). An advantage of the exoskeleton interface is the allowance for force feedback at the contact points between the operator’s body and the exoskeleton [30, 32]. However, exoskeletons come with the drawback of increased inertia in the interface device due to the addition of actuators for force feedback.

In the recent gaming market, there have been significant advancements in virtual reality (VR) controller interfaces, allowing users to experience feedback through the vibration of the VR controller. In research, VR controllers have been used as teleoperation interfaces [33,34,35,36]. VR controllers offer significant convenience compared to motion capture or exoskeleton devices, as they only require the operator to grasp them, eliminating the time needed to equip motion capture interfaces. However, the information obtained through VR controllers is limited to the position and rotation of the operator’s hands, leading to pose discrepancies between robots and human users. Commercial motion trackers,Footnote 1 in comparison to traditional motion capture interfaces, offer convenience in terms of wearability [37,38,39,40,41,42]. They allow for a relatively precise matching of the operator’s posture and the robot’s. However, the drawback is that these motion trackers cannot provide force feedback to the operator, and problems can occur when the motion tracking camera is accidentally blocked, preventing it from capturing the state of the operator [43].

Another commercial product, Head-Mounted Displays (HMD) for virtual reality (VR), have been researched to provide users with a realistic display for a more immersive sensation, allowing them to experience telepresence while remotely controlling the robot [44,45,46,47,48].

To accelerate the development of robotic avatars, the ANA Avatar XPRIZE competition was established [49]. Unlike other robot competitions that typically involve developers operating their robots to complete predefined missions [50], the ANA Avatar XPRIZE Finals and Semifinals took a different approach. In this competition, randomly assigned judges used and evaluated each team’s systems. Within an hour, each team had to complete the setup of the avatar system in the operator room, where robots cannot be directly seen and where the judge controls the robot remotely. In addition, each team had to provide instructions to the judge on how to use the system. The evaluation of teleoperation technology in this competition was based on the successful completion of missions and the subjective scores given by two judges. Therefore, the system must provide an intuitive and user-friendly interface, even for non-experts. In the semifinals, most of the robots were powered through a tethered connection and supported by a crane while performing tasks. However, the finals required the robots to move around freely [51], meaning they needed to be untethered and equipped with mobile robotic systems capable of providing various sensory feedback.

Most finalists prepared robots that mimicked the operator’s actions using a teleoperation interface with haptic feedback capability [52,53,54,55]. Furthermore, most robots incorporated wheels for efficient long-distance movement (which will be discussed in Sect. 6.4.) Our team, Team SNU, also used a mobile base to enhance mobility, meeting the Finals requirements while preventing the robot from falling over. In the semifinals, our system featured a tracking marker-based interface that mimicked the operator’s actions, facilitating the robot’s movement. However, it had limitations, such as the absence of haptic feedback for the operator [38]. We developed and integrated a haptic device into our Semifinals interface that used markers to address these limitations. The updated system provides haptic feedback directly to the operator’s wrist. Additionally, we developed gloves to provide tactile and kinesthetic feedback, along with robot hands capable of manipulating the tools required for the finals.

This paper outlines the Robotic Avatar System developed by Team SNU for the ANA Avatar XPRIZE Finals. The remainder of this paper is organized as follows. Section 2 provides an overview of the robotic avatar System consisting of an operator station and an Avatar robot. Sections 3 and 4 provide detailed explanations of the proposed teleoperation and telepresence system, respectively. In Sect. 5, we present our user studies for evaluating our avatar system’s performance. Section 6 describes the missions and results of the ANA Avatar XPRIZE, offering insights gained from the competition, and Sect. 7 concludes the article.

Fig. 1
figure 1

Illustration of the robotic avatar system structure, which consists of an Avatar robot and an operator station. The Avatar robot consists of TOCABI, two head cameras, one speaker and microphone, one wrist camera, two robot hands, and one mobile base. The operator station includes a head-mounted display (HMD) (VIVE Pro2), a haptic feedback device, four trackers (VIVE Tracker3), a pair of haptic gloves, and foot pedals. The haptic feedback devices are connected to the haptic gloves worn by the operator. Two VIVE trackers are attached to each upper arm, one to the back of the operator and one to the chair. The operator wears the HMD with feet placed on the pedal. (Color figure online)

2 Proposed Robotic Avatar System

In this section, the robotic avatar system of Team SNU is introduced. Figure 1 shows the structure of our robotic avatar system, and a concise summary of its specifications is presented in Table 1. Our robotic avatar system is described in terms of both hardware and software.

Table 1 Specifications of the robotic avatar system

2.1 Hardware Structure of Robotic Avatar System

We made several modifications to our system between the competition’s semifinals [38] and finals. These changes included the integration of robot hands and a mobile base for the Avatar robot, as well as the incorporation of haptic feedback devices and haptic gloves for the operator station, as shown in Fig. 1. Operator station refers to equipment setup for remotely controlling robots. The following sections will provide detailed explanations of the operator station and the Avatar robot.

2.1.1 Avatar Robot

Our Avatar robot is comprised of a humanoid robot featuring two camera sensors, one speaker and microphone, one wrist camera, two robot hands, and a mobile base (Fig. 1).

  • TOCABI: The humanoid robot TOCABI (TOrque Controlled compliAnt BIped) was used for the Avatar robot [20]. Its height is 1.8 m, and its weight is 100 kg. TOCABI has 8 DoF in each arm, 6 DoF in each leg, 3 DoF in the waist, and 2 DoF in the neck. The payload of each arm is approximately 5 kg without the robot hands and approximately 3 kg with the robot hands. Various sensors were integrated to capture and transmit environmental information. Two head cameras, a speaker, and a microphone were equipped on the head to convey the environment information and facilitate communication between the operator and individuals near the robot [56]. The head cameras are See3CAM\(\_\)24CUG\(\_\)CHL of \(\textit{e-con Systems}\),Footnote 2 and for integrated speaker and microphone functionality, a Jabra Speak 410Footnote 3 was employed. Each wrist of the robot featured a 6-axis force/torque sensor (F/T) from ATI,Footnote 4 used to measure the weight of objects held by the robot. An Intel RealSense Depth Camera D435Footnote 5 was attached to the left wrist. The wrist camera played a crucial role in identifying the surface characteristics of objects, particularly in conjunction with the force sensors on the robot hand fingers (Sect. 4.4). We used the wrist camera to capture images directly to train the algorithm for surface detection and classification.

  • Robot Hands: Figure 2 shows our finger module with two DoFs and robot hands with eight DoFs [57]. The robot hand consisted of four modular fingers, as shown in Fig. 2c. Considering the velocity and torque of the joints, two types of DYNAMIXEL actuators were used for the fingers: XC330-M288-TFootnote 6 for the adduction/abduction joints (A/A) and XC330-M181-TFootnote 7 for the flexion/extension joints (F/E). For robust movement of the A/A joint, the XC330-M288-T actuator was chosen. The F/E joint, responsible for gripping and pushing objects, needed to move faster than the A/A joint and, therefore, used the XC330-M188-T actuator. The three F/E joints are coupled by an internal 4-bar linkage. Compared to the three-fingered hand used in semi-finals [38, 58], the robot for the finals featured four fingers. The final competition required a robot hand capable of grasping various objects, such as a switch bar, a canister, a drill, and a stone (Sect. 6.1). When handling a drill, one finger must be dedicated to pressing its trigger button for activation, making it difficult for a robot with only two remaining fingers to grasp the drill securely. Therefore, the robot hand for the finals was developed with four-finger modules. The grasp taxonomy evaluation confirmed the robot hand’s capability to perform 15 out of 16 motions [57]. To improve grasp stability, the fingertips of the right hand were covered with silicone, and 3D force sensors (Optoforce)Footnote 8 were attached to the fingertips of the left hand (Fig. 2c). Additionally, the palm of the robot hand was designed to mimic the shape of a human palm to increase the contact area during object grasping. The maximum force of a finger module was experimentally measured as 4.92 N.

  • Mobile Base: The robot TOCABI was equipped with a mobile base with mecanum wheels and was used for navigation during the final missions. In the semifinals, bipedal walking was used. The mobile base included a chair that provided a seat for TOCABI (Fig. 3). The CTM300-7R mecanum wheels allowed the robot to move in any direction, while a mini PC (NUC11TNK) controlled its operations. The mobile base was powered by a lithium-ion battery pack 21700, 7S6P battery.Footnote 9 The mobile base’s dimensions were 0.66 m in width (0.44 m without wheels), 0.68 m in length, and 0.58 m in height, including the wheels. The mobile base had a weight of approximately 50 kg. To improve the visibility of the robot’s surroundings, lights were attached to both sides under the chair.

Fig. 2
figure 2

Illustration of the finger module and robot hand. a Each finger module has two actuators. Blue and red arrows represent the axes of the actuators. The red arrow denotes the flexion/extension (F/E) joint axis, while the blue arrow indicates the adduction/abduction (A/A) joint axis. b The schematic figure illustrates three links moved by the F/E actuator. The red link corresponds to the metacarpophalangeal link, the green link pertains to the proximal link, and the blue link relates to the distal link. c Each robot hand consists of four finger modules. Silicon fingertips are attached to the right hand. Force sensors are attached on the left hand. (Color figure online)

Fig. 3
figure 3

Snapshots of the mobile base [59]. The mobile base includes the chair where TOCABI is sitting. a Isometric view of the mobile base when the TOCABI is seated. The mobile base includes four mecanum wheels, one Battery (21700, 7S6P), and a Mini PC (NUK11TNK). b The front view provides a detailed view of the robot’s chair, comprising two bars and a section designed for the robot’s hip. c The top view reveals the bottom box area of the mobile base, which serves as the placement space for TOCABI’s feet. d The side view shows buttons for turning on the mobile base and its Mini PC and a display that indicates the base’s battery status. (Color figure online)

2.1.2 Operator Station

The operator station enables the operator to teleoperate with the robot and provides the sensory information input of the robot to the operator. Figure 1 shows the entire operator station and its components. We used three commercial products (VIVE Pro2, VIVE Tracker3, and rudder pedals) and two manufactured devices (haptic feedback device and haptic gloves).

Fig. 4
figure 4

Snapshots of haptic feedback devices and gloves for teleoperation. The \(\textit{Glove-Haptic Device Junction}\) connects the haptic feedback devices and the gloves. A finger strap is attached to the end of each finger. A vibration motor is on the index finger of the left hand. The BOA fit system is installed on each palm. (Color figure online)

  • VIVE Pro2: VIVE Pro2 of HTC is used as the HMD for sending the visual and auditory information to the operator (Sect. 4.1). The HMD provides the operator with visual information for telepresence and shows the operator the interface needed to operate the Avatar robot. Furthermore, TOCABI’s head moves in response to the operator’s head movements sensed through the HMD.

  • VIVE Tracker3: Four HTC VIVE trackers V3 are used to measure the upper body motion of the operator (Sect. 3.1). Two VIVE trackers are attached to each upper arm, one to the back of the operator, and the other to the back of the chair to serve as a reference frame(Fig. 1). The position and orientation of the trackers are measured optically through external base stations.

  • Haptic Feedback Devices: Our developed haptic feedback devices can accurately measure the position and orientation of the operator’s wrist while also providing force feedback to the operator (Sect. 4.2). The devices that provide haptic feedback are displayed in Fig. 4. They are 1.8 m tall and 0.5 m wide. Each device consists of six joints, with the first joint using a prismatic actuator and the remaining joints using revolute actuators. The workspace of the haptic feedback device is a cylinder with a 0.9 m radius centered on the J1 axis. The height of the cylinder is 1.8 m, ensuring that the haptic device’s workspace adequately covers most of the human operator’s workspace. A wrist connector between the haptic gloves and feedback device is next to the 6th axis.

  • Haptic Gloves: The gloves are also developed to measure the movement of the operator’s fingers (Sect. 3.2) and to deliver tactile and kinesthetic feedback to the operator (Sect. 4.3 and 4.4): tactile for delivering roughness of the object, and kinesthetic for delivering whether the robot grasps the object. The gloves can exert a maximum force of 1.4 N on the operator’s fingers (Sect. 4.3). The strap of each finger is connected to the middle phalanx of the operator’s finger. For ease of use, the BOA fit systemFootnote 10 is installed in the palm of the gloves. The vibration actuator of the left index finger is placed on the operator’s fingertip and allows them to perceive the roughness of an object through vibration (Sect. 4.4). Each finger measures the joint position of the F/E and A/A joints of the operator.

  • Pedal: T.Flight rudder pedals from \(\textit{Thrustmaster}\)Footnote 11 are used as the controller of the mobile base (explained in Sect. 3.3). The operator can use the pedal with their feet while seated. The pedal commands the mobile base to Drive, Rotate, Move Left, Move Right, and Reverse. Switches attached to both sides of the pedals are used to change the driving mode of the mobile base or control the VR interface.

Fig. 5
figure 5

Software system structure of our proposed system. The wireless connection between the Operator station and the Avatar robot uses TCP/IP and ROS. Each arrow indicates where the data comes from and where it goes to.The Operator station includes the Operator PC and Haptic PC. HMD, VIVE trackers, Haptic gloves, and pedals are connected to the Operator PC. The haptic feedback devices are connected to the Haptic PC. Three PCs are wirelessly connected to the Processing PC: TOCABI PC, Recognition PC, and Mobile PC of the Avatar robot. The head cameras, speaker & microphone, and robot hands are connected to the Processing PC. TOCABI is connected to the TOCABI PC. The wrist camera is connected to the Recognition PC. The mobile base is connected to a Mobile PC. (Color figure online)

Table 2 Specifications of the robotic avatar system computers

2.2 Software Structure of the Avatar Robot System

Our software structure is illustrated in Fig. 5. The operator station is comprised of two computers, while the Avatar robot consists of four. Table 2 provides an overview of each computer’s specifications. The operator station and Avatar robot are connected through a Wi-Fi network. The HMD data is transmitted via TCP/IP while the robot data is transmitted via Robot Operating System (ROS) over TCP/IP. The TOCABI PC runs on Ubuntu 20.04 [60] and the Operator PC uses Windows to operate Unity3D. To transfer data between Windows and Ubuntu systems through ROS messages, we used Win-Ros.Footnote 12

The Operator PC is connected to various devices such as HMD, VIVE tracker, Haptic gloves, and pedal. In order to facilitate real-time voice communication with low latency, we use an open-source tool called Mumble.Footnote 13 Mumble is installed on both the Operator PC and the Processing PC to transmit sounds between the people around the Avatar robot and the operator. The head cameras on the Avatar robot capture the surrounding environment with a resolution of 1920 x 1200 and a field of view of 104.6\(^{\circ }\) horizontally and 61.6\(^{\circ }\) vertically. The video manager on the Processing PC transmits video to the Operator PC at a rate of up to 114 Hz and 100 Mbps. On the Operator PC, Unity3D in OpenVR receives the video data through TCP/IP and adjusts the 2D video image to fit the VR screen of the HMD. Each image from the head camera is projected to each eye of the operator through the HMD. The camera image from the mobile base is also transmitted to the HMD via TCP/IP.

The VIVE trackers and haptic feedback devices are used to teleoperate TOCABI’s arms for manipulation. Real-time position and orientation information for the VIVE trackers is obtained through their open-source API at a rate of 90Hz. The operator’s hand position and orientation are calculated using the forward kinematics of the haptic feedback device at 2000 Hz. To determine the relative position between both hands, the distance between two haptic feedback devices is measured. ROS topics for the VIVE trackers and the haptic feedback device are published to the motion retargeting algorithm on TOCABI PC at a rate of 100 Hz. The motion retargeting algorithm calculates the desired joint positions and velocities of the TOCABI’s upper body with the transmitted data (Sect. 3.1). The TOCABI PC provides F/T sensor data to the Haptic PC via ROS for force feedback.

The haptic gloves measure the finger movement of the operator. Measured data is transmitted from the Haptic glove API on the Operator PC to the Hand Controller in the Processing PC via ROS. The Hand Controller maps the operator’s hand motion to the robot hand using the transmitted glove data. The Hand Controller and Recognition Algorithm send feedback information to the glove via ROS. The current of TOCABI’s finger motor, \(\textit{I}_{finger}\) is measured and transmitted to the Haptic glove API. The Haptic glove API provides kinesthetic feedback to the operator, indicating whether TOCABI’s hand has grasped an object or not. The opto-force sensors on the robot hand’s fingertip measure the contact forces of the fingertips. The Recognition algorithm in Recognition PC distinguishes the surface of the stone under the palm of the robot’s hand. With the measured force of the robot fingertip and the recognized object, the information of the object that the robot hand grasps or touches is delivered to the operator through the vibration motor of the gloves(Sects. 4.3 and 4.4).

The pedals are used to drive the mobile base. The output of the pedals is transmitted from the pedal API of the Operator PC to the Mobile Controller of the Mobile PC via ROS. The Mobile Controller calculates the velocity of the wheels using the transmitted pedal command (Sect. 3.3).

2.3 Untethered System

In order for our system to operate untethered, we have implemented batteries for the Avatar robot and wireless communications between the Avatar robot and the operator station. The two batteries, Tattu Plus 22,000 mAh 22.2V Lipo battery,Footnote 14 are carried on both sides of TOCABI’s waist. Batteries supply the rated voltage to the robot PC and Elmo drivers through voltage conversion, with further details explained in [20]. The mobile base uses a separate PC and power supply from the robot and is powered through the attached battery of the mobile base.

For communication between the Operator station and the Avatar robot, a Netgear R7800 router was used.Footnote 15 During the ANA Avatar XPRIZE Finals, the XPRIZE network was provided at the venue and team garage.Footnote 16 Our operator station (Operator PC, and Haptic PC) is connected to the XPRIZE network via an Ethernet line. The Avatar robot (Processing PC, TOCABI PC, Recognition PC, and Mobile PC) is connected wirelessly via the router.

3 Teleoperation

This section describes the three types of teleoperation: upper body operation, hand operation, and mobile base operation.

3.1 Upper Body Operation

Fig. 6
figure 6

Description of coordinates of the operator and the robot. Each arrow indicates which part of the operator is mapped into the Avatar robot. The yellow circles indicate the origin of each coordinate. The red, green, and blue arrows are the orientation axis in xy, and z, respectively. The wrist and upper arm use three orientations in xy, and z axes. The upper body uses the orientation in x and y axis, while the head uses the orientation in y and z axis. a Illustration of the coordinates of the operator. The operator positions their arms in an ‘L’ shape for pose calibration. b Illustration of the coordinates mapped to the Avatar robot. (Color figure online)

Fig. 7
figure 7

Diagram of motion mapping. \({\textbf {p}}^o_{ha}\) and \({\textbf {R}}^o_{ha}\) are the position and orientation of the hand of the operator. \({\textbf {p}}^o_{b}\) and \({\textbf {R}}^o_{b}\) are the position and orientation of the body of the operator. The body includes upper body, upper arm, shoulder, and head, \(b \in \{upper body, upper arm, shoulder, head\}\). \({\textbf {p}}^r_{ha,d}\), \({\textbf {R}}^r_{ha,d}\) and \({\textbf {R}}^r_{b,d}\) are the desired position and orientation of the robot. \({\textbf {q}}_d\) is the desired joint angle. \(\varvec{\uptau }_d\) and \(\mathbf {\uptau }_g\) are desired torque and gravity torque, respectively. (Color figure online)

The operator’s upper body movements are tracked by haptic feedback devices and VIVE trackers to control TOCABI. We have combined haptic feedback devices and VIVE trackers to accurately measure the position and orientation of the operator’s hand while enabling the Avatar robot to mimic the operator’s upper body movements simultaneously. Additionally, force feedback can be delivered to the operator by the haptic feedback devices.

In Fig. 6, the coordinates of the operator delivered to the Avatar robot using a haptic device and VIVE trackers are shown. Figure 6a shows how the operator uses the haptic feedback devices, VIVE trackers, haptic gloves, HMD, and pedal. The haptic feedback devices measure the position and orientation of the hand of the operator, \({\textbf {p}}^o_{ha}\) and \({\textbf {R}}^o_{ha}\). The VIVE trackers and the HMD measure the position and orientation of the chest, upper arm, shoulder, and head. The measured coordinates of the operator are mapped into the coordinates of the Avatar robot according to the diagram in Fig. 7. The pose required for Pose Calibration in Fig. 7 involves attaching both arms to the body and forming an ’L’ shape with the arms. The method used to map the orientation of the shoulder, upper arms, chest, and head was introduced in our previous research [37]. From this Pose Calibration, the initial position of the operator’s hands, \(\overline{\textbf{p}}^o_{ha,i}\), are obtained. The next step is to calculate the desired velocity of the robot hand. This is done using the following formula:

$$\begin{aligned} \textbf{p}^r_{ha,d}= & {} \overline{\textbf{p}}^r_{ha,i} + a(\textbf{p}^o_{ha} -\overline{\textbf{p}}^o_{ha,i}), \end{aligned}$$
(1)
$$\begin{aligned} \textbf{R}^{r}_{ha,d}= & {} \textbf{R}^o_{ha}, \end{aligned}$$
(2)
$$\begin{aligned} \dot{\textbf{p}}^r_{ha,d}= & {} \textbf{K} (\textbf{p}^r_{ha,d}-\textbf{p}^r_{ha}), \end{aligned}$$
(3)

where \(\textbf{p}^r_{ha,d}\), and \(\dot{\textbf{p}}^r_{ha,d}\) are the desired position and velocity of the robot hand. \(\overline{\textbf{p}}^r_{ha,i}\) is the initial position of the robot corresponding to the ’L’ pose. \(a \in [1.0, 1.3]\) is the scaling factor that represents how much the robot’s hand moves in proportion to the distance moved by the operator’s hand [37]. When \(a = 1.0\), the robot hand moves the same distance as the movement of an operator’s hand. Additionally, when \(a = 1.3\), the robot hand moves 1.3 times the distance of the movement of an operator’s hand. This is based on the ratio of the operator’s arm length to the robot’s arm length. K is the feedback gain for tracking the desired position of the robot’s hand.

Table 3 Priority of the tasks

The angular velocities of TOCABI’s joints are calculated using hierarchical quadratic programming (HQP)-based inverse kinematics [37]. Table 3 displays the task priorities related to Fig. 7. The top priority is to ensure safety by restricting joint angle, joint velocity, and hand velocity. The second priority is the orientation of the chest, whereas the third priorities are the position and orientation of the hand, and the orientation of the head. The head orientation is used to control the visual feedback in the HMD. The orientation control of the upper arm and shoulders has the lowest priority, which aims to make the robot’s pose similar to that of the operator. The optimal joint velocities of the upper body are computed by solving the HQP problem in (4) while adhering to the task priorities in Table 3,

$$\begin{aligned} \begin{array}{ll} \mathop {\textrm{mini}}\limits _{\dot{q}} \quad {\rho _{p}\left\| {\textbf {J}}_{p}\dot{{\textbf {q}}}-\dot{{\textbf {x}}}_{p,d}\right\| ^{2} + \left\| \dot{{\textbf {q}}} \right\| _{A}^{2}}\\ \quad \mathrm{s.t }\qquad {\underline{\dot{{\textbf {x}}}}^r_{ha} \le {\textbf {J}}_{ha}\dot{{\textbf {q}}} \le \overline{\dot{{\textbf {x}}}}^r_{ha}}\\ \qquad \qquad {{\textbf {K}}_q(\underline{{\textbf {q}}}-{\textbf {q}}_{k-1}) \le \dot{{\textbf {q}}} \le {\textbf {K}}_q(\overline{{\textbf {q}}}-{\textbf {q}}_{k-1})}\\ \qquad \qquad {\underline{\dot{{\textbf {q}}}} \le \dot{{\textbf {q}}} \le \overline{\dot{{\textbf {q}}}}}\\ \qquad \qquad {{\textbf {J}}_n\dot{{\textbf {q}}} = {\textbf {J}}_n\dot{{\textbf {q}}}^*_{p-1}, \forall n \in 2, \ldots , p-1, (p \ge 3) }, \\ \end{array} \end{aligned}$$
(4)

where p is the \(p^{th}\) priority task in Table 3. \(\dot{{\textbf {x}}}_{p,d}\) is the desired velocity of the \(p^{th}\) priority, and \({\textbf {J}}_{p}\) is the Jacobian matrix of the \(p^{th}\) priority. \(\rho _p\) is the weighting value for the control error and should be much larger than 1 to control the desired motion accurately. The first term of the cost function in Eq. (4) is to minimize the velocity error of the \(p^{th}\) task, \(\Vert \textbf{J}_{p}\dot{\textbf{q}}-\dot{\textbf{x}}_{p,d} \Vert ^{2}\). \(\left\| \dot{{\textbf {q}}} \right\| _{A}^{2}\) is the regularization term that is being weighted by the inertia matrix of the robot, \({\textbf {A}}\), which minimizes the kinematic energy of the robot. \({\textbf {J}}_{ha}\) is the Jacobian matrix of the hand, \(\underline{\dot{{\textbf {x}}}}^r_{ha}\) and \(\overline{\dot{{\textbf {x}}}}^r_{ha}\) are the lower and upper limits of the velocity of the robot hand, respectively. \(\underline{{\textbf {q}}}\) and \(\overline{{\textbf {q}}}\) are the lower and upper limits of the upper body joint angle, while \(\underline{\dot{{\textbf {q}}}}\) and \(\overline{\dot{{\textbf {q}}}}\) are the lower and upper limits of the upper body joint velocity, respectively. \({\textbf {J}}_n\) is the Jacobian matrix of higher priority than p. \(\dot{{\textbf {q}}}^*_{p-1}\) is the optimal value obtained from the previous hierarchy. The optimized \(\dot{{\textbf {q}}}^*\) is then integrated into the desired joint position, \({\textbf {q}}_d\). To avoid self-collision, the method introduced in [61] was applied to TOCABI. If a self-collision is detected, TOCABI will halt its motion and notify the operator, allowing the operator to move away from the self-collision situation before resuming the robot’s operation. The desired torque, \(\varvec{\tau }_d\), is calculated using the proportional-derivative (PD) control method in the joint space with gravity torque compensation. The latency between the operator and the controller is around 10 to 20 ms, and the operator is barely aware of the delay.

Fig. 8
figure 8

Five gestures for mapping operator’s hand pose to the robot hand. Each figure illustrates the measured finger joints for each posture. (Color figure online)

3.2 Hand Operation

Exoskeleton-type gloves are equipped on the operator’s hands to control the robot hand. The mapping of the operator’s pose to the robot hand is explained in Fig. 8. The glove shown in Fig. 4 has four linkage-type fingers and can measure the operator’s finger joint angles of F/E and A/A. The maximum and minimum F/E angles of the operator’s finger are measured using five mapping gestures [62]. Theses values are then linearly mapped to the maximum and minimum F/E angles of the robot hands to enable the operator’s finger motions to control the robot hand. This method is an extension of a previous study that mapped human actions onto a robot hand with three fingers and has now been expanded to four fingers [62]. The A/A motion of the operator’s finger is similarly mapped to the robot hand. The A/A movement of the robotic hand enables it to grasp various shapes of objects stably.

3.3 Mobile Operation

Fig. 9
figure 9

Illustration of the mobile base operation procedure: The button \(\textcircled {L}\) changes the HMD menu. The button \(\textcircled {R}\) changes the three modes of the mobile base: Drive, Reverse, and Parking. The buttons \(\textcircled {1}\) and \(\textcircled {2}\) operate the mobile base. When \(\textcircled {1}\) or \(\textcircled {2}\) button is pushed, the mobile base moves in the respective direction: left for the \(\textcircled {1}\) button and right for the \(\textcircled {2}\) button. Pressing both the buttons of the \(\textcircled {1}\) and \(\textcircled {2}\) activates the mobile base in drive or reverse mode. If the pedal (\(\textcircled {3}\)) is rotated, the mobile base also rotates accordingly: clockwise \(\textcircled {-3}\) or counterclockwise \(\textcircled {3}\). (Color figure online)

The mobile base of the robot has four mecanum wheels, which enable it to move in any direction. Out of the available choices of joysticks, 3D Rudder pedals, and flight pedals, we selected the flight pedal as the interface for controlling the mobile base. We excluded the joystick because the operator’s hands needed to control the robot’s arm remotely. Additionally, we ruled out the 3D Rudder pedal because the operator would have to continuously pay attention to keep it in a neutral position when not actively moving the mobile base. The procedure for mobile base teleoperation is illustrated in Fig. 9.

The buttons labeled as \(\textcircled {1}\) and \(\textcircled {2}\) can only be pressed in one direction, requiring a separate reverse button to switch to the reverse mode. When in Parking mode, the mobile base will not respond to pedals commands. Upon generating a pedal command, the desired velocities (\(v^m_{d,x}, v^m_{d,y}\), \(\omega ^m_{d,z}\)) are mapped using the pre-defined maximum velocity. Here, the subscript m stands for the mobile base. For the ANA Avatar XPRIZE Finals, we set a constant maximum velocity of 0.75 m/s for Drive mode, 0.5 m/s for side movement, and 0.5 rad/s for rotation. The pedals input value ranges from 0 to 1, depending on the degree to which the pedal is pressed, and is then used to determine the desired velocity by considering the maximum velocity value. The desired angular velocities of each wheel (\(\omega _1, \omega _2, \omega _3,\) and \(\omega _4\)) are calculated using the desired velocity of the mobile base and the kinematics according to the equation below, as described in [63].

$$\begin{aligned} \begin{bmatrix} \omega _1 \\ \omega _2 \\ \omega _3 \\ \omega _4 \end{bmatrix} = \frac{1}{R} \begin{bmatrix} 1&{} 1&{} -(l_1 + l_2) \\ 1&{} -1&{} l_1 + l_2 \\ 1&{} -1&{} -(l_1 + l_2)\\ 1&{} 1&{} l_1 + l_2 \end{bmatrix}\cdot \begin{bmatrix} v^m_{d,x}\\ v^m_{d,y}\\ w^m_{d,z} \end{bmatrix}, \end{aligned}$$
(5)

where R is the radius of the wheel, \(l_1\) is the width of the mobile base (the distance from the center to the center of the wheels), and \(l_2\) is the length of the mobile base (the distance from the center to the center of the wheels).

4 Telepresence

Telepresence technology enhances the operator’s perception, creating the feeling as if they are physically present at the location of the Avatar robot. Our robotic avatar system caters to three of the five senses: vision, hearing, and touch. Additionally, the operator can sense the robot’s interactions with the environment through force feedback. In this section, we will elaborate on the HMD that provides vision and hearing in TOCABI, as well as the Around View Monitor (AVM) that enables vision around the mobile base. We will also detail the force, tactile, and kinesthetic feedback mechanisms that convey to the operator the robot’s interactions with objects or the environment.

Fig. 10
figure 10

Snapshot of the HMD view: the surroundings and UI are presented to the operator. Each rectangle signifies the type of information being transmitted to the operator. (Color figure online)

4.1 Head Mounted Display

4.1.1 Visual Feedback

Our robotic avatar system provides the operator with visual and auditory information through an HMD for telepresence [47, 64]. To achieve this, we utilize the HTC VIVE Pro 2 HMD, which offers a resolution of 2448 \(\times \) 2448 pixels per eye. We transmit the video using the TurboJPEG Codec after capturing the image through a USB camera on the robot with OpenCV. The image is then encoded encoding using Python’s TurboJPEG and then sent via TCP. In Unity, the image is received through TCP using the TurboJpegWrapperFootnote 17, decoded, applied to a Unity texture, and then displayed on the operator’s HMD with a latency of 100 ms. The HMD device has a microphone and speaker which allows the operator to hear and communicate with individuals nearby TOCABI. The visual image captured by TOCABI’s head camera is transmitted to the HMD.

4.1.2 User Interface

The HMD not only displays the robot’s surroundings but also presents a user interface (UI) to assist the operator in teleoperating the robot. The operator can perceive the scene viewed by the robot through the HMD, as depicted in Fig. 10. The lower section of Fig. 10 explains the information conveyed to the operator, which includes the Mobile Direction UI and AVM for mobility, Finger Grasp Feedback for kinesthetic feedback assistance, and the Force Bar for force feedback. The UI provides information on the connection status between the operator and the Avatar robot, finger grasp feedback, and force feedback. The AVM illustrates the surroundings of the mobile base along with the current driving mode, which could be D (Drive), P (Parking), and R (Reverse). The Mobile Direction UI indicates the direction in which the mobile base is moving.

Fig. 11
figure 11

Display of the UI according to the robot’s status. The stages of the robot connection process are marked with \(\textcircled {1}, \textcircled {2}\), and \(\textcircled {3}\). The corresponding notices for each state are as follows: Ready Pose in \(\textcircled {1}\), Connected in \(\textcircled {2}\), and Disconnected in \(\textcircled {3}\). \(\textcircled {4}\) and \(\textcircled {5}\) represent the steps when a disconnection occurs. In \(\textcircled {4}\), the notice is Disconnected Press Left button to open menu, and in \(\textcircled {5}\), it is Press Ready Pose BUTTON!! before Reconnect. (Color figure online)

The UI provides the operator with real-time information about the robot’s status. As shown in Fig. 11, the robot can be in one of three states: \(\textit{Ready Pose}\), \(\textit{Connected}\), or \(\textit{Disconnected}\). The \(\textit{Ready Pose}\) represents the initial pose of TOCABI, which the operator needs to replicate before connecting to the robot. The robot’s movement is only enabled in the \(\textit{Connected}\) state, and transitioning from \(\textit{Ready Pose}\) to \(\textit{Connected}\) facilitates this motion.

In case of an emergency or singularity occurrence, the robot system automatically switches to the "Disconnected" mode while updating the information on the HMD screen. After that, the operator can reset the robot’s state to the \(\textit{Ready Pose}\). Once this reset has been performed, the robot becomes operational again and transitions back to the \(\textit{Connected}\) state. This mechanism ensures safe control of the robot and prevents potential damage during its operation.

4.2 Force Feedback

Fig. 12
figure 12

Snapshots of force feedback through the HMD. Each circle indicates the object grasped by the robot hand, while the color and length of each rectangle indicate the object’s weight. a The snapshot depicts the moment when TOCABI activates the switch. As shown in the white rectangle, the force bar is short and green, signaling that the force feedback from the switch is light. b The snapshot captures TOCABI lifting the drill. As shown in the white rectangle, the force bar is long and red, indicating the substantial weight of the drill. (Color figure online)

The proposed system has two ways to provide force feedback: visual feedback and haptic feedback. When lifting objects, the F/T sensor on the wrist detects changes in force and torque induced by the object. The weight of the object is then calculated using the wrist orientation and the F/T sensor values. The weight information is then displayed on the force bar on the HMD. The changes in the force bar on the HMD corresponding to different objects are illustrated in Fig. 12. For example, when TOCABI lifts a light switch, the force bar appears as a short green bar, as shown in Fig. 12a. On the other hand, when lifting a heavy-weight drill, the force bar turns red and increases in length as depicted in Fig. 12b.

Fig. 13
figure 13

An illustration of force feedback from the robot to the haptic feedback device of the operator. Yellow circles represent F/T sensors attached to the wrists. The force measured by the F/T sensor is calculated as the desired torque through clipping and mapping using the transpose of the Jacobian matrix. Red, green, and blue arrows indicate the force feedback’s x, y, and z directions, respectively. (Color figure online)

Fig. 14
figure 14

Desired force for clipping and scaling. \(F_{sensor}\) represents the force measured by the F/T sensor attached to the wrist, oriented in the gravity direction. \(F_{d}\) represents the desired force transmitted to the operator as force feedback. \(F_{dead}\) serves as a threshold for the dead zone, allowing the neglect of values attributed to uncertainties. In the light-weight zone, \(F_{d} = F_{sensor}-F_{dead}\). In the heavy-weight zone, \(F_{d} = F_{light} + K_{scaling}\times (F_{sensor}-F_{light})\), where \(K_{scaling}\) is the coefficient force scaling. If \(F_{d}\) surpasses \(F_{clipping}\), then \(F_d = F_{clipping}\). (Color figure online)

The haptic feedback devices allow the operator to feel the force feedback physically. The acquisition and conveyance of force feedback to the operator are depicted in Fig. 13. The forces exerted on the robot’s hand, denoted as \(F_{sensor}\), are measured with the F/T sensor placed on the wrist. \(F_{sensor}\) reflects only the weight of the object by removing the contribution of the robot hand, and it is represented in the robot base frame. When TOCABI is not holding anything, gravity, and friction compensation are added to the haptic feedback device as follows:

$$\begin{aligned} \tau ^{haptic}_{d} = \tau ^{haptic}_{gravity} + \tau ^{haptic}_{friction}, \end{aligned}$$
(6)

where \(\tau ^{haptic}_{d}\) is the input torque for the haptic feedback device, \(\tau ^{haptic}_{gravity}\) and \(\tau ^{haptic}_{friction}\) are the gravity and friction torques for the haptic feedback device, respectively. \(\tau ^{haptic}_{gravity}\) is determined through computations based on the CAD model of the haptic feedback device. The \(\tau ^{haptic}_{friction}\) was calculated by adjusting the coefficients of static friction and viscous friction.

TOCABI measures the force, \(F_{sensor}\), exerted when lifting an object, and this force is scaled by a \(K_{scaling}\). The scaling factor, \(K_{scaling}\), is multiplied to the weight of the object, and the resulting value, \(F_{clip}\), is clipped, ensuring that it does not exceed a specific value as shown in Fig. 14. According to competition regulations, we needed to differentiate between objects weighing a maximum of 32 Oz (around 900 g), and those weighing less (around 300 g). Our system uses a set of values called \(F_{dead}, F_{light}\), and \(F_{heavy}\) as 2.0 N, 2.5 N, and 5.25 N, respectively. When the measured force (\(F_{sensor}\)) falls between 2.5 N and 5.25 N, the difference between the values of \(F_{sensor}\) and \(F_{light}\) is multiplied by the scaling factor (\(K_{scaling}\)) to calculate \(F_d\). During the ANA Avatar XPRIZE Finals, we set the value of \(K_{scaling}\) as 4. The resulting scaled force, \(F_{d}\), is then used in the (6).

$$\begin{aligned} \tau ^{haptic}_{d} = \tau ^{haptic}_{gravity} + \tau ^{haptic}_{friction} + J^T_{haptic}F_{d}, \end{aligned}$$
(7)

where \(J^ T_{haptic}\) is the transpose of the Jacobian matrix of the haptic feedback device, and \(F_{d}\) is the scaled reaction force by the object weight.

Fig. 15
figure 15

Illustration of the kinesthetic feedback system. a Snapshot depicting the moment when the Avatar robot grasps the object. b Realization of the kinesthetic feedback component of the haptic gloves. Each finger of the gloves is equipped with a spring and a string. The string is pulled and released by the string motor, while the spring ensures that the finger returns to its original position when the string motor is not in operation. When the robot hand grasps the object tightly, the current of the finger motor surpasses the threshold. The string motor pulls the string, resulting in the movement of the operator’s finger. (Color figure online)

4.3 Kinesthetic Feedback

The gloves provide kinesthetic feedback to the operator, indicating whether or not the robot’s hands have successfully grasped the objects. Figure 15 demonstrates how the kinesthetic feedback is transmitted to the glove. When the robot hand fully grasps the object, the fingers cannot bend, causing their current values to increase and surpass the threshold. This results in the string being pulled by the servo motor (HITech HS-5070MH servo motorFootnote 18) and the spring (MISUMI AUA5-15 springFootnote 19) generating a force that is applied to the operator’s finger, trying to extend it. The maximum force of kinesthetic feedback is 3.24 N. The servo motor has a maximum torque of \(3.8 ~\text {kg} \cdot \text {cm}\) and a pulley with a 1 cm diameter is used, which generates a maximum force of 38 N. However, when used in the competition, the spring’s stiffness parameter is reduced to generate a maximum force of 1.4 N. The spring is limited to prevent it from stretching beyond this maximum force. Even though the maximum load of the spring is 3.24 N, the kinesthetic force is limited to 3.24 N. The operator can perceive the sensation of his finger being pulled, providing a tangible indication that the robot hand has successfully grasped the object.

Fig. 16
figure 16

Illustration of the tactile feedback system. The blue circle in the left figure shows the vibration motor attached to the left index finger of the glove. The green circle and box highlight the Intel RealSense D435 camera affixed to the wrist. The red circle in the right figure indicates the 3-axis force sensor manufactured by Optoforce. These sensors measure the force when the robot hand’s fingers make contact with an object. The RealSense camera is used to recognize the surfaces of stones. The vibration motor imparts varying sensations to the user, contingent upon the perceived roughness of the contacted or perceived stones. (Color figure online)

Table 4 Classification performance of model (%)

4.4 Tactile Feedback

Fig. 17
figure 17

Explanation of the recognition result and the strategy of identifying the stone surface. a Snapshot of the recognition of the stone surface. Each stone label denotes its surface characteristics, recognition confidence, and the distance from the robot’s wrist. It’s important to note that this result snapshot is not delivered to the operator. b Stone Surface Identification Strategy: Our approach utilizes two sensor inputs, the optical force sensor on the fingertip and the RGB-D camera on the wrist, to identify the stone surfaces. The strategy yields four types of vibration responses. No vibration occurs when there is no input from the optical force sensor. If there is input from the optical force sensor, the YOLO v5 algorithm is employed to detect whether the object is a stone and to distinguish its roughness. In cases where the object is not identified as a stone, vibration is triggered to simulate touching. On the contrary, if a stone is detected, the vibration is determined based on the perceived roughness within the defined parameters (white box). (Color figure online)

During the Avatar XPRIZE Finals, participating teams were challenged to create methods that would provide physical haptic feedback. Our team developed a robotic avatar system that conveys the roughness of surfaces to the operator through the vibration motor on the index finger of the left glove. This approach was crucial in fulfilling one of the competition missions, which required detecting a rough-surfaced stone hidden behind a curtain and out of view (explained in Sect. 6.1). Measuring the roughness of an object using a force sensor attached to a robot finger requires delicate manipulation of the robot finger while maintaining contact between the object and the sensor [65, 66]. This becomes even more challenging when the object is out of sight. To address this, our system uses a more intuitive approach by employing an Intel RealSense camera mounted on the left wrist (as shown in Fig. 16). The YOLO v5 algorithm and roboflow are used to recognize the surface of the stone [67, 68]. Moreover, since the items to be used in the final competition were disclosed by the XPRIZE competition organizers, we were able to prepare datasets in advance by acquiring stones of the same type. The performance of the trained model [69] is presented in Table 4. The mean average precision (mAP) [70] at an intersection over union (IoU) threshold of 0.5 and the mean AP at IoU thresholds ranging from 0.5 to 0.95 are expressed as mAP\(_{50}\) and mAP\(_{50:95}\), respectively. The recognition algorithm successfully distinguishes between smooth and rough stones, as shown in Fig. 17a.

The process of delivering tactile feedback about the roughness of a stone to the operator is illustrated in Fig. 17b and described in [69]. The robot hand’s contact with the object is initially determined based on input values from the Optoforce sensor. If there is no input value from the sensor, the operator does not receive any feedback. The wrist-mounted camera, which faces in the same direction as the palm, helps detect the presence of an object beneath the robot’s hand. If the robot hand is in contact but there is no object beneath the palm, the operator receives feedback about the contact. When a stone is beneath the palm, the roughness recognition algorithm comes into play to determine the surface roughness of the object. The dataset used for the algorithm training was obtained from various environments, including dark or bright settings, with certain parts of the rocks obscured. Once the trained recognition system distinguishes the roughness of the stone, corresponding vibrations are transmitted to the operator. A low-frequency vibration is triggered when a smooth surface is detected, while a high-frequency vibration is transmitted upon detecting a rough surface. However, the system can only detect two kinds of stones. Nevertheless, it provides the operator an intuitive perception of the stone’s roughness. While we successfully validated our approach in a test bed, we regret that we were unable to test it during the competition.

Fig. 18
figure 18

Snapshots of benchmark tasks. a Test of grasping a stone, an empty bottle, and a joystick controller. b Test of force feedback to identify the weight of three bottles (12.7 N, 4.9 N, and 0.24 N). c Test of identifying the roughness of stone surfaces (rough and smooth). This curtain effectively blocks the operator from viewing the stones. d Test of mobility in an environment with a trajectory length of 6.5 m and one obstacle). e Test of drill maneuverability (drill weight: 1.7 kg, bolt height from the ground: approximately 1 m). (Color figure online)

5 System Evaluation

We evaluated our system through a user study when users encountered tasks for the first time. Participants were given instructions solely through verbal explanations before the evaluation test. The experiment was carried out by ten participants in a single trial. All participants were members of the research team and had a basic understanding of our system. Participants were divided into three groups based on their experience with the system: Beginner (up to 90 min), Intermediate (90–180 min), and Expert (over 180 min). There were 3 Beginners, 4 Intermediates, and 3 Experts.

5.1 Benchmark Tasks

To evaluate the intuitiveness of our avatar system’s telepresence and teleoperation, five benchmark tasks were designed based on the ANA Avatar XPRIZE Competition.

  • Manipulation and grasping task (Fig. 18a): The purpose of this task was to assess the telemanipulation capabilities. The task involved using TOCABI’s arm to reach a specific location and grasp different objects, such as a joystick, water bottle, and stone, placed on a table. The aim was to evaluate the arm and hand operations involved in the task. During the task, the Avatar robot picked up each object sequentially and handed them to a person standing in front of the table. The total time taken for the task was measured from the moment the participant began moving until the task was completed.

  • Identifying weight (Fig. 18b): This task was created to test the participant’s ability to distinguish weight using force feedback. Three water bottles, weighing 12.7 N, 4.9 N, and 0.24 N respectively, were placed on a table. In the Avatar XPRIZE Competition, the objective was to differentiate between canisters that weighed approximately 12 N and 2.5 N [49]. For this evaluation experiment, we aimed to assess whether the system could differentiate between even smaller weight differences. The Avatar robot lifted the bottles one by one, and the participants had to determine the sequence of the weights based on the force feedback. In the force feedback experiment, participants were required to identify the order of heaviness of the three objects solely using the haptic device, without relying on the information from the HMD. The test was considered successful only if the order was entirely correct.

  • Identifying stone surfaces (Fig. 18c): This task tests the ability to identify the roughness of a surface without visual feedback. A curtain obstructed the view between the robot and the table, preventing the participants from seeing the table. This task assesses the ability to determine the roughness of a surface without visual feedback. On the table, two stones were present- one rough and the other smooth. The participants used the robot’s left hand to differentiate the roughness of the stones. The success or failure of identifying the rough stone was then measured.

  • Mobility (Fig. 18d): During this task, we evaluated how adept participants were at controlling a mobile base using a pedal-shaped interface connected to a visual interface. The start and finish points are marked with rectangles. The robot had to navigate around a large table to reach the finish point, which was approximately 6.5 m away from the start point. The participants had to move the robot to the finish point without colliding with the obstacle. If the robot hit the table, it was considered a failure. We measured the total time it took from the moment the robot departed the start point until the participants believed that the robot’s mobile base had completely entered the finish point.

  • Drill maneuverability (Fig. 18e): This task was designed to evaluate the participant’s ability to perform precise manipulation tasks with heavy tools using a haptic device. The aim was to test whether they could teleoperate both the robot arm and hand to move towards a bolt, which was positioned approximately 1 m above the ground, and then loosen it by activating the drill. The force feedback from the drill was removed during the drill evaluation test, as it was deemed too heavy and could negatively affect the participant’s teleoperation. It’s important to note that grasping the drill was not evaluated in this task. The total time taken to remove the bolt by the drill was measured from the moment the robot, holding the drill, was placed in front of the workspace.

Table 5 Result of system evaluation experiment

5.2 Results

The results of the experiment are presented in Table 5, which shows the average completion times and success rates, along with their standard deviation (SD). You can find the experiment in the following link [71]. Each evaluation in Fig. 19 highlights the observed tendencies within each group, despite the small number of participants ranging from 3 to 4 in each experimental group.

Fig. 19
figure 19

Box and whisker plots of each evaluation test: manipulation and grasping, mobile and drill maneuverability. The X-axis of each figure shows the groups of participants. The Y-axis of each figure shows the completion time of each evaluation test. The blue boxes correspond to the beginner group, the green boxes to the intermediate group, and the red boxes to the expert group. Additionally, the red lines within the boxes represent the median values, while the black circles represent the mean values. (Color figure online)

  • Manipulation and grasping task: The average time taken by each participant to complete the task is \(79.5 \pm 37.9\) s, with completion times ranging from 28 s to 174 s. The standard deviation of 37.9 s indicates that the participant’s ability affects their performance in the manipulation and grasping process. In the manipulation and grasping evaluation test, the box plot for each participant group is displayed in Fig. 19a. The mean completion times for the beginner, intermediate, and expert groups were \(109 \pm 56.3\) s, \(78.3 \pm 20.3\) s, and \(51.7 \pm 30.4\) s, respectively. It is observed that the duration of the task is inversely proportional to the participant’s experience level.

  • Identifying weight: The success rate of the participants in the identifying weight task is 60 \(\%\). All participants could identify the lightest object, but some had trouble distinguishing between the objects of medium and heaviest weight. The success rate of each group is displayed in Fig. 20. Interestingly, the success rate of identifying weights did not seem to be related to the level of experience with the system, which could be associated to the clipping issue described in Fig. 14. As explained in Sect. 4.2, the desired force transmitted to the participants \(F_d\) of each bottle after scaling are 0 N, 11.6 N, and 42.8 N (we use \(K_{scaling}\) as 4, which is the same value used in the ANA Avatar XPRIZE). However, the weight of the heaviest bottle exceeds \(F_{clipping} = 13.5\) N, so \(F_{d}\) for the heaviest bottle becomes 13.5 N instead of 42.8 N. Therefore, the weight difference between the second heaviest object and the heaviest object becomes only 1.9 N. Also, due to the significant inertia of the haptic device, a participant with less sensitivity might have difficulty distinguishing the 1.9 N difference and the counterforce arising from inertia. For this reason, we speculate that the expert group found it challenging to distinguish between the force feedback arising from the 0.9 N difference and the counterforce from inertia, likely due to their familiarity with the inertia of the haptic device, as they tended to operate the haptic device more quickly.

  • Identifying stone surfaces: All participants were able to identify stone surfaces with a success rate of 100 \(\%\), regardless of their level of experience with the system.

  • Mobility: The average time for the mobility test for all participants was 46.6 s, with a range from 35 s to 58 s. The mobility experiment was the one in which the influence of system experience appeared to be the least significant. As shown in Fig. 19b, the average completion times for each group are 44.6 s, 47.5 s, and 47.3 s, respectively. Surprisingly, having more experience with the system did not reduce the completion time. Thus, the results suggest that the mobile base system is intuitive and allows users with less system experience to perform similar operations as those with more system experience.

  • Drill maneuverability: During the drill maneuverability test, the average time taken by all participants was \(39.8\pm 28.9\) s. The drill experiment was conducted to evaluate the time taken by each participant to complete the task, and the results showed that the shortest time taken was 16 s and while the longest time taken 124 s. This indicates that there was a considerable variation in completion time. The mean and standard deviation for each group are shown in Fig. 19c: \(59.7\pm 55.7\) s for the beginner group, \(33.5\pm 5.2\) s for the intermediate group, and \(28.3\pm 12.5\) s for the expert group. The drill task was a complex experiment that required participants to align the drill held by the robot with a small bolt and then manipulate the drill button. Interestingly, except for the first participant, the mean time taken by the remaining participants to complete the drill task was 30.4 s, with a standard deviation of 7.62 s. This suggests that the proposed system does not pose significant difficulties in performing fine tasks using the drill. The first participant took over 120 s because he had difficulty keeping the drill perpendicular to the wall while attempting to remove the bolt. The group of beginners showed more variability than any other group in the experiment.

Fig. 20
figure 20

The success rate of the evaluation of identifying weight. The blue parts show the success rate, and the red parts show the failure rate. The graphs show, clockwise from the left top, the ones for the total participants, the beginner group, the expert group, and the intermediate group, respectively. (Color figure online)

The average time taken to complete the three measured tasks was 213.3 s for the beginner group, 159.3 s for the intermediate group, and 127.3 s for the expert group. This indicates that participants with more experience tend to handle the system more efficiently. However, the small number of participants in each group makes it difficult to consider these results to be highly conclusive. Therefore, it would be beneficial to conduct user studies with a larger number of participants in future research on avatar systems. Additionally, if we train participants who have never used the system before on how to operate the avatar system and then compare their task performance based on the duration of their training, we can better analyze the relationship between the amount of experience with the system and the ability to use it effectively.

Fig. 21
figure 21

Description of ANA Avatar XPRIZE Finals missions and test course. The operator is in another room. (Color figure online)

Table 6 Missions of the ANA Avatar XPRIZE finals

6 ANA Avatar XPRIZE Finals

In this section, the missions and results of the ANA Avatar XPRIZE will be outlined. We will also provide a brief introduction and analysis of the interface used by other teams, and discuss the valuable lessons and insights gained from our participation in the competition.

6.1 Missions of ANA Avatar XPRIZE Finals

During the Avatar XPRIZE finals, teams were ranked based on their scores, with a maximum of 15 points available. The Avatar ability was worth 10 points, the Operator experience was worth 3 points, and the Recipient experience was worth 2 points. The competition tested the avatar system’s ability through 10 missions conducted by the operator. These missions allowed operators to evaluate the system’s performance and effectiveness. The locations where these ten missions were carried out are described in Fig. 21. Detailed descriptions of each mission can be found in Table 6. In the Avatar Ability category, teams scored one point for a pass and zero points for a fail. To proceed to the following missions, teams had to succeed in the current mission, and if the allocated time passed before they succeeded, their trial would end.

The judges scored the Operator and Recipient Experience tasks: 0 points for Never/Poor, 0.5 points for Sometimes/Fair, and 1 point for Always/Good. Table 6 provides detailed explanations of the judge’s evaluations.

Fig. 22
figure 22

Snapshots of Team SNU performing the final missions in DAY 1. a Mission 1. b Mission 2. c Mission 3. d Mission 4. e Mission 5. f Mission 6. g Mission 7. h Mission 8. i Mission 9. Team SNU did not try Mission 10. (Color figure online)

6.2 Result of the ANA Avatar XPRIZE Finals

In the Avatar Finals, Team SNU received a score of 12.5 points for DAY 1Footnote 20 and DAY 2Footnote 21 (8 points for Avatar Ability, 4.5 points for Judge Experience). Figure 22 shows the missions our team carried out during the two days of the Final competition. Team SNU attempted 9 out of the 10 missions over two days, completing 8 of them. However, mission 9, which involved grasping a drill and unscrewing the bolt with the drill, was unsuccessful.

In this section, we provide a detailed explanation of the reason behind the failure of mission 9. The task required grabbing a drill from a table, turning it on, and then moving to the next wall to remove a bolt by unscrewing it. On DAY 1, there were several factors that led to the failure of mission 9, as shown in Fig. 23. In Fig. 23a, b, it is evident that the operator grabs the drill while the orientation of the drill and the robot hand is not aligned. Consequently, the index finger could not fully push the drill button. Furthermore, the operator could not position the drill perpendicular to the wall due to the unaligned orientation of the drill. As a result, mission 9 on DAY 1 failed.

On DAY 2, mission 9 failed again. The reasons behind this failure are shown in Fig. 24. The operator on DAY 2 successfully grasped the drill and placed the index finger on the drill button, as shown in Fig. 24a, b. However, after moving the robot into the wall (as shown in Fig. 24c, d), the operator attempted to move the mobile base instead of moving the arm. As shown in Fig. 24, even though the robot hand holding the drill hit the wall, the operator continued to move the mobile base, causing the drill to rotate in the hand. Consequently, mission 9 failed.

Fig. 23
figure 23

Snapshots of the drill mission failure on DAY 1: a the operator view when the orientation of the drill and the orientation of the robot hand are not aligned while holding the drill. b Test course view when the robot fingertip is not placed on the drill button. c The test course view when the robot hand is approaching a wall with the drill misaligned in the robot hand. d The test course view reveals a discrepancy between the operator’s moving direction toward the wall and the orientation of the robotic hand. e The test course view when the misalignment of orientation between the drill and the robot hand causes the drill button to be released. (Color figure online)

Fig. 24
figure 24

Snapshots of the drill mission failure on DAY 2: a The operator view when the orientation of the drill and the orientation of the robot hand are aligned while holding the drill. b Test course view showing the robot’s index fingertip placed on the drill button. c Test course view capturing the robot approaching the wall while holding the drill. d Test course view before the robot holding the drill collides with a wall. e Snapshots illustrating the drill rotating within TOCABI’s hand after it hits the wall. The operator in DAY 2 did not stop the mobile base in front of the wall, causing the robot to continue moving towards the wall. As a result, snapshots depict the robot’s hand unintentionally colliding with the wall. (Color figure online)

As described in Sect. 5.2, the drill task was a difficult task that an operator with limited experience with our avatar system could not proficiently perform. Furthermore, the system has limitations, such as the difficulty in showing the robot hand holding the drill from various angles and the inability to detect obstacles and collisions between the robot and its surroundings, akin to depth perception. The system’s failure to provide appropriate guidance to the operator on whether it is better to move the mobile base or manipulate the robot arm for task execution might have led to the failure of the drill task.

6.3 Analysis of the ANA Avatar XPRIZE Finals

Table 7 Result scores of the 12 teams in the ANA Avatar XPRIZE Finals last day

Table 7 shows the scores of the selected finalists; 12 teams were selected to participate in the DAY 2 test. The time shown in Table 7 is the time when each team succeeded in its last mission. Only four teams, namely NimbRo [72], Pollen Robotics [73], Team Northeastern [53], and AVATRINA [54], completed all the missions. Also, only four teams (NimbRo, Pollen Robotics, i-Botics [74], and Inbiodroid [75]) were able to achieve the perfect judge score of 5. Furthermore, only two teams, NimbRo and Pollen Robotics, were able to achieve a perfect score in both the task and judge evaluations. These scores reveal that the avatar system, which receives the maximum scores from the judges, does not guarantee perfectly executed missions.

In Fig. 25, the execution times of Team SNU and the top 5 team’s missions are compared. The scoreboard on the released videoFootnote 22 measures each mission execution time, but there may be some emerging errors. As shown in Fig. 25, missions 2 and 3 required the Recipient and Operator to converse about the overall mission, and most teams completed the mission within a similar execution time.

Our team’s performance on missions 6 and 7 was on par with the top 3 teams, suggesting that our robotic avatar system is adept at providing force feedback to the operator and accurately relocating objects to specified locations, much like the top-ranking teams.

Nine teams made an attempt at mission 9, with a success rate of six out of nine. This particular mission necessitated the lifting and manipulation of weightier objects in comparison to mission 6. Additionally, the task involved the precise pressing of the drill button, and the haptic feedback system posed a challenge for the operator, given the object’s weight.

Only four teams succeeded in completing mission 10 and had varying completion times. The method of preparation for each team to measure and convey the texture of the stone to the operator varied, resulting in varying execution times required for mission 10. Nimbro [72] used an audio sensor on the robot finger for detecting and a vibrotactile actuator for the feedback [76]. Team Northeastern [53] also used an audio sensor (microphone on the wrist) for tactile feedback [53]. Team AVATRINA used the LiDAR camera on the gripper to detect the surface [54]. Pollen Robotics also used an audio system for stone surface detection. (Pollen Robotics has not published research findings, but the final competition videoFootnote 23 shows microphones and thin white plates attached to both grippers. Additionally, in the last mission, the white plate is used to scrape on the stone surface.) Interestingly, not a single team utilized direct feedback through contact with the robot hand to differentiate the roughness of the stone.

Fig. 25
figure 25

Comparison of mission execution time for top 5 teams and Team SNU. M1: Moving to commander desk. M2: Reporting to the commander. M3: Receiving and confirming missions. M4: Activating switch. M5: Traveling to the next task. M6: Identifying the heavy canister. M7: Placing canister into the slot. M8: Navigating a narrow path. M9: Using the drill to remove the door. M10: Identifying the rough textured rock and retrieving it. (Color figure online)

Table 8 Comparison of robots and operation systems participating at ANA Avatar XPRIZE

6.4 Comparison of Avatar Systems

The robotic avatar systems of ANA Avatar XPRIZE Finals are briefly compared in Table 8. While it is possible to rank each team based on their competition performance, it is difficult to say which team’s system was the best approach. Therefore, in this section, we examine the methods that the participating teams used most frequently.

The most commonly used forms of the Avatar robot combined a wheeled base with dual arms: nine teams utilized a humanoid-type upper body, six teams used two manipulators, and two teams used one manipulator for manipulation.Footnote 24 Additionally, 14 teams employed wheels, two used legs, and one used a combined wheel-leg robot for mobility. Only iCub [81] and Janus [82] used bipedal locomotion. During the Avatar XPRIZE Finals, we (Team SNU) used a legged humanoid robot, primarily relying on the mobile base with the robot seated. Thus, Team SNU falls into the category of teams that use wheels. Avatar-Hubo utilized a robot capable of transforming between bipedal walking mode and wheel mode [79]. Avatar-Hubo utilized bipedal walking mode for manipulation missions and wheel mode for mobility missions. 10 teams used robotic hands, and 5 teams used grippers. Two teams, AVATRINA [54] and Cyberselves|Touchlabs, used both robotic hand and gripper.

All teams, except Dragon Tree Labs and Last Mile, have employed a method of remotely controlling robots based on the gestures of operators. Dragon Tree Labs and Last Mile [80] used the joystick controller or mouse. The other teams used different interfaces to mimic the operator’s behavior: eight teams used trackers, five teams used a haptic arm, five teams used VR controllers, and one team used an exoskeleton.Footnote 25 Within this context, a differentiation has been established between a haptic arm and an exoskeleton, based on the connection between the operator and the device. Specifically, it is determined whether the connection is limited to a single point, such as the wrist, or if it extends across multiple locations throughout the body. The haptic feedback device developed by Team SNU was classified as a haptic arm since it involves a connection between the operator and the device at a single point. Additionally, it was clarified during the ANA Avatar XPRIZE Workshop that Pollen Robotics’ exo-elbow is not intended as a remote control device, but rather as a device designed for providing haptic feedback.Footnote 26 The consensus among numerous teams appears to favor an intuitive teleoperation interface that mimics the operator’s gestures. Among the total teams, 4 teams used one device for the teleoperation interface, while 13 teams used two or more devices. 12 teams used gloves to control robotic hands or grippers, while the remaining teams remotely operated both robot arms and hands or grippers through a single teleoperation interface.

The ratio of teams that used hands and teams that used feet as the interface to control robot movement was similar: nine teams used feet (3D pedal, one-foot pedal, and trackers), and eight teams used hands (VR controller, flight joystick, and mouse).

14 teams used VR-capable HMD as the interface to deliver telepresence to the operator. Two teams used a larger monitor, and one team used a regular monitor.

6.5 Lessons Learned

The preparation and testing for the ANA Avatar Finals indicated future research directions for us and the community. What we learned during the preparation and the testing of the ANA Avatar XPRIZE is presented as follows:

  • Fast Networking With Low Latency: The teleoperation in real-time requires fast communication between the robot and the operator. During the competition, there were communication delays at the venue on the qualification day. Some teams reported encountering issues related to network disconnection or drops, such as unexpected network drops in UNIST [55] and networking disconnection in AVATRINA [54]. Instances of disconnection or drops in networking can result in delays between the operator and the robot, making real-time remote control or immediate feedback unfeasible. Therefore, low-latency fast networking is indispensable for a robotic avatar system.

  • Intuitive and Ergonomic Teleoperating System with Force Feedback: Using a haptic device and VIVE trackers together had the advantage of the robot closely tracking the operator’s movements, faithfully mimicking the actions, and providing force feedback to the operator. However, to provide force feedback, the operator device must include actuators, thereby increasing the inertia of the device and diminishing the system’s ergonomic qualities. Team Northeastern also mentioned that the difference in inertia in master and slave is unsuitable for teleoperation, as it causes disturbance to the operator [53]. We infer that this contributed to the execution time of the mission demonstrated using the drill in Sect. 5 and our failure of the drill mission twice in the ANA Avatar competition. We believe that compensating for inertial-induced unintended movements in the haptic device could prevent this failure. This can be achieved by installing F/T sensors between the operator’s hand and the haptic device, measuring unintended movements through these sensors, and applying necessary corrections. This approach resembles Nimbro’s method, which involves detecting the operator’s movements using an F/T sensor attached to the operator’s arm and controlling the operator arm accordingly [84]. An alternative approach involves separating force feedback from the operator system, employing a method akin to AVATRINA [54], where force information is transmitted through the visual system.

  • Visual Feedback Should Provide a Wide Field of View and Diverse Perspectives: Visual information emerges as the most effective means for operators to comprehend the robot’s surroundings. Nonetheless, the current system’s cameras provide the operator with a narrower field of view (FoV) compared to human natural vision, resulting in the operator receiving less visual information through the robot’s perspective than they would with their own eyes. Furthermore, the limited visual information conveyed can lead the operator to make mistakes. In our team’s case, as emphasized in Sect. 6.2, the operator faced difficulty visually confirming whether he securely grasped the drill during Mission 9. Additionally, the operator failed to determine if the mobile platform was continuously moving towards the wall, resulting in instances where the hand holding the drill collided with the wall. As another example, in the case where the iCub team collided with the door frame due to the operator’s mistake,Footnote 27 a wider FoV would likely have reduced the chances of such operator errors. A system like Nimbro, offering six degrees of freedom to move the camera [72], or a system like Northeastern, providing depth information through lasers as visual data [53], could have potentially prevented these issues. Considering this, the avatar system should be developed to offer visual feedback with a wide field of view and allow the operator to easily move the camera to see the object from various angles.

  • Difficulty of Bipedal Walking: During the competition, four teams used bipedal robots: Team SNU, Team Avatar-Hubo [79], iCub [81] and Janus [82]. However, only two teams, iCub and Janus, attempted bipedal walking in the competition. Despite the competition venue being visibly flat and suitable for bipedal robots to navigate, both teams that attempted bipedal walking faced challenges and did not achieve satisfactory results. While it is inherent that bipedal walking on flat ground may be slower than wheeled movement, the reality is that slight variations in the floor make the surrounding environment uneven. Ultimately, developing robots capable of navigating in 3D environments remains a challenge.

  • Need to Develop a Robot Hand that can Move and Feel Similar to a Human Hand: The missions of the ANA Avatar XPRIZE underscored the significance of developing robotic hands that can exhibit flexible movements similar to a human hand and receive tactile feedback in contact with objects. On Day 1, during the drill mission, the operator grabbed the drill, with the robot index finger’s middle phalanx link pressing the drill’s button. While the kinesthetic feedback indicated successful drill grasping, it failed to specify which finger pushed the drill button. Furthermore, even if the operator had detected a change in the drill’s orientation, the robot hand could not rotate the drill without placing it back on the table. Although our developed robot hand effectively grasped objects with in-hand motion. Team UNIST also mentioned the challenge of developing a robot hand capable of moving as freely as a human hand [55]. Additionally, the feasibility of tactile feedback for each finger link could significantly enhance the operator’s ability to sense object roughness. In our team’s case, we used an RGB camera and recognition algorithm to discern the roughness of stones [69]. To perceive the stone’s roughness, the four teams that completed the mission employed sound feedback [53, 72] or LiDAR [54] to perceive the stone roughness. To enable a robot hand to perceive the object’s roughness similarly to a human, it necessitates the development of tactile sensing throughout each finger’s links in the robot hand. Additionally, structural and control advancements are required to develop a robot hand capable of free movements, resembling the motions of a human hand.

  • Shared Autonomy Control: Many teams spent significant time attempting to grasp the drill, activate it, and unscrew the screw, with only a few teams achieving success. While the advanced teleoperation system enabled the operator to control the robot step by step as desired, it posed challenges in cases requiring fine control, making precise robot manipulation more demanding. AVATRINA implemented semi-autonomy technology, distinct from shared autonomy, and reported its advantageous impact on the operator [54]. The development of shared autonomy control, integrating both manual control by a human and autonomous control by a robot, could address these challenges in remote operation. The operator’s remote manual control moves the robot’s position or approaches the target object. Simultaneously, the robot’s autonomous control aligns its hand with the object or adjusts the position and direction of the held tool to match that of the target object. This streamlined approach reduces the time required for executing remote operations using a robot.

7 Conclusion

This paper comprehensively describes our robotic avatar system, comprising the humanoid robot TOCABI and an operator station facilitating remote control. Our system is designed to provide operators with an intuitive teleoperation experience, ensuring an immersive telepresence. The effectiveness of the proposed system was validated through self-conducted evaluation tests and participation in the ANA Avatar XPRIZE Finals. It allows operators to remotely control the Avatar robot based on their movements, providing a physical stimulus of haptic feedback that enables the operator to sense the weight of objects and distinguish the roughness of surfaces.

During the ANA Avatar XPRIZE Finals, our robotic avatar system empowered operators to complete 8 out of 10 missions with just one hour of training. However, limitations were identified, notably the significant inertia of the haptic feedback device, which poses challenges for precise remote control. Additionally, although the participants in the evaluation tests demonstrated a high task completion rate in evaluation tests, the method for discerning the roughness of stones, although not attempted in the Finals, still presents a gap compared to human perception. Our plans involve developing an advanced, intuitive teleoperation interface with minimal inertia, informed by extensive user studies. Furthermore, ongoing research is essential to strike a balance between autonomous controls facilitating fast and precise robot manipulation, potentially surpassing human capabilities, and providing exact haptic and control feedback.