1 Introduction

Gaming is an ubiquitous activity that has characterized human behavior in every part of the world at any historical period. In the last 20 years, there was a growing interest in the study of beneficial influences of videogames on cognition and emotion [1]. Several games, often defined as “serious games” are currently developed with the specific purpose of positively affecting behavior and cognition [2]. While the most of interest is dedicated to cognitive development and rehabilitation contexts, there is also an emerging interest in the effect of gaming in the general adult population. In a systematic review and meta-analysis, Pallavicini and colleagues [3] found support to the hypothesis of a beneficial role of video gaming on healthy adult in multiple domains of cognition, processing speed, response time, memory, task-switching, mental spatial rotation and emotion.

There is evidence about positive influences on cognition of non-digital gaming in general. For example, many studies on board games like Chess, Shogi and Go indicate that they are effective in cognitive rehabilitation [4] and in improving cognitive and perceptual skills [5]. It is important to note that the distinction between videogames and in-person games is often blurred and, by consequence, similar effects can be obtained in digital and non-digital versions of the same games. For example, most of the popular board and card games nowadays have digital versions available to the market. Furthermore, there is evidence that serious games have compatible effects in the non-digital and in digital formats (see for example [6]). Recently, we have seen the emergence of exergames, which combine the involvement of sensorimotor skills typical of physical exercise with the power of digital settings [7]. Finally, hand games have been used in many “serious” settings, like computational thinking education [8] and early childhood cognitive development [9, 10].

A subset of non-digital games is represented by hand games. Hand games have been greatly popular in history, perhaps because of their simplicity. In fact, hand games do not require any particular setting, devices or apparatuses to be played. The most studied hand game is Roshambo (Rock-Paper-Scissors, RPS), which is a non-cooperative strategic game that has been used as a non-computerized exergame in cognitive declining elderly adults [11], to investigate cognitive strategies in schizophrenia [12], to understand strategic interactions between healthy adults [13]. A game similar to RPS, yet more complex, is Morra, an ancient hand game still played nowadays. In its more popular variant, two players simultaneously extend one arm in front of the opponent to show a number of fingers, while uttering a number from 2 to 10. The player who successfully guesses the total number of fingers shown by the two hands scores a point. From a cognitive point of view, Morra is a complex activity which involves, and possibly integrates, many perceptual, cognitive and motor processes. During Morra, while listening and seeing the numbers presented by the opponent, a player needs to select two numbers, one to be shown with the fingers and one to be spoken. In order to be successful, a player should select those numbers in a very careful way: the to-be-shown number should be difficult to predict and the to-be-said number should be selected in order to target the number the opponent will show. This requires memory of previously shown and said numbers by the player and by his/her opponent. Moreover, this involves executive functions [14] to inhibit the numerals uttered which must always be greater than the numbers of shown fingers, and a dual-task attentional performance, to simultaneously detect and process visual (fingers) and verbal (spoken numbers) information. The task also requires an integration of visual information, with an automatic recall of the arithmetical fact [15] and the verbal information, which concerns the numbers said by both players which are compared with the arithmetical fact to decide who makes the point. All these operations, summarized in the diagram in Fig. 1, are conducted in a very small amount of time (more than one round per second).

Fig. 1.
figure 1

A speculative model of the processes involved in each round of Morra playing

Morra analysis can provide a new approach to study the interaction between several cognitive functions in an ecological setting [16]. Considering its complexity, Morra is also a good candidate to be included in the category of serious games. The development of an artificial agent able to play Morra at different levels of expertise against human opponents is an important tool that can serve several goals in education, rehabilitation, cognitive training in healthy adults and basic research. In this paper we will focus on the development of an artificial agent able to play the Morra game against humans.

Previous studies have been published on the development of robots able to play hand games. In particular, several robots have been developed to play RPS against humans [17,18,19,20,21].

Morra and RPS are similar in many aspects as they are both zero-sum competitive games requiring the integration of sensorimotor skills, executive functions, attention and decision making. However, the two games also differ in many respects. Morra has a more complicated structure, having a much larger set of combinations to remember, involving the integration of two sensory modalities to receive the inputs, and requiring more advanced defensive and attack strategies to master the game [16].

In recent years, Zizi developed the Morra system Gavin 1.0, an artificial agent able to autonomously and successfully play Morra against a human opponent [22]. Recently, Gavina 2121, a new implementation based on the experience acquired from Gavin 1.0 has been developed. Gavina 2121 (see Fig. 2) has the ability to play Morra against human players and also allows the analysis of the behavior of human players in terms of numeric sequences produced during games.

Fig. 2.
figure 2

Gavina 2121 plays against a human opponent in a public square in Bitti (Sardinia)

In each round, Gavina moves its robotic right arm synchronously with the opponent’s movement, and shows a certain number of fingers. Simultaneously, Gavina, like its human counterpart, tries to guess the sum of all displayed fingers, the ones shown by the opponent and by Gavina itself. Gavina’s main objective is to defeat its opponent. For this purpose, the system tends to show random sequences of numbers while trying to detect repetition of non-random sequences of numbers shown by its opponents. Gavina achieves this goal by using a machine learning (ML) system based on a fully automated Bayesian network, which converges to progressively more accurate predictions. The fact that Gavina systematically outperforms human competitors supports the theory that humans are bad randomizers of sequences [23].

The first version, Gavin 1.0, worked as a black box and did not allow the extractions of the strategies used to win. As Gavina2121 has the additional scope of helping the analysis of numeric sequences produced by humans, we developed a hybrid system, which is able to provide information on how the game estimates are produced, a characteristic of the so-called expert systems. The most recent implementation of Gavina uses a nondeterministic version of the n-gram model through a Bayesian network implemented with probability hypercubes, in a similar way to the Markov model. The system consists of 5 predictors and an arbiter who decides which predictor is likely to have the most successful choice in relation to the measurement of support and confidence. Each predictor focuses on human number sequences of different lengths and calculates the probability of the repetition of patterns of 1, 2, 3, 4 and 5 numbers respectively. For example, if the system uses sequences of 2 numbers to accurately predict the next number of a specific human player, it means that, in general, that player tend to repeat a certain pair of numbers, say 2 followed by a 5, and that using this information is for the system the best way to predict the next number outcome. The choices of the arbiter are recorded and can be used to interpret human performances.

The development of gesture and voice recognition is a very important topic in the development of multimodal interfaces [24]. A fundamental component of a hand game robotic system able to play against human opponents is the gesture recognition system used to recognize the hand configuration of the human player in real time. This is not a simple problem, as human hands during hand games need to be tracked at high speed with non-blurred images stable enough to allow gesture recognition. In this article we will mostly focus on the sensing characteristics and how they are integrated with the computational core and with the actuator skills of the robot.

Previously developed systems [25] used a high-speed vision system (500 fps) to actively track and recognize the human hand gestures, processing single frames to identify fingertips located outside a predetermined circular boundary centered on the hand palm. More recently, a similar system used the Leap motion device and two separate machine learning architectures to evaluate kinematic hand data on-the-fly to recognize and segment human motion activity and to classify hand gestures [20]. However, both these implementations suffer from the use of costly capture devices and do not ensure sufficient accuracy in finger counts.

In previous implementations of Gavina, we tried different solutions to the gesture recognition problem in Morra. Our first solution was to use five Hall effect sensors [26] positioned on each fingertip of the human player and a small magnet positioned in the center of the palm of the same hand. When the player extends a certain number of fingers, it creates an unambiguous pattern of magnetic activation in the Hall sensor system. Specifically, the measurement of the variations of the magnetic fields detected by each of the sensors returns the exact number played. This method, which in the experimentation phase proved to be extremely accurate, fails to detect the correct number of fingers when the human player produces ambiguous bending or extending of fingers.

Other solutions have been provided through the use of systems dedicated to motion detection, such as kinect and leap motion. However, such attempts suffer from several limitations due to the dynamics of the game. Specifically, players often rotate their wrist and tend not to bend or extend fingers completely. Also, the presence of other hands in the frame, the extremely variable lighting conditions, the variability of hands’ position in the playing space, would make motion detection technology unreliable in the specific context of Morra.

Considering the previous attempts, we are currently working on developing a recognition system that can univocally return the number of extended fingers of a human hand in real time. In this study we describe the pilot implementation of MediaPipe Hands [27] within our artificial Morra agent Gavina 2121. Our scope is to demonstrate that MediaPipe is a robust, reliable, flexible and easy to implement automatic gesture recognition system in Gavina 2121.

2 Methods

The hand tracking solutions previously implemented in Gavina required cumbersome apparatuses or devices. In fact, the magnetic solution described above required a glove in which to install a magnet and sensors, and the use of additional sensors placed on the forearm to signal, with the extension of the arm, the temporal proximity of a new measurement. Analogously, the use of motion detection devices required the integration of different devices that are not easy to implement in quasi-naturalistic settings, like morra tournaments. Therefore, we decided to test a different approach in which the hand position is captured and tracked without utilizing additional devices like magnetic systems or motion capture devices. To accomplish this task we used the MediaPipe hand tracking framework [27]. MediaPipe Hands can predict landmarks on an image or a video sequence using a pretrained convolutional neural network (CNN) and represent its prediction on the hand detected by drawing the hand landmarks frame by frame. In detail, a visual object is passed through a machine learning pipeline that involves two subsequent models: the Palm Detection Model makes the system able to draw an initial bounding box of the palm that becomes the next input of the Hand Landmark Model. Finally, the latter model traces the 21 hand landmarks of the detected hand(s) using a regressor to decide their positions and draw them on the visual content combined with a prediction (and a label) of the detected hand(s).

On a practical level, we invoked the constructor of the object-class Hands by passing the parameters necessary to define the specifics of the model. As the MediaPipe reference source suggests, this framework can accept five parameters. From these five, four were crucial for our purposes during the testing phase of the algorithm: model_complexity provided the chance to opt for a more complex convolutional network structure by passing the value 1, max_num_hands allowed us to decide the maximum number of hands that framework must track. Finally, min_detection_confidence describes the minimum probability to detect the hand(s) in the scene and min_tracking_confidence makes the user able to set up a value that indicates the probability threshold for tracking successfully the hand(s). The choice of passing those values to the model was motivated by the outcomes of the preliminary phases of the algorithm. In fact, we didn’t observe any substantial differences in the accuracy from changing the parameters.

2.1 Apparatus

For our tests we used a HP ProBook 455 G2 laptop, a built-in 708879-3C2 Webcam module for image acquisition. We developed our code in Python 3.9 programming language. For image processing we used real time image acquisition, prerecorded morra from tobii glasses 2 eye tracker, a sony HDR MV1 video camera and several commercially available models of smartphone cameras.

2.2 Procedure

Using the framework of Mediapipe hand, which provides 21 landmarks of the 3-d position of a hand in real time, we developed an algorithm for finger counting. In the testing phase we assessed the reliability of automatic finger counting in different conditions. We concentrated on hand position variability, counting fingers in real time and counting fingers from recorded videos.

Hand Position Variability:

Our first step was to test whether the system could detect the position of the fingers of a human hand both when the hand is in a prone or supine position. This test was necessary because Morra players in actual Morra games often alternate supine and prone hand positions when extending their fingers in front of their opponent.

Counting Fingers in Real Time:

The second step was to allow the system to count fingers while displaying random numbers of fingers in real time, with an approximate frequency of a number per second. This test was necessary to simulate the way Gavina detects and recognizes numbers of extended fingers in a Morra game against a human opponent. Specifically, to test if a finger is extended, the system compares the Y coordinates of the distal phalanx (tip of the finger) and the proximal phalanx (the one connected to the methacarp) from the hand landmarks received from MediaPipe. If the Y coordinate of the distal phalanx is greater than the Y coordinate of the proximal phalanx, the system will assign the status of “extended” to the finger in analysis. Finally, the system will count how many extended statuses are present, determining the number presented by the human player.

Counting Fingers in Images from Pre-recorded Videos:

Finally, we tested if the system was able to detect and recognize the number of extended fingers on pre recorded videos of Morra games. This step was important to assess the possibility to automatically tabulate data of actual Morra games between two or four human contenders.

3 Results

Hand Position Variability:

We tested the ability of the algorithm to detect fingers’ position in prone and supine hand positions (Fig. 3). As in the first tests the software was unable to detect the numbers in prone hands, we modified the original function and split the supine/prone hand cases by considering the landmarks of the wrist and the base of the middle finger and comparing their y-coordinates. MediaPipe demonstrates an accuracy of 95.7% in palm position detection. Indeed, in our final test, results indicate that our apparatus is able to correctly detect palm position and fingers with high accuracy that reflect high scores in finger counts (see next paragraph for statistics).

Fig. 3.
figure 3

The system shows high accuracy in counting the number of extended fingers in real time from hands in both supine and prone positions

Counting Fingers in Real Time:

Our system counts the number of fingers by comparing Y coordinates of the distal and proximal phalanxes of each of the five fingers and then counting the number of extended vs. non-extended statuses. Our test indicates that our algorithm, in a total of 30 trials for prone and supine hand, could achieve 86% and 93% of accuracy, respectively.

Counting Fingers in Images from Pre-recorded Videos:

The quality and the modality of the Morra game recordings was very variable. Specifically, they used smartphones, video cameras and a mobile eye tracking system to record Morra games from a sample of college students at Lawrence Technological University. Moreover, Morra games in an ecological setting have the two hands of the opponents in close spatial proximity with one another. This makes it very hard for an autonomous recognition system to distinguish, select and process the two hands in separation. During our tests, both the variability of the videos and the simultaneous presence of two hands in the same frame, made automatic finger counting very challenging to the system. The main issues consist in unsteady camera recording, broad scene focus, poor frame angles. For these reasons, we splitted the videos to test the model over consistent game sequences where the frames reproduced a clear choice of the player. By reducing the exposure to those issues we were able to investigate how we can improve the further recordings to prevent the system from possible distractors and isolate the recordings containing only frames of hands involved in the game (see Fig. 4).

Fig. 4.
figure 4

Simultaneous detection and tracking of landmarks from two hands in a prerecorded video

4 Discussion

In this study we tested the robustness, reliability, flexibility and simplicity of implementation of MediaPipe Hand [27] as an automatic gesture recognition system for our artificial Morra agent Gavina 2121. Specifically, we assessed the accuracy of automatic recognition of the number of extended fingers of a human hand by MediaPipe in different settings: prone and supine hands in real time camera acquisition and with pre recorded videos.

Our results indicate that MediaPipe is able to count the number of extended fingers of a human hand with good precision both in supine and prone hand positions making it a good candidate for implementation in Gavina2121. However, the accuracy of recognition is reduced when the system is detecting finger position from pre recorded videos. In this case, the presence of two hands in the same frame, the variability of the quality of the videos and the always different dynamic of the motor behavior of human players makes the automatic gesture recognition very challenging.

Reliable gesture recognition is of vital importance for our Morra study as it is applicable to several research contexts and experimental paradigms. For example, playing Morra against an artificial agent allows the setting of a flexible training environment in which the user can employ different levels of difficulty with which to customize the robot’s skills. Also, accurate gesture recognition allows telemorra, in which two human opponents can play Morra against each other online in a virtual setting and have Gavina as a point counter and referee. Telemorra can be applied to pedagogical and research contexts especially in cognitive development and rehabilitation settings. Our Morra agent can be flexibly employed in many contexts, from schools, rehabilitation centers, experimental psychology and cognitive neuroscience laboratories.

Like other serious games [4, 5, 28], Morra has the potential to positively affect cognition and to improve cognitive and perceptual skills. Moreover, with its involvement of many perceptual, cognitive and motor processes, it is an ideal tool to test several cognitive processes, as well as their development and rehabilitation [16]. An artificial Morra player can sensibly expand the numerous educational, rehabilitation and research applications of Morra.

Several steps need to be taken in order to use Gavina at the best of its computational capability, including increasing accuracy rates of finger counting, the integration of speech recognition software within the system to automatically recognize spoken numbers. Also, a virtual reality rendition of Gavina would allow a less complicated implementation and easier reproducibility than the physical robotic agent.