Keywords

1 Introduction

Published in Mind in 1950, Alan Turing’s original proposal raised questions on the intelligence of the machines, with the most famous question to be “Can machines think?”. This inspired a lot of researchers to start examining the potentials of human–computer interaction (HCI), leading to a point where technology has started to be actively involved in the communication process.

To decipher and interpret the features and the boundaries between humans and technology, a first step is to compare human–human interaction with the one between humans and machines. The research of human–human communication can reveal the most useful information for enhancing the HCI field and thus, it can clearly be stated as a starting point. It has already been proved that people are more willing to discuss and even disclose private information when computers follow and present human-based conversation rules (Nass and Moon 2000).

Baylor (2011) stated the three main factors that can characterize a natural social interaction between a human and an agent. What we characterize as agent, based on what Ferber defined, is “a physical or virtual entity that can act, perceive its environment (in a partial way) and communicate with others, is autonomous and has skills to achieve its goals and tendencies” (Ferber 1999). Thus, according to Baylor’s research, social interaction is portrayed by the appearance of the agent, i.e. cartoon or realistic figures, the communication features, such as gestures or facial expressions and the content of the dialogue. All this research has been based on Bandura’s first theoretical social cognitive learning theory, where he supports that people learn behaviors and norms be imitating other people who react in the same way. Trying to boost this imitation, researchers are trying to create more realistic avatars or robots to facilitate the human interaction with the technology. This realism is based on human responses and reactions and human appearance.

2 Human Perception During HCI

2.1 The Role of Human Likeness

There are several hypotheses tested for the human likeness. The most commonly used are the uncanny valley, the atypical feature, the category conflict, and the similarity hypothesis. The first one, the uncanny valley hypothesis (UVH), described by Professor Mori, suggests that when a character just resembles a human, without being one, creates awkward feelings in human observers (Mori 1970). The higher the human likeliness, the stronger the sensation of eeriness. There are several promoters of this hypothesis, supporting that an agent, avatar, or a robot, is better to be cartoon-based rather than having a physical appearance in order to be more preferred by a human (Baylor 2011). Research on virtual representation has proved that too much anthropomorphism can lead to negative effects, less trust, and discomfort (Nowak 2004). Recently, Stein and Ohler (2017) supported the extension of this theory as the “uncanny valley of the mind” where they argue that it is also the human-like behavior of the agent, “behavioral anthropomorphism” as they call it, that can cause negative reactions (Stein and Ohler 2017). Some researchers also support that it is not only the high degree of human-like appearance that can trigger this hypothesis but also a possible mismatch between the form and the behavior (Nowak and Fox 2018). However, Mori et al (2012) expressed their doubts, proving that if the agent is designed in a way that is hardly distinguishable from a real person, then the valence becomes positive again (Mori et al. 2012). The morphology of an agent, aligned with the uncanny valley hypothesis, may indeed influence the perception and the behavior of a person during an interaction, but the degree depends on the task. People prefer more human-like morphology when they refer to social roles or to real-time interaction for example (Edwards et al. 2019). An evaluation of UVH was conducted by Lupkowski and Gierszewska in their recent work, where they used 12 computer-rendered humanoid models to test the human perception and the UV effect (Lupkowski and Gierszewska 2019). For their purpose, they used a subscale of the NARS questionnaire regarding the human traits. The main points of their research are firstly, that the highest comfort level was noticed for a cartoon-based character and secondly that the belief of a person in human uniqueness can directly affect his/her attitude toward an agent; the higher the belief, the more nervous the person toward the agent.

The typical feature hypothesis supports that atypical features of the stimulus may influence the perception (Borst and Gelder 2015). Burleigh et al. noticed that the eye size constitutes such a feature. Moreover, they found that whenever human likeness was high, eeriness was low (linear relationship) (Burleigh et al. 2013). Third, the category conflict hypothesis (Borst and Gelder 2015) suggests that “when human likeness of the stimulus is comprised of a morph between two categories, the stimuli in the middle of this scale are perceived as ambiguous, leading to a negative effect”. Yamada et al. tested also this hypothesis, concluding that the most ambiguous image reflects to an increased processing time (Yamada et al. 2013). Lastly, the similarity hypothesis by Rosenberg-Kima et al. (2010) predicts that the gender similarity (male or female) and the attractivity of an agent have a more positive effect on the motivational outcome. This hypothesis was confirmed by Shiban et al. (2015) who used a young female agent and an older male one to test the effects on performance and motivation in learning process (Fig. 13.1).

Fig. 13.1
figure 1

The uncanney valley as described by Mori et al. (2012), depicting the relationship between the natural resemblance and the affinity for it. The dotted line represents the effect of the presence of movement

However, the big question here is if it is the appearance of the agent who influences the perception and the performance of the user, or it is also their behavior in combination with a contextual environment. Are there measurable benefits for the user and can we reach a level where a virtual avatar or a robot can really simulate the human behavior so that we can really compare the different cases and come to a conclusion about the usability of such agents?

Nevertheless, the majority of studies, having examined the role of human likeness, have been based in questionnaires suitable for such purposes. As Kätsyri et al. also mentioned, these kinds of studies cannot easily clear up the existing vagueness of this field, so psychophysiological studies are of need (Kätsyri et al. 2015). Ratajczyk et al. continued the work of Lupkowski and Gierszewska mentioned above, using electrodermal activity (EDA) and response time measurement to evaluate the UV effect and the human perception toward the same 12 characters, assessing also the role of their environment (background). Another interesting recent example is the one of Ciechanowski et al. who used facial electromyography (EMG), respirometer, electrocardiography, and EDA to examine the human-nonhuman interaction process between a human and a chatbox (Ciechanowski et al. 2019).

2.2 The Role of Embodiment and Presence

Intelligent systems have two critical features that can affect human’s perception during HCI: embodiment and presence. Embodiment was defined by Pfeifer and Scheier (1999) as a term which refers to the fact that “intelligence cannot merely exist in the form of an abstract algorithm but requires a physical instantiation, a body”. The level of embodiment is dependent on the nature of the agent (physical, virtual, or even a combination of both), the morphology (i.e. human-like or cartoon-based), as well as the modalities it can support and the extent to which these modalities can be carried out (Li 2015). Other variables, like gestures, speech speed, and haptic stimuli, may also be considered as aspects of an agent’s embodiment.

Whereas embodiment concerns the agent and its relationship with its environment, presence deals with the way this agent is presented to others. Milgram et al (1995) defined physical and digital presence as a situation where the embodied agent can be touched, saying specifically “whether primary world objects are viewed directly or by means of some electronic synthesis process”. Zhao categorized physical and digital presence as copresence and telepresence, respectively (Zhao 2003). Copresence, as a term in a sociological framework, describes the conditions under which humans interact with each other (Zhao 2003). Under the umbrella of HCI and HRI, copresence refers to how the agent is displayed to the user. Zhao (2003) used two dimensions to describe the copresence. The first one refers to the mode of being with others” and concerns features that can physically shape a human interaction whereas the second one refers to “the sense of being with others”, linked to the feeling and the subjectivity of the user (Zhao 2003). We need though to differentiate physical embodiment and physical presence (copresence) as an agent, that may have physical embodiment, may not have a physical morphology presented to the user (Li 2015). There are several researchers who tried to evaluate the role and the influence of presence and embodiment in virtual environments or in robotics (Li 2015; Lee et al. 2006; Mollahosseini et al. 2018).

So, here we can pose the first question, regarding the effect of physical presence. Do people react differently in an interaction with a copresent agent (robot) compared to a telepresent one?

Researches up to now have proven that psychological responses between these two situations differ due to a variety of reasons. Initially, one reason is the size of the agent and consequently the influence it can have (Hoffman and krämer 2011). Robots that are physically present have usually a bigger size than a virtual agent displayed in a screen. As Huang et al. have mentioned, taller individuals tend to provoke a bigger social influence (Huang et al. 2002) and thus, the larger size of the physical robot may be more imposing, having a stronger impact.

Distance is one of the main aspects of presence, as Zhao also supported (Zhao 2003), which can be divided into the physical and the electronic proximity. This leads to the second reason which is the physical distance between the user and the agent, as physical proximity is normal to have different effects compared to the electronic one (Shinozawa et al. 2005). Moreover, the interaction with a physical agent allows a better understanding of its morphology and motion, creating a more familiar environment with the user. In general, it has been shown that physical presence can improve the user’s behavior as well as increase the level of enjoyment and trust (Li 2015).

In the case of the same appearance, a recent survey showed that the 79% of the up to date studies favored a robot that is copresent compared to a telepresent one (Li 2015).

The next question derives as a continuation of the latter and examines the effect of physical embodiment. Do people react differently interacting with a physical agent compared to a virtual one?

One reason for which the embodiment may result to the psychological processing of the user is the degree of realism (Hoffman and Krämer 2011). Han et al. compared, with the use of functional Magnetic Resonance Imaging (fMRI), real and virtual visual worlds through the observation of movie or cartoon clips, aiming to provide information on how we perceive characters in real and virtual worlds (Han et al. 2005). They concluded that the perception of real-world characters triggers the medial prefrontal area (MPFC) of the brain and the cerebellum which act as an online representation and empathy of mental states of others, whereas cartoon clips of humans and non-human agents activated the superior parietal lobes which are associated with attention when referring to actions (Han et al. 2005). The cartoon-based clips also engaged the occipital area of the brain which is linked with the visual attention mechanism. The latter has also been proven by the study of Baka et al. (2018) where, with the use of an electroencephalography device (EEG), they showed that the occipital area, among a physical, a virtual identical to the physical and a virtual cartoon-based environment, reacts differently only under the cartoon-based environment, being synchronized in an alpha state (8–12 Hz). The alpha state activity, in that case, is associated with the recruitment of visual attention mechanisms.

Studies that have examined the influence of physical embodiment separately from the physical presence, comparing telepresent robots to virtual avatars, reported no significant results (Li 2015).

However, what if the physical embodiment and the physical presence are combined? The majority of the studies have supported that people prefer the physical presence of a robot than a virtual avatar (Li 2015), having also significant effects in several behavioral responses like performance, attention (Looije et al. 2012) and response speed (Jost et al. 2012). However, gesturing has been proved to play an important role in the response of people during HCI. Thus, to complement the above, people prefer copresent agents, compared to telepresent robots or virtual agents, but only when they use gestures to complete their interaction (Hasegawa et al. 2010).

In general, Jamy Li proved though his survey that physical presence plays a greater role in psychological responses to an agent than the physical embodiment. So, it seems that no matter the nature of the embodiment (virtual or physical) which constitutes a feature of the character, the presence is the one that can directly influence the response and behavior of the people (Li 2015). That is, what matters is how the agent will be presented to the user and finally, how the embodiment can allow that. However, there is a limitation in this field as there are no studies that have used avatars of high-level naturalism, decreasing the effect of human appearance. Moreover, the exploration of how additional variables like gesturing or voice features can influence humans’ responses is also clearly missing from the up to date bibliography.

2.3 Other Features that Can Influence Human’s Perception During HCI

Another important feature that has been tested in such interactions is the role of the eye gaze. Eye gaze is one of the most important features of the human behavior while a social interaction as it can serve several purposes and functions like enhancing the attention, revealing emotional information, preserving engagement. Therefore, it has been proved that the physical presence plays a greater role in the gaze’s perception compared to physical embodiment and thus a robot’s eye gaze can be more accurate than the one of a virtual agent (Mollahoseini et al. 2018).

Studies that have examined and proved that through facial expressions the behavioral and emotional intentions of another person can be predicted, started around 1973 (Ekman 1973). It has been shown that observers tend to activate similar facial muscle activity with the speakers intended facial expressions (Kunecke et al. 2014). This reaction has been characterized as Rapid Facial Reaction (RFS) [Moody] constituting an affective reaction that can occur automatically after the stimulus presentation. The majority of such studies have used a congruent direction of gaze and body using an eye-tracking system for the gaze, EMG for the face muscles, and a questionnaire for the self-assessment (Schrammel et al. 2009). However, the importance of body’s direction started to raise questions and more recent studies (Marschner et al. 2015; Kluttz et al. 2009) examined the influence of the difference in body and gaze direction. Thus, although it has been shown that the gaze is one of the major indicators of socio-communicative dimensions, it has been finally proved that only when combined and congruent with body orientation, it can modulate emotional experience and attention.

Humans can express different kind of emotions while interacting with different type of agents, under the same circumstances. A priori, the communication between human beings has been guided and facilitated by the existence of emotions. Emotions, as an inherent internal procedure, are the mirror of what we feel, allowing us to perceive and understand our environment, including ourselves. It has been proven that people experience more positive emotions when interacting with a virtual agent that provides positive feedback instead of a negative one (Pour et al. 2010). Mollahosseini et al. (2018) studied the perception of people toward facial expressions of a virtual agent, a copresent retro-projected robot, a telepresent robot and a video recording of a human and they found that the emotion recognition rates differentiated among the several agent conditions. In other words, humans perceived, and consequently expressed, differently the emotions based on the nature of the agent (Mollahosseini et al. 2018). Lazzeri et al. (2015) also proved that emotions that are expressed through facial expressions can be better perceived on a robotic agent than a virtual one (Lazzeri et al. 2015). On the contrary, virtual agents seem to be more effective when it concerns visual speech due to the computer graphics that can provide a better accuracy on the realism and the animations (Mollahosseini et al. 2018).

3 Human—Robot Interaction (HRI)

There have been a lot of research trying to decipher the human behavior and perception when interacting with robots instead of other humans. There are even movies that describe such interactions and, even if we consider them as science fiction films, we are at a point where people have started communicating and meeting social robots in a real-life context incorporating personal or professional roles (Edwards et al. 2019).

It has already been shown that the first reaction of people toward an initial communication with a social robot is a feeling of uncertainty and decreased anticipation (Edwards et al. 2019). However, Edwards et al. suggested that this behavior is a result of the deviating social communication pattern, that leads to the alteration of the “script” and expectations during a human–human interaction. Humans, unconsciously, follow a script when interacting with each other, adapted to various social situations. One of the roles of HRI research though is to decode these scripts and allow similar behaviors to take place during a human–robot interaction.

Communication has been described by Kellerman (1992) as a “heavily-scripted procedure” (Kellerman 1992). In the framework of this procedure, humans are used to interact with another humans, creating an anthropocentric expectancy in the communication. However, despite these expectations, it has been supported that people tend to treat computer or other social intelligent technology as if they were people, by applying similar social scripts as the ones used during a human–human interaction. Reeves and Nass (1996) first illustrated this opinion with their Computers Are Social Actors (CASA) paradigm, showing that people mindlessly relate to machines and apply social rules as if they were indeed real people, even if they are aware of their incapability to embody emotions and intentions (Reeves and Nass 1996). Reeves and Nass, in the same study, also suggested that people treat televisions like real people. This was confirmed by Nass and Moon (2000), who examined user’s responses to different kind of televisions and they concluded that humans perceive them also as social actors (Nass and Moon 2000). In general, Nass and Moon supported that people tend to focus on the social cues, even if they are a few, bypassing the asocial features of the entities. CASA has been already involved in several studies in a broader field of research including AI and social robots. Recently, Mou and Xu (2017) compared the initial human-AI social interaction with the one between humans in terms of personality traits and communication attributes (Mou and Xu 2017). They support that their outcome complements the CASA paradigm as they found that people can change their behavior toward social actors if they are aware that they will interact with an AI.

Edwards et al. (2019) showed that the human-like morphology can satisfy this anthropocentric expectancy during an interaction. They also confirmed the hyperpersonal model, launched by Walther (1996), based on which computer-mediated communication can sometimes surpass a face-to-face interaction in terms of intimacy and liking (Walther 1996). Thus, they concluded that according to the context of the discussion an interaction with a robot can increase the level of attribution of social presence and decrease the degree of uncertainty.

While the boundary between human–computer and human–human interaction is described by CASA concept, social psychologists maintain doubts regarding the psychological invariance that can characterize a person across several different situations. This is the so-called personality paradox or consistency paradox, describing that a person can present different personality traits and behaviors under different circumstances. Attempting to solve this paradox, in the framework of human–computer interaction, Mischel and Shoda developed the Cognitive—Affective Processing System (CAPS) (Mischel and Shoda 1995). Mischel wanted the psychologists to think like mechanics and valuate people’s responses according to particular conditions. According to this model, the personality system encompasses mental representations consisting of various cognitive-affective units (CAUs) that include a person’s goals, beliefs, values, affective responses, and memories (Mischel 2004). Different CAUs can be activated under different conditions and different context, shaping accordingly the behavior of the individual. Consequently, when interacting with a machine, some people may feel more confident during the interacting process whereas others can feel confused and frightened. Therefore, based on the CAPS model, when interacting with a machine, humans’ behavior and reaction should be different than the one presented when communicating with another human (Mou and Xu 2017).

However, it has been shown that putting robots in an anthropomorphic framework, by giving to them a personal name and even a story to follow, can affect human’s behavior and reaction toward them (Edwards et al. 2019).

3.1 Social Robots and Their Features

The idea of robots, as mechanical agent serving specific purposes, has started a very long time ago, described even in Greek mythology. However, robots with natural language features, able to participate in a conversation, appeared in the 1990s, with the example of MAIA (Antoniol et al. 1993) and RHINO (Burgard et al. 1998). These kinds of robots were developed to cover a specific range of applications and consequently had some limitations, like the limited nonverbal communication, the difficulty in the perception of human speech, the specific pre-defined range of responses (Mavridis 2015). All these restraints of the 90 s have become the inspiration of the next years’ research trying to understand and enhance the features of human–machine interaction.

Robots have been tested in several roles serving various applications where verbal and nonverbal communication are needed, like assistance and companionship (Wada and Shibata 2007; Dautenhahn et al. 2006), receptionist (Makatchev et al. 2010), educational purposes (Li et al. 2016; Kanda et al. 2009), museum robots, and tour guides (Yamazaki et al. 2012; Evers et al. 2014), or even involved in art, like musicians (Petersen et al. 2010) and dancers (Kosuge et al. 2003). In all the above applications, the main goal is the fluidity in the communication between the human and the machine, for any verbal or nonverbal feature. To succeed this, researchers had to address limitations like breaking the “simple command only” barrier, coordination of motion and nonverbal communication, affective interaction, multiple speech acts, mixed-initiative dialogue, etc. (Mavridis 2015) On the contrary, this kind of restraints have already been addressed in the virtual world since the early seventies, with the Winograd’s SHRDLU program that could support different speech acts and basic mixed-initiative dialogue (Winograd 1972). Due to the lack of the physical entity of a robot, VR was easier to be developed faster and in a different way than the area of robotics. We can assume that this is why people are more used to this technology, expressing also a higher preference toward it. Nevertheless, the main difference between robots and virtual agents is the physical embodiment.

Birmingham et al. examined a new role for robots, as mediator in a multi-party support group (Birmingham et al. 2020). The role of the robot was to motivate people to speak to each other and overcome their stress by increasing the sense of trust. Participants however declared at the end that the robot made the discussion mechanical, with lack of real flow and they noticed the specific features of the robot responsible for that. As the authors used a Nao robot, the participants noticed clearly the lack of humanity first of all in the expressions of its face. Thus, in line with other studies, facial expressions play a crucial role for an efficient interaction. For example, Zawieska et al. highlighted the importance of facial expressions as the majority of their participants attributed the intelligent behavior of the robot used for their experiment to its facial expressions (Zawieska et al. 2012). Moreover, Birmingham et al. found that the sound of the robot was not natural and consequently, non-native speakers had a difficulty to understand its voice.

One of the most important items that has been addressed by both worlds, is the affective/emotional aspect. Affection during human interaction plays a crucial role as it is directly associated with learning processes, persuasion, and empathy (Mavridis 2015). Pioneering work in this domain was made on virtual avatars like Steve (Johnson et al. 2000) or Greta (Rosis et al. 2003) that became the inspiration for Cynthia Breazeal to develop the Kismet robot, an expressive mechanomorphic robot head with perceptual and motor modalities that can support multiple facial expressions (Breazea and Velásquez 1998; Breazeal 2003].

Second most important feature is the one of motor and nonverbal communication coordination. People, when interacting with each other, they use several kind of motor actions head nods, hand gestures, gaze movements and of course lip-syncing (Mavridis 2015). There has been also stated that humans use lip information to perceive better a communication, the so-called McGurk effect (Mavridis 2015). Thus, to support even the basic level of naturalness during an interaction, agents should be able to use some of these features to accompany their sound production.

Social robotics is a rapidly increasing field aiming to develop robot capable of socio-emotionally interacting and communicating with humans serving several domains like education, health, and entertainment (Mollahosseini et al. 2018). The researches and recent technologies are trying to define the best choice between robots and virtual agents, best suited for the needs of social interaction (Table 13.1).

Table 13.1 Examples of social robots from several domains in a chronological order

4 Humans and Virtual Avatars

Over the last years, the use of Virtual Agent (VA) has started to be known for its effectiveness over the use of real human, boosting users’ motivation and even performance. Thus, the question of whether to implement a virtual agent or a robot is still under a lot of investigation and is considered to be totally dependent on the requirements of the task to be performed. The main advantages of a VA, as they have been stated until now, are the overall little cost of use, the easiness and flexibility of its use as it can be used anytime and from anywhere, the dynamical anytime changes of its appearance, as well as further possibilities that it can offer as collecting and examining real-time physiological data, such as facial or movement expressions (Li 2015; Yokotani et al. 2018). For the better understanding of this comparison (Virtual vs real human), we need to mention two, almost recent, terms. The first one, the agency belief, refers to the reaction of people toward VAs and specifically to the extent to which they can believe that a VA represents a real human (Lucas et al. 2014) whereas the second one, the behavioral realism, concerns the degree to which a VA can really behave like a real human (Blascovich 2002).

It has been shown that different levels of agency belief and behavioral realism serve different purposes. For example, VA’s low behavioral realism is considered to be suitable for interviews settings (Rizzo et al. 2016). Specifically, voice-only interviews have been proved to be more effective than face to face ones, helping participants to feel more comfortable to speak with higher level of self-disclosure (Bailenson et al. 2006). Moreover, participant’s low agency belief seems also to be more effective in such cases (Lucas et al. 2014). On the other hand, Baylor and Kim (Baylor and Kim 2009) showed that a physically present agent can provoke better motivational results than a voice or a text box under learning circumstances. There are several studies supporting that embodied talking agents can enhance the engagement of the user (Mollahosseini et al. 2018). However, Mayer and Dapra showed that an agent can have a positive effect on a user only when the voice it supports is human-like and not a machine voice (Mayer and DaPra 2012).

4.1 Conceptualization and Perception of Avatars

Virtual Reality Environments, with their virtual characters, can offer opportunities and enable manipulations that may be difficult, or even impossible, to happen in a natural environment. In these environments, users can control, embody, and interact through avatars in several contexts, shaping the field of computer-mediated communication (Nowak and Fox 2018). The use of an avatar, in such kind of communication, plays a crucial role as avatars can be used as a means of influence in a variety of contexts like health communication, interpersonal communication, nonverbal communication, advertising, etc. (Nowak and Fox 2018). It can also support more complex behaviors and actions, enhancing the nonverbal communication through gestures or body movements.

Every avatar has each own characteristics that can include for example appearance, behaviors, or abilities and can be specified based on several factors like the users’ preference and their previous experiences in such environments as well as the technological capabilities of the system. However, as Nowak and Fox. (2018) mentioned, the term “avatar” is used by many researchers without being properly defined, causing sometimes misinterpretations in the framework of the relevant studies.

The origin of the word “avatar” is derived from the Hinduism and specifically from the Sanskrit word for “descent” (Nowak and Fox 2018). In this concept, an avatar is the incarnation of a deity on earth, being able to experience the human aspects. Nowadays, and for more than twenty years, avatars have been acknowledged as digital representations. The term became popular mainly though the novel of Neal Stephenson (1992), who used it repeatedly to refer to characters being in digital environments (Nowak 2004). Following to that, a lot of researchers gave several definitions to this term trying to include the features like the appearance, the abilities, the degree of realism or the anthropomorphism. Therefore, some definitions include terms like “cartoon-based” or “two dimensional” but these are continuously evolving as the technologies advance. We often hear terms like “embodied avatar”, “virtual human”, “agent”. In every case, there are two main points that are served; the avatar can represent the user in a computer-mediated environment, and it can provide the experience of interaction with the environment of another user. The most recent definition is the one of Nowak and Fox (2018) where “an avatar is a digital representation of a human user that facilitates interaction with other users, entities, or the environment” (Nowak and Fox 2018). They chose to use a broad definition that can be used as an umbrella, independent of any specifications or characteristics.

The characteristics of an avatar can directly influence the user’s perception. For example, based on the Information Processing Theory, people can get easier affected and can pay higher attention to sources that consists of dynamism (McGuire 1985). Aspects that can influence a person’s perception of an agent being in a virtual environment can be technical based, like the anthropomorphism or the realism, or in a more social context, like the gender, the age and the ethnicity.

Minutely, anthropomorphism includes the perception of any human trait or quality such as emotions, behavior, cognition presented in any human or non-human entity. It can be mainly increased by the image of the avatar as well as its behavior (Nowak and Fox 2018). There are a lot of studies on how anthropomorphic representations can influence the communication, showing that the higher level of it can lead to a more natural and persuasive (Heyselaar et al. 2017), more attractive (Gong 2008) interaction, with an increased level of social presence and engagement (Kang and Watt 2013). Furthermore, realism is the perception of how a situation, or an object, can be realistic, and it is often mixed up with the term of anthropomorphism. In the context of realism, an avatar can be judged based on its appearance, the rendering, the naturalness, and the fluidity of its movements and way of speaking.

On the other hand, given that avatars are perceived as social entities based on CASA (Reeves and Nass 1996), there are also social factors that can influence the perception of the users. First of all, the most common categorization humans use to do is the determination of gender. As Lakoff (1987) said people tend to attribute a gender to others even when physical or biological information is not available (Lakoff 1987) and probably this is an instinctual procedure as they believe that they can understand others or predict behaviors. Studies have proven that gender in specific contextual virtual environments play a role in human’s reaction. For example, children prefer a male voice when it regards football and a female when to princesses or make-up (Lee et al. 2007) whereas adults prefer a young female avatar compared to an older male one for educational purposes (Shiban et al. 2015). Moreover, people often try to decipher the ethnicity of a person as they believe they can predict her/his behavior (Nowak and Fox 2018). A study of Eastwick and Gardner (2009), among others, showed that people were influenced of the existence of black and white people in a virtual environment (Eastwick and Gardner 2009).

Another study that proved the role of gender combined with the self-similarity in a gaming environment, is the one of Lucas et al. (2016) where men preferred to be represented by their own avatar whereas women preferred a stranger. This study is the only research up to date who has used a photorealistic self-similar avatar to study the effect of the appearance of the avatar in the performance and the perception of the user under a gaming environment. Lucas et al. tried to answer the question of the importance of the self-relevance of a virtual human under a specific context and although the difference in the gender they found, they noticed that the self-similarity provokes a bigger engagement and connection between the user and the avatar. Similar recent study is the one of Wauck et al. (2018) who used a more natural photorealistic self-similar avatar in a gaming context but with better technology features with which they respected even the gender aspect and they used different animations (male and female) for the two sexes. Their results indicated that there is no difference in the performance of the user based on the appearance of the avatar and no effect on gender as well. They attribute that to the better technology they used with which they avoided any negative effect on user’s experience. However, further investigation it is needed under different environment and context to verify or contradict all these results.

5 Conclusions

No matter the technology, robotics, and virtual agents can improve the accessibility of various contents. Robots and other intelligent systems are able to improve the quality of human life by providing an assistance in intensive and difficult situations or even an independence in the way of living for people who have the need, like elderly or people with motor/cognitive disabilities. Nowadays, agents have the ability to embody and fill social roles (Spence 2019). An embodied agent can be a physical robot or a virtual character that has an identifiable body and can use modalities like voice, gestures, or facial expressions to communicate. The main differences between a virtual and a robotic agent is the physiology of the human face, the natural neck motion, the shared gaze but mostly the physical presence (Spence 2019).

Although the continuous effort of the existing studies to enhance the domain of human–computer interaction by addressing all the aforementioned features, it seems that the fluidity and the naturalness of the interaction has not yet been achieved. This has as a consequence for people to still prefer the human communication in any content. Jamy Li et al (2016) for example compared robotic and virtual agents through a video setting in an educational content, as instructors (Li et al. 2016). However, they showed that attitude was more positive toward human compared to robots, but agents have the potential to act as an alternative with the strict requirement that they are designed well. Moreover, Yi Mou and Kun Xu showed that people tend to be more open, self-disclosing, outgoing and in general more positive when interacting with another people compared to an AI agent (Mou and Xu 2017). The same was verified by the study of Shechtman and Horowitz (2003), who found that when people were talking to a human instead of a computer, they tend to be more talkative and spend more time to the conversation (Shechtmann and Horowitx 2003).

This preference can also be an outcome of the low degree of naturalism. Fischer et al. (2011) for example, found that people laughed when they had to respond in a robot’s greetings, admitting that they found the movement unusual during their interaction (Fischer et al. 2011).

The current state-of-the-art aims to present the limitations of the existing technology used in the broad area of human–computer interaction up to now. Although different kind of agents have been used to contribute to several domains, like education, health, entertainment, both in virtual and physical environments, the virtual character or robot that will make a human feel as comfortable as interacting with another human has not been reported yet. Undoubtedly there are a lot of factors that should be taken into account when an agent is prepared to be used in the context of human-technology interaction, as stated before, but the most difficult part is the optimal selection and combination of these factors. What is mainly missing from the up to date state-of-the-art is the direct comparison of all these technologies with the original human–human communication. What we need to do is to keep studying the human–human communication and not only features of the HCI as what is missing is how we, as humans, react in several contexts of communication. The extraction of human features in such a context, like voice characteristics as range of frequencies, volume and timbre, gestures or body movements executed by feet or the body trank and even more physiological features like brain or muscle signals can complement the existing technologies and studies. Moreover, the way humans react to computer-mediated characters and to virtual environments can be a tool to decipher and understand existing human communication theories that can also support the aforementioned.

Thus, through this kind of research and by creating models for the human verbal and nonverbal communication, we can contribute to the enhancement of naturalism of every kind of agent, offering a higher level of understanding and affection in the context of everyday communication.