Keywords

1 Introduction

Over the years, the area of human-computer interaction has focused primarily on cognitive factors. However, in the last decade there has been a growing interest in imminently emotional factors. In addition, studies in the field of psychology have recognized the importance of emotions for human cognition, motivation, learning and behavior. This understanding of emotions has inspired many researchers to build machines that can recognize, express, model, communicate and respond to human emotions.

Computer systems can display emotions, language and communication patterns, and their conversational and cognitive abilities can be equated to some extent with those of human beings.

Rosalind Picard’s Affective Computing was published in 1997 [1] and laid the foundation for equipping machines with emotional intelligence. Thus, a new area of ​​research called affective computing has emerged, concerned with the emotional side of computers and their users. The term can be defined as computation that relates to, raises, or deliberately influences emotions. However, the definition may not be consistent since the goal is the automatic recognition of emotion by computer systems. Yet, for a natural and effective HCI that approaches an affective dimension, computers still need to look intelligent [1] and for intelligent interaction the reference should be human-human interaction. In this regard, Wiener [2] defends that human-to-human communication should be the model for human-machine interaction (as well as machine-machine interaction), arguing that communication between humans and machines should not be distinguished from natural communication between humans, and it is irrelevant that a communicative signal is processed by a machine and not by another human.

One of the main limitations of affective computing is that most previous research has focused on recognizing emotions from a single sensory source, or modality. However, as HCI is multimodal, researchers have tried to use various modalities for the recognition of emotions and affective states.

2 Multimodality in HCI

In an HCI, the sender translates concepts (symbolic information) into physical events that are transmitted to the appropriate receiver and the receiver interprets the received signal in terms of abstract symbols. These processes involve the user’s senses and motor skills and symmetrically the input and output mechanisms of the system.

The purpose of a multimodal system is to provide an extension of sensorimotor capabilities so that they should replicate the processes of natural communication between humans [3]. This mode of communication involves the simultaneous use of various modalities, so a computer system should be able to support them in interaction with the user.

For this to happen, a multimodal computer system must be equipped with hardware that should enable the acquisition and/or transmission of multimodal expressions (at a time compatible with the user’s expectations), be able to choose the output mode appropriate to the content to be streamed, and be able to understand multimodal input expressions [4].

With the increasing complexity of computer applications, a single modality will not be able to ensure effective and emotionally expressed interaction across all tasks and environments. Thus, over the years, there has been an investment in the processing of natural language, computational vision and gesture analysis within HCI. This investment has sought integration into traditional interfaces by giving them a higher functionality potential.

Multimodal interfaces have represented another direction for computing, and there is a great potential for integrating distinct synergistic modalities, supported by the myriad technologies that have been made available. In fact, they present both potentialities and constraints, and each of the sensory modes of interaction should be selected according to the communication effectiveness promoted by it. This effectiveness is conditioned by numerous variables, namely the characteristics of the content to be transmitted, the sender, the receiver, the input and output mechanisms, the cognition systems (human and computational), among others.

2.1 Contribution of the Cognitive Sciences

Multimodality is a research route that requires the contribution of cognitive sciences in the context of human perception and the use of coexistent modalities in a natural context of interaction – for example, speech, gesture, look and facial expressions.

Cognitive sciences have evolved over the course of the twentieth century from a model of atomistic (unimodal) perception – the view of the construction of the whole through the joining of individual parts – to the perceptual (multimodal) model of Gestalt theory – the view that the whole is different from the sum of its constituent parts [5].

Recent studies suggest increasing evidence that our bodies are activated for multiple input mode, although the classical theory of neurological sensory processing favors the single modality model and the primary sensory areas of the cortex are unisensory [6].

Ghazanfar and Schroeder [7] consider that the integration of the different qualities of information from the various sensory organs (combined in the brain to produce a unified and coherent representation of the external world) does not occur at the level of the most specialized, higher level areas of the neocortex after its individualized processing in less specialized lower-level areas (as traditionally suggested). On the contrary, they attest that much (if not all) of the neocortex is multisensory and processes the various qualities of information in an integrated manner, practically from the beginning of its capture. In this context, human perception results from a unified representation of the set of sensory inputs received from the outside world.

Instead of analyzing sensory modalities autonomously with eventual later integration, researchers began to consider the role of multisensory integration in the perceptual process by demonstrating that there are transmodal effects on human perception (i.e., the senses influence each other) and that temporal synchrony also takes over a relevant role in these effects [5]. Indeed, while on the one hand the concept of multisensory data fusion can hardly be considered recent, as humans and animals have evolutionarily developed the ability to use multiple senses in order to increase their survival possibilities, advances in computing and Sensors provided the ability to emulate, through hardware and software, the natural information fusion capabilities of humans and animals [8]. Dumas et al. [9] highlight the fact that research in cognitive psychology has revealed that:

  • the working memory of humans dedicated to the different modalities (and consequent processing power) is partially independent of each other, so the presentation of information through different modal channels increases the total memory used by the human in information processing, promoting the expansion of their capabilities and better performance;

  • humans tend to reproduce their interpersonal interaction patterns when they multimodally interact with a computer system;

  • The way human perception, communication and memory work leads to improved performance when they interact multimodally with a computer system.

It is an understanding also defended by Anthony et al. [10] who stated that it is natural and convenient for humans to communicate with the computer using modal channels to aid thought processing, concept visualization and establishing an affective relationship. Also, Tzovaras [11] argues that the “interface” between the human and the environment, as well as between humans, is multimodal, and all senses (even if some are dominant) participate in the operations of perception, action and interaction. In this regard, Landragin [12] mentions that the way we view an object determines the discourse and gestures we use to refer to it, while the gestures we produce structure our visual perception. This understanding corresponds to the realization that visual perception, language and gesture establish multiple interactions with each other. On the other hand, Aran et al. [13] recall that speech components, such as lip movements, hand-based sign languages, head and body movements, in addition to facial expressions, constitute available multimodal information sources that are integrated into communication by the hearing impaired. Thus, real-world behavior and perception are dominated by the integration of information from multiple and diverse sensory sources.

In recent years, also in the field of linguistics, researchers have become aware that a theory of communication describing real human-human interactions must encompass a diversity of dimensions. This is why multimodality has been considered a better representation of the complexity of discourse [14] than unimodality.

2.2 Potentialities and Constraints of Multimodal Interfaces

Multimodal interfaces are a class of multimedia systems that integrate artificial intelligence and have gradually gained the ability to understand, interpret and generate specific data in response to analytic content, differing from classical multimedia systems and applications that do not understand data semantics (sound, image, video) they manipulate [15, 16].

Although both types of systems may use similar physical input and output (acquiring, storing and generating visual and sound information), each serves a different purpose: in the case of multimedia systems information is subject to the task and is handled by the user; in the case of multimodal systems, information is a resource for performing the task control processes themselves. Martin et al. [17] argue that, in the context of a computer system, the option for multimodal solutions will only be convenient if it has been ratified by usability criteria. They refer, for example, to the following:

  • allow faster interaction;

  • allow selective adaptation to different environments, users or usage behaviors;

  • enable a shorter learning curve or be more intuitive;

  • improve the recognition of information in a noisy environment (e.g. sound, visual or tactile aspects);

  • allow the linking of information presented to a more global contextual knowledge (allowing for easier interpretation);

  • and allow the translation of information between modalities.

This is also the understanding of Ferri and Paolozzi [18], when they state that the choice of a multimodal interface, instead of a unimodal solution, depends on the type of action to be developed by the user and its increased usability potential. In fact, the various input and output channels usable with HCI (keyboard, mouse, touchscreen, microphone, motion sensor, monitor, speaker, haptic receivers, etc.) have their own benefits and limitations, so multimodal interaction is often used to compensate for the limitations of one modality by making another available [19]. Each input modality must be adapted to a set of interaction contexts, not being ideal or even being inappropriate in others [20]; therefore, the selection of interaction modality is a matter of extreme relevance in a multimodal system.

2.3 Development Requirements, Limitations and Constraints

System designers are increasingly using more and more different (often alternative) input/output modalities to exchange information between systems and their users. For this to happen the design of multimodal interfaces must be based on the following principles: selection of content to transfer; assignment of appropriate modalities to the content; and functional implementation of the modality ensuring the transfer of content.

Also Bernsen [21] presents a similar logic of procedures that an interface designer should consider: identifying the information to be exchanged between users and the system; performing a good match between the information and the available input/output modalities in terms of functionality, usability, naturalness, efficiency, etc.; and designing, implementing and testing the interface. The usability of multimodal interfaces can generally be facilitated if users are familiar with this mode of interaction. There are interfaces that place an excessive cognitive burden on their users, although this is a problem that can be circumvented by their “disappearance” and naturalness so that users can focus exclusively on the activity.

In recent years, research on multimodal HCIs has been focused on analyzing and designing mainstream interfaces. In turn, Bernsen and Dybkjær [22] warn against the danger of exaggerating the promotion of multimodal interaction, especially when the aggregation of modalities does not promote any efficiency increase in human-computer communication and point out that the interaction results should always be valued. In this regard, they criticize, for example, the trivialization of research in more or less elaborate animated conversational agents, when they occupy valuable screen space and processing resources, compared to the mere discursive output.

Although empirical studies show that users change their attitude and expectations towards the computer system when confronted with a more or less realistic animated conversational agent, assuming the posture of an interaction closer to human communication, i.e. more affective, multimodal interactive systems are sometimes contaminated by the desire of designers to apply new technologies which do not amount to a real increase in usability and the success of the interface tends to be evaluated by empirical approaches.

3 Naturalness in Multimodal HCI

The flexibility of a multimodal interface should accommodate a wide variety of users, tasks and environments that are beyond the possibilities of interaction through a single mode. In this regard, Ferri and Paolozzi [18] claim that there is a growing demand for user-centered system architectures with which the user can interact through the natural modalities of human-human communication. This should be developed in a natural enough way so that adaptation to the computer system should not be necessary, the opposite being favored. Such an interface will be one that appeals to the user’s intuition, supported by the transfer of skills and knowledge acquired in previously experienced environments and contexts. In this regard, Maybury and Wahlster [23] emphasize the need for the development of increasingly intelligent interfaces, defining them as those that promote the efficiency and naturalness of HCI by aggregating the benefits of adaptability, context fitness and task development support. Bernsen and Dybkjær [22] present two possible lines of analysis, constituted around two interaction paradigms:

  1. 1.

    paradigm of natural multimodal interaction sustained in the strict use of the modes of communication that individuals use to communicate with each other;

  2. 2.

    paradigm of functional multimodal interaction in which any modality (natural or not) should be used if it leads to the promotion of more efficient interactions.

For example, Yin [24] defines natural human-computer interaction as the multimodal interaction that occurs in a cognitively transparent and effortless manner, leading the computer system to understand what the user is doing or communicating without letting him alter the pattern of natural behavior he would develop with another person. In this regard, the production of more natural interfaces should involve the use of the sensory modality(ies) that most effectively accomplish the task. He also argues that computer systems have sensory modalities that may have, in some cases, higher usability than those currently used in human-human communication, defining the naturalness of an interaction from the point of view of greater ease of interaction and superior usability rather than from the perspective of its mere parallelism with the interaction between human individuals.

4 Transparency of Interfaces

Advances in technology, in addition to the objectives of ease, learning, use and naturalness, are allowing both computers and interfaces to be transparent. Norman [25] argued that the computers of the future should be invisible. In this context, Bolter and Grusin [26] approach the concept of transparency (as a characteristic of immediacy; i.e., the absence of mediation or representation) that occurs when the human user forgets (or does not have knowledge) of the means by which information is being transmitted, thus being in direct contact with the content. They state that “virtual reality, three-dimensional graphics and interface design are, together, seeking to make digital technology ‘transparent’” so that the user may feel that he is a part of the system it integrates.

Within the scope of HCI and in accordance with the above, the interface development processes must ensure their standardization, consistency and transparency, in order to satisfy the user’s needs and facilitate human action. There is also a growing desire for the “disappearance” of the interface as a mediator of an HCI in order to make interactions more real and closer to reality.

5 Aspects for the Anthropomorphization of HCI

Human beings tend to attribute anthropomorphic characteristics, motivations and behaviors to animals, artifacts and natural phenomena.

For a computer system to promote an affective relationship with the user, it must promote anthropomorphization, which, in turn, may be the result of particular characteristics such as the promotion of multimodality, the naturalness of interaction and the transparency of the interface (see Fig. 1). Anthropomorphization and also personification seek to inculcate human characteristics to computer systems through the interface, as well as to promote their transition from technological object to subject. Thus, the technological interfaces that, being anthropomorphic, resort to multimodality, naturalness and transparency must become, simultaneously, a representation (of the human) and a device (tool). When this simulation is able to reconstruct the idea of ​​human relationship in human-computer interaction, it achieves its concrete objective: the machine moves from being an object to being a subject. [27] The passage from object to subject, achievable due to the embodiment of physical and psychological aspects of the human being, is indicative of the anthropomorphization of the entire object or machine.

Fig. 1.
figure 1

Aspects for the anthropomorphization of HCI

6 Conclusion

Aspects such as multimodality, naturalness and transparency, which, in synergy, contribute to affective computing, can also contribute to the anthropomorphization of HCI. The whole design of the interaction between humans and computer systems is dependent on the options taken by a series of specialists who intervene in its planning, development and implementation in particular by a variety of ethical decisions involved in the development of an interaction technology. The challenge of approaching affective systems is growing, but the lack of tools to support interface designers and other interveners is a constraint that needs to be resolved. To this end, it is important to develop standards across all interfaces that seek to promote affectivity through multimodality, naturalness and transparency, although this is not, in fact, an easy objective to pursue.

One question that can be asked is whether emotions can be designed so that affective computing should be a design requirement. Emotions are a result of many different aspects and designers may not be able to control and bring together all the conditions necessary to create specific emotions. They may, eventually, establish the context of an emotion and not the emotion itself since it is difficult to generate affective responses in judgments, for example, of questions of aesthetics, of taste, behavior, etc., as well as in how these judgments influence later decision making. Nevertheless, some guidelines are presented that facilitate the design of mainstream, natural, transparent multimodal systems, as the issue of affective computing is taken as an objective to be achieved, and some steps are described that may be determinant for its success and general acceptance:

  • Terminology – Use of consistent terminology for the presentation and operation of the interface.

  • Feedback – Constant feedback to the user from the interface, so that the user is aware of the point of use at which he finds himself and knows the possibilities and channels of interaction available at all times.

  • Error Prevention – Prevention and proper management of errors by the system and its user, providing ways for them to be consciously corrected.

  • Accessibility – Clear specification of requirements for the interface with particular attention to the imperative that it should cover the maximum number of users, contexts of use and possible applications, in order to ensure flexibility for users with skill limitations and in situations that impose restrictions or different possibilities of use.

  • Imperceptible Tutorial – Idea that the users accidentally access information, but, actually, it is an occurrence designed to integrate them into the system and indicate the principles for its use.

  • Personalization – The interface adapts to the user’s needs after an analysis of his behavior and interaction patterns.

  • Privacy – Concern with the necessary flexibility in the decision by users as to how their privacy and security will be managed.

  • Multimodality – Option for multimodality of input and output in order to maximize the cognitive and physical capacities, as well as the usage preferences of the various users.

  • Interaction Starting – The interface initiates the interaction and encourages the users by calling their attention to relevant and timely information, instead of waiting for them to start interaction.

  • Predictive Interface – The interface tries to predict the user’s intention, offers possible results and suggestions that go beyond the user’s immediate intentions.

  • Unpredictability – The interface uses randomness as a way to induce the feeling of randomness expressed through interaction and access to the system from multiple points, without a defined hierarchy.

  • Automation – Common and repetitive tasks are automated to simplify interaction. Generative potential of the interface is based on flexible rules that cause a certain degree of autonomy, without user interaction.

  • Visible Information – Quick observation and gathering of information at opportune moments are allowed, increasing the relevance of accessory information placed on the periphery of the interface in the user’s focus and at the appropriate time.

  • Natural Interaction Modalities – The use of sensory channels (actuators) that are natural to the equivalent modes of communication in a human-human or human-environment context and, cumulatively, the procedure for using these sensory channels is developed in a way equivalent to the way they would be developed in this context.

  • Transparency – Disappearance of the interface, placing the user in direct contact with the information through one or more modes of interaction.

These design requirements seek to promote the development of more affective, effective, simpler and more natural interactions by promoting superior accessibility and usability. This effectiveness is conditioned by numerous variables, namely the characteristics of the content to be transmitted, the sender, the receiver, the input and output mechanisms, the cognition systems (of the human and of the computer), among others.