Keywords

1 Introduction

This work aims to build an integrative perspective on memory performance in the context of voice-based technology, in particular of voice-based conversational agents (VCAs) where user voice is the only means of operation and human-like speech is the main output directed at a user.

Memory encompasses the encoding, storing and retrieval of information and as such it is arguably involved in any complex human behavior. The question pursued here concerns user memory performance as an outcome of human-technology interaction, in contrast to retrieving information from user memory during interaction. Put differently, performance refers to how much, and what sort of, information is remembered after an interactive technology has been queried. This matters in situations where technology is designed to enhance knowledge, to educate and to function as a cognitive support tool and aide. Investigating memory performance in such contexts has a long-standing tradition in HCI, for example, when evaluating the outcomes of engaging with online learning environments [1]. However, these traditional concerns about memory have shown little resemblance to the quality of human-technology interaction that can be achieved by voice-based services, although an integration of voice-based interaction is now within easy reach for many day-to-day applications.

Voice and speech can be generated with increasingly human-like qualities. In many situations, this means that the outcomes of human-technology interaction will resemble more and more the outcomes of a conversation with a human interaction partner. The interaction becomes more complex, richer in social cues and more natural in terms of conversational structure and norms [2]. These changes should be most pronounced in the case of technologies that are fully speech-based, i.e., they are voice-operated and use speech as the sole output. These technologies are referred to here as voice-based conversational agents (VCAs). Other forms of voice-based technologies may combine auditory input and output with text and visual display, e.g., by reading back text input, or by turning user voice into text.

1.1 VCAs in the Context of Media Equation and Digital Companionship

While it is possible to interact with a VCA on static computers or laptops, the two main modes of use at present, arguably, involve smart speakers and mobile phones [3, 4]. Both technologies have been shown to be subject to personification on the user’s side [5, 6], and the user-technology relationship in the case of smartphones has been likened to one of digital companionship [7].

The concept of digital companionship is based on media equation theory [8] which states that the medium for information processing, the computer system, gets equated with a human interaction partner [8, 9]. The reasons for this are seen to be based in our human evolution: our ability to communicate has evolved with other human, or at least animate, conversation partners. As a result, we are looking out for social cues during complex interactions, and we respond to such cues in a way that presupposes another human, not inanimate technology. Smartphones have been shown to hold a special place in users’ networks of relevant social actors and technological devices, and they are easily perceived to be part of a more meaningful relationship [7, 10] involving the user. Similarly, interactive digital systems that make more use of social cues have been shown to be more persuasive, i.e., to bring about higher levels of user compliance with system requests [2, 11].

VCAs represent one more functionality that narrows the gap between human-human and human-technology interaction. What is more, they easily add to the companion nature of those technologies that are either always with us and close to us (smartphones), or occupy a space in our personal living environment (smart speakers). So far, the effects of VCAs have been discussed mostly in terms of user acceptance and general quality of interaction [3,4,5]. An investigation of memory will help to further our understanding when it comes to cognitive effects, which can be objectively measured, and the question of deeper psychological changes brought about by technology. While memory has always been important as a performance indicator for some technologies, VCAs and digital companions imply that memory for an interaction episode with technology can now become relevant to the user for personal, self-related reasons. This development will be outlined in the next sections.

2 Memory and Novel Forms of Human-Computer Interaction

2.1 Novel Research Questions Arising from Socially Complex Technology

Memory, it is argued, can be expected to be sensitive to the social cues that are included in generated speech, but the determining processes and factors are likely to differ depending on the type of memory that is under investigation.

The memory system in humans is complex and multi-faceted. The focus here is firstly on declarative memory – information that is in principle accessible by the individual through a conscious effort, and is represented by knowledge in a format that can be communicated. A crucial distinction within declarative memory is the one between semantic and episodic memory [12]. Not only do these refer to different types of knowledge, to a certain extent they can be seen as two separate systems affected by different processes.

Semantic memory consists of general, factual knowledge [13, 14]. In its simplest form, this can include dictionary-type definitions (knowing what things or concepts refer to) as much as shopping lists (knowing what items constitute the list) and exam revision material (knowing which statements constitute relevant knowledge).

In the context of human-technology interaction, investigating performance in relation to semantic memory does not pose particular challenges. Clearly defined recall or recognition tasks can be employed to assess whether information encountered during interaction has been retained by the user. Traditionally, a driving research question has been in which format knowledge should be organized and displayed by technology. In the context of VCAs, however, novel questions arise which are concerned with the ways in which social cues contained in speech allow for attention to be focused on, or diverted from, a knowledge task.

Episodic memory refers to personal experiences, to events and situations that unfold in a temporal sequence, remembered from a first-person perspective [15, 16]. Where the emphasis is most strongly on personal life events, episodic memory is also referred to as autobiographical memory [17].

In contrast to semantic memory, autobiographical memory is more clearly of a long-term nature, carries self-relevance and is retrieved in a process of active construction that draws dynamically on a knowledge base. While not all content of episodic memory may conform to these parameters, episodic and autobiographical memory are used interchangeably here, in the interest of simplicity. The dynamics involved in the construal of autobiographical memory are often due to the social context in which memories are retrieved.

Traditionally, there has been little reason to rank interactions with technology as life episodes of any longer-lasting relevance. In consequence, models of autobiographical memory have not been applied to the outcomes of such episodes. The advent of realistic speech generation, however, combined with natural language processing capabilities and algorithmically enhanced interaction, means that interacting with VCAs bears increased resemblance to human-human communication and genuine conversational settings. As such, interaction episodes become sufficiently socially enriched and complex to warrant their investigation from the viewpoint of autobiographical experience. As with semantic memory, the consideration of VCAs gives rise to novel questions that address the meaningfulness of human-technology interaction and its effects on human cognition.

2.2 Effects of Digital Technology on Memory

As outlined in the previous section, VCAs necessitate a rephrasing of established research questions. In the following, a brief overview is provided over different approaches to memory in HCI-related research. This overview, while not claiming to be exhaustive, demonstrates that an understanding of the malleability of memory is essential in order to address the cognitive consequences of interacting with VCAs.

Memory as a Performance Measure of Digital Technology.

Memory performance has been studied in those domains where it is an outcome variable of immediate interest because the technology is supposed to provide support for it and other cognitive functioning. In educational settings, studies on online learning and training have investigated learner memory in the form of knowledge tests and other standard assessments of information retention and information accessibility. General findings in the field are by no means conclusive, in particular when the scaling up and wider implementation of digital learning environments are evaluated [1].

Specific interactive features and functionalities within such learning environments in the form of learning assistants have been shown to improve memory [18]. However, memory enhancement can still fluctuate across studies and environments, even for highly specific functions, and may be down to particular design features and wider context. Consider, for example, findings that ascribe a positive memory effect to instructor visibility in instructional videos [19] compared to those that do not [20].

A particular feature of learning systems that is closely related to VCAs are chatbots. These text-based conversational agents have been found to be particularly suitable for self-directed learning since they provide individualized support and feedback on learning progress in a flexible manner. These claims have been supported by a variety of studies showing positive effects on memory, for example in the context of IT university studies [21], foreign language learning [22] and general knowledge [23]. While these findings can be expected to transfer easily to VCAs, any additional effects due to the social cues contained in human-like generated speech and the context of a more life-like conversation remain to be tested.

Another area where technology-supported memory has received particular attention concerns assistive technologies for specific populations with cognitive impairments or disabilities that affect learning and retention [24]. This is where voice-based functionality is given added relevance since the engagement with technology itself is often limited in such populations [25].

Non-interactive forms of voice-based features are used in assistive technologies in the form of text-to-speech tools, and these have been shown to improve reading comprehension in the case of reading disabilities [26]. VCAs have been proposed where users can be expected to benefit the most from human-like interactions as in the case of dementia [27]. However, a rigorous assessment of the effects on memory is at present not available.

Changing the Organization and Accessibility of Knowledge in Memory.

Next to studies that have evaluated the effectiveness of specific features and functionalities, other work has investigated the more pervasive effects on memory that may come from extended interaction with digital technology. This work is concerned not so much with memory performance, as a criterion of success, but with the organization and accessibility of knowledge in memory. Sparrow, Liu, and Wegner [28] found that being faced with a knowledge gap in a difficult quiz activated the concept of Google as a search engine in participants. Their interpretation was that Google search had become part of a transactive memory system with users and was accessed routinely to supply information that was not retrievable from the user’s knowledge base. Sparrow and Chatman [29] go on to outline how the investment and distribution patterns of human cognitive resources may change in the face of ubiquitously available online information.

In a similar vein, Storm and Stone [30] demonstrated how the act of saving information digitally freed up cognitive capacity for the learning and remembering of new information. This off-loading of memory again suggests that digital systems can be used as external memory stores – not external to a central computational device, but external to the human memory system [31].

The concepts of off-loading and transactive memory concern first and foremost semantic memory. Far less is known about effects on autobiographical memory, its structure and the ways in which episodes are constructed from memory. Some researchers have argued that technologies that help us in the charting of everyday life and recording of personal information, namely social media, can stimulate autobiographical memory through reminiscence. This is a process of repeated rehearsal of past events, often directed at social relationships and joint experiences [32]. To the extent that social media can showcase specific pieces of personal information, they can also shape and guide reminiscence, for example by drawing attention to past friends [33].

There is at present a marked absence of research to outline the effects of VCAs on autobiographical memory. Reminiscence through social media is still reminiscence regarding the interaction with other humans. Carrying further the concept of digital companionship, however, requires to consider reminiscence regarding the interaction with VCAs. Similarly, transactional memory has been discussed in the context of passive technologies such as search engines, not in relation to highly interactive agents.

In sum, then, memory research in the context of HCI offers some insights into effects on both semantic and episodic memory that may translate to VCAs, but more theoretical elaboration is needed to fully assess the potential that VCAs carry. In the following, therefore, more consideration will be given to the malleability of memory performance and its general dependency on environmental conditions.

2.3 Context-Dependent Memory

Psychological models that imply malleable and context-dependent memory performance illustrate what factors in a more naturalistic human-VCA interaction are likely to affect memory. Memory has been shown to depend on social cues during the encoding and retrieval process. Socially motivated information processing, transactive memory and theories of a dynamic autobiographical memory system are the three main frameworks that allow for addressing the novel research questions associated with VCAs. As stated previously, these questions concern the ways in which social cues steer attention (for semantic memory) and the ways in which we respond to meaningful and socially significant interactions with technology (for episodic memory).

Regarding socially motivated information processing, a fundamental factor concerns animacy. Animacy refers at first to the perceived nature of the stimulus to be remembered (animate, in the sense of alive). Research has consistently shown that animate stimuli improve memory performance [34]. Animacy further affects language comprehension and the organization of knowledge in memory [35]. These effects are explained by the richer, more detailed encoding that is possible with animate stimuli [36] and the increased allocation of attentional resources to animate versus inanimate stimuli [37]. To be clear, research on animacy and memory has addressed the animate nature of the stimulus, but not of the medium. Still, it can be speculated that a more life-like, animate source of information, as constituted by VCAs, has similar effects.

Further forms of socially motivated information processing concern self-relevance and social categorization in the context of ingroup and outgroup membership. Intergroup settings provide strong drivers for selective attention and selective memory, and this extends to all information that can to some extent be associated with such settings. For example, better memory has been demonstrated for positive behaviors in ingroup members as compared to outgroup members [38]. These biases are explained in terms of ingroup favoritism, to preserve a positive ingroup identity [39]. Beyond favoritism, however, research has also shown that there is a general increase in memory performance for information relevant to the ingroup, rather than the outgroup, because ingroup matters command more attentional resources [40]. Transferring these findings to the domain of technology, VCAs are suited to the signaling of social group membership through a range of linguistic markers, and carry the potential to activate an awareness of ingroups and outgroups.

Transactive memory systems [41, 42] have already been used to explain how users turn to the Internet automatically to fill in knowledge gaps [28]. VCAs, however, can be a far more active component in transactive memory. They have the potential to take on different social roles and to simulate closeness and rapport to the extent that dynamics can be expected as they have been documented in human dyads and teams. Memory performance has been shown to be affected by the social characteristics of a team such as social closeness, familiarity and gender [43, 44]. It is worth noting that socially positive characteristics do not necessarily lead to better performance. Being teamed up with friends is more likely to reduce an individual’s memory performance [43, 44]. This can be explained by an adaptive process that is at the core of the transactional memory system. An individual’s information processing is based on assumptions about the knowledge structure of others in the system, and as such, efforts may be reduced when in the presence of reliable partners [45].

A much deeper-running level of self-relevance than discussed so far is at play when human-technology interaction turns into meaningful social experiences and becomes part of the knowledge base that underlies autobiographical memory. The functions and drivers associated with this type of memory are complex and go beyond those associated with semantic memory. Autobiographical memory refers to, as stated previously, personal experiences stored long-term but retrieved in an active and dynamic process [46,47,48]. In this perspective, autobiographical memory draws on a self-memory system, a more stable, long-term storage, but the assembly of episodic building blocks is not always the same and depends on goals of the working self, i.e., on the current psychological state of the individual [47].

While VCA-based interactions may show suitable complexity in order to be included in an autobiographical memory system, they also need to display sufficient levels of personal relevance. How can the meaningfulness of such interactions be determined? This may well depend on the extent to which VCA-based interactions can fulfil the various functions that have been ascribed to autobiographical memory. These functions have been labelled as “self”, “social” and “directive” [49,50,51].

Self-based functions are those that provide a sense of coherence, identity and continuity. Remembering the past, as a sequence of actions, developments and events involving the self, helps to achieve this. Social functions refer to the maintenance of intimacy, to social bonding and to the upkeep of relationships. Reminiscence, recall of joint activities, instances of support and positive social exchanges all cater to the social needs of the individual. Directive functions provide guidance and orientation in the present, they support decision-making and goal formation. This functional account illustrates how strongly the current psychological needs of the individual determine what gets remembered and how. It also helps to identify how exactly VCAs could become part of this memory system.

In terms of self-based and social functions, VCAs are unlikely to be able to fully compete with human interaction partners. They have neither the durability nor the level of intimacy that come with the friends that really matter. However, echoing again digital companionship, VCAs can emulate weaker forms of human-human interaction and may be able to address self-based and social functions much more than less interactive technologies. In terms of directive functions, past interactions with VCAs may well constitute relevant informative experiences that provide guidance in the present. Psychologically meaningful interactions could result when the VCA meets urgent support needs (e.g., calling for help in emergency), when the VCA plays an instrumental role within a wider social context (e.g., communicating with friends, being used for a joint activity), or when the VCA enhances an episode of personal relevance (e.g., keeping a holiday organized).

In sum, the memory processes discussed allow for a theoretically informed identification of those characteristics that will make interactions with VCAs more or less memorable. Naturalness of voice, familiarity or resemblance with specific others, signaling ingroup membership, conveying impressions of competence and reliability are some of the factors that come with VCAs and that have been shown to affect human (semantic) memory. In addition, the potential for a socially meaningful interaction experience, a genuine conversational setting, is likely to extend effects on semantic memory to episodic memory. This depends on the capacity of VCAs to fulfil one or more functions that are typically ascribed to autobiographical memory (self-based, social, directive). The theoretical and methodological challenges that come with this extension are addressed next.

3 Conversational Engagement and VCAs

Any effects of VCAs on human memory are of particular interest where they are “unique”, i.e., occur over and above those of comparable technologies such as text-based chatbots. Such unique effects can come from multiple sources. First of all, there are aspects of language and the illusion of natural speech. This includes all paraverbal signals associated with language: cadence, flow, timbre, rhythm, speed, prosody etc. Paraverbal signals can be expected, first of all, to affect perceptions of animacy. Then there are aspects of familiarity and speaker identity. Is the voice characteristic and conducive towards creating a virtual identity? Is the voice modelled on genuine, recognizable humans like the self or close others? Does the voice suggest any particular societal group or stratum? These aspects will trigger effects associated with social cognition and transactive memory.

In addition, beyond the technical aspects of speech generation, there is the question of how VCAs are integrated in devices to form a new, consistent whole that acts as conversation partner. A smart speaker does not explicitly offer any functionality other than voice-based interaction. The device is merely the embodiment of a VCA, typically with a fixed spatial location. Mobile phones seem to be on the other end of the spectrum. Mobile phones are normally smartphones, devices optimized for screen-oriented, hand-held use, inviting haptic interaction and online connectivity, next to the original function of being a telephone. Yet, in terms of providing VCAs an environment in which they can take on character and meaning, mobile phones offer great potential.

Bringing this multitude of factors back to systematic research on memory performance poses a serious challenge. One avenue for further research is to isolate factors and subject them to experiment-based scrutiny. To date, this approach has been used to derive a robust hypothesis and method for studying the effects of voice on semantic memory [52], but more work in this area is needed to build a reliable empirical support base.

A second avenue consists in defining and capturing more inclusive concepts from conversational analysis that help to predict what happens as the overall quality of the interaction changes. Conversational engagement is proposed here as such a concept which has been developed specifically in the context of computer-mediated communication [53, 54].

Conversational engagement refers to the general state of activation and readiness to respond in conversational settings. Engagement is an indicator for immersion in the situation, the amount of cognitive resources and attention devoted to the conversation, and the general degree of social involvement.

Engagement has been conceptualized as a multi-dimensional construct encompassing wider bodily movement, gesturing, facial expression, but also paraverbal and verbal signals [54]. Its operationalization in research is likewise multi-dimensional, ranging from real-time motion capture to in-depth content analysis of recorded conversations.

Crucially, engagement has been related to cognitive conversational outcomes, and increases in engagement have been shown to lead to increased memory for conversational content [53]. Recall of content here did not happen as in the recall of pre-defined pieces of information, but in a wider sense as the recall of any aspect of the interaction that has left some longer-lasting impression on conversation partners. This suggests that engagement is a suitable concept for investigating effects on autobiographical, and not just semantic, memory.

In sum, conversational engagement offers a sufficiently complex concept to capture (a) the various ways in which VCAs can enhance and enrich human-technology interaction and (b) the various cognitive outcomes and different types of memory affected by such enhancement. The approach presented here seems to be particularly suitable where technology is already multi-functional and rich in affordances. It may be therefore that VCAs and mobile phones form the best area of application, as compared to smart speakers. Importantly, conversational engagement is not only a particular theoretical approach, but implies a clear set of methods and measures. While engagement has been previously applied to computer-mediated human-human settings, the transition to human-technology settings can be achieved seamlessly by adapting existing methodology [53].

4 Conclusions and Implications

In conclusion, VCAs present another milestone on the ongoing journey of technology towards becoming a complex conversation partner. A core outcome of conversational settings concerns memory, in the form of factual knowledge, but also in the form of stored personal experiences. VCAs add to the potential meaningfulness of human-technology interaction, and their cognitive effects should be assessed by both semantic and episodic memory performance. A summary of concepts and processes discussed in the present work is presented in Fig. 1.

Fig. 1.
figure 1

Summative presentation of concepts and processes that integrate VCA characteristics, memory processes, and conversational engagement.

Studies that have highlighted the context-dependency of memory can be utilized to identify a range of factors that fall under the social cues that VCAs can convey: animacy, familiarity, group membership and identity, team roles in transactive memory systems can all be signaled effectively through generated speech. In addition, models of autobiographical memory help to identify conditions under which human-technology interaction will acquire increased meaning and significance, namely when self-based, social and directive functions of memory are addressed by the technology. The theoretical integration presented here therefore allows for the derivation of testable hypotheses for further empirical study.

In addition, the effects of VCAs can also be addressed by recent work on conversational engagement, a broad construct that situates the link between VCA and memory outcomes in a wider framework of the effectiveness of conversational settings. This approach lends itself to further research since it specifies not only variables that influence engagement, but also a methodology to capture engagement and its outcomes.

Next to the implications for research, the present investigation of memory is also intended to benefit developers and service providers as well as users. The increased use of VCAs in everyday life, in the form of smart speakers or as assistants for mobile phones, makes effects on memory more relevant to a large user base. Developers should consider how different aspects of generated speech affect human cognition and leave enough room for a flexible customization of speech output. Further, the choice of default settings for generated speech should also be informed by the likely effects on user memory. Service providers need to make conscious decisions as to the social role they envisage their products to play, and what users should be expected to “take away” from interacting with VCAs. Users, then, are best empowered by an awareness of the effects of VCA engagement, beyond mere appeal of the interaction. This will help them to choose system settings, contexts for use, and interaction goals in a more self-determined way. As technology becomes a conversational partner, such an awareness will maintain necessary levels of digital literacy and competence.