Keywords

1 Introduction

Voice assistants (e.g., Alexa, Google Assistant, Siri) are voice-controlled devices that allow consumers to use their voice to make queries such as listening to music, accessing the latest news, or answering general questions. In the U.S., it is estimated that conversational user interface (CUI) users will surpass 123 million by 2021, which represents an increase of 44% since 2017 (Petrock 2019). In addition, a recent study shows that Amazon has sold 75 million dollars’ worth of smart speakers around the globe in 2018, a growth of 600% over the last year (Tung 2018).

Although voice assistants have become omnipresent in our phones, vehicles, and homes, to date, academic research that aims at developing methods to study these increasingly popular technologies is still lacking (Nass 2005; Sciuto et al. 2018; Lopatovska and Oropeza 2018; Lopatovska and Williams 2018; Jiang 2015). In fact, not all traditional methods for evaluating the user experience appears to be suited to the context of interaction with intelligent voice assistants. For instance, the “Think Aloud” method (Fonteyn et al. 1993) where the researcher asks the participant to verbalize what he or she is doing and thinking while performing a task does not apply in this context since the participant is already using his/her voice to interact with the device.

Therefore, the goal of this paper is to propose a new approach to evaluate user experience during vocal interactions with voice assistants. Specifically, we propose to bonify self-reported measures used before and after the task with psychophysiological measures (i.e., electrodermal activity and micro facial expressions) to investigate the automatic and non-conscience reaction during the interaction. To test the feasibility of our new approach, we conducted a laboratory experiment in which participants (N = 11) were instructed to interact with Alexa. To elicit emotional reactions from participants, we designed a set of tasks likely to generate a wide range of discrete emotions.

The article is structured as follows. We first review existing research using self-reported measures in the context of voice assistant, then we discuss related work on psychophysiological measurement in the HCI literature. Next, we explain our research methodology as well as summarize the results and their interpretations in the discussion.

2 Current Research on Voice Assistants Using Self-reported Measures

Past research on user interaction with voice assistants has been using both qualitative and quantitative research methods such as questionnaires, diaries, and interviews.

Questionnaires are a widely used tool since they allow researchers to manage a large amount of data from participants quickly and inexpensively (De Singly 2016). There are several forms in which questionnaires can be presented. For example, using Likert scales, questionnaires can be quickly presented to participants before or after completing a task without hindering the flow of the experiment. In a study conducted by Jiang et al. (2015), participants were asked to complete a sequence of 10 tasks using the vocal assistant Cortana on a smartphone, and a questionnaire was used to assess frustration, success, effort, and reuse intentions. For every task, the participant only had to answer a questionnaire regarding their experience using a standard 5-point Likert scale, the most commonly used question model for measuring affective variables (Brown 2000; Burns and Grove 2005).

Similarly, diaries have also been used frequently as a method for qualitative research because it provides access to users’ subjective impressions and more importantly, reflections on their interactions. This technique is advantageous since studies have shown that the presence of a stranger, e.g., researcher, might affect the way a user will interact with a voice assistant since it is mainly used in a private or comfortable context (e.g., home, with friends or alone) (Easwara Moorthy and Vu 2015). Hence, diaries offer a suitable alternative or an addition for qualitative research that might be affected by the presence of a researcher in a laboratory (Nicholl 2010). Researchers have used this method in a variety of contexts to user experience after the use of a voice assistant. For instance, Lopatovska and Williams (2018) used a diary log in studying user personification of Alexa. The study data were collected primarily through a structured online diary, which participants were asked to complete once a day for four days. The diary was also the primary method in Lau et al. (2018)’s study on users’ privacy concerns when interacting with voice assistant. Through the analysis of the diary logs, they found that many non-users did not see the utility of smart speakers or did not trust speaker companies. Other studies went further. They found innovative ways to conduct data collections to understand how Alexa was used in participant’s households with multiple members on a long period in a more natural way without having to report their interactions in a diary (Sciuto et al. 2018; Lopatovska and Oropeza 2018). For example, a recent study by Porcheron et al. (2018) used a Conditional Voice Recorder (CVR), a device that is activated when Alexa is turned on, to record the interaction. That way, it is possible to record multiple interactions with the voice assistant and family members in a natural context of use.

As a common tool in the HCI literature, interviews are often used as a complementary method in conjunction with the above-discussed methods. For example, in order to study user sharing practices of voice assistants, Garg and Moreno (2019) used semi-structured interviews in addition to diary logs. In a similar vein, in-depth interviews were conducted to have a better understanding of the collected conversational logs with voice assistants in investigating of Alexa’s in-homer usage pattern (Sciuto et al. 2018).

Finally, observations are the only traditional methods providing a way to record user behavior during the interaction directly. For instance, in a recent study examining user interaction with voice assistants in public spaces, the area around Alexa was observed at different times of the day and different days for one week, totalling 5.5 observation hours and 132 persons observed (Lopatovska and Oropeza 2018). However, observation provides little insight on how the user feels emotionally and cognitively during the interaction.

3 Psychophysiological Measures in HCI

As presented above, most studies used qualitative or quantitative methods, mostly relying on self-reported measures. Although they provide extensive and informative results on user interaction with voice assistants, these methods alone may suffer from not precisely measuring what the user really experienced at the moment of the interaction. Researchers are calling for multi-method approaches that consider what the users really experience and perceive (Vom Brocke et al. 2020). For instance, it is possible that these results mainly “assess the user’s reflection on the interaction, but not the interaction itself” (Georges et al. 2017, p. 91). Therefore, we posit that what users have really experienced might be different from their subjective evaluation of their experience.

Research in Human-Computer Interaction (HCI) has used psychophysiological measures as a viable indicator of cognitive and emotional states such as cognitive effort or frustration (Rowe et al. 1998; De Guinea et al. 2013, 2014; Giroux-Huppé et al. 2019; Beauchesne et al. 2019; Lourties et al. 2018; Agourram et al. 2019; Maunier et al. 2018). The literature has shown that user’s emotional and cognitive states can also be inferred using psychophysiological signals, such as electrodermal activity (EDA), heart rate, eye-tracking, and facial expressions (see Riedl and Léger 2016 and Riedl et al. 2020).

By using self-reported measures only, researchers can face various cognitive biases such as social desirability (de Guinea et al. 2014). For example, psychologists suggest that the presence of a stranger (e.g., researcher) can change the way one will interact and, in our case, use a voice assistant to respond in the most socially desirable way (Piedmont 2014, p. 6036–6037). For example, by asking participants their likelihood to use a voice assistant in multiple environments (e.g., alone at home, in the metro or at work), Easwara and Vu (2015) found that the social context in which the interaction occurs, influence the information transmitted to the vocal assistant. Hence, psychophysiological measurement tools can contribute to overcoming bias coming from self-reported measures or observations (Xiong and Zuo 2020).

Thus, in the context of assessing the experience of users while they are interacting with a voice assistant, psychophysiological tools are an interesting add-on because they make it possible to complement traditional means of measurements (e.g., questionnaires, interviews), but especially to bring a precision on a specific emotional state, in time, to which a user cannot remember (Lourties et al. 2018). For example, it might be difficult for a participant, in the context of evaluating an intelligent voice assistant, to remember how he/she felt at a particular moment of the interaction (e.g., when he/she felt frustrated after the CUI gave an irrelevant answer to his/her question).

How users react at the moment of interacting with a device comes from unconscious and automatic mechanisms (De Guinea et al. 2013). The most accurate way to assess how they felt at one particular moment is with the psychophysiological response to the stimuli rather than their perception of what motivates their reaction (Dijksterhuis and Smith 2005).

In this research, we contribute to the literature on human interaction with voice assistants by proposing a multi-method approach to study user experience with a voice assistant by combining both psychological and psychophysiological measures, which could provide insights to researchers and developers in the field of intelligent assistants Specifically, this study leverages electrodermal activity and micro facial expressions based on Ekman’s universal facial expressions (Ekman 1997) (happy, sad, angry, surprised, scared, disgusted) and emotional valence (positive-negative) in studying user experience with intelligent assistants. In the next section, we show how psychophysiological measures can offer interesting additional information to conventional self-reported measures.

3.1 Arousal

Arousal is an emotional state related to psychophysiological activity, which is linearly manifested from “calm” to “aroused” (Deng and Poole 2010; Russell 2003). Being aroused by a specific stimulus results typically in a feeling of alertness, readiness, or mobility (e.g. body movement, deep breath) (Boucsein 2012). This emotional state can be measured with Electrodermal Activity (EDA), which can assess the changes in the skin conductance response (SCR) from the nervous system functions (Braithwaite et al. 2013; Dawson et al. 2000; Bethel 2007). It is an easy to use and reliable psychophysiological measure that has been widely used in NeuroIS research (Léger et al. 2014; Brocke et al. 2013; Giroux-Huppé et al. 2019; Lamontagne et al. 2020). Arousal can also be measured perceptually by using the self-reported measure such as the Self-Assessment manikin rating (SAM), in which users report their perceived emotional state for a specific stimulus, such as excited, wide-awake, neutral, dull, calm (Bradley and Lang 1994).

However, the main advantage of using a psychophysiological measure to assess arousal is that it is not invasive, requires no overt behaviour to be recorded, and offers an ecologically valid portrait of the user’s arousal, at any time during an experiment (Dirican and Göktürk 2011). For instance, in a study on child-robot interaction, Leite et al. (2013) measured user’s arousal through skin conductance and found that such a method is valuable and reliable for capturing interaction with social robots. Also, it can be used to complement and validate traditional survey methods (e.g. questionnaires).

Moreover, in a study measuring the effects of time pressure and accuracy using a computer mouse, participants were asked to paint rectangles with a decreasing time limit. Heiden et al. (2005) found that there was a significant difference in electrodermal data between task difficulty levels. Finally, in a study providing a systematic assessment of IS construct validity, de Guinea et al. (2013) found that the convergent validity of arousal was evidenced by the significant correlation between the SAM scale and the electrodermal data.

3.2 Valence

Emotional valence refers to the emotional response, with negative emotions (e.g., fear, anger, sadness) on one side of the spectrum and positive emotions (e.g., joy, surprise) on the other, to a specific stimulus (Lane et al. (1999). Valence can easily be measured perceptually with self-reported measure (e.g., SAM Scale) as the intensity of positive emotions minus the intensity of negative emotions expressed within a range from −1 to 1 (Bradley and Lang 1994). Another way to measure valence is by interpreting facial expressions, which are expressed by the micro-movements of facial muscles (e.g. frowning when angry) (Ekman 1993). It used to be that the only way to interpret facial expressions was via a trained observer who would observe and note changes in facial expressions based on the Facial Action Coding System (FACS) by Ekman and Friesen (1997).

Today, this time-consuming method is replaced with automatic facial analysis tools (AFA), which can automatically recognize the small changes in facial action units (e.g. raising a brow, chin raise, jaw drop, etc.) and interpret data based on the FACS (Cohn and Kanade 2007, Ekman 1997).

This technology allows us to accurately detect facial expressions in real-time by distinguishing between a set of discrete emotions such as angry, happy, disgusted, sad, scared, surprised. For example, Danner et al. (2014) used this technology to examine participants’ facial reactions when tasting orange juice samples to compare implicit measures from the tool with explicit measures from the questionnaire. They found that the software was accurate to report changes in the participant’s micro facial expressions between the different samples. Zaman and Shrimpton-Smith (2006) found that, compared to a user questionnaire, data captured by facial micro-expressions is more effective in measuring instant emotions and fun of use. Also, their results suggest that questionnaire data was instead a reflection of the outcome of a task, than a genuine self-reflection of how the user felt when accomplishing the task. Similarly, in a recent study, Lourties et al. (2018) explored the convergent validity of self-reported measures with psychophysiological measures. Their results suggest that the experience lived by a participant is not the same as it is reported. Users self-evaluate their emotional valence more accurately at the end than at the beginning of a task, while they evaluate their arousal more accurately only at the beginning of a task.

To the best of our knowledge, no studies have yet used automatic facial analysis in conjunction with the precise triangulation of electrodermal activity to study user experience with a voice assistant. The proposed triangulated method could provide new insights for this learning or evaluation context using voice only.

4 Method

To test the feasibility of using psychophysiological measures in conjunction with psychometric measures to evaluate user experience with voice assistant, we conducted a pilot laboratory experiment where participants were invited to actively interact with Alexa through Amazon’s (Amazon Inc, Seattle, WA) Echo Dot (3rd generation) device by completing a series of tasks. A total of 11 subjects participated in the experiment (4 males, 7 females, mean age = 24; sd = 5.48) and received a $20 gift card as compensation. This project was approved by the IRB of our institution.

4.1 Participants and Design

Since this is a feasibility study, and we wanted to generate as much as variance in the data, we designed a within-subject experiment where each participant was instructed to perform a sequence of interactions. The experiment has one factor with two conditions: impossible tasks (i.e., queries that Alexa was unable to complete) and possible tasks (i.e., queries that Alexa was able to complete) in order to induce negative emotions such as frustration. Participants were randomly assigned to two different sets of tasks wherein one condition, they completed possible tasks before impossible tasks and in the other condition, we reversed the sequence. During the experiment, participants completed a set of 8 interactions in total.

4.2 Procedure and Measures

Participants were informed that they would have to complete a total of 8 tasks. The goal of each task was explained under the form of pictograms on a tablet.

Participants completed a short questionnaire after each interaction as well as a final questionnaire at the end of the study, followed by a brief interview. To measure user perceptions, the 5-point Self-Assessment Manikin (SAM) scale (Bradley and Lang 1994) was used. The tool allows to directly measure a person’s perceived emotional reaction to a stimulus, such as valence and arousal. Respectively, the scales range from sad (1) happy (5) and calm (1) to excited (5).

For the psychophysiological arousal measure, we collected EDA with a Biopac MP-160 (Biopac, Goleta, USA) device with pre-gel sensors placed on the palm of the participant’s non-dominant hand to capture changes in skin conductivity.

Electrodermal measures were standardized using as a reference a baseline captured on each participant before the experiment. The baseline consists of measuring the normal electrodermal activity unique to each participant, so that variations from the baseline can be compared. Also, results were rescaled from −1 to 1 for analysis purpose.

Finally, psychophysiological emotional valence was captured via micro facial expressions with the software FaceReader (Noldus, Wageningen, Netherlands). This non-obtrusive method can detect up to six emotions: happy, sad, angry, surprised, scared, and disgusted. Valence value was calculated by subtracting the value of the “happy” emotion and the value of the highest negative emotion (Noldus, FaceReader).

Since the objective of this study is to investigate user experience at the moment of interaction with a voice assistant, only psychophysiological measures that were captured at the moment of listening to Alexa’s answers were retained for analysis. It is the participant’s reactions to the response given by the voice assistant that interests us.

4.3 Material and Apparatus

The apparatus was installed in a quiet room with a mirror window, to reduce noise or external stimulation to make sure there was no interruption and that our psychophysiological data would be good quality (see Fig. 1 for a detailed setup).

Fig. 1.
figure 1

Experimental setup

Our experimental setup was composed of an Alexa device, a microphone, mounted with a camera, and a digital tablet was installed. During the experiment, participants were interacting with the device. Facial expressions during the experiment were captured using a Logitech camera (Newark, USA), and recorded with the software Media Recorder (Noldus, Wageningen, Netherlands). The software Observer XT (Noldus, Wageningen, Netherlands) and CubeHX (Montréal, Canada) was used to precisely and temporally synchronize all psychophysiological measurements, in line with the guidelines proposed by Léger and colleagues (Léger et al. 2014, 2019; Courtemanche et al. 2018). Statistics were performed using the Statistical Analysis System 9.4 (SAS Inst., U.S.A.).

5 Results

To analyze the data, we first performed several linear mixed-effects regressions where each of the measures was entered as a dependent variable (see Table 1 for detailed results). For self-reported measures, namely the valence and arousal, we found that participants reported significantly more positive valence in the possible tasks, compared to impossible tasks (t (76) = −3.77, p < .001), which was expected. This suggests that participants felt more positive emotions than negative emotions when having successful interactions with the voice assistant. However, arousal did not show a significant difference (t (76) = 0.54, p = .59, NS) between the two task sets.

Table 1. Summary of results: means standard deviation and linear regression

For psychophysiological measures, arousal results suggest that impossible tasks generate much higher EDA than possible tasks (t (2638) = 7.46, p < .0001). This means that participants experienced a much higher aroused emotional state when they were having difficulties during their interactions. However, in terms of the valance, we did not find a significant difference between possible and impossible tasks (t (1776) = −0.94, p = .35, NS). The following table presents the descriptive statistics and regression results.

In order to understand the relationship between the two self-reported measures and the psychophysiological measures, we conducted two additional linear mixed-effects regression analyses. The results showed that the self-reported arousal is positively correlated with psychophysiological arousal (t (2638) = 3.82, p < .0001). However, surprisingly, our analysis revealed that self-reported valence was negatively correlated with psychophysiological valence (t (1776) = −5,09 ρ < .0001).

6 Discussion

Our main contribution with this methodological paper is through the triangulation of psychological and psychophysiological measures since, to the best of our knowledge, this study is the first to compare results from both psychophysiological and self-reported measures in the context of user interaction with a voice assistant. Specifically, we found that for arousal, results from EDA showed a significant difference between possible tasks and impossible tasks (but the self-reported measure did not capture such difference). In contrast, for valence, the self-reported measure was more effective than the automatic facial analysis (AFA) in detecting variance in valence. Since previous studies mainly used self-reported measures in studying user interaction with voice assistant, our study contributes by showing the benefit of a multimethod approach in this context, as each method captures a distinct emotional dimension. This suggests that during interaction with a voice assistant, what users experienced might not be exactly the same as reported by themselves. We note that this finding is in line with previous research that combining both methods in studying similar emotional states (i.e., arousal and valence) (Lourties et al. 2018).

Also, the results suggest that the self-perceived arousal was consistent with the psychophysiological responses measured with electrodermal activity when combing both task sets, as they showed a significant positive correlation. These results support previous findings in HCI research using EDA and extend these findings in user interaction context with voice assistants. For example, De Guinea et al. (2013) found that the convergent validity of arousal was evidenced by the significant correlation between the SAM scale measure and the electrodermal measure. Such correlation was evidenced in the current research as well.

Moreover, our results indicate that the emotions inferred from the user’s facial expressions by AFA during the interaction complement the self-perceived emotional valence reported by the users. However, we note that there is a discrepancy between valence inferred based on AFA and the reported by questionnaire. For example, they are negatively correlated in general when combining both tasks. To investigate this surprising result, we conducted further observation analysis by analyzing the video recordings of our participants performing the tasks. We found a tendency of several participants smiling when they were not able to complete an impossible task, but a smile emanating from frustration rather than joy, which would be aligned with self-reported valence results.

As a future research avenue, researchers have found a way to overcome this kind of situation by focusing on a new set of emotions called epistemic. For example, D’Mello and Calvo (2013) report in their E-learning study with students that “boredom,” “confusion,” “curiosity,” “happiness,” and “frustration” where the most common affective states felt during learning and reading situations. In particular, the affective state of “confusion” might be interesting to test in our context since there can be much discrepancy between what the participant expects to get as an answer and the actual answer given by the intelligent voice assistant since speech recognition is not yet optimal. We are currently running a new study where we are considering the affective states “boredom,” “confusion,” and “curiosity.”

Our experience is limited by the fact that it took place in a user experience laboratory. Thus, the user experience may have been slightly different than if it had taken place in a more natural setting. Future research could extend the current study to other real-life settings such as home and office where interaction with voice assistant is more frequent. In addition, our experiment only measured EDA and facial expressions, while many other tools and measurements suggested by the literature still need to be tested in our specific study context. Hence, it would be interesting for future research to consider a more natural set up and to add more psychophysiological tools. Also, rarely do voice assistant users use their device without performing other tasks at the same time. The main advantage of this tool is that it allows the user to perform a vocal command when he can perform something else simultaneously (e.g. walking, driving or listening to television). In our opinion, the idea of adding pupillometry to measure cognitive load (Sirois and Brisson 2014; Léger et al. 2018) in a multi-tasking context using a vocal assistant would be an excellent contribution to the research in HCI.