1 Introduction

Conversational Agents (CAs) are artificial intelligence technologies that help humans interact with computers, such as chatbots or virtual assistants [1]. Users send control commands by voice, and CAs reply by voice or execute user commands by controlling hardware devices [2]. It was suggested that voice (vs. text) interaction leads to more positive attitudes toward virtual assistants [3]. Currently, representative CAs such as Amazon Echo, Apple Siri, and Google Meena can perform functions such as checking the weather, making phone calls, and playing music. They can also communicate with humans emotionally, such as “Tell me a joke.” and “Hi Siri, what is the weather like today?”.

1.1 Speed effects and user experience

The rate of speech, an important component of speech, may also be a vital cue influencing affective experience in speech interaction. The rate of speech is a human-specific definition of linguistic expression. The volume of words included per unit of time when people express or communicate information using words has specific communicative or communicative meanings. The speech rate expresses the speaker's style, personality, and psychological state in conversational speech. As an important component of speech, the rate of speech plays an important role in speech and news communication [4, 5]. Chan and Lee [6] examined the relationship between the natural speech rate and intelligibility. They showed that at a rate of 4.3 characters per second (231 ms per character), subjects perceived information with correctness of up to 99%. Another study showed an age difference in speed in which older people were found to have greater deficits and worse user experience at fast speeds [7]. With regard to perceived understandability, users consider news with a slow speed to be more understandable and have a better user experience [8, 9].

The emergence of CAs has increased interactions between humans and machines [10, 11]. The primary design goal of a CA is to become an AI companion for the user and form a long-term emotional connection with us [12]. Therefore, Maedche et al. [13] suggested that virtual assistants should be designed with specific human characteristics, such as voice, gestures, and facial expressions. The essence of voice interaction is the continuation of human communication. The issue of using social expression is a topic that cannot be ignored [14]. Experiments with the assistant’s voice showed that a human voice builds trust in the virtual assistant and generates stronger behavioral intentions [15]. There have also been many studies on CAs’ personalized expression to improve user experience, such as using personal pronouns [16, 17], using male or female voices [18, 19], and adding personalized greetings [20].

The personalized design of CAs cannot be separated from the suitable communication speed. A study of speech rate in CAs for people with vision impairments suggested that CAs actively adjust their speech rates according to the context while maintaining humanness [7]. Specifically, two recent studies that observed CA use showed that people become frustrated when CAs interact at a slower than expected speed [21, 22]. It is important to clarify how we should design the speech rate of CA to enhance the user experience. Therefore, we propose Hypothesis 1 (H1).

H1

The rates of speech that CAs use will affect the user experience.

1.2 Interpersonal communication principle in CA interactions

Communication Adaptation Theory (CAT) is a general theoretical framework for interpersonal and intergroup communication [23]. It attempts to explain and predict why, when, and how people adjust their communication behavior during social interactions and what social consequences these adjustments will have. In the aspect of speech interaction, the theory suggests that “upon entering an interpersonal encounter, people often unconsciously begin to synchronize aspects of their verbal (e.g., accent and rate of speech) and non-verbal behavior (e.g., gestures and posture)” [24, 25]. Speakers, thus, typically make their discourse more or less similar to that of their listeners at various linguistic and paralinguistic levels [25]. For example, doctors may inform patients by slowing their speech rate and using technical terms that match the patient's speech rate to help them understand their health status [26]. Muir et al. [27] found that the linguistic style accommodation shapes impression formation and rapport in CA. They thought that it was necessary for CA to adopt the convergence rate of speech with humans in face-to-face communication. The rate of speech plays an important role in communication. Therefore, we propose Hypothesis 2 (H2).

H2

As the user's rate of speech increases, the user prefers a corresponding increase in the speed of the CA.

1.3 Gender difference in CA interactions

Gender is an important variable in the user experience because men and women may have different requirements and feelings. Several studies on the impact of gender on voice assistant use have demonstrated that men have higher ratings for voice assistant response [28]. In a study of the impact of gender and user’s technical experience on the performance of voice-activated medical tracking applications, the characteristics of female voice were found to pose certain technical challenges as an output [29]. Adding anthropomorphic cues to computer agents reduced the perception of agent friendliness for males [30]. Research on the effects of gender on voice processing suggests a processing speed asymmetry between female and male voices [31]. Therefore, we propose Hypothesis 3 (H3).

H3

The impact of CA adoption at different speeds on user experience is related to gender.

Based on these considerations, this study aims to explore the necessity of rate of speech in speech interaction; how the rate of speech affects users' emotional experience and satisfaction; and how the rate of speech is designed to meet users' emotional needs. Therefore, we propose the following three research questions.

RQ1: How does the rate of speech of CAs affect user experience, and which feedback speed should be used by CAs?

RQ2: How should CAs respond when they are aware of a person's speed preference?

RQ3: How do males and females react to CAs' responses to different speeds?

The findings of this study provide designers with insights into how CAs should control their rate of speech in human–computer interactions.

2 Method

2.1 Materials and design

To investigate the effect of rate of speech on users' subjective evaluation of using a CA, as well as users' rate of speech preferences and their scenario difference, this study uses a 3 (CA’s speed: slow, medium, and fast) × 2 (gender: male and female) mixed factorial experimental design with three user rates of speech (slow, medium, and fast) and collected data using subjective evaluations, forced choice, and interviews from 25 participants. The rate of speech between the CAs and the user was achieved via one-to-one conversations between the CAs and users. Therefore, we tried to give each communication feedback sentence the same length. In addition, for the rate of speech, we set three levels of speed used in the speech system response process based on the common rate of speech in China Central Television of Putonghua: fast (approximately 320 words/minute), medium (approximately 280 words/minute), and slow (approximately 240 words/minute). Also, in practice, users may use different speeds for different scenarios and would like to receive feedback at the appropriate speed. Therefore, we selected seven common communication scenarios between users and CAs as scenario variables, including a weather query, music play, news query, health query, travel query, restaurant query, and recipe query. The experimental material for this study is simulation speech dialogs, in which the user asks a question, and the CA answers. After determining the final script through pretest and internal discussion, we used the Baidu AI open platform to produce speech dialogs, where the user asks in a male voice, and the CA answers in a female voice, as the experimental material. The speech material is shown in Table 1.

Table 1 List of questions for the conversational agent

2.2 Participants

A total of 25 participants were recruited for this experiment, including 13 males and 12 females aged between 20 and 41 years old (Mean = 26.5, SD = 4.5). The participants were required to have used at least one voice interaction device before the experiment, including a mobile phone or other CA. Basic participant information is shown in Table 2.

Table 2 Basic information about the study participants

2.3 Procedure

We performed the experiment using E-prime 2.0 in a quiet room. After recording demographic information, participants logged into the E-prime experimental platform. First, participants viewed the experimental instructions. Participants were told that they would listen to several conversations between a user and a CA. After each conversation, we asked the participants to score the subjective evaluation scale and choose the preferred feedback speed of the CA at different users’ speeds. Then, they automatically proceeded to the next conversation. Finally, we interviewed the participants with the five questions shown in Table 3 after the experiment. Before the experiment, participants were asked to practice at least twice to ensure they were proficient in the experimental procedure. Then, participants completed the experiment independently. To avoid the influence of early experimental material on the later material, we used a completely randomized order to present the 7 scenarios. In addition, the 9 dialogs in each scenario were also randomized. The experiment lasted for approximately one hour, and no specific breaks were scheduled between tasks.

Table 3 Interview questions

2.4 Measured variables

We selected five subjective evaluation mentions as dependent variables, which are the same as those used in a previous study [28, 16, 33, 34]. The mean of all dependent variables reflect user responses to the CA and their true feelings. All participants were asked to rate nine combinations of each speed in seven experimental scenarios. Mentions were recorded using the following scale:  ‘the response of the voice system is pleasant’,‘the response of the voice system is natural’; ‘the response of the voice system is trustworthy’; ‘the response of the voice system can close the distance between myself and the voice system’ and ‘the response of the voice system is satisfying’. The reliability and validity of these factors have been verified in previous studies [16, 34]. The participants were asked to fill in a 7-point Likert-type scale, ranging from totally disagree to totally agree (1 = I do not agree at all,7 = I totally agree).

3 Results

3.1 Subjective evaluation

We performed an exploratory factor analysis of the subjective evaluation scale. The KMO value was 0.919 (above 0.8), and the Bartlett ball test result was significant (p < 0.001). Therefore, we extracted one factor to explain these five questions as the "Subjective evaluation score" [35]. The cumulative variance explained rate is 81.04%. The subjective evaluation score was obtained by calculating the mean of the five measure questions.

The descriptive statistics of the composite scores of subjective evaluations are shown in Table 4. We performed a repeated measures analysis of variance (ANOVA) on the subjective evaluation scores of the CA speeds (levels: slow, medium, and fast) × gender 2 (levels: male and female) in the three user speeds (levels: slow, medium, and fast). Results showed that the primary effect of CA speed was significant (F (2,446) = 70.453, p < 0.001, partial η2 = 0.240), which supports the trend of average scores (H1). Also, the interaction effects between user speed and CA speed were also significant (F (4,892) = 8.831, p < 0.001, partial η2 = 0.038) (H1). The interaction effects between CA speed and gender were also significant (F (2,446) = 13.290, p < 0.001, partial η2 = 0.056) (H3). There was no significant interaction effect between user speed and gender. The mean scores result of CA’s speed and user’s speed showed that the medium-speed responses were better than fast and slow speeds (5.592 compared to 5.292 and 4.677, respectively).

Table 4 Mean scores and standard errors of the subjective evaluation

Communication adaptability theory can be verified from these mean values. Figure 1 shows that when users spoke slowly, the CA that used the medium speed (Mean = 5.669, SD = 1.150) had higher scores than those who used the slow speed (Mean = 4.779, SD = 1.463) and fast speed (Mean = 5.172, SD = 1.280). When users spoke at a medium speed, the CA that used the medium speed (Mean = 5.596, SD = 1.230) also had higher scores than those that used the slow speed (Mean = 4.613, SD = 1.483) and fast speed (Mean = 5.264, SD = 1.265). When users spoke quickly, the scores of the CA’s medium speed (Mean = 5.537, SD = 1.277) and fast speed (Mean = 5.465, SD = 1.172) were similar, and the slow speed (Mean = 4.480, SD = 1.489) had the lowest scores. These results indicate that when users speak quickly with the CA, they expect the CA to reply equally quickly (H2). Thus our findings indicate that, when users speak faster, they prefer a corresponding increase in the speed of the CA, but there is an optimal level of feedback speed.

Fig. 1
figure 1

Interaction effect between the CAs’ speed and users’ speed. Note: CA = Conversational Agent

Gender differences are also shown in the experimental results. As shown in Fig. 2, male participants have no significant change in their preference for the medium feedback speed as their speed increases. For female participants, they had a significant preference for conversational agents to adopt a feedback rate of speech that matched their own speech rate. When comparing male and female users’ performance, male users are more tolerant of feedback adjustments than female users with different feedback speeds, as shown by their significantly higher evaluation scores for medium and slow speeds compared to female users. Also, male and female users showed the opposite results when the feedback speed was fast, with male users being less receptive to fast than female users at all three feedback speeds.

Fig. 2
figure 2

Interaction effect between CA speed and user speed and gender Note: CA = Conversational Agent; UX-CAX = User and Conversational Agent adopt the speed level X. X contains the F, M, L equal to Fast, Medium and Slow

3.2 Forced choice

The results of forced selection were nearly identical to the scoring results, and we analyzed the results of forced selection with descriptive statistics. With the slow user speed, participants were more likely to select the medium feedback speed, indicating the second-highest selection for slow speed and the highest acceptance and tolerance of slow speed among the three user speeds (Perc Choice slow = 28.7%, Perc Choice medium = 49.3%, Perc Choice fast = 22.0%). When speaking at a medium speed, users were more likely to choose a medium feedback speed (Perc Choice slow = 11.6%, Perc Choice medium = 66.1%, Perc Choice fast = 22.3%). When users spoke quickly, they also preferred a medium feedback speed, but their preference for fast speed increased, and their tolerance for slow speed was the lowest among the three user speeds (Perc Choice slow = 10.7%, Perc Choice medium = 54.2%, Perc Choice fast = 35.1%). Thus, the results of the scoring experiment were confirmed.

Regarding gender, results showed that both male (Perc Choice slow = 18.8%, Perc Choice medium = 56.4%, Perc Choice fast = 24.8%) and female participants (Perc Choice slow = 37.8%, Perc Choice medium = 42.8%, Perc Choice fast = 19.4%) preferred medium-speed responses when speaking slowly. However, men were much less tolerant of slow speeds than women. If male (Perc Choice slow = 9.1%, Perc Choice medium = 66.1%, Perc Choice fast = 24.8%) and female participants (Perc Choice slow = 13.9%, Perc Choice medium = 66.1%, Perc Choice fast = 20.0%) communicated at a medium speed, they also both preferred medium speed responses. However, a higher percentage of fast-speed responses was found in men than women. If male (Perc Choice slow = 10.3%, Perc Choice medium = 53.9%, Perc Choice fast = 35.8%) and female participants (Perc Choice slow = 11.1%, Perc Choice medium = 54.4%, Perc Choice fast = 34.4%) communicated quickly, they both preferred medium-speed responses. However, we found that the proportion of users choosing feedback in fast speed among females gradually increased. Thus, we conclude that males have a lower tolerance for slow speeds when using CAs than females. These results agree with those of the scoring above.

3.3 Interview results

We processed the interview data of 25 participants using thematic analysis. The following opinions were obtained. Users' feelings about using CAs with different speeds varied markedly. However, there are some relatively similar conclusions. See details at Table 5. Regarding feedback speed, users that preferred medium and fast CA speech typically expected CAs to reply at medium or fast speeds, mainly when communicating at medium and fast speeds. In scenarios with high information content, such as recipes and news, users think that fast speech will be uncomfortable and unintelligible; in scenarios with low information content, complex tasks, or detailed interactions, such as weather and music, users think that slow speeds will make people lose patience. Regarding language continuity, some users said that slow responses would be more difficult to understand, and fast responses could better identify the approximate meaning of feedback.

Table 5 Opinions about the rate of speed of conversational agents

Regarding the importance of speed in human–computer interaction, all participants reported that speed is more important in human–computer interaction, particularly when dealing with unfamiliar information. Some users thought that different speeds represented different emotional expressions and affected users' emotions, while CA answers that were too fast would make people feel like they were being rushed; however, responses that were too slow would make users feel irritable and impatient. Some users also suggested that different feedback speeds should be provided for different ages and needs.

Considering CA speed adjustment, more than 80% of the participants wanted the speed to be adjustable. Half of the participants hoped that it could be adjusted automatically according to personal habits and speed. In contrast, others thought the speed should be adjusted according to different usage scenarios. Participants typically wanted the CAs to reply at speed similar to their speed rather than being set to a fixed speed.

Considering CAs’ speed acceptance, 35% of the participants thought it was more acceptable to speak at a slower speed than their own rather than a faster speed than their own. The remaining 65% of the participants thought it was acceptable for CAs to speak faster than their speed.

4 Discussion

This study investigates the effect of conversational agents’ feedback speed on user experience in interaction. According to the results of quantitative analysis, users prefer CAs to reply at medium speed. This result was obtained by analyzing the results of CAs’ feedback speed evaluating scores for different user speeds separately and by considering CAs’ feedback speed together. These results confirm that CAs can improve user experience through speed adjustment [36] and strengthen the application of linguistics in CAs.

4.1 Application of communication adaptation theory

There is a trend that when users speak more quickly, they expect the CA to increase its feedback speed; they also believe there is an optimal feedback speed. This trend supports the communication adaptation theory [23]. Users' responses to CAs’ feedback speed indicate discourse convergence regarding linguistic and interpersonal communication principles. As users' speed increases, their propensity for CAs to be fast increases. Users adjust their speed to match the system when they are confronted with immaturity and limited capacity than when they perceive the system to be relatively mature and capable [37], which results in a gradual convergence between the users' speed and their choice of feedback speed for CAs, which is medium speed.

4.2 Optimal speech rate of CA

CAs are defined by most people as assistants, with voice interaction moving toward an "intelligent personal assistant" [38]. In this study’s experiments, users tended to humanize the speed of CA feedback, which reflects the expectation of humanizing intelligent machine personal assistants. The expectation of medium-speed responses may be due to the definition of the status of the CAs. Although CAs are defined as assistants by most people, people expect assistants to have a lower status than themselves, in addition to anthropomorphic expectations. Studies have reported that people are satisfied with ordering others and directing their behavior [39]. While speed can be used as a general cue to enhance trustworthiness, fast speeds can enhance persuasiveness [4] for CAs with lower status than the user, and faster speeds are associated with extraversion and dominance [40, 41]. Additionally, in some suggestion-based interactions, faster speeds can make people feel ordered and persuaded, which is incompatible with the CAs' status. Replying at a slow speed can make users feel rushed and disrespected when they are eager to obtain answers and results.

4.3 Gender difference

We found that male and female users have different feelings and choices about using CAs’ feedback speed. Female users are more tolerant of CAs speaking at a slow speed than male users. We explain this phenomenon using the difference in CA's purpose and psychological demand between males and females. In the aspect of purpose, several studies have demonstrated that men are generally outcome-oriented and more focused on achieving desired outcomes [42]. For male participants, they wanted to communicate as quickly as possible to complete the conversation. Coffman and Marques [43] note that men are more action-oriented in their use of language, while women are more relationship-oriented. The men think that CAs only answer questions and execute their own instructions and are unwilling to accept CAs’ attempts at different speech rates. However, the women focused on the process of building a relationship with the CA in communication, so they have a greater tolerance for the CA at slower speeds. Another widely held explanation is that men are more competitive in their use of language, while women are more cooperative [44]. Ni [45] classified male communication styles as power-oriented, self-equipped, and purpose-focused and female styles as relationship-oriented, interpersonal-oriented, and curious. The main purpose of male communication in this process is to support their prominence [43]. Therefore, women have a higher status perception of the CA [46], though men have lower status perceptions of CAs [47]. This also contributes to the difference in men and women's acceptance of CA’s speed. Metze et al. [48] highlight the importance of tailoring dialogues in human–computer communication, especially regarding gender. Christou and Parmaxi [49] also argued that designing gender-sensitive tools should take into account the differences between men and women and strive to provide equal participation of all genders. Therefore, personalized designs should be implemented for both male and female users in the future.

4.4 Limitations and future studies

This study also has several limitations. First, the participants were only Chinese, and we only tested Putonghua's speed. The extensibility of findings remains to be verified across different languages. Similarly, the acceptance of fast, medium and slow speed is an interesting and meaningful topic across cultures and contexts. In addition, the scale used in the study was relatively basic. We have only analyzed and evaluated the overall scores of the scales. In future studies, we can focus on developing or validating more scales to evaluate the user experience of CA in terms of different dimensions, such as trust and satisfaction.

5 Conclusion

The results of this study can be used to propose a new paradigm in the field of human–computer interaction by introducing the theory of communication adaptation to the problem of feedback speed adjustment in conversational agents and exploring the role of speed in human–computer interaction. We argue that the rate of speech plays a strong role in human interaction and affects the emotions and attitudes of both communicators. The rate of speech also affects the perceptions and satisfaction of users in human–computer interactions. With the widespread use of CAs and virtual social media, this study provides a reference for communication speed to solve the problem of virtual intelligence anthropomorphism. We find that users prefer CAs to respond at a medium speed level. Their responses are most acceptable to users when they are adjusted to the user's speed, gender, and usage scenario. These results confirm the principle of the need for personalized design in human–computer interaction and virtual network interaction.