Within the broad and complex question of how emotions and mood affect learning and cognitive processes, recent research has focused on how multimedia materials impact learners’ affective state and learning outcomes (e.g., Heidig et al. 2015; Liew and Tan 2016; Mayer and Estrella 2014; Park et al. 2015; Plass et al. 2014; Um et al. 2012). Multimedia refers to instructional content delivered through textual and pictorial information (Mayer 2001, 2014). These studies generally consist of an expository text on a scientific topic (for example, how vaccines work) after which participants answer multiple choice comprehension questions and transfer tasks (for example, to explain why a previously affected person does not need the vaccine). Affective influences are either induced prior to learning (for example, with the Velten procedure, or emotional videos), or embedded in the learning materials through esthetic elements such as color, shape, faces in anthropomorphic characters, or animations. Esthetic elements are defined as non-instrumental affective and experiential appeals by means of layout, content, structure, or website design (Hassenzahl and Tractinsky 2006; van Schaik and Ling 2009). Such qualities are assumed to lead to an affective response in the user (e.g., pleasure, satisfaction) and to behavioral responses (e.g., approach, avoidance) (Hassenzahl 2004; Heidig et al. 2015).

Emotions, in these studies, are understood and manipulated from the perspective of core affective states, valence and arousal (Bradley et al. 2001; Lang and Davis 2006). Valence varies along the positive/negative, or pleasure/displeasure, dimension, reflecting neuropsychological motivational systems of appetitive and reward approximation, and escape, flight, or avoidance responses. Arousal varies from low to high and expresses the amount of psychophysiological arousal induced by any of both neural systems (Bradley 2009; Lang and Davis 2006). According to Lang (1995), dimensions of valence (pleasant-unpleasant) and arousal (active-calm) allow us to represent the totality of affective expression in a two-dimensional space defined by these two main axes (see also Jackson et al. 2019). The dimension of dominance (controller-controlled) was added later (Lang and Davis 2006), reflecting the level of control over the emotional response. This dimension showed to be highly correlated with valence, because the feeling of control tends to increase in pleasant situations and decrease in the presence of hostile events (Bradley 2009). In summary, the fundamental parameters that define the emotional experience are the dimensions of valence and arousal, and the individual’s degree of control over their emotional experience allows for greater precision in emotional states.

Overall, studies have found a facilitating effect of positive affect (Heidig et al. 2015; Liew and Tan 2016; Mayer and Estrella 2014; Park et al. 2015; Plass et al. 2014; Schneider et al. 2016; Um et al. 2012). For example, Um et al. (2012) induced positive or neutral emotions, and then showed a positive (warm colors and anthropomorphic faces) or neutral (gray-scale informative) design of the learning material. The positive emotional design resulted in higher comprehension and transfer. Using similar materials, Plass et al. (2014) obtained a different pattern in comprehension and transfer, but overall, positive emotional inducing warm colors and round face-like shapes facilitated cognitive outcomes. This effect of positive emotional design has been attributed to intrinsic motivation, which would increase mental effort in the task (Heidig et al. 2015; Liew and Tan 2016; Plass et al. 2014; Um et al. 2012).

As for negative affect, it has been manipulated by previous induction with the Velten procedure (Liew and Tan 2016), or color and site esthetics (Heidig et al. 2015), showing a detrimental effect of negative affect on performance. Liew and Tan (2016) found that learning transfer was impaired under the influence of negative mood, and that ending the task mitigated the negative affect. They argued that although negative mood led to worse performance, it encouraged learners to invest higher mental effort, presumably as a mood repair mechanism (Forgas 2013). For their part, Heidig et al. (2015) found that the negative design had a slight decrement in cognitive outcomes.

The classic description of the effect of arousal on learning describes a U-shaped curve, an optimal range of arousal, which is not too low and not too high, to achieve the best performance (Yerkes and Dodson 1908). The effects of arousal on memory and learning have been characterized by numerous studies (for an overview, see Kensinger 2004; Schneider et al. 2019). Arousing stimuli might act as an alert signal, attract the focus of attention, and induce a fast response; they are also remembered better. For example, in multimedia learning, high arousal induced by video clips was found to benefit learning in a recall test, as well as to increase motivation and mental effort ratings (Chung et al. 2015). However, higher states of arousal, coupled with negative stimuli, can signal an alarm state which results in decreased learning performance (e.g., Eysenck et al. 2007; Zeidner 2007).

In contrast to the emotional design literature, Harp and Mayer (1997, 1998) and Mayer (2014) argued that increasing interest in learning materials through emotionally appealing but irrelevant illustrations would lead to pay more attention to the irrelevant images, which could also interrupt the reader’s mental model construction, and so hinder comprehension. Indeed, including seductive images with decorative and emotional, but not informational value, has been associated with decreases in expository text comprehension, memory, and learning (e.g., Abercrombie et al. 2019; Eitel et al. 2019; Harp and Mayer 1997, 1998; Park and Lim 2007; Sanchez and Wiley 2006; Saux et al. 2015; Strobel et al. 2019; Wiley 2019). This replicated finding (see meta-analysis by Rey 2012) has been termed the seductive detail effect. For example, in Saux et al. (2015), a text explained a mechanism (how aerogels work), with or without a picture depicting the person who invented it. College students remembered less about the mechanism when the picture was included. The seductive detail effect would happen because emotionally appealing irrelevant images distract attention from relevant information, and/or disrupt the chain of thought leading to incomplete or incoherent representations of the text, or activating incorrect schemas from long-term memory and inducing comprehension errors (Harp and Mayer 1997, 1998; Mayer 2014; Rey 2012). The attentional hypothesis has received more support (Rey 2012; Sanchez and Wiley 2006).

The seductive detail effect was not found in all studies; it seems to be moderated by cognitive (previous knowledge, working memory capacity, spatial ability) and motivational aspects (Rey 2012). In particular, Schneider et al. (2016) studied decorative pictures depicting positive, context-related pictures (e.g., happy people studying) presented along expository texts, and found that these pictures did not have detrimental effects on recall and learning. Aligned with the emotional design perspective, and following Magner et al. (2014), they considered that some decorative pictures might have positive motivational properties, and activate implicit schemas, that help learning instead of acting as seductive details.

Schneider et al. (2019) suggested that the seductive detail effect on multimedia learning can be moderated by the students’ level of arousal. Participants studied texts with or without seductive details (provided by irrelevant sentences, not via images), and their perceived state of arousal was manipulated (perceived by false low or high heartbeat rate), measuring also their real heartbeat rate. They found that seductive details hindered participants’ recall and transfer in a perceived low arousal state, but not for those in high perceived arousal. However, neither the real heartbeat rate nor the electro-dermal activity varied as a function of the perceived heart rate. They speculated that because these participants perceived they were in a high arousal state, their cognitive resources were also high overall during the task, so the cognitive or attentional load caused by the seductive detail in other studies did not affect the high perceived arousal performance. Schneider et al. (2019), in line with the emotional design perspective, conclude that design and context elements that increase learners’ arousal can enhance attention and cognitive resources, and therefore learning.

In synthesis, in multimedia comprehension, the emotional design perspective supports the inclusion of graphical elements (e.g., colors, images such as faces) to enhance positive affect, and the avoidance of graphical design inducing negative states (e.g., plain text, grayscale graphs) (Heidig et al. 2015; Liew and Tan 2016; Mayer and Estrella 2014; Park et al. 2015; Plass et al. 2014; Schneider et al. 2016; Um et al. 2012). As for the arousal dimension, arousing stimuli might act as an alert signal, contributing to attention and memory (e.g., Chung et al. 2015; Schneider et al. 2019) but higher states of arousal, alone or coupled with negative stimuli, might negatively affect performance. On the other hand, the seductive detail literature has found a negative effect of attractive but irrelevant images on comprehension and learning (e.g., Harp and Mayer 1997, 1998; Park and Lim 2007; Sanchez and Wiley 2006; Saux et al. 2015), also seemingly mediated by low perceived arousal (Schneider et al. 2019). The present study addresses the effects of presenting decorative images selected for their valence (positive or negative), and arousing properties, in multimedia comprehension. Also, it should be noted that both perspectives have focused on comprehension of expository texts. This study extends this line of research to comprehension of procedural multimedia.

Regardless of the setting, in academia, at work, or in everyday contexts, there is another type of text in which a set of instructions guides a certain goal: procedural text, such as the one found in instructions on how to complete forms, to present papers or exams, to follow directions, to assemble objects, to cook a meal, and in similar tasks. Procedural text consists of a series of steps that lead to a final product. To understand procedural text, a reader must construct a coherent and integrated mental model, in which goals, sub-goals, and actions are represented and updated in a strategic, execution-oriented representation (Brunyé et al. 2006; Diehl and Bergfeld Mills 2002; Irrazabal et al. 2016). Research on procedural text has emphasized the advantage of including images to depict the elements and steps in sequential fashion, over text-only presentations, in memory and assembly tasks (Brunyé et al. 2006, 2007; Diehl and Bergfeld Mills 2002; Gyselinck et al. 2008; Irrazabal et al. 2016; Zacks and Tversky 2003). The present study addressed the emotional effect of graphical elements in procedural multimedia.

Furthermore, this study sought to disentangle effects of valence and arousal. Previous studies on emotional design (Heidig et al. 2015; Liew and Tan 2016; Mayer and Estrella 2014; Park et al. 2015; Plass et al. 2014; Um et al. 2012) have not differentiated both dimensions. When inducing an affective state prior to learning, they have confounded valence with high arousal. For example, in Plass et al. (2014), the “neutral” mood induction consisted of a “chillout” video, which could be argued, will induce a positive and low arousal state. In the experimental learning session, affect was manipulated with esthetic elements, which were meant to be subtle. Thus, effects of valence and arousal could not be disentangled in previous studies. The present work systematically varied the stimuli along the valence and arousal dimensions.

Thus, the present study has addressed the effect of decorative graphical elements with affective properties (high/low in valence and arousal) in procedural multimedia comprehension. First, a pilot study determined the images’ affective properties. Pictures depicting construction scenes, workers, and landscapes were selected from the International Affective Picture System (Lang et al. 2008) and Internet searches. Participants rated them in terms of their valence and arousal according to the IAPS scoring procedure (Lang et al. 2008). In this way, stimuli high or low in both dimensions were selected and introduced in the second study, which sought to explore the images’ effect on multimedia instructions. The instructions consisted of a series of steps to assemble a LEGO™ object, in a succession of screens. Outcome variables were processing indexes (mean study time per screen) and assembly accuracy in recall. The emotional design framework would predict better performance with positive images, and worse with negative ones, relative to a control condition. On the other hand, the seductive detail perspective would predict that decorative images with emotional but not informational value negatively affect performance regardless of their emotional valence. In addition, the study assessed the effect of arousal as different from valence.

Pilot Study

The aim of this study was to select the emotional images for the experimental study, which required images with the following emotion-inducing combinations: positive/high arousal, positive/low arousal, negative/high arousal, negative/low arousal.

To that end, the International Affective Picture System (Lang et al. 2008) and its scoring system were employed. The IAPS includes a standardized set of photographs to elicit emotions in experimental contexts, widely used internationally (Bradley and Lang 2007). The system consists of images along with a rating system for valence, arousal, and dominance affective dimensions (the Self-Assessment Manikin, Lang 1980). After observing each photo for a brief period, participants rate them according to their affective dimensions in Likert graphical, non-verbal scales (Bradley and Lang 1994).

Given that the task to be performed along the images in the subsequent study would be to assemble an object, images shown within the task had to be construction-related, but not actual objects or assembly graphs. Therefore, construction scenes or related images (workers, landscapes) were selected from the IAPS original sets. In addition, an online search for images with the keywords “construction” and “assembly” was performed, in order to include more images to be tested in case the IAPS images did not result in the valence and arousal combinations for the experimental study.

Another question regarding image selection for the experimental study was the total number of images to be selected. The experimental task would require participants to watch a series of steps (an instruction) to complete an object assembly. In a previous study with the same instructions as materials, time spent watching each step ranged from 4 to 8 s approximately (Irrazabal et al. 2016). However, according to Lang et al. (2008), an image has to be presented at least 6 s to induce an emotional response. For this reason, we decided to include only one image across the instruction steps. Therefore, the final selection would comprise four images, each for one of the four combinations of valence and arousal.

The set of sixty pictures was rated for their valence and arousal, by a sample of participants with similar characteristics to the second study, to determine the final picture set.

Method

Participants

Sixty undergraduate psychology students from a private university in Buenos Aires, Argentina (Mage = 21.36, SD = 4.76; 49 women, 11 men) completed the study. Participants signed an informed consent before taking part in the study and were debriefed after completing the procedure. The study was approved by an institutional ethics committee.

Materials

From a web search (keywords “construction” and “assembly”), 18 color images were selected, depicting people in construction situations (construction workers, kids assembling toys, and similar). Additionally, 42 IAPS images depicting people, buildings, and landscapes were selected. The pictures’ set was implemented to be shown according to IAPS guidelines, along with a booklet to rate images with the SAM (Lang et al. 2008).

Procedure

In small group sessions (20 participants per group), lasting approximately 30 min each, participants were shown each image and rated them in terms of affective dimensions. Images were implemented in a PowerPoint presentation and projected in a blank screen (150 × 120 cm). Four images (not construction-related) were used to demonstrate and practice the task. Each image was preceded by a fixation signal, after which the image was presented and attended to for 6 s, followed by 15 s when participants had to rate valence, arousal, and dominance in each corresponding manikin. Dominance ratings were included to maintain the standard IAPS administration procedure, but given the goals of the study, they were not further analyzed. The order of presentation for each group was varied to counterbalance order effects in ratings.

Results and Discussion

First of all, validity of ratings obtained in this sample was tested. For this purpose, ratings for the pre-selected subset of 42 IAPS images obtained in this sample were compared to the same images’ ratings in the Argentinean standardization study (Irrazabal et al. 2015). Neither valence nor arousal differed significantly (paired t tests p > 0.05). Table 1 shows descriptive values (M, SD) for valence and arousal ratings, in the standardization sample and in this study, and their statistical comparison. These results imply that overall ratings in this sample do not differ from standardization values, thus validating these ratings.

Table 1 Descriptive values (M, SD) for valence and arousal ratings in the standardization sample and in this study, and statistical comparison

For all pre-selected images (42 from IAPS, 18 new), valence ratings varied from 3.13 (Min) to 7.85 (Max); arousal from 2.67 (Min) to 5.83 (Max). The experimental study needed images with positive or negative valence, and high or low arousal. To further select images, cutoff ratings were set at, and excluding, median valence and arousal for the 60 images. Cutoff ratings therefore resulted in high ≥ 5, and low ≤ 4.5 for both dimensions.

Images with the following combinations were then selected: high valence (positive), high arousal; high valence (positive), low arousal; low valence (negative), high arousal; low valence (negative), low arousal. The final selection included four images, three from the online search and one from the IAPS (ID 2399). Their content was as follows: a father and a boy playing with blocks (positive, high arousal), a working carpenter (positive, low arousal), an architect crying with despair (negative, high arousal), a woman with facial expression of suffering or pain (negative, low arousal). Their ratings are shown in Table 2.

Table 2 Valence and arousal ratings of the selected emotional images

Thus, this study provided validation for the affective (valence and arousal) inducing value of the images to be employed in the following experiment.

Experimental Study

This study examined the effect of decorative graphical elements with affective properties in procedural multimedia comprehension. Participants watched instructions consisting of three successive steps to assemble a LEGO™ object, with or without decorative images. Images depicted construction or landscape situations, and had positive or negative valence, and low or high arousal ratings according to the previous study. Outcome variables were study time per screen and assembly accuracy. Study time per screen consisted of the time each participant spent watching each of the screens that showed each of the steps of the instruction, which had to be remembered. Assembly accuracy was defined as the number of errors committed by the participant in the execution of the instruction, from memory after each instruction was presented. Both sequence and position errors were registered. Sequence errors consisted of instances in which the participant did not respect the sequence order of the instruction, for example, placing a part in the first place when it should have been after another. Position errors were those in which the participant inserted the LEGO™ block in a location different from that indicated by the instruction.

The emotional design framework would predict better performance with positive images, and worse with negative ones, relative to a control condition. On the other hand, the seductive detail perspective would predict that including a decorative image with emotional but not informational value negatively affects performance, without taking into account valence.

Method

Participants

Fifty-seven undergraduate psychology students (Mage = 24.07, SD = 8.80; 46 women, 11 men) volunteered to take part in the study in exchange for partial course credit. Participants signed an informed consent before taking part in the study and were debriefed after completing the procedure. The study was approved by an institutional ethics committee.

Materials and Design

Fifteen instructions consisting of a series of steps to assemble a LEGO™ object were employed. For each instruction, the complete object was achieved after three steps. Each step instructed on how to put together two parts of the object, until the object was completely assembled.

Instructions were presented on a computer screen, using a self-paced method. All sequences were presented using Paradygm Software, at 300 × 300 pixel resolution, with 14-point Times New Roman font. Steps were presented one at a time, clearing the screen when the step was accomplished.

Instructions were presented in three formats: (i) text-only; (ii) diagram-only; and (iii) multimedia (diagram + text) (Irrazabal et al. 2016). In the present experiment, each participant saw 15 instructions in only one presentation format. In all conditions, screen was split horizontally into two segments. A diagram-only condition presented instructions in pictorial format (a diagram with a picture of LEGO™ pieces and arrows showing how to assemble the pieces) duplicated on the screen, one in each screen segment (see Fig. 1). The text-only condition presented instructions in verbal format duplicated on the screen (one in each screen segment, see Fig. 2). The multimedia condition presented instructions combining text and image, each format in each screen segment (see Fig. 3). Both screen segments provided the same information twice, simultaneously, in order to control the repetition inherent in multimedia (Brunyé et al. 2006, 2007).

Fig. 1
figure 1

Picture-only instructions

Fig. 2
figure 2

Text-only instructions

Fig. 3
figure 3

Multimedia instructions

Each block consisted of five types of emotional condition: (i) without emotional images; (ii) positive valence, high arousal image; (iii) positive valence, low arousal image; (iv) negative valence, high arousal image; (v) negative valence, low arousal image. The emotional images were approximately the same size as the diagrams depicting the parts of the object to be assembled and placed along the middle axis to the right of the screen. Each type of emotional condition showed one image that appeared constantly along the three steps of the instructions. As described previously, their content was as follows: a father and a boy playing with blocks (positive, high arousal), a working carpenter (positive, low arousal), architect crying with despair (negative, high arousal), woman with a headache (negative, low arousal).

Data Collection Procedures

In individual sessions, participants were asked to watch and try to remember series of steps, in order to assemble LEGO™ objects. First, they completed three training instructions, with supervision and feedback from a research assistant. When the participant had learned the task, he or she saw and executed the set of experimental instructions. Each participant viewed fifteen instructions in only one presentation format (e.g., either text-only, diagram-only, or text + diagram). Five blocks of instructions were designed. Each block contained three assembly instructions with an emotional image type (no emotional, positive high, positive low, negative high, negative low). Then, for the presentation, the order of the instructions was randomized.

Each instruction had the following sequence: after a 500-ms fixation cross, the first step was presented; the participant self-administered the steps pressing the space bar, until the last step was shown, followed by the word “Assemble.” Along each instruction presentation, participants had the LEGO™ pieces to one side and could see but not touch them until prompted with the word “Assemble.” When the participant finished assembling the object, the next instruction was self-administered by pressing the space bar.

Two research assistants tested the participants. The second research assistant was trained in the assembly of each of the stimuli, and her task was to register participants’ responses in a grid: for each item, sequence and location errors. Sequence errors reflected alterations of the temporal order of the instruction. Location errors indicated an alteration in the spatial location, more specifically, that the participant located the LEGO™ piece in a location or position not indicated by the instruction. Thus, number of errors for each instruction varied between 0 and 6.

Data Analysis Procedures

Data analyses examined the effects of presentation modality and emotional images on study times and assembly accuracy, taking into account all individual observations per condition, with general linear mixed modeling (Baayen 2012) carried out with the packages lme4 version 1.1-17 (Bates et al. 2015), lmeRtest 3.0-1 (Kuznetsova et al. 2017), in R version 3.5.0 (R Core Team 2018).

General linear mixed effects models have benefits over traditional analyses such as ANOVA or linear regression (Baayen 2012). They allow the inclusion of all observations as dependent variables in a single analysis, instead of computing an overall mean per condition, thus increasing statistical power. As dependent variables, we considered all individual study times on each step, and the number of errors for each instruction.

Also, general linear models include random intercept parameters in the design, so they are able to account for initial differences in the data contributing to variability or confound in responses. In our analyses, individual differences in their idiosyncratic reaction time and precision, and differences due to emotional condition, and presentation format, when analyzing the effect of presentation format and emotional condition respectively (e.g., when analyzing one, the other goes to the random parameter). Participants were entered as random intercepts in all analyses.

In addition, generalized linear models are not limited by the response distribution limitations of traditional analyses, allowing to model a variety of dependent variable distributions. In the case of accuracy as dependent variable, which was a small count number (1 to 6 errors), we employed generalized linear mixed models with the Poisson family, better suited to the count distribution (Baayen 2012). For study times, they were log transformed and entered in Gaussian models (Baayen and Milin 2010).

The analysis plan contemplated first an examination of the effects of presentation format on time and accuracy, with random intercepts per participant and emotional condition, and then analyses of the effects of emotional images on time and accuracy, with random intercepts for participants and presentation format. Analyses consisted of fitting a model with random intercepts only as baseline models, and then comparing with similar models but including fixed factors. This procedure obtains an overall effect of the fixed factors (comparable with an overall F in ANOVA), and the linear model estimates of each condition effect.

Results

The analysis plan contemplated first an examination of the effects of presentation format on time and accuracy, with random intercepts per participant, and then analyses of the effects of emotional images on time and accuracy, with random intercepts for participants and presentation format. Dependent measures included study times during instruction and accuracy during assembly. Exploratory analyses of study times showed the presence of outlier observations (± 1.5 SD of the distribution for each presentation condition), which were eliminated. After eliminating outlier observations, some participants had more than 10% missing observations, and therefore, they were excluded from the final sample. Thus, the final sample consisted of forty-six participants (per presentation condition: diagram-only n = 16; text-only n = 15; text + diagram n = 15).

Table 3 shows descriptive statistics for the mean time per screen (i.e., per step), and number of errors (mistakes of sequence or position), as a function of presentation format (text, diagram, text + diagram), and Table 4 as a function of type of emotional image (positive-high, positive-low, negative-high, negative-low, and no emotional image).

Table 3 Descriptive statistics for mean study time per screen and number of errors in assembly as a function of presentation modality
Table 4 Descriptive statistics for mean study time per screen and number of errors in assembly as a function of type of emotional image

In the first place, we examined the effect of presentation format. For both study times and accuracy, we compared two nested models, a model including presentation format as a factor, and random intercepts for participants and emotional images’ condition, against a model with those random intercepts only. After that, we compared types of presentation format within the model, fitted by restricted maximum likelihood estimation, and Satterthwaite’s method for t tests.

For study times, the model with presentation format and random intercepts added to the prediction to the model with random intercepts only; the likelihood ratio test was significant (χ2 (2) = 18.4, p < 0.0001). Compared with the diagram presentation, log study times for the text only presentation were significantly longer (b = 0.279, SE = 0.066, CI 95 [0.149–0.407], t (43) = 4.21, p < 0.0001), but similar for multimedia (b = 0.022, SE = 0.066, CI 95 [−0.107–0.151], t (43) = 0.332, p = 0.741).

For accuracy, the model with presentation format did not add to the prediction than the model with random intercepts only, according to the likelihood ratio test (χ2 (2) = 4.7, p = .095). Compared to the diagram presentation, number of errors for text only, b = 0.201, SE = 0.222, CI 95 [− 0.250–0.650], z = 0.901, p = .367, and for multimedia, b = − 0.308, SE = 0.229, CI 95 [− 0.779–0.149], z = − 1.342, p = .180, presentations did not differ significantly.

Overall then, the presentation format had an effect on study times, so that text-only instructions were studied slower, but did not have significant effects on execution accuracy.

Finally, the effects of emotional images on study time and assembly accuracy were examined. For both study times and accuracy, we first compared nested models, that is, a model including the type of emotional image as a factor, and random intercepts for participants and presentation modality against a model with those random intercepts only. Next, the effects of each type of emotional condition (positive valence/high arousal, positive valence/low arousal, negative valence/high arousal, negative valence/low arousal) were tested and compared to the condition without image within the model, fitted by restricted maximum likelihood estimation, and the Satterthwaite’s method for t tests.

For log study times, the model with the emotional condition with random intercepts for participants and presentation modality was better than the model with random intercepts only, according to the likelihood ratio test χ2 (4) = 51.36, p < .0001. Relative to the instructions without emotional images, stimuli with low arousal led to faster study time, irrespective of valence: for positive low, b = − 0.136, SE = 0.037, CI 95 [− 0.209 to − 0.064], t (41) = − 3.676, p < 0.001; for negative low, b = − 0.154, SE = 0.037, CI 95 [− 0.226 to − 0.081], t (41) = − 4.142, p < 0.001; for positive high, b = 0.029, SE = 0.035, CI 95 [− 0.044–0.101], t (41) = 0.774, p = 0.439; for negative high, b = 0.043, SE = 0.035, CI 95 [− 0.029–0.115], t (41) = 1.152, p = 0.249.

For the number of errors, the model with the emotional condition with random intercepts for participants and presentation modality was better than the model with random intercepts only, likelihood ratio χ2 (4) = 18.57, p < .001. Relative to the instructions without emotional images, stimuli with high arousal led to more errors, irrespective of valence: for positive high, b = 0.274, SE = 0.126, CI 95 [0.025–0.526], z = 2.171, p = .029; for negative high, b = 0.315, SE = 0.125, CI 95 [0.068–0.565], z = 2.521, p = .011; for positive low, b = 0.018, SE = 0.133, CI 95 [− 0.247–0.284], z = 0.137, p = 0.891; for negative low, b = − 0.149, SE = 0.139, CI 95 [− 0.428–0.127], z = − 1.070, p = 0.284.

Overall, the effect of emotional images was associated with their arousing properties, but not their valence. Relative to the instructions without emotional images, those with low arousing images were studied faster and those with high arousing images led to more errors. Figure 4 shows mean study time per condition, and Fig. 5, mean number of errors per condition.

Fig. 4
figure 4

Mean study time per screen as a function of type of emotional image

Fig. 5
figure 5

Mean number of errors in assembly as a function of type of emotional image

Discussion

We have examined the effect of decorative graphical elements with affective properties in procedural multimedia comprehension. Participants watched three-step instructions to build a LEGO™ object, with or without decorative images, which had controlled emotional characteristics. A first study was performed to select images related to construction themes with particular combinations of high or low valence and arousal. Images were rated along the valence and arousal dimensions, and those with positive or negative valence, and low or high arousal ratings were employed in the second study. In this latter experiment, outcome variables were study times and assembly accuracy.

Instructions were presented in different formats: text-only, diagram-only, text + diagrams. As in previous studies (Brunyé et al. 2006, 2007; Diehl and Bergfeld Mills 2002; Gyselinck et al. 2008; Irrazabal et al. 2016), we found that instructions including only textual information were processed at a slower pace than those including diagrams, that is, pictorial relevant information. In the case of instructions, elements and their spatial position are best depicted and mentally represented as diagrams during learning. Text-only format does not easily and directly convey the position of objects, so increased times in text-only condition would be necessary to compensate for this. However, presentation format did not affect accuracy, as different from previous studies on multimedia learning. In fact, we found that multimedia, as compared to text only, led to better accuracy in assembly in a previous study with the same instructions (Irrazabal et al. 2016). The multimedia advantage of textual plus pictorial instructions, relative to other conditions, is attributed to a dual channel processing in both verbal and visuo-spatial working memory (Mayer 2014). Processing a decorative but irrelevant image would take upon the same visuo-spatial working memory limited resources, thus detracting from remembering and learning the relevant diagram. Therefore, the multimedia advantage would be weakened. In addition, including emotional information in all conditions could affect multimedia processing via other mechanisms (Schneider et al. 2016, 2019). In this case, as we argue later, processing was affected mainly by the level of arousal.

According to the emotional design framework (Heidig et al. 2015; Liew and Tan 2016; Mayer and Estrella 2014; Park et al. 2015; Plass et al. 2014; Schneider et al. 2016; Um et al. 2012), performance would be better with positive images, and worse with negative ones. Under the seductive detail perspective (Harp and Mayer 1997, 1998; Park and Lim 2007; Sanchez and Wiley 2006; Saux et al. 2015), including a decorative image with emotional but not informational value would negatively affect performance in general, regardless of its emotional valence. Also, the arousing properties of the stimuli must be optimal, because a high level of arousal, especially coupled with negative valence, would lead to worse performance (Chung et al. 2015; Schneider et al. 2019). The present study also adds to the emotional effects on comprehension literature by manipulating both dimensions, valence and arousal. Most studies in the field have not differentiated these dimensions (Heidig et al. 2015; Liew and Tan 2016; Mayer and Estrella 2014; Park et al. 2015; Plass et al. 2014; Um et al. 2012), inferring the source of the emotional effect (valence or arousal) without a systematic control of stimuli characteristics. In this study, previous validation of stimuli properties has allowed to enter valence and arousal as experimental variations. In this regard, our study is in line with others studying separate effects of valence and arousal (e.g., Gomes et al. 2013; Mather and Sutherland 2011; Schneider et al. 2016, 2019), finding that they have different effects on cognitive processes (for an overview, see Sakaki et al. 2012).

Overall, we found an effect of emotional images in study times and assembly accuracy. Regarding the emotional design hypothesis, the emotional effect could not be attributed to positive or negative valence. Rather than valence, the relevant emotional dimension was arousal. Low arousal images made participants study the instructions faster, and high arousal images led to more errors in performance. As stated in the introduction, in general, arousing stimuli attract the focus of attention and are also remembered better. In this case, the arousing stimuli is an irrelevant picture inserted in the to-be-learned material, as different from studies in which the arousing properties are part of the target stimuli to be attended or remembered (Kensinger 2004), or induction studies in which arousal is manipulated externally (for example, with false heart beat rate, Schneider et al. 2019). Therefore, the beneficial effects of low arousal must be interpreted within this task.

This pattern would argue in favor of the seductive images hypothesis: including an arousing emotional stimulus negatively affects comprehension, regardless of valence. Arousing decorative images could capture attentional resources (Rey 2012; Sanchez and Wiley 2006), as reflected in longer study times, detracting attention from what should be learned, and leading to more errors. This study adds to the seductive details effect literature, in particular, by showing that an embedded irrelevant picture with arousing properties leads to negative outcomes. Future studies with fine grained measurement of attentional allocation, such as eye-tracking, could focus on this hypothesis.

This contrasts with Schneider et al.’s findings (2019) that a perceived high arousal state dampened the seductive detail effect. There are several differences between their study and the one presented here. First, their study required the learning of only textual information, and the seductive detail was also textual. Second, they manipulated the emotional information through a false heartbeat rate feedback, and that auditory feedback is why they consider their task a multimedia learning setting. Third, they employed an arousal-inducing task (false heartbeat rate, Strain et al. 2013) specifically designed not to overload attentional and working memory resources. Therefore, we can agree in general with their suggestion that design and context elements that increase learners’ arousal can enhance motivation and learning (see also Schneider’s et al. 2016, argument for decorative images’ motivational properties). In fact, in the present study, low arousal was better for learning as compared with a control condition without an emotional image. There might be “an optimum range of arousal, which is not too low (no activation) and not too high (avoidance behavior), but instead activates learners” (Schneider et al. 2019, p.73). Also, low and high arousals are relative to the induction method or technique; in this case, the IAPS rating system provides a reference to select emotion images. Although Schneider et al. (2019) did not find an association between perceived arousal and physiological measures, future studies with other psychological measures along with physiological trackers of arousal during multimedia learning could approach this issue.

These results also suggest that emotional images would dampen the multimedia advantage for instructional materials, given that we did not find an advantage for multimedia presentations relative to those containing only diagrams. However, our study might not have had sufficient power to detect presentation format effects, or their interaction with emotional images, due to the low number of participants per group in this inter-subjects multimedia manipulation. Future studies could focus on whether emotional images modulate the multimedia effect in better powered designs for this specific factor.

In conclusion, we found evidence against the inclusion of irrelevant, highly arousing, emotional pictures in instructional multimedia materials. Even if an image conveys positive emotion, it can decrease performance, an effect that might be attributed to the attentional distraction from relevant content. Our results suggest, therefore, that in instructional design irrelevant emotional elements should be handled with caution, even when conveying positive emotion. On the other hand, an optimum level of arousal might be needed for efficient learning.