Pictures are commonly used in various problem-solving contexts, including formal testing, usual educational exercises, such as classroom practice and homework, and problem-based learning (Greiff et al., 2013; Martin et al., 2016; OECD 2007; Wu et al., 2015). These pictures include pictorial forms such as diagrams, photos, animations, and videos. The way in which problems are represented affects individual problem-solving performance (Bodner and Domin, 2000; Elia et al., 2007; Gagatsis and Elia, 2004). In this meta-analytic review, the multimedia effect in problem solving is defined as the phenomenon whereby an individual’s problem-solving performance is enhanced when equivalent pictures are added to illustrate or replace part of the problem text. The multimedia effect in problem solving is assumed to exist in various types of problems across different domains, such as structured multiple-choice problems (e.g., Lindner et al., 2017a, 2017b; Saß et al., 2012), word problems (e.g., Dewolf et al., 2015; Hoogland et al., 2018a), insight problems (Schwert, 2007), and analogy problems (e.g., Beveridge and Parkins, 1987). It includes problem solving for both assessment and learning purposes.

Compared with the multimedia effect in learning or the multimedia principle (Mayer, 2001, 2009, 2014), which has been the subject of extensive research (Butcher, 2014; Rop, 2017; Schüler et al., 2019), the multimedia effect in problem solving remains a relatively new topic of research (Lindner et al., 2018). Fairly recently, a growing number of studies have investigated the extent to which pictures affect an individual’s problem-solving performance, often through randomized experimental designs or qualitative analysis of individuals’ solution strategies. These studies have involved various subjects such as mathematics (e.g., Agathangelou et al., 2008; Berends and van Lieshout, 2009; Dewolf et al., 2017; Elia et al., 2007; Elia and Philippou, 2004; Hoogland et al., 2018a; Ögren et al., 2017; Zahner and Corter, 2010), science (e.g., Hao, 2010; Hartmann and Leutner, 2013; Lindner et al., 2017a, 2018; Lindner et al., 2017b; Maries, 2013; Whitley et al., 2006), and language (e.g., Dindar et al., 2013; Garrett, 2008; Ginther, 2001). However, due to inconsistent findings, it remains unclear how the addition of pictures affects students’ problem-solving performance (Wu et al., 2015). To date, no comprehensive meta-analytic review has been conducted to determine the aggregate multimedia effect in problem solving and its relevant boundary conditions. This study aims to fill this gap and provide insights into the design of text–picture problems.

Theoretical Background

Different isomorphic external representations of a problem may generate dramatically different cognitive processes and problem-solving behaviors, which is referred to as the representational effect (Chuah et al., 2000; Zhang and Norman, 1994, 1995). A problem can be represented in a pictorial (i.e., picture) or descriptive (i.e., word) format (Schnotz and Bannert, 2003). External representations of a problem refer to the presentation of the problem to the problem solver, while internal representations describe how the problem solver constructs the problem in his or her mind, also known as a mental model (Johnson-Laird, 2005). Students can also externalize their internal representations through drawing, writing or speaking in the problem-solving process.

There are currently no specific theories to explain the role of pictures in problem solving (Lindner et al., 2017b). The problem-solving process involves mental model construction similar to the learning context. Thus, many scholars have relied on intensively studied multimedia learning theories to explain the benefits of pictures in problem solving (Jarodzka et al., 2015; Lindner et al., 2016). The most frequently cited multimedia learning theories include the cognitive theory of multimedia learning (CTML) (Mayer, 2001, 2009, 2014), the cognitive–affective theory of learning with multimedia (CATLM) (Moreno, 2006, 2009; Moreno and Mayer, 2007), the integrated model of text and picture comprehension (Schnotz and Bannert, 2003), and cognitive load theory (CLT) (Sweller, 1988; Sweller, 1994). These theories basically assume that individuals construct internal representations of an object by encoding features observed in an environment (Newell, 1990). They attribute the multimedia effect in learning to the advantage of dual-channel processing (Baddeley, 1992; Baddeley et al., 2011; Paivio, 1986), that is, the encoding of texts and pictures through two symbolic systems: auditory/verbal and visual/pictorial. Each system is limited in terms of working memory capacity, but their combined capacity is additive (Chandler and Sweller, 1991) (see Fig. 1). Individuals can access greater processing capacity when materials are presented through the two symbolic systems.

Fig. 1
figure 1

Explanation of the multimedia effect in problem solving based on multimedia learning theories

Students are also more likely to construct better mental models through dual-channel processing, as visual–spatial and verbal representations complement and constrain each other, and thus deepen students’ knowledge construction (Ainsworth, 2006; Mayer, 2014). Mayer (2009, 2014) further differentiated five cognitive processes underlying multimedia learning: selecting relevant words into verbal working memory; selecting relevant pictures into visual working memory; organizing selected words into a verbal model; organizing selected pictures into a pictorial model; and integrating the verbal model, the pictorial model and activated prior knowledge. However, the added value of integration is conditional and requires additional efforts. Therefore, there are boundary conditions for the multimedia effect in learning (Schnotz and Bannert, 2003; Mayer, 2014; Schüler et al., 2019). Although these multimedia learning theories can help explain the multimedia effect in problem solving to some extent, very few scholars have followed multimedia learning principles when designing text–picture problems (Hoogland et al., 2018a).

There are a number of stepwise models that describe the problem-solving process across different domains (e.g., Mayer, 1987; Nathan et al., 1992; Reusser, 1996; Verschaffel et al., 2000; Winter et al., 2006; Zahner and Corter, 2010). The problem-solving process includes not only an internalization process (comprehension phase) but also an externalization process (solution phase), which is different from the learning process (see Fig. 1). Thus, pictures in problem solving have two functions. One function is similar to that in the learning context, that is, serving as mental scaffolds to interpret question items (Lindner et al., 2017a2017b) and make the most out of our working memory capacity (Saß et al., 2012). The other function is to facilitate the extraction of information in the answering process and thus to offload problem solving (Lindner et al., 2017b; Zhao et al., 2014). Pictures are easier to remember and recall than words (Nickerson, 1965; Shepard, 1967) and provide easier access to semantic memory (Paivio, 1986). Thus, pictures can serve as prominent recall cues and facilitate the solution phase.

Instead of assuming that cognition occurs solely in the human brain, the theories of embodied and distributed cognition posit that human cognition is distributed throughout the human brain, the human body and the external environment (Hardy-Vallée and Payette, 2009; Hollan et al., 2000). Such theoretical approaches have been adopted to explain human cognitive processing in interaction with multiple external representations (Pande and Chandrasekharan, 2017). External representations can be a way to offload an individual’s cognitive load by providing ready-to-use representations (Zhang and Patel, 2006). Individuals do not necessarily construct internal representations to mediate actions, but they can act directly on information extracted from external representations (Gibson, 1966, 1979). Therefore, internal and external representations are not differentiated but are instead indispensable and distributed across the human mind and cognitive artifacts (Zhang, 1997; Zhang and Norman, 1994) (see Fig. 2). Pictures are advantageous for expressing visual–spatial information; they are more closely aligned with the mental models of problem solvers (Schnotz, 2014). Thus, task-relevant pictures in problems are effective in scaffolding a student’s problem-solving process. However, the benefits of additional pictorial representations depend on whether students can make full use of these external representations and generate their own by adjusting those provided.

Fig. 2
figure 2

Explanation of the multimedia effect in problem solving through the embodied and distributed cognition approaches

Measuring the Multimedia Effect in Problem Solving

Two major quantitative cognitive outcomes (i.e., response accuracy and response time) are used to measure the multimedia effect in problem solving. Response accuracy measures how close an answer is to the correct answer or the proportion of correct answers in a large problem set. Response time measures the duration of perceptual, cognitive, and psychometric activities to generate a solution (Goldhammer et al., 2014). In addition to studies on quantitative cognitive outcomes, some studies have explored the multimedia effect on an individual’s affective experience or metacognition, including pleasure and engagement in problem solving, preference for pictures, perceived task difficulty, and response certainty (i.e., students’ level of confidence in their response) (e.g., Dewolf et al., 2015; Lindner et al., 2018).

Multimedia Effect on Response Accuracy

Response accuracy has been widely used as an indicator of an individual’s problem-solving performance and is typically measured in relation to the difficulty of an item (e.g., Lindner et al., 2017b). Response accuracy and item difficulty are distinct but strongly correlated. Thus, the multimedia effect on response accuracy in this study included the multimedia effect on individual accuracy in problem solving and item difficulty levels.

Measurements of the multimedia effect on response accuracy have yielded inconsistent results. Some findings have suggested that representational pictures make items easier to understand, even without providing solution-relevant information (Lindner et al., 2016; Lindner et al., 2017b; Saß et al., 2012). This phenomenon is referred to as the multimedia effect in testing (Lindner et al., 2018; Lindner et al., 2017a). For example, Lindner et al., (2017b) added pictures to illustrate the stems of multiple-choice questions and found that the test performance of all students improved. Hoogland et al., (2018a) added pictures to math word problems in an integrated format and also observed improvement in student performance.

However, several studies have found no multimedia effect on response accuracy (Agathangelou et al., 2008; Dewolf et al., 2014; Dewolf et al., 2015; Dewolf et al., 2017). For example, the study conducted by Dewolf (2014) on the use of visual aids to support students in solving non-standard math word problems yielded non-significant results. Some studies have even detected a reverse multimedia effect on response accuracy (Berends and van Lieshout, 2009; Elia et al., 2007; Gagatsis and Elia, 2004).

Multimedia Effect on Response Time

Response time, also called “time on task,” is a process measure of students’ processing speed, regardless of accuracy (Goldhammer et al., 2014; Scherer et al., 2015). Researchers have focused on whether pictures can boost students’ problem-solving fluency. However, measuring the multimedia effect through response time has yielded inconsistent results. Some researchers have reported a reduction in response time when students are asked to solve text–picture problems rather than text-only problems (Lindner et al., 2017b; Saß et al., 2012), while others have found no such effect (e.g., Berends and van Lieshout, 2009; Lindner et al., 2017a).

On the one hand, adding pictures to the problem text may complicate the problem and increase the response time by providing more information to process. Attention distraction is a general effect of pictures (Harp and Mayer, 1998; Wiley et al., 2014). Indeed, eye-tracking research has shown that pictures compete for students’ attention in problem solving to varying degrees (Dewolf et al., 2015; Lindner et al., 2017a). Studies on seductive details (i.e., eye-catching but irrelevant information) in multimedia learning have also shown that students spend more time learning when there are seductive details in instructional materials (Park et al., 2011). On the other hand, some eye-tracking studies have shown that more information does not necessarily extend the response time. Additional pictures in problems may capture very little student attention (Dewolf et al., 2015). In addition, more time spent on processing pictures may be compensated by less time spent on processing texts (Lindner et al., 2017a).

Meanwhile, adding pictures may increase students’ problem-solving fluency according to dual processing theory, which distinguishes two mental processes: automatic and controlled processes (Schneider and Chein, 2003; Shiffrin and Schneider, 1977). Pictures physically resemble an object by visually representing many of its specific and idiosyncratic features. This representational process is more akin to low-level construal. By contrast, words highlight the categorical invariant characteristics of an object, which is more akin to high-level construal. Thus, pictures tend to be processed automatically and rapidly, requiring little mental effort, whereas words are processed through a much more controlled and conscientious process that requires higher-level cognitive operations. As such, adding pictures to illustrate or replace part of the problem text may speed up students’ mental model construction. According to the picture superiority effect (Nickerson, 1965; Shepard, 1967), pictures are also better recall cues in information retrieval and thus accelerate solution generation. In addition, pictures can facilitate the answering process through analog structure mapping (e.g., Beveridge and Parkins, 1987; Gentner et al., 2003; Schnotz and Bannert, 2003).

Multimedia Effect on Metacognition and Affective Experience

In contrast to the diverse findings concerning the multimedia effect on response accuracy and response time, existing studies have consistently found that pictures in problems generate higher metacognitive judgments and positive affective experiences.

Students have been found to perceive lower item difficulty (Garcia-Retamero and Holffrage, 2013) and higher response certainty when solving text–picture problems than text-only problems (Lindner et al., 2018; Dewolf et al., 2015; Ögren et al., 2017). According to dual processing theory (Schneider and Chein, 2003; Shiffrin and Schneider, 1977) and the picture superiority effect (Nickerson, 1965; Shepard, 1967), pictures tend to boost perceptual fluency more than words and are easier to remember and retrieve from memory. Moreover, increased perceptual and retrieval fluency tend to induce higher metacognitive judgments (Bjork et al., 2013). Therefore, pictures are more likely to make the item subjectively easier to understand compared with text, boosting students’ confidence in problem solving.

Research has also found that pictures in problems tend to lead to positive affective experiences. For example, students are more likely to report more positive attitudes and higher test-taking pleasure when pictures are embedded in problem texts (Hao, 2010; Lindner et al., 2016). Students also tend to show greater situational interest (Ögren et al., 2017) and a higher level of arousal when presented with engaging pictorial representations.

The multimedia effect on metacognition and affective experience may also influence the multimedia effect on response accuracy and response time by influencing students’ cognitive engagement and behavior in problem solving. Pictures may reduce individuals’ rapid-guessing behavior in testing (Lindner et al., 2017b; Wise et al., 2009), increase their mental effort in problem solving (Ögren et al., 2017), persuade students to believe an accompanying statement (Ögren et al., 2017) or possibly slow down the loss of engagement in problem solving (Lindner et al., 2017b).

Moderators of the Multimedia Effect in Problem Solving

Function Types of Pictures

The multimedia effect in problem solving is conditional upon the types of pictures included in problem-solving tasks. Studies have compared how pictures of various function types affect the multimedia effect in problem solving (e.g., Agathangelou et al., 2008; Berends and van Lieshout, 2009; Dewolf et al., 2015; Elia et al., 2007; Saß et al., 2012; Saß et al., 2017). However, the function types of pictures have been mainly investigated in isolation across studies.

Specifically, pictures in problems have been categorized according to four functions: decorative, representational, organizational, and informational (Berends and van Lieshout, 2009; Carney and Levin, 2002; Levin, 1981; Saß et al., 2017) (see Fig. 3). Decorative pictures have little or no relation to the text and are usually aimed at triggering a positive mood or creating a learning atmosphere for the problem solver, for example, in the study by Dewolf et al., (2015), a cartoon of a student pondering a word problem was added to a text-based test (see Fig. 3a). Representational pictures are concrete illustrations of information conveyed in the text. They are closely related to the context of the problem and serve to scaffold an individual’s understanding or to strengthen his or her engagement in the task. For example, Lindner et al. (2016) visualized the components of four types of solutions to a chemical solution problem using schematic diagrams (see Fig. 3b). Organizational pictures present the information structures in the text through a visual–spatial approach. For example, in the study by Garcia-Retamero and Hoffrage (2013), in a diagnostic inference task, a square map reflected the natural frequency of breast cancer, including the overall number of women at risk, the number of women who had breast cancer and the number of women who had received a positive mammogram (see Fig. 3c). Informational pictures contain details needed to solve the problem, such as diagrams and charts. For example, in the study by Hoogland et al. (2018a), texts and pictures were integrated in such a way that individuals would be unable to solve a problem if the pictures were removed (see Fig. 3d).

Fig. 3
figure 3

Overview of the different types of pictures in problems. a Decorative picture; b Representational picture; c Organizational picture; d Informational picture; e Multiple pictures

Some problems involve several pictures that serve different functions. For example, in the study by Saß et al. (2012), in a multiple-choice science question, a picture of a girl with a floodlight in the item stem was representational, while a girl with a floodlight and her possible shadow was informational in the answer choices (see Fig. 3e). Specific pictures are usually categorized according to their primary function.

Researchers have investigated how different pictorial functions affect individuals’ problem-solving performance (Agathangelou et al., 2008; Carney and Levin, 2002; Elia et al., 2007; Elia and Philippou, 2004; Gagatsis and Elia, 2004; Lindner et al., 2016, 2018). Pictures that contain task-relevant information are more likely to reduce item difficulty and improve problem-solving performance, while adding decorative pictures to the text tends to neither enhance nor harm problem-solving performance (Agathangelou et al., 2008; Carney and Levin, 2002; Elia et al., 2007; Lindner et al., 2018).

Other Moderators

In addition, the way in which texts and pictures are combined is important for problem-solving performance. Functionally related texts and pictures that are placed separately may lead to a split-attention effect and extraneous cognitive processing (Berends and van Lieshout, 2009; Crisp and Sweiry, 2006; Ögren et al., 2017). Therefore, an individual’s problem-solving performance may worsen if texts and pictures providing complementary information are separated in space or time (Saß and Schütte, 2016; Saß et al., 2017). An individual’s test performance is also likely to be affected when irrelevant content is provided (Jarodzka et al., 2015).

Researchers have also considered the moderating effect of problem complexity (Hoogland et al., 2018b; Lindner et al., 2016; Solano-Flores et al., 2016; Zahner and Corter, 2010). A wordy problem tends to overload students’ cognitive capacity, as it requires better reading comprehension, which may adversely affect their problem-solving performance. Pictures can make the spatial structure of a wordy problem explicit and thus easier to understand (Boonen et al., 2014; Fuchs et al., 2015; Hoogland et al., 2018b), although some studies have not detected such an effect, which may be due to the low variance in the problem’s wordiness (e.g., Lindner et al., 2016).

The multimedia effect in problem solving may also differ across domains. External representations in different domains vary considerably (Pande and Chandrasekharan, 2017). For example, representations in chemistry can include the periodic table, chemical equations, molecular diagrams, energy graphs, simulations, and so on, (Johnstone, 1982), while in mathematics, they can include numbers, algebraic equations, geometries, and metric spaces (Lesh et al., 1987). Although this study did not focus on a specific domain due to the limited number of primary studies, the domain was considered a moderator in the following analysis.

Research Hypotheses

This meta-analytic review aimed to bring together current quantitative research on the multimedia effect in problem solving using the three outcome measures, and to identify relevant moderators. In terms of study outcomes related to metacognition and affective experience, the statistics available on the measurement of affective experience are not sufficient to conduct a meta-analysis. Therefore, this study only focused on the multimedia effect on metacognition, in particular response certainty.

Two hypotheses are proposed based on the literature:

  1. Hypothesis 1.

    The multimedia effect on response accuracy: Individuals have greater response accuracy when solving text–picture problems than when solving text-only problems.

  2. Hypothesis 2.

    The multimedia effect on metacognition: Individuals have greater response certainty when solving text–picture problems than when solving text-only problems.

Regarding the multimedia effect on response time, that is, whether additional pictures speed up students’ problem solving, there are no clear preferred expectations in either the empirical or theoretical accounts. Therefore, no hypothesis is proposed on this effect, but the effect was carefully examined based on the included studies.

Method

Data Collection and Reduction

This study followed Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) to identify relevant studies (Moher et al., 2009). Initial searches adopted the following search terms and their variants to limit the titles or subjects: (“multiple representations” OR “text–picture” OR “multiple external representations” OR “image-rich” OR “illustration” OR “picture” OR “visual representation” OR “pictorial representation” OR “depictive representation”) AND (“problem solving” OR “problem-solving” OR test* OR “multimedia effect in testing” OR “multimedia testing” OR “multimedia problem solving”). The initial screening found two thousand, nine hundred and eighty-four articles (2984) in the ProQuest Dissertations and Theses A&I, Eric, LISA, PsycARTICLES, PsycINFO, PsycTESTS, ScienceDirect, Web of Science, and China National Knowledge Infrastructure databases. There was no start date limit, but the cut-off date was set to before August 8, 2018.

To be included in the final meta-analysis, studies had to meet the following inclusion and exclusion criteria:

  1. (a)

    The study should use an experimental or quasi-experimental design, with a control group in the text-only condition and at least one experimental group in the text–picture condition;

  2. (b)

    The problems in the text-only and text–picture conditions should be equivalent, that is, adding pictures or modifying texts should not affect solution-relevant information;

  3. (c)

    At least one of the following target outcomes should be measured quantitatively: response accuracy, response time or affective experience;

  4. (d)

    The quantitative results should include the effect size for any specific measure or enough data to compute the effect size;

  5. (e)

    When multiple publications are based on the same data, only one publication should be included;

    and

  6. (f)

    The study should be reported in English or Chinese.

In addition to these search and screening procedures, Google Scholar was searched using the main keywords to identify conference abstracts, government documents, and other online materials. Snowballing from the reference lists of the primary studies helped identify other relevant research and new keywords. This strategy was essential because this meta-analysis topic is relatively new and relevant studies are distributed across different disciplines, so they tend to use inconsistent terminologies to describe similar phenomena.

The entire screening procedure is detailed in the PRISMA flow diagram (see Fig. 4). The initial coder excluded “easy rejects” through title screening. Another coder was then trained and filtered the remaining papers according to the inclusion and exclusion criteria independently. Overall inter-rater reliability (Cohen’s kappa) for abstract and full-text screening was high (κ = 0.86). All disagreements were discussed and consensus was reached. The first coder then finished the snowball search and contacted a few key researchers who published many papers in the field (e.g., Lindner, Hoogland, Van Dooren and Dewolf) to check whether there were any unpublished studies or important publications missing from the final literature list.

Fig. 4
figure 4

PRISMA flow diagram of article selection for problem solving with pictorial representations

Coding Procedure and Study Characteristics

Codes were developed to characterize the studies included in the review list (see Table 1). Two researchers conducted the coding work independently. The inter-rater reliability of the main moderators in terms of initial coding was high (κ > 0.80). All disagreements were resolved by seeking consensus.

Table 1 Codes for study characteristics

Computation of Effect Sizes and Analysis

Hedges’s g was adopted to measure the effect sizes. R (a software environment for statistical computing and graphics) as well as the Comprehensive Meta-Analysis (CMA, version 2) software were used to conduct the meta-analyses in this study. For between-groups designs, the data formats for calculating the effect sizes included “means, SDs, and sample sizes,” “Cohen’s d and variance,” “events and sample size in each group,” “means, sample sizes, and p-values” and “sample sizes and p-values.” The data formats for within-group designs included “means, sample sizes, and paired t-values,” “mean difference (paired) and standard error” and “sample sizes and p-values from paired t tests.”

To control for interdependencies between effect sizes, the effect sizes for the same target outcome in a single study were combined before the studies were pooled if the participants involved in these effect sizes overlapped. However, for the moderator analysis, different effect sizes in one study were retained if they were related to different types of pictures or to item(s) with different levels of difficulty. Influential and outlying studies were also identified using the Baujat plot (Baujat et al., 2002) before computing the summary effect sizes for different outcomes. A variety of outlier and influential case diagnostics can be computed when conducting a meta-analysis (Viechtbauer and Cheung, 2010). A Baujat plot visually identifies sources of heterogeneity in meta-analytic data by plotting each study’s contribution to overall heterogeneity against each study’s contribution to the combined effect size (Baujat et al., 2002).

Based on previous studies, multiple variables, such as picture functions, problem difficulty, and problem domain, were predicted to influence the multimedia effect in problem solving. Thus, random effects and mixed effects models were adopted to conduct the main meta-analysis and moderation analysis using the rma function in the metafor package (Viechtbauer and Cheung, 2010). Regarding publication bias, funnel plots (Light and Pillemer, 1984; Sterne and Egger, 2001) were used for all of the outcomes (Rothstein et al., 2005). Furthermore, the ranktest and regtest functions were used to detect funnel plot asymmetry. For significant observations, Rosenthal’s fail-safe N test was conducted using the fsn function in R. The fail-safe number estimates any possible missing publication with an average z-value of zero that could make the combined effect size non-significant (Rosenthal, 1979).

As a general rule, all meta-analysis results should be interpreted with caution because, by necessity, they only select quantitative evidence (Greco et al., 2013). Furthermore, the combined effect sizes may not be robust enough if the available studies are too few or too different. It is thus essential to examine the heterogeneity of the observed effect sizes in a meta-analysis. Different study characteristics may contribute to differences between studies that could not be explained by subject-level sampling errors (Lipsey and Wilson, 2001). Therefore, if there is heterogeneity, moderator analysis is usually performed in a meta-analysis to help explain the heterogeneity of the observed effect sizes. This study reported Q and I2 statistics to test for heterogeneity of variance. If Q exceeds a critical value (p < .05), the average effect size is interpreted as significantly heterogeneous. I2 estimates the variance of aggregate effect sizes that is due to heterogeneity rather than sampling errors. I2 at around 25%, 50%, and 75% is interpreted as indicating low, medium, and high heterogeneity, respectively (Higgins and Thompson, 2002).

Results

Sample Analysis

There were 51 pairwise comparisons from 26 articles (see Table 2): 23 journal articles, two theses (Garrett, 2008; Maries, 2013), and one research report (Ginther, 2001). The very low number of original articles limited the reliability of aggregate effect sizes in this meta-analysis. All of the publications included in the final meta-analysis were in English. Most of the authors were from Europe (54%) and the USA (38%). The publication dates ranged from 1983 to 2018. Over 60% were published after 2010, indicating that the multimedia effect in problem solving has attracted increasing attention in recent years. In the 51 comparisons, the total number of participants was 38,987, with a median of 63 and a range of 10 to 31,842. Excluding the study by Hoogland et al. (2018a) as an outlier (n = 31,842), the mean number of participants in the remaining studies was 143. The sets of problems used in these comparisons were primarily from science (53%) and mathematics (35%). The participants were generally primary school (43%) and university (41%) students. The number of between-groups designs (67%) was double the number of within-group designs. About half (53%) of the studies presented problems in a computer-based environment. The multiple-choice type of problem (59%) was used slightly more often than structured open responses. The average number of problems included in each study was 19, with a range of 1 to 48.

Table 2 Overview of included studiesa

Heterogeneity analysis showed that there was a high level of heterogeneity in the outcome constructs of response accuracy (Q (33) = 783.9, I2 = 95.79, p < .001) and response time (Q (12) = 45.553, I2 = 73.66, p < .001) and a moderate level of heterogeneity in response certainty (Q (3) = 7.252, I2 = 58.634, p = .064). In terms of response accuracy, heterogeneity could not be determined for decorative pictures because there was only one comparison available; comparisons between informational pictures (Q (6) = 349.396, I2 = 98.28, p < .001) revealed the greatest heterogeneity, followed by organizational pictures (Q (7) = 41.79, I2 = 83.25, p < .001) and representational pictures (Q (14) = 44.527, I2 = 68.56, p < .001). The substantial amount of heterogeneity in the comparisons necessitated analysis of the possible moderators and indicated the limited reliability of the average effect sizes.

Publication Bias Analysis

The funnel plot for the response accuracy measures (k = 34) appeared to be visually symmetrical (see Fig. 5a). Furthermore, neither the regression test (t = 1.22, df = 32, p = .23) nor the rank correlation (Kendall’s τ = 0.157, p = .19) showed significant asymmetry. The fail-safe N test showed that 1,302 missing publications were needed to make the significant combined effect size statistically non-significant, and it is unreasonable that so many studies go undetected. According to Rosenthal’s (1991) guidelines, the fail-safe number also exceeded 5 × k + 10 (5 × k + 10 = 180 < 1,302). The overall effect size for response accuracy was therefore robust and unlikely to have been inflated by a threatening level of publication bias.

Fig. 5
figure 5

Funnel plots of the outcomes. a Response accuracy; b Response time

Publication bias for response accuracy was further analyzed for various types of pictures due to the considerable heterogeneity between studies. For decorative pictures (k = 1), publication bias could not be analyzed due to the lack of enough available comparisons. The regression test and the rank correlation for representational (k = 15), organizational (k = 8), and informational (k = 7) pictures all showed no significant asymmetry. The fail-safe number for both representational (N = 107) and organizational (N = 100) pictures exceeded Rosenthal’s (1991) critical values (i.e., 85 and 50, respectively), whereas for informational pictures (N = 7), it was below the critical value of 45. Thus, for representational and organizational pictures, there was no publication bias that could significantly threaten the validity of the average effect sizes, but publication bias could be a problem for informational pictures in measures of response accuracy.

For comparisons measuring response time (k = 13), the regression test (t = 3.06, df = 11, p < .05) and the rank correlation (Kendall’s τ = 0.50, p < .05) revealed the presence of significant publication bias. The funnel plot (see Fig. 5b) also indicated visual asymmetry. In particular, there was one extreme comparison in the lower right corner, representing the study by Goolkasian (1996). After removing this comparison (k = 11), the regression test (t = 2.80, df = 9, p < .05) and the rank correlation (Kendall’s τ = 0.51, p < .05) still indicated significant asymmetry. Therefore, the comparisons included in this study did not provide representative evidence for the multimedia effect on response time, so the aggregate effect size should be interpreted with caution.

In terms of publication bias for the effect sizes of response certainty (k = 4), the regression test (t = 3.156, df = 2, p = .09) and the rank correlation (Kendall’s τ = 0.83, p = .09) indicated no threatening publication bias. However, the fail-safe number (N = 23) for response certainty was below the critical value of 30 (i.e., 5 × 4+10), indicating that publication bias could exist. Due to the limited number of comparisons, it was difficult to evaluate the symmetry of the funnel plot. Given the inconsistent results for publication bias for response certainty, caution should also be exercised when interpreting the aggregate effect size of response certainty.

Overall Multimedia Effect in Problem Solving

Multimedia Effect on Response Accuracy

The Baujat plot for response accuracy suggested no apparent outliers. The overall estimate based on a random effects model showed that including pictures had a significant and positive average effect on response accuracy (Hedges’s g = 0.32, p < .001, 95% CI [0.17, 0.46]), even after removing the two articles (Lindner et al., 2016; Lindner et al., 2017a) measuring item difficultyFootnote 1 (Hedges’s g = 0.30, p < .001, 95% CI [0.16, 0.45]).

The evidence for the effect of decorative pictures was not sufficient for a reliable interpretation, as there was only one study (Berends and van Lieshout, 2009) reporting a non-significant effect on response accuracy. Without considering decorative pictures, the results showed that a picture’s function did not significantly moderate the average multimedia effect on response accuracy (Qbet (3) = 3.28, p = .35). The small aggregate effect size for representational pictures reached the significance level (Hedges’s g = 0.24, p < .01, 95% CI [0.10, 0.37]). As indicated by the forest plot (see Fig. 6), Maries’s (2013) study was driving the heterogeneity. Maries’s study was a dissertation examining the effects of drawings in physics problem solving. Maries provided students with additional diagrams usually drawn by physics experts at the initial stage of problem solving and found that these diagrams did not scaffold students’ problem solving as expected and even had a negative effect on their performance. He explained that additional diagrams led to students spending less time on conceptual planning and drawing less productive diagrams in problem solving. Maries used a quasi-experimental approach rather than a randomized controlled trial. His study also included an instruction phase in which all students learned to solve physics problems through drawing. Therefore, Maries’s study may not be comparable to other studies. By removing his study, the p value of the significant positive average effect of representational pictures on response accuracy dropped while the average effect size slightly increased (Hedges’s g = 0.27, p < .001, 95% CI [0.14, 0.39]).

Fig. 6
figure 6

Forest plot of the effect sizes (Hedges’s g) of response accuracy across different function types of pictures

Organizational pictures had a significant average effect of medium size (Hedges’s g = 0.52, p < .01, 95% CI [0.17, 0.86]). Among the eight comparisons, only Gick and Holyoak (1983) reported a negative effect of organizational pictures on response accuracy. Their study focused on analogical reasoning in problem solving and examined whether an additional diagram illustrating the underlying schema of an analog story could help students solve a disparate but analogous transfer problem. They found that the participating students could interpret the diagram as representing the similar schema implicit in the two analog stories and could use the diagram as a retrieval cue to solve the transfer problem. However, they failed to understand the diagram when it was presented with only one analog story and thus performed even worse than those in the no-diagram condition. The study by Gick and Holyoak indicated that the designed organizational function may not be fulfilled if students cannot effectively understand the purpose of a picture.

Informational pictures (Hedges’s g = 0.12, p = .58, 95% CI [-0.31, 0.56]) had no significant average effect on response accuracy. However, six of the seven studies showed the expected direction of the effect. As indicated by the forest plot (see Fig. 6), the combined effect size of the studies by Lin et al. (2013) and Yang and Huang (2004) was the only negative combined effect size and seemed to drive the heterogeneity of informational pictures. These two studies reported a large negative effect size of informational pictures in solving math problems for primary school students (sixth grade) in Taiwan and suggested that interpreting pictorial representations in mathematics required a specific type of mathematical ability. Multiple pictures (Hedges’s g = 0.56, p = .13, 95% CI [−0.16, 1.27]) also had no significant aggregate effect, although the three included comparisons reported positive effects on response accuracy.

Problem difficulty (Qbet (1) = 0.14, p = .71) was not a significant moderator of the multimedia effect on response accuracy. Adding pictures to difficult (Hedges’s g = 0.28, p < .01, 95% CI [0.11, 0.46]) or easy problems (Hedges’s g = 0.34, p < .01, 95% CI [0.08, 0.60]) both significantly improved an individual’s response accuracy with small-to-medium average effect sizes. Other possible moderators were also tested (see Table 3). They were all found to be non-significant, including institution location (Qbet (2) = 2.86, p = .24), effective sample size (QM (1) = 0.31, p = .58), set size (QM (1) = 0.56, p = .46), problem domain (Qbet (3) = 7.09, p = .07), answer format (Qbet (1) = 0.02, p = .90), test environment (Qbet (1) = 0.69, p = .41), participants’ education level (Qbet (2) = 1.81, p = .41), experimental design (Qbet (1) = 0.72, p = .40), and control conditions (Qbet (1) = 0.40, p = .53).

Table 3 Moderator analysis of the multimedia effect on response accuracya

Multimedia Effect on Response Time

Goolkasian’s (1996) study was found to be a potential outlier for response time. There were two qualified comparisons in this study, both of which were influential, with large residuals (externally standardized residuals were 2.12 and 3.09, respectively) and a strong influence on the results. Removing Goolkasian’s study significantly reduced the heterogeneity (from Q (12) = 45.55 to Q (10) = 13.03). Goolkasian (1996) provided students with pictorial or textual background information, then asked them to respond to the test sentences (true/false) as quickly as possible. Compared with other studies, he reported a much greater reduction in response time when students were provided with pictures. After removing Goolkasian’s study, the meta-analysis did not find a significant and positive average multimedia effect on response time (Hedges’s g = −0.06, p = .31, 95% CI [−0.16, 0.05]) though 6 of the 11 studies showed a positive multimedia effect on response time (see Fig. 7). The effect was still non-significant when Goolkasian’s study was included (Hedges’s g = 0.15, p = .13, 95% CI [−0.04, 0.34]). Therefore, the multimedia effect on response time remained uncertain based on these limited and inconsistent findings.

Fig. 7
figure 7

Forest plot of the effect sizes (Hedges’s g) of response time

Due to the limited number of comparisons, the moderating effect of a picture’s function type was not tested. For decorative pictures, there was only one comparison available (see Fig. 7). Berends and van Lieshout (2009) reported that decorative pictures significantly increased the response time. Three of the five comparisons also indicated that representational pictures increased response time. In addition, only one negative effect (Lindner et al., 2017b) among the five comparisons reached the significance level. The only comparison for organizational pictures reported a non-significant and positive multimedia effect on response time (Whitley et al., 2006). For informational pictures, four (Goolkasian, 1996; Saß et al., 2012) of the five comparisons suggested their significant effects in reducing response time. For multiple pictures, there was only one comparison available (Saß et al., 2012), which indicated a positive but non-significant multimedia effect on response time.

Due to the limited number of comparisons, moderator analysis was also not conducted for problem difficulty. After removing Goolkasian’s (1996) study, response time was found to be significantly moderated by test environment (Qbet (1) = 4.39, p < .05) and experimental design (Qbet (1) = 9.42, p < .01). Additional pictures in problems using a paper-and-pencil format significantly increased response time (Hedges’s g = −0.18, p < .05, 95% CI [−0.32, −0.04]), whereas a computer-based environment had a non-significant average effect (Hedges’s g = 0.05, p = .57, 95% CI [−0.11, 0.21]). For studies adopting a between-subjects design, a significant average multimedia effect on response time was detected (Hedges’s g = 0.22, p < .05, 95% CI [0.01, 0.43]), whereas within-subject designs showed a significant reverse effect (Hedges’s g = −0.13, p < .01, 95% CI [−0.21, −0.05]).

Across the included studies, there was no significant moderating effect of institution location (Qbet (1) = 0.23, p = .64), effective sample size (Q (1) = 1.14, p = .28), set size (Q (1) = 0.14, p = .70), problem domain (Qbet (1) = 2.12, p = .15), answer format (Qbet (1) = 3.53, p = .06), participants’ education level (Qbet (1) = 0.84, p = .36) or control conditions (Qbet (1) = 3.03, p = .08). The presence of pictures significantly increased the response time when studies had no control variables (Hedges’s g = −0.12, p < .01, 95% CI [−0.21, −0.03]) or adopted items requiring structured open responses (Hedges’s g = −0.16, p < .05, 95% CI [−0.30, −0.03]).

Multimedia Effect on Metacognition

There were four comparisons for response certainty. It should be noted that Garcia-Retamero and Hoffrage (2013) measured perceived task difficulty, which was assumed to be consistent with response certainty. The analysis revealed a positive and significant aggregate effect size for response certainty (Hedges’s g = 0.74, p < .01, 95% CI [0.27, 1.20]) (see Fig. 8). However, the limited number of comparisons may affect the reliability of this result. In addition, no moderation analysis could be conducted due to insufficient comparisons.

Fig. 8
figure 8

Forest plot of the effect sizes (Hedges’s g) of response certainty

Discussion

This meta-analysis showed a significant small-to-medium average multimedia effect on response accuracy (Hedges’s g = 0.32) and a significant medium-to-large average multimedia effect on students’ response certainty (Hedges’s g = 0.74), but no significant multimedia effect on response time on average. Overall, we found evidence to support Hypotheses 1 and 2. However, the limited data did not allow for a clear evaluation of these two hypotheses and the multimedia effect on response time.

Multimedia Effect on Response Accuracy

Overall, the significant multimedia effect in problem solving in the included studies is consistent with the multimedia effect in learning (Butcher, 2014; Mayer, 2001, 2009, 2014). Early meta-analyses of the multimedia effect in learning have shown that decorative pictures are not beneficial for text learning, whereas other types of pictures can enhance text-based learning with medium-to-large effect sizes (Levin et al., 1987). Recent studies on the multimedia effect in learning have also reported medium-to-large effect sizes (Butcher, 2014). However, variations in the problem-solving context complicate and diversify the multimedia effect. The 34 pairwise comparisons on response accuracy included in this meta-analysis showed high heterogeneity. However, it did not identify important moderators.

Moderating Effect of Function Type

For decorative pictures, there was only one comparison available indicating a non-significant effect on response accuracy (Berends and van Lieshout, 2009). This somewhat echoes the double-edged effects of seductive details or decorative pictures on multimedia learning outcomes. On the one hand, seductive details or decorative pictures are task-irrelevant details and may thus hinder learning when they induce overload (Lehman et al., 2007), unless students can intentionally ignore them, which depends on their task experience (Rop, 2017; Rop et al., 2017) or their perceived irrelevance of these seductive details (Eitel et al., 2019). On the other hand, seductive details or decorative pictures capture very little attention (Lenzner et al., 2013), but help elicit positive affective experiences and lower perceived cognitive load, which may promote germane processing when cognitive resources are available (Park et al., 2011; Schneider et al., 2016).

By not considering decorative pictures, the function of the pictures did not significantly moderate the multimedia effect on response accuracy. Adding representational (Hedges’s g = 0.24) and organizational (Hedges’s g = 0.52) pictures significantly improved the response accuracy based on the included studies, but this effect was not found for informational or multiple pictures. The overall results of this study cannot fully confirm the previous findings that task-relevant pictures are beneficial for response accuracy while decorative pictures have little or no effect on it (Agathangelou et al., 2008; Carney and Levin, 2002; Elia et al., 2007; Lindner et al., 2018). Pictures with different functions had varying degrees of the multimedia effect on response accuracy, which is difficult to interpret. This study presents some alternative explanations for the effects of organizational and informational pictures.

Regarding the strong positive effect of organizational pictures, the DeFT (Design, Functions, Tasks) framework proposed by Ainsworth (2006) can offer some insights. This framework suggests that various combinations of external representations mainly fulfill three key functions: generating complementary processes or providing complementary information in mental model construction; constraining interpretations of any representation; and constructing a deeper understanding through an integration process. According to the DeFT framework, decorative pictures have no prominent functions; representational pictures tend to complement and constrain interpretations of corresponding texts; informational pictures and texts tend to complement each other by providing unique information; and organizational pictures are most likely to induce all types of functions and thus best support mental model construction.

The significant impact of organizational pictures can also be explained by the difference between pictures and texts. Pictures are assumed to be more computationally efficient than texts in conveying spatial information. Thus, organizational pictures make full use of the picture advantage to provide spatial information and scaffold mental model construction. This is consistent with a recent finding in learning that overlapping spatial information is beneficial, but overlapping non-spatial information has no effect on learning performance (Schüler et al., 2019).

Informational pictures are quite common in problem-solving contexts (Dewolf, 2014). Thus, the non-significant aggregate effect of informational pictures is worth discussing, as well as why they sometimes inhibit problem solving (Lin et al., 2013; Maries, 2013). Informational pictures replace part of the problem text with an equivalent pictorial format. Maries (2013) attributed the detrimental effect of informational pictures to students spending less time on conceptual analysis of the problem.

The text-driven assumption may provide an alternative explanation. There has been plenty of evidence showing that text is a dominant cue in mental model construction and guide picture processing (e.g., Canham and Hegarty, 2010; Folker et al., 2005; Hao, 2010; Hegarty and Just, 1993; Schüler et al., 2019). Therefore, informational pictures may not be fully processed due to the lack of textual guidance, as they have little overlap with the problem text.

Another explanation concerns an individual’s representational competence. Informational pictures contain information that is indispensable for problem solving, and thus usually complement the text. The ability to interpret informational pictures is therefore necessary to solve such problems. Students with low levels of representational competence (i.e., the ability to use representations to understand and communicate ideas and solve problems [Daniel et al., 2018; Huinker, 2015]) may find it difficult to process pictorial information. Students who have difficulty integrating pictorial and descriptive representations may find it more difficult to solve text–picture problems.

The way in which informational pictures are combined with the problem text also affects student performance. The text–picture format has been shown to be more important for problems with informational pictures than for those with other types of pictures (Saß et al., 2017). An integrated format is needed for complementary information representations to support individual problem solving. An individual’s test performance will be hampered when relevant content is presented in a split manner (Jarodzka et al., 2015).

Multimedia Effect on Response Time

The results of this meta-analysis did not show a significant average multimedia effect on response time. The benefits of the picture superiority effect (Nickerson, 1965; Shepard, 1967) and the dual processing assumption (Schneider and Chein, 2003; Shiffrin and Schneider, 1977) do not appear to be strong enough to generate a consistent change in response time when equivalent pictures are added to illustrate or replace part of the problem text. However, this finding should be interpreted with caution due to the presence of publication bias, the very limited number of primary studies, and a high level of heterogeneity in the measures of response time.

Pictures with various functions affect response time in a very complex way. Eye-tracking research has shown that students pay different attention to different types of pictures (Dewolf et al., 2015; Lindner et al., 2017a). However, this study could not effectively separate the average multimedia effect on response time induced by different types of pictures given the small number of studies. Therefore, future primary studies should measure response time and adopt techniques such as eye tracking or thinking aloud to further examine picture processing in problem solving.

Whether pictures can function as recall cues also influences response time. For example, Saß et al., (2012) found that pictures in the stems and answer options of multiple-choice items can significantly reduce students’ response time. They suggested that adding pictures to answer options can facilitate solution processing through analog structure mapping processes (Schnotz and Bannert, 2003). This is consistent with the results of previous studies on analogical problem solving in which pictures in analog items serve as recall cues for information retrieval (Beveridge and Parkins, 1987; Gentner et al., 2003). This visual analog effect in the stems and answer options also applies to study and test contexts in which a matching picture in both study materials and test items can improve test performance (Lindner et al., 2018). This is also referred to as transfer appropriate processing in recognition tasks, which suggests that memory is improved when there is a match rather than a mismatch between item processing for a test and for study (e.g., Morris et al., 1977).

Readers should also be cautious with the interpretation of response time itself due to its heterogeneous effects on an individual’s problem-solving performance (Goldhammer et al., 2014; Goldhammer et al., 2015). There is no clear answer as to whether a lower response time is good or bad. On the one hand, students’ response time may indicate their speed of information processing (Scherer et al., 2015). On the other hand, a reduction in response time may indicate a reduced level of engagement such as rapid-guessing behavior. These two-sided effects are also moderated by task difficulty and individual skill levels (Goldhammer et al., 2014; Goldhammer et al., 2015). For difficult problems or low-skilled individuals, a longer response time is more likely to indicate better performance. Consequently, simply using response time may be inadequate for measuring multimedia problem-solving performance if the response time itself is not an evaluation criterion.

Multimedia Effect on Metacognition

The multimedia effect on an individual’s metacognition in problem solving was supported in this study, but this result may be affected by publication bias. The students perceived greater response certainty when solving text–picture problems than when solving text-only problems (Hedges’s g = 0.74). The higher metacognitive judgment induced by pictures has also been observed in learning contexts, in which seductive details reduce students’ perceived cognitive load and task difficulty (Lenzner et al., 2013; Magner et al., 2014; Park et al., 2011; Schneider et al., 2016). Higher metacognitive judgments under text–picture problem conditions may also benefit students’ positive affective experiences (e.g., motivation, self-efficacy, pleasure, arousal) in multimedia problem solving.

However, increased confidence in problem solving induced by pictures may have detrimental effects when metacognitive judgments are biased. Phenomena such as overconfidence and underestimated task difficulty are commonly referred to as metacognitive illusions or bias due to the use of multimedia materials in both problem solving (Ögren et al., 2017) and learning contexts (Bjork et al., 2013; Eitel, 2016; Serra and Dunlosky, 2010; Wiley et al., 2014). For example, students tend to confirm statements accompanied by pictures and perceive conflicting text information as plausible (Isberner et al., 2013; McCabe and Castel, 2008; Ögren et al., 2017). Inflated metacognitive judgments may also lead to suboptimal restudy behaviors and less mental effort in text comprehension (Lindner et al., 2018; Ögren et al., 2017). Consequently, the question of whether an individual’s increased confidence due to pictures in problems is biased should be carefully considered.

Limitations and Directions for Future Research

This meta-analysis has several limitations and its results should be interpreted with particular caution. First, the quality of the studies included in the sample cannot be fully guaranteed. Due to the current limited number of quantitative studies on this topic, we adopted a broad literature search strategy without strict exclusion criteria in terms of experimental design or publication source. Moreover, publication bias could exist in the multimedia effect on response accuracy (for informational pictures), response time, and metacognition. Future review research should better address these questions by including more stringent quality evaluation criteria when selecting the literature.

Second, the limited number of studies and high levels of heterogeneity may also have influenced the findings. Due to limited studies on this topic, only 26 articles were qualified based on our criteria. These studies came from different fields, their participants ranged from primary students to adults and they had different sample sizes. The heterogeneity of response accuracy and response time was far too high. A high level of heterogeneity may indicate a lack of precision in aggregate effect sizes. Therefore, a random effects model was adopted. Moderator analysis was also conducted to decompose this heterogeneity. However, we did not find significant moderators for response accuracy, leaving the heterogeneity still unexplained.

This study estimated problem difficulty based on students’ response accuracy on text problems, because most studies did not report the difficulty level of their items. Therefore, it should be noted that the problem difficulty code in this study may not capture the actual difficulty level. In addition, due to limited data, this study did not examine other important moderators identified in the literature, such as the text–picture integration format, an individual’s prior knowledge, and problem complexity. Future review research should test and interpret these other possible moderating effects when more qualified studies are published.

Finally, this study only categorized pictures according to their functional relationship with the problem texts. However, the main categorization of picture types is not always perfectly clear. In addition, it is difficult to determine whether a certain type of picture was interpreted by the students as it was designed to be. At the same time, pictures identified as having the same function may vary considerably. For example, representational pictures may include cartoons or realistic photographs. The abstractness of a picture may also moderate the multimedia effect in problem solving by invoking different contexts or levels of detail. This study did not categorize pictures based on this dimension. This limitation should encourage further research on the moderating effect of picture type.

Conclusion and Implications

This meta-analytic review supports the multimedia effect in problem solving to some extent and highlights the impact of pictures on problem-solving performance in common exercises and formal tests. The use of multiple media in assessments or problem-based learning raises a number of issues that have not yet been fully explored. On the one hand, multimedia can improve item access and serve as an accommodation to minimize the influence of language proficiency or reading comprehension skills (Solano-Flores et al., 2014; Winter et al., 2006). On the other hand, an individual’s ability to interpret multiple representations is likely to cause access issues if representational competence is not included in the targeted measurement construct. This review also casts doubt on the direct application of multimedia learning principles to guide item design. Kirschner et al. (2017) proposed a cognitive theory of multimedia assessment that reversed most of the original multimedia learning principles to better differentiate novices and experts and ensure ecological validity. However, these claims remain empirically untested. Further prospective studies are needed to investigate item design principles in the contexts of problem-based learning, typical exercises, and formal testing.