Introduction

Over the past decades, a wealth of empirical research has demonstrated that students learn better with text and pictures than with text only or picture only (see Anglin et al. 2004; Carney and Levin 2002; Fletcher and Tobias 2005; Levie and Lentz 1982; Vekiri 2002 for reviews). This finding is known as the multimedia effect (cf. Mayer 2009). Multimedia effects have been found not only when text and pictures were presented simultaneously (see Mayer 2009 for a review), but also when they were presented sequentially (e.g., McCrudden et al. 2009).

If text and pictures are presented in a certain temporal sequence, the question is: Which sequence is better for learning—picture or text first? Existing theories on memory representations (e.g., Kulhavy et al. 1993; Schooler 2002) and mental model construction (e.g., Schnotz 2002; Van Dijk and Kintsch 1983) allow for explaining both why it is better for learning to process the picture before text and also why it is better for learning to process the picture after text. Accordingly, on an empirical level, some studies obtained better learning outcomes when presenting the picture before rather than after the text (e.g., Robinson et al. 2003; Verdi et al. 1997), whereas other studies demonstrated the exact opposite effect (e.g., Huff and Schwan 2008; Shaw et al. 2012). Therefore, in the present review, we hypothesize that it is not the sequence of presenting text and pictures per se that predicts learning outcomes. Rather, it is the functions that text and pictures have for the processes and outcomes of learning that make the difference (e.g., Carney and Levin 2002; Vekiri 2002); amongst other influences such as prior knowledge, these functions depend on the sequence in which text and pictures are presented (cf. Ainsworth 2006). Accordingly, results of the empirical studies reviewed in this article are analyzed with respect to the functions of text and pictures in a given sequence. From the analysis, two boundary conditions are derived that may determine when it is better for learning to process the picture or text first. These are (1) the type of assessed knowledge and (2) the relative complexity of information presented within the picture and the text. These boundary conditions should be considered guidelines for further research in this context. Further research would be necessary to empirically validate when and why it is better for learning to process the picture before text, or text before the picture, and thus to be able to derive more specific instructional recommendations.

To systematically study why processing of a picture is beneficial for processing of text and vice versa, the majority of studies reviewed in the present paper investigated the effects of presenting a picture before or after text in a sequential display (39 out of 42 studies). In the three remaining studies, students’ eye fixations while processing text and pictures in a concurrent display were used as indicators for the sequence with which verbal and pictorial information was processed, as it is assumed that visual attention on a stimulus reflects immediate cognitive processing (cf. eye-mind hypothesis; Just and Carpenter 1980). Whereas influential reviews in the research areas of reading (Rayner 1998, 2009), scene perception (Henderson 2003), and multimedia learning (Van Gog and Scheiter 2010) lend support to the eye-mind hypothesis, it is notable that there are situations under which it may and may not hold true. According to Hyönä (2010), the eye-mind hypothesis is likely to hold true if the available visual environment is relevant to the task at hand. In the studies reviewed in the present article, students were usually instructed to learn the given information in preparation of a knowledge or comprehension test. Thus, the available visual environment was indeed relevant to the task at hand so that the eye-mind hypothesis was likely to hold true for the studies reviewed in this article.

For the purpose of the review, a broad definition of the terms “text” and “picture” was applied. Text refers to any kind of information in a verbal code such as short or long prose, expository text, or verbal instructions in a written or spoken format. The defining criterion for text is that it comprises arbitrary symbols that are associated with the represented objects only by convention, and not by structural similarity (cf. descriptive representation; Schnotz 2002). Pictures, by contrast, are defined as being associated with the represented object by similarity or common structural properties. Thus, photographs are defined as pictures because they are similar to what they represent (first-order isomorphism; Shah et al. 2005). In addition, other types of visual displays such as maps, diagrams, graphs, graphic organizersFootnote 1, matrices, geographical, or concept maps are defined as pictures even though they do not necessarily share physical similarities to what they represent, and even though parts of their structure are specified by convention (cf. Schnotz 2002). Common to these displays, however, is that arrangements of objects in space are used to represent structural and/or conceptual features (cf. Hegarty 2011; Larkin and Simon 1987). For instance, objects belonging together are presented in close proximity in a diagram. Similarly, a higher bar in a graph represents a higher semantic value associated with it. Accordingly, all kinds of visual displays where space is meaningful are treated as pictures in the present article.

Applying this broad definition of text and picture, the present article reviews empirical studies from the research areas of multimedia learning, learning with graphic organizers, learning with maps and text, and learning with multiple external representations. The goal is to explain apparently contradictory findings for sequence effects when learning with text and pictures, and derives two boundary conditions that may predict under which conditions pictures are better to be processed before versus after the text. To this end, the present review refers to theories on memory representations (e.g., Kulhavy et al. 1993; Schooler 2002) to explain when it is better for recall performance to present the picture before text, or text before the picture. Moreover, it refers to theories on mental model construction (e.g., Schnotz 2002; Van Dijk and Kintsch 1983) to explain when it is better for comprehension to process the picture or text first. In the following section, we address how sequencing of text and pictures may affect memory representations, and therefore recall performance.

To identify potentially relevant studies for the present review, the computerized databases for research in psychology (PsycINFO) and education (ERIC) were searched by entering combinations of the keywords “learning with text and pictures,” “multimedia,” “multiple external representations,” “graphic organizers,” “graphic overviews,” “graphic advance organizers,” and “graphic post organizers” and the keywords “sequence,” “order,” “presentation order,” “before,” and “after” using both “AND” and “OR” of the Boolean operators (up to January 2014). Moreover, to not miss any relevant study, we screened the articles that were cited in already identified papers, and incorporated the relevant articles.

For studies to be selected for inclusion in our review, they had to meet each of the following five criteria as follows: (1) studies used randomized assignment to groups and quantitative data analysis; (2) text and pictures were processed in an identifiable sequence (either because sequence was experimentally varied or processing sequence was identifiable via eye movement data); (3) text and pictures were created by researchers (or instructors) and were not self-generated by learners; (4) learning outcomes were measured (recall, recognition, and/or comprehension of presented information); and (5) results of the studies were clearly interpretable (design not confounded by extraneous factors).

In the end, 42 studies located in 26 journal articles, 1 PhD thesis, 1 conference proceeding, and 1 master thesis met the criteria, and hence were included for review (see Tables 1 and 2). Of these 42 studies, 16 studies directly compared learning outcomes from processing the picture before versus after the text (see studies marked with an asterisk in Tables 1 and 2). The remaining 26 studies investigated sequencing effects more indirectly by comparing whether processing the picture before versus after text was better for learning than processing text or picture only.

Table 1 Reviewed studies showing beneficial effects from processing pictures before text
Table 2 Reviewed studies showing beneficial effects from processing pictures after text

How Sequencing Affects Recall Performance

In this section, we review empirical studies that were mostly conducted in the context of theories on memory representations. In these studies, text and pictures that are used as to-be-learned materials usually have a high information overlap (in its extreme form containing redundant information). Pictures represent most of the information that is stated in the corresponding text in an organized manner, since it is meaningfully distributed in space (e.g., graphic organizer; Robinson 1998). The main learning task is to recall and recognize information from text and pictures. As a result, recall and recognition performance are the main learning outcome measures. According to influential theories in this context (e.g., Kulhavy et al. 1993), better recall and recognition result from richer and more connected memory representations. Thus, the more connections can be formed with prior domain knowledge and the to-be-learned materials (i.e., text and pictures), but also between the to-be-learned materials and the learning assessment (recall and recognition test), the better the learning outcomes. Studies that are reviewed in this context yield contradictory results at first sight, with some studies showing better recall performance from presenting the picture before text, and other studies showing better recall performance from presenting the picture after text. In the following, the results of these studies, together with their theoretical explanations, are presented. Subsequently, the apparent inconsistencies among the studies are resolved by referring to a recency effect (cf. Baddeley and Hitch 1993), meaning that it is better for learning outcomes if the type of information that is assessed (text-based vs. picture-based) maps to the information that is provided by the representation (text vs. picture) that is presented last, and therefore, most recently prior to the assessment.

Better Recall from Presenting the Picture Before Text

According to research in the context of the bushiness hypothesis (Baggett 1984) and the model of working memory operations (Kulhavy et al. 1993), recall is fostered by presenting the picture before text. The bushiness hypothesis (cf. Baggett 1984) rests on the assumption that prior domain knowledge as well as information extracted from pictures and text are represented as concepts in memory that have a number of possible associations to be formed with other concepts. Processing a picture leads to a visual concept, which allows forming more associations with other concepts compared to a verbal concept, which is inferred from text. Thus, the visual concept is assumed to be “bushier” than the verbal concept. When learning with pictures and text, the visual and verbal concepts will first be connected to the already existing semantic network. The number of associations that can be formed with the existing semantic network is determined by the learners’ level of prior domain knowledge. If the picture (i.e., bushier concept) is processed first during the learning episode, it allows linking more of the subsequent information than if the text is processed first. This increases the likelihood of creating a compound concept containing information from both picture and text, which can foster recall performance, especially when prior knowledge is low, so that the total number of possible associations that can be formed between the existing semantic network and the visual and verbal concepts is highly constrained (i.e., bushiness hypothesis). This hypothesis was tested in an empirical study (Baggett 1984) in which students with low prior knowledge had to recall the names of pieces of a construction kit from a film that presented the moving pictures either before (for 21, 14, and 7 s), concurrently, or after (for 21, 14, and 7 s) the corresponding verbal narration. The sequence of presenting text and picture information was experimentally varied. In line with the bushiness hypothesis, the results from both immediate and delayed testing (7 days later) revealed better recall from presenting the pictures before the narration than from presenting the pictures after the narration (see Table 1). Best recall performance was achieved in conditions with concurrent narration as well as with the pictures preceding the narration by 7 s.

The model of working memory operations (Kulhavy et al. 1993; Verdi and Kulhavy 2002) is based on dual coding theory (DCT; Paivio 1986). The DCT states that information from pictures and text are encoded into two separate but connected memory stores (i.e., nonverbal and verbal memory store). Retrieving information from one memory store automatically activates the corresponding information in the other memory store so that it is sufficient to retrieve the information from one of the two stores. In consequence, the better information from the two memory stores is connected, the more easily it can be retrieved. Kulhavy et al. (1993) in their model apply DCT to the processing of maps. According to Kulhavy et al. (1993), maps have a special status in memory. They are represented in memory as intact, holistic units that can be held in working memory as a single chunk (Miller 1956), even though they may contain considerable information about features embedded within the map framework. Thus, when a map is presented prior to corresponding text, information from the map picture can be held as an intact unit in working memory while subsequently encoding information from text without exceeding the capacity of the cognitive system. This allows for simultaneous encoding of map and text information, leading to connected memory representations and hence better retrieval.

In contrast, if the map is presented after text, it should be much more difficult to connect the information from map and text. Due to the linear format of text, it is assumed that text is represented as numerous unrelated propositions in memory. Thus, keeping all the information from text active in working memory as well as retrieving it from long-term memory into working memory requires a considerable amount of resources. If the corresponding map picture is subsequently presented, connecting information from text and map might fail because keeping text propositions active in working memory while encoding information from the map picture exceeds the capacity of the cognitive system. As a consequence, the ease of retrieving information with text before map should be inferior to the ease of retrieving information with map before text. Empirical studies that experimentally varied whether an image was presented before versus after a text yielded support for the model of working memory operations (see Table 1). First, studies showed that presenting an image of a map prior to the corresponding text fostered recall performance in low prior knowledge learners compared to presenting the text prior to the map image (Dean and Enemoh 1983; Verdi et al. 1997). Second, two experiments of Verdi et al. (1996) showed that presenting biology diagrams to middle-school students prior to presenting text led to better recall and labeling performance than presenting the text before the diagrams, thereby extending the model to pictorial representations other than maps.

Similarly, studies conducted in the context of learning with graphic organizers yield support for the claim that presenting the picture before text leads to better recall than presenting the picture after the text (Robinson et al. 2003; Simmons et al. 1988) or the text only (Alvermann 1981; Snouffer and Thistlethwaite 1980). In a study by Simmons et al. (1988), delayed recall of information from a science text was better when students studied a graphic organizer before rather than after the text. Moreover, in one of three experiments conducted by Robinson et al. (2003), students were better able to recall macropropositions and relational information when the graphic organizer was presented as a complete set before rather than after the text. Robinson et al. (2003) concluded that presenting the graphic organizer as a complete set before the text provided learners with an overarching scaffold onto which relevant details from subsequently read text could be mapped. Similar to the explanations by Baggett (1984) and Kulhavy et al. (1993), this resulted in connected memory representations and hence in better retrieval.

To sum up, research in the context of the bushiness hypothesis (Baggett 1984) and the model of working memory operations (Kulhavy et al. 1993) suggests that learners are better able to form connections between text and picture representations when the picture is presented prior to presenting text, thereby yielding a richer and better connected memory representation fostering recall.

Better Recall from Presenting the Picture After Text

In this section, we review studies according to which presenting the picture after text is desirable for recall performance (see Table 2). In the reviewed studies, pictures were graphic organizers (e.g., Robinson 1998) or visualizations of dynamic scenes (e.g., Huff and Schwan 2008). Both types of pictures basically display what is stated in the text so that there is a relatively high information overlap. Using text and pictures with such an information overlap, research has shown that presenting the text after the picture is detrimental to recognition performance. This finding is known as the verbal overshadowing effect (e.g., Meissner and Brigham 2001). One viable explanation for the verbal overshadowing effect is a transfer-inappropriate processing shift (Chin and Schooler 2008; Dodson et al. 1997; Schooler 2002). According to a transfer-inappropriate processing shift, subsequent verbalization of an initially presented picture (e.g., a picture of a face) can disrupt a holistic memory representation constructed from the picture, in turn being detrimental to performance in a recognition test. Accordingly, across the past two decades, several empirical studies have demonstrated that if text is presented after the picture, memory accuracy is disrupted for recognition of various types of visual stimuli such as map configurations, faces, or cars (see Chin and Schooler 2008; Meissner and Brigham 2001; Meissner et al. 2008; Schooler 2002 for reviews).

Moreover, studies have shown that if similar information is given in both text and pictures, presenting the text before picture has desirable effects on the recognition and reproduction of dynamic scenes (Huff and Schwan 2008, 2012), as well as on the recall and application of concept relations compared to presenting the text only (Kauffman and Kiewra 2010; Kiewra et al. 1999; Robinson et al. 1998, Exp. 2; Robinson and Kiewra 1995; Robinson and Schraw 1994). Additionally, in the study of Shaw et al. (2012), students were better able to apply knowledge about concepts and their relations when they learned with picture after text compared with picture before text. It can be concluded that the picture representation was better accessible in the assessment when it was presented after the text rather than before the text because it was the most recent representation prior to the assessment. This, in turn, fostered recall performance (cf. recency effect; Baddeley and Hitch 1993). In a similar vein, in two studies of McCrudden et al. (2009), three types of dependent variables were assessed, one of which was recall of the causal sequence (explicitly depicted in picture). Results of the study revealed that presenting the picture after text led to higher outcomes on all three dependent variables than presenting text twice. As expected by a recency effect, recall of the causal sequence (as depicted in picture) profited the most from presenting the picture after text, and thus, from the picture as the most recent representation prior to the assessment (see Table 2).

Boundary Condition: Type of Assessed Knowledge

In the preceding sections, some studies showed that it was better for learning outcomes to present the picture before text (e.g., Robinson et al. 2003; Verdi et al. 1997), whereas other studies revealed the exact opposite effect (e.g., Huff and Schwan 2008; Shaw et al. 2012). These seemingly contradictory findings may be reconciled by referring to a recency effect (cf. Baddeley and Hitch 1993). On the one hand, given low prior knowledge, inspecting the picture before the text fostered text-based recall because this sequence of presenting information increased the likelihood of creating connected memory representations so that the text representation was better retrieved when asked to recall facts from text (cf. Baggett 1984; Kulhavy et al. 1993). On the other hand, inspecting the picture after text fostered recall and recognition of picture information because the picture as the most recent representation was better accessible during retrieval, thus fostering performance (cf. Baddeley and Hitch 1993; Schooler 2002).

Results from a study of Peverly (1981) are in line with this argumentation. In this study, recall of a story was assessed after presenting either picture before text, text before picture, text twice, or picture twice. The only factor that influenced the results was the medium that was presented second (last). Results for recall of the story were consistently better in the two conditions with the text presented last than in the two other conditions, thereby supporting the recency-effect explanation. If mainly pictorial recall or recognition is assessed, then it may be better to present the picture after text so that the picture representation is better accessible in the assessment, fostering performance. Accordingly, a simple recency effect may underlie apparently contradictory findings from studies investigating sequence effects in the context of theories on memory representations.

In conclusion, the boundary condition that may determine whether it is better to present text or pictures first, and therefore reconcile findings concerning sequence effects, is the type of assessed knowledge. When recall is mainly text-based, then it should be better for learning outcomes to present the picture before text. When recall is mainly picture-based, presenting the picture after text should foster learning outcomes. How sequence effects may be explained in the context of theories on mental model construction is addressed in the following section.

How Sequencing Affects Comprehension

In this section, we review empirical studies that were mostly conducted in the context of theories on mental model construction. In many of these studies, text and pictures are used to explain the processes involved in scientific phenomena (e.g., how cell reproduction works; Stalbovs et al. 2013). The main learning task is to understand the scientific phenomena. As a result, comprehension of such a phenomenon is the main learning outcome measure, which is usually measured by requiring learners to draw inferences based on the presented information. It is assumed that text and pictures both contribute to constructing and updating a mental model that reflects comprehension (e.g., Van Dijk and Kintsch 1983). Depending on the level of prior knowledge and on the sequence in which text and pictures are processed, text and pictures have different functions in the process of mental model construction, and thus, in the process of constructing comprehension. Accordingly, empirical studies found better comprehension from processing a picture before text as well as from processing text before a picture. Results of those studies will be presented and explained in the following. Subsequently, it is suggested that the relative complexityFootnote 2 of information presented in text and picture may determine whether it is better for comprehension to process the text before the picture or the picture before the text.

Better Comprehension from Processing the Picture Before Text

When reading to understand the text, according to the construction-integration model (Van Dijk and Kintsch 1983), a reader first constructs a mental representation of the text surface structure, from which both a propositional representation of the semantic content (i.e., text base) as well as a mental model of the specific situation described in the text are generated. The text base is constructed based solely on semantic information explicitly stated in the text. The text base alone usually yields an impoverished and often even incoherent network. To achieve better comprehension, relations that are only implicit in a text must be inferred to yield a coherent mental structure (Glenberg and Langston 1992). Thus, understanding a text often requires interpreting the text by integrating text propositions with prior knowledge, mentally created images, or information extracted from a previously inspected picture (e.g., Bransford and Johnson 1972).

However, especially readers with low prior knowledge sometimes fail to construct a coherent mental model of the situation described in a text (cf. Bransford and Johnson 1972). They construct a mental model that inadequately reflects the contents or situations described in a text, thereby hampering comprehension (Schnotz and Bannert 2003; Schnotz and Kürschner 2008). By contrast, if prior knowledge is high (cf. McNamara et al. 1996), or if a picture is presented prior to reading the corresponding text, the process of constructing an adequate mental model from text, and hence comprehension, can be facilitated. Unlike text, pictures are related to their represented referents via structural similarity or commonality (cf. Hegarty 2011; Schnotz 2002) so that spatial relations expressed among the objects in a picture can be mapped onto the corresponding semantic relations to provide the structure of the mental model (analogical structure mapping; Schnotz and Bannert 2003). This means that information about the structural relations among the objects in a picture is preserved within the mental model (cf. Johnson-Laird 1980). As a consequence, a mental model can be directly constructed from the picture without requiring much interpretation or inference of additional information (Glenberg and Langston 1992; Larkin and Simon 1987; Hegarty and Just 1993). The picture is considered to be one possible expression of a mental model (Gyselinck and Tardieu 1999; Gyselinck et al. 2008).

Thus, processing of a picture may initially provide learners with the structure of a mental model so that part of the mental model construction process may already be completed based on the picture. When processing subsequent text, corresponding steps of mental model construction are not needed anymore. Thus, instead of having to construct a mental model from scratch, initial picture inspection may provide learners with a mental scaffold facilitating subsequent processes of mental model construction from text (cf. Eitel et al. 2013b; Gyselinck et al. 2008; Schnotz and Bannert 2003). Accordingly, in the studies by Eitel et al. 2013a, b, presenting a causal system picture to low prior knowledge learners before presenting the corresponding text led to better comprehension and faster reading of text about the system’s spatial structure compared with presenting just the text. These effects held true even if the initial picture presentation was very short (i.e., 600 ms or 2 s), suggesting that providing low prior knowledge learners with the global structure of an adequate mental model (i.e., a mental scaffold) can have beneficial effects on subsequent mental model construction, and thus on comprehension (see Table 1). Further evidence for this assumption comes from a recent study by Stalbovs et al. (2013), which shows that initially attending to the picture instead of attending to the text was related to more successful learning with multimedia about the biological processes of mitosis and meiosis. In a similar vein, Salmerón et al. (2009) showed that reading a graphical overview at the beginning of a difficult hypertext presentation was related to improvements in comprehension (especially when prior knowledge was low), whereas reading the overview at the end of an easy hypertext was related to a decrease in hypertext comprehension. The authors concluded that initially processing the overview increased the salience of the hypertext structure, thereby supporting low prior knowledge learners in generating inferences based on subsequent text.

Moreover, due to the specific nature of pictures (Stenning and Oberlander 1995), the mental scaffold provided by the picture may constrain the range of (erroneous) interpretations or inferences that are made based on the text (cf. Ainsworth 2006; Scaife and Rogers 1996). In particular, pictures can assist in the process of constructing a mental model from text because they can make relations explicit that are only implicitly conveyed by the text (cf. Glenberg and Langston 1992; Gyselinck and Tardieu 1999; McCrudden et al. 2011; Zwaan and Radvansky 1998). Thus, pictures may give a specific example on how to interpret text (cf. interpretation function; Levin et al. 1987). In the case of a well-designed picture, this can make the text more coherent and comprehensible, thus fostering understanding (Carney and Levin 2002; Gyselinck et al. 2008). Accordingly, in a study by Bransford and Johnson (1972), comprehension of a text passage was improved when a picture about the situation described in the passage was presented prior to the text (see Borges and Robins 1980 for a replication). Comprehension was improved compared with presenting just the text and compared with presenting the picture after text. Moreover, presenting the (coherent) picture before text was also better than presenting the picture before text when the picture contained the same objects but in a rearranged manner (partial context). Bransford and Johnson (1972) concluded that the appropriate context given by the (coherent) picture before text led to better comprehension; for the context to be helpful, it was required that the relations between the objects described in the text were provided by the initial picture—understanding the relations within the context was a prerequisite for understanding the events suggested by the passage. Similarly, McCrudden et al. (2011, Exp. 1) showed that presenting a causal diagram prior to presenting text led to better learning outcomes for sentences that semantically overlapped with the diagram and shorter reading times than when learning with just text. The authors concluded that diagrams helped by making relations explicit, thus facilitating subsequent processing of text.

According to Schnotz (2005), presenting the picture after text may even provide learners with a disadvantage that is absent when pictures are processed prior to text. According to Schnotz, a text never describes a subject matter with enough detail to fit just one single picture or one mental model. Thus, a mental model constructed from just text will always differ in some respects from the picture that illustrates the subject matter. If such a text was presented prior to the corresponding picture, the picture would likely interfere with the mental model initially constructed from text, thus being detrimental to comprehension. In contrast, if the picture was presented before the text, subsequent mental model construction would be based on the specific mental model initially constructed from the picture, thus fostering comprehension. This assumed superiority of presenting pictures before rather than after the text is called the picture-text sequencing effect (Schnotz 2005). Accordingly, in an empirical study in which students learned with text and pictures about the principle of plate tectonics, Ullrich (2011) showed that presenting the picture before text in a sequential format led to better recall and comprehension than presenting the text before the picture (see Table 1).

To sum up, initially processing pictures can facilitate processing of text by constraining interpretation (Ainsworth 2006), and thus by resolving ambiguity that is usually present in text. Moreover, information extracted from the picture can act as a scaffold to facilitate the process of constructing an adequate mental model, which in turn fosters comprehension, especially for learners low in prior knowledge (e.g., Eitel et al. 2013b; McCrudden et al. 2011).

Better Comprehension from Processing the Picture After Text

Similar to processing text when processing pictures with the goal of understanding their displayed contents, learners are assumed to construct a mental model (cf. Van Dijk and Kintsch 1983). According to influential models of learning with text and pictures such as the cognitive theory of multimedia learning (Mayer 2009) or the integrative model of text and picture comprehension (Schnotz 2002), constructing a mental model from a picture roughly involves two processing steps. First, relevant information from pictures has to be perceived or selected from the instruction. According to Schnotz (2002), this process takes place in a largely automated manner by making use of perceptual processes and visual routines. The learner creates a perceptual representation of the visuospatial relations depicted in the picture. In a second step, visuospatial relations from the perceptual representation are then mapped onto semantic relations to provide the structure of the mental model (analogical structure mapping; Schnotz and Bannert 2003). According to Mayer (2009), selected images are organized into a pictorial mental model by establishing connections between parts of the picture.

When learning with complex pictures, however, selecting the relevant information that is later used for mental model construction may be difficult, which in turn may impair comprehension. In a complex graph, such as a complex weather map in the domain of meteorology, task-relevant information may need to be selected from a much larger amount of displayed information (Canham and Hegarty 2010). In such a complex graph, it may be hard for students to distinguish between which information is relevant and which information is irrelevant with regard to solving the current task, especially when prior knowledge levels of learners are low. With increasing prior knowledge or expertise, however, students learn to separate task-relevant from task-irrelevant information so that they select only the relevant information and ignore the irrelevant information (cf. information reduction hypothesis; Haider and Frensch 1996). Studies using materials from meteorology (Canham and Hegarty 2010; Lowe 1993, 1994, 1996, 2004), medicine (Lesgold et al. 1988), art (Antes and Kristjanson 1991), chess playing (Charness et al. 2001), or biology (Jarodzka et al. 2010) provide evidence for the information reduction hypothesis, showing that more expert students focus more on elements that are thematically relevant than novice students do. Accordingly, a higher level of prior knowledge or expertise can lead to selecting more relevant information from a complex picture, which in turn can be helpful for comprehension (e.g., Canham and Hegarty 2010).

In other words, high prior knowledge can constrain the process of selecting information from a complex picture, in turn being helpful to mental model construction. One way to increase prior knowledge levels of students before they learn with complex pictures is to initially provide them with domain knowledge given in a text as was done in two experiments reported in Canham and Hegarty (2010). In these studies, novice students were either taught or not taught the principles of meteorology using mainly text prior to processing complex weather maps (text-picture sequential format). Eye movements as well as the ability to draw inferences from the weather maps were compared between students who were taught the principles of meteorology initially (i.e., high prior knowledge students) and students who were not taught the principles of meteorology initially (i.e., low prior knowledge students). Results revealed that high prior knowledge students attended more to task-relevant information in the maps than low prior knowledge students did, which resulted in superior performance in inference generation from the weather map (see Table 2). These results suggest that task-relevant knowledge acquired from initially presented verbal instructions effectively guided attention to relevant parts in the picture, in turn fostering inference making (comprehension). This suggests that the text guided and constrained the information selection process from the complex picture, which in turn fostered comprehension.

The idea of text-guided processing of pictures has received empirical support in research on learning with text and pictures and in related domains. When presenting text and pictures concurrently, the text was used as a guide on how to process the concurrently presented picture (Folker et al. 2005; Hegarty and Just 1993; Ozcelik et al. 2010; Rummer et al. 2011; Schmidt-Weigand et al. 2010a, 2010b; Schwonke et al. 2009; Van Gog et al. 2009b). Thus, text guidance may be helpful for comprehension not only when it provides (additional) content information, but also when it guides attention to the relevant parts in the picture without providing further content information. Such effects have been found in the signaling literature. Here, several types of cues guided attention towards relevant parts in complex static and dynamic learning materials without giving additional content information (e.g., Bétrancourt, 2005; Canham and Hegarty 2010; De Koning et al. 2009; Hegarty et al. 2003; Jarodzka et al. 2010; Mautone and Mayer 2001; Ozcelik et al. 2010; Scheiter and Eitel 2010; Van Gog et al. 2009a).

To conclude, given low prior knowledge, pictures may foster comprehension when processed after text because information extracted from initially processed text can act as a guide to facilitate the selection of relevant information subsequently presented in the picture (Canham and Hegarty 2010; Hegarty and Just 1993).

Boundary Condition: Relative Complexity

As shown by studies in the previous sections, inspecting the picture both before and after the text can foster comprehension. On the one hand, studies showed that processing the picture before text helped to constrain the interpretation of text that was ambiguous and hard to understand without sufficient background knowledge or context, thereby fostering comprehension via facilitated mental model construction (e.g., Bransford and Johnson 1972; Glenberg and Langston 1992; Schnotz 2005). On the other hand, other studies showed that initially processed text guided attention towards the relevant parts of a subsequently presented complex picture, thereby fostering comprehension (e.g., Canham and Hegarty 2010; Hegarty and Just 1993; Lowe 2004). One may conclude that it is helpful to learning if the medium that contains less complex information is presented first. As a result, information presented in the first medium is more likely to be understood even for low prior knowledge students, and thus, it can guide or facilitate processing of the more complex information presented in the other medium. Accordingly, the boundary condition that may determine whether it is better for comprehension to process the picture or text first is the relative complexity of picture and text.

This argumentation is in line with Ainsworth (2006), stating that it is reasonable to start an instruction by presenting the least complex representations to the learner. Moreover, this argumentation is in line with assumptions made by the elaboration theory of instruction (Reigeluth et al. 1980). According to this theory, an instruction should be presented in a way that the less detailed and less complex information should be presented first, and thus prior to presenting more detailed and complex information. In analogy to a zoom lens, the theory prescribes that an instruction should begin with a wide-angle view of the subject matter, which shows the major relationships among those parts but which still lacks in details. Afterwards, the subject matter should be divided into the subparts (“zooming in”) so that students can elaborate on each subpart. This zooming in should be continued until the desired level of detail is reached. This type of sequencing an instruction in an easy-to-complex manner has received much empirical support (e.g., Ainsworth et al. 1998; Weidenmann et al. 1999). In the context of learning with text and pictures, one would conclude that the medium containing less complex information, whether text or the picture, should be presented first. As such, it can facilitate processing of the medium presented second (e.g., via constraining interpretation or attention guidance; Ainsworth 2006; Hegarty and Just 1993), and thus foster comprehension.

Further Research Along Boundary Conditions

To sum up, we reviewed empirical studies that were conducted in the context of theories on memory representations (e.g., Kulhavy et al. 1993) and on mental model construction (e.g., Van Dijk and Kintsch 1983). As the present review suggests, a recency effect may explain apparently contradictory findings from studies investigating sequence effects in the context of theories on memory representations. In conclusion, the type of assessed knowledge (text-based vs. picture-based recall) is assumed to moderate whether it is better for learning to present the picture before or after the text. Whereas a picture-before-text sequence should lead to better recall in a text-based assessment, a picture-after-text sequence should lead to better picture-based recall and recognition. The studies reviewed in this article seem to support this hypothesis (see previous sections). However, in the context of theories on memory representations, most studies that directly compared presenting the picture before versus after the text used an assessment that was based on information from both text and picture. Fewer studies used a merely text-based assessment and, to our knowledge, there are so far no studies that directly compare presenting the picture before versus after the text and use a merely picture-based assessment (see Table 3). Such research, however, would be crucial to empirically validate the recency-effect explanation of sequencing effects as formulated within the present article. Further research should therefore systematically manipulate the sequence of presenting text and pictures together with the type of assessed knowledge (text-based vs. picture-based).

Table 3 Reviewed studies that directly compare learning outcomes from presenting pictures before versus after the text as a function of the hypothesized boundary conditions

In addition, the present review suggests that the relative complexity of the picture and text may explain findings of better comprehension from studies conducted in the context of theories on mental model construction. The reviewed studies seem to support the hypothesis that it is helpful for comprehension if the medium that contains the less complex information (text or picture) is presented first, and thus may guide or facilitate processing of the more complex information presented in the second medium (text or picture). However, so far, there exist only few studies that directly investigate this (see Table 3). Accordingly, further studies that systematically investigate the effects of text-picture versus picture-text sequences in combination with the relative complexity of text and pictures on comprehension outcomes are needed. Results of such studies could provide additional empirical support in favor of our hypothesis that an easy-to-complex sequencing of multimedia instructions could indeed explain the effects of better comprehension, regardless of whether the picture or text would be presented first. Hence, this research would contribute to our knowledge about the interplay between the dimensions of sequencing and complexity in the process of mental model construction.

In conclusion, the ultimate goal of the present review was to generate informed hypotheses based on the given research evidence about how to explain sequencing effects when learning with pictures and text. The present review seeks to stimulate further research that more systematically tests for the validity of the proposed hypotheses (boundary conditions) to better understand the processes involved when learning with pictures and text.

In the present review, we made use of two distinct explanations for sequence effects when learning with text and pictures (i.e., recency effects; facilitated processing of medium presented second), but these explanations may not be specific to the situation of learning with text and pictures. For instance, since the 1960s, recency effects have been well-established in memory research, where they were mostly studied using unrelated word lists (e.g., Murdock 1962). This suggests that recency effects are not bound to the situation of learning with text and pictures. Moreover, the idea of an easy-to-complex sequencing of representations with the intention to facilitate comprehension (as suggested in the present review) may also not be specific to learning with text and pictures only. For instance, in a mathematical learning environment designed for primary school children (COPPERS; Ainsworth et al. 1998), coin problems were presented to children via increasingly abstract representations: first as pictures, then as a mixture of text and pictures, then as text only, and then as algebra. One may assume that this sequence was better for learning because initially acquired comprehension of the more concrete or easy representation (i.e., realistic picture) facilitated processing and comprehension of the more abstract and complex representation presented later in the sequence (i.e., algebra). However, to our knowledge, there is not much empirical research investigating the effectiveness of easy-to-complex sequencing compared to other types of sequencing of representations. Hence, it remains to be tested in empirical studies whether sequence effects can generally be explained by facilitated processing of more complex representations due to the initial processing of easier representations.

Regardless of their generalizability, in the present review, explanations of sequence effects, namely recency effects and facilitated processing of the medium presented second, were assumed to be independent. This makes sense, considering that for recency effects to apply, the congruency between the format of the most recent representation at learning (text or picture), and the representation format of the assessment (text-based or picture-based recall), is important. By contrast, according to theories on comprehension and mental model construction, the format of the assessment is not important. Comprehension is assumed to be a modality- or media-unspecific construct such that better comprehension would be equally applicable to text-based and picture-based assessments (cf. Gernsbacher et al. 1990). Accordingly, studies conducted in the two research contexts (memory vs. comprehension) that investigated mainly recall or mainly comprehension were treated separately in this review.

However, the learning outcome measures of recall and comprehension may not be entirely independent of each other. On the one hand, to demonstrate comprehension of a subject matter in a subsequent assessment, one has to recall what one had understood initially. On the other hand, correctly understanding a subject matter often requires processing it on a deeper level to be able to draw the required inferences, and deeper processing is known to facilitate recall in addition to facilitating comprehension (cf. Craik and Lockhart 1972; Salomon 1984). Thus, to test whether recall and comprehension outcomes, and therefore, whether their two separate theoretical explanations (recency effects; facilitated processing of medium presented second) are indeed independent of each other, future empirical research should investigate whether systematically manipulating the relative complexity of text and pictures may interact with the systematic manipulation of the assessment type (recall vs. comprehension) when studying sequence effects in learning from text and pictures. Such research should take care that the learning outcome measures are valid and reliable in assessing the constructs of recall and comprehension.

Another interesting direction for future research would be to continue analyzing processing data when studying the effects of the sequence of presenting text and pictures. Presenting text and pictures in a sequential manner has a large advantage compared to presenting text and pictures simultaneously; that is, the former allows studying in isolation how the medium presented first (picture or text) affects processing and learning from the medium presented second (text or picture). This can provide valuable information, especially when process data is analyzed. For instance, by analyzing the eye movements of students learning with a sequential text-before-picture presentation, Canham and Hegarty (2010) were able to provide empirical data in favor of the claim that processing of a text prior to inspecting a complex picture can be helpful to comprehension because information extracted from the text guides attention towards the corresponding relevant information in the picture. Similarly, other types of processing data such as think-aloud protocols or self-explanations have been shown to provide valuable information regarding processes taking place when learning with text and pictures (e.g., Ainsworth and Loizou 2003; Butcher 2006; Chi 2000). Hence, further research may continue making use of such data to study how processing of text may interact with the processing of pictures. This may provide further information about processes that underlie successful learning with text and pictures, thereby providing a basis from which instructional recommendations may be derived in the future.

The Influence of Learner Characteristics

Future research regarding this topic should also focus more on the influence of certain learner characteristics such as prior knowledge, reading abilities, or visuospatial abilities, since they might strongly influence effects of the learning instruction. For instance, visuospatial abilities might play a role because presenting the picture before text might reduce the degree of required visuospatial reasoning based on the text. In the studies of Eitel et al. (2013a, b), presenting a picture of a pulley system before text fostered comprehension and sped up the processing of subsequent text about the system’s spatial structure (compared to presenting text only). It was concluded that part of the required mental model construction was already completed based on the initial picture inspection, which facilitated subsequent visuospatial reasoning processes based on the text, thereby speeding up the reading process and fostering comprehension. One might conclude that this facilitating function of the picture would be especially helpful for learners with low visuospatial abilities. If, in contrast, the picture is presented after text, then students would first need to construct a mental model based on text only, which would require a higher degree of visuospatial reasoning that could be detrimental especially for learners low in visuospatial abilities, whereas learners high in visuospatial abilities might be able to compensate for the missing picture in the initial position.

In a similar vein, a study of Dean and Enemoh (1983) has shown that presenting the picture before text could compensate for low prior knowledge levels. When the picture was presented before text in their study, students low in prior knowledge scored equally high on a free recall test as students high in prior knowledge, and higher than when low prior knowledge students received the picture after text. Referring to theories on memory representations (e.g., Kulhavy et al. 1993), one may explain these findings by assuming that either prior knowledge or the picture in the primary position provided learners with an organized mental structure that allowed for connecting and integrating subsequent text information, hence fostering retrieval. Thus, it appears that especially students with low prior knowledge may profit from presenting the picture before text, while students high in prior knowledge may not necessarily need this kind of help. It is conceivable that high prior knowledge students might even benefit from having the more demanding task of first trying to understand the text on their own, without having the help of supporting pictures. So far, however, systematic research concerning this potential moderating role of prior knowledge when presenting text and pictures in different sequences is missing. Therefore, an important avenue for further research is to study the role of relevant learner characteristics (e.g., prior knowledge, reading abilities, visuospatial abilities) on learning with different text-picture sequences.

The Influence of Segment Size and Pacing

Two other relevant factors with respect to sequencing effects when learning with pictures and text are the size of the segments and the pacing of the sequence. Several of the studies reviewed in this article investigated the effects of an instructor-paced and coarse-grained sequence of presenting text and pictures (25 in total); that is, they addressed the situation of presenting the whole picture once before or after presenting the whole text (e.g., Robinson et al. 2003; Ullrich 2011; Verdi et al. 1997). Some other studies investigated the effects of multiple cycles of text-picture processing (ten in total), in which only a part of the information from text and picture was given within each cycle (e.g., Baggett 1984; Robinson et al. 1998; Shaw et al. 2012). From the point of view of the temporal contiguity principle, presenting the text and pictures in close temporal proximity is generally seen as more effective for learning than presenting them in a temporally discontiguous manner (see Ginns 2006; Mayer 2009 for overviews). Accordingly, it could be assumed that performance decreases along a continuum from a simultaneous presentation to a fine-grained sequential presentation to a coarse-grained sequential presentation (see also Mayer and Anderson 1991, 1992; Mayer et al. 1999; Mayer and Sims 1994), especially when learning materials are complex (Ginns 2006) and when they are presented in a short and system-paced manner (cf. segmenting principle; Mayer 2009). While we do not doubt this, the present review nevertheless shows that even instructor-paced and coarse-grained sequential presentations of text and pictures produced better learning outcomes than presenting text only or picture only (e.g., McCrudden et al. 2011). Whether segment size and pacing also moderate the effects of presenting the picture before versus after text in a sequential display remains to be subject for further empirical research.

Summary and Conclusions

In the present article, studies were reviewed that showed better learning outcomes from presenting the picture before text as well as from presenting text before the picture. At first sight, the reviewed studies revealed a mixed pattern of results regarding whether it is better for learning to process a picture or text first. While in some studies, presenting the picture before text was better for learning outcomes (e.g., Dean and Enemoh 1983; Robinson et al. 2003), other studies revealed the exact opposite effect (e.g., Huff and Schwan 2008; Shaw et al. 2012). Against the backdrop of theories on memory representations and mental model construction, in the present article, we hypothesized that two boundary conditions, namely (1) the type of assessed knowledge and (2) the relative complexity of information conveyed by the picture and by text, would determine whether it is better for learning outcomes to process the picture or text first. Whereas the reviewed studies tended to support our hypotheses, the present review also shows that systematic research still has to be done to provide sufficient empirical evidence in favor of our claims (e.g., research using picture-based assessments in the context of sequencing effects).

Accordingly, with this review, we want to give guidelines for further research, which is research that is conducted along our hypothesized boundary conditions. Such research could provide evidence for our hypotheses regarding the cognitive processes that may underlie effects of the sequence of presenting text and pictures. Understanding which cognitive processes are responsible for a certain sequential presentation to be better for learning might provide valuable information about which processes ought to be stimulated to foster the learning success in the future. Thus, in the long run, such information may provide the theoretical basis from which more specific instructional recommendations could be derived about when and how to process pictures and text to foster the learning success.