Keywords

In this meta-analysis the efficiency of multimedia applications on literacy skills in developing young children was examined. In particular, we looked at computer-assisted instruction (CAI), picture storybooks presented on a computer with audio and video animations (e-books), and conventional TV/Video applications. The review is restricted to the 2000–2010 period in order to compare the results with a study that covered 1990–2000, an equally long period (Blok et al. 2002). Studies on learning to read and write in alphabetic languages were eligible.

1 Becoming Literate

Initially, all written words with the exception of a few words recognised from environmental print are completely unfamiliar to beginning readers. At school, children first learn how letters are pronounced, and then learn to read words by consecutively translating each letter (grapheme) into a sound (phoneme) and blending the sounds into a whole-word sound, a process called phonological recoding. Alternatively, look and say methods or a mix of decoding and whole word strategies are used for words such as yacht, the sixteenth century Dutch spelling. Thus, two processes are involved in word recognition: (1) phonological recoding, and (2) visual-orthographic look-up, coined by Coltheart (1984) as the dual route model of reading. Share (1995) speaks to the developmental aspects of the dual route model. He proposes that phonological recoding serves as a self-teaching mechanism for visual-orthographic look-up, enabling the beginning reader to proceed from slow deciphering trough decoding to fast retrieval of word pronunciations through visual-orthographic look-up. The self-teaching hypothesis (Share 1995) contends that with every phonological recoding attempt, both the phonological (how the word is pronounced) and the orthographic (how the word is written) specifications will be strengthened in the lexicon.

The psychological process underpinning reading comprehension, the ultimate goal of reading, seems to be even more complicated. However, the assumption that reading comprehension builds on listening comprehension has proven to be a good starting point (Kintsch and Rawson 2005). According to these authors, comprehension largely depends on automatic processes that help us build up a representation of the text at hand. Automatic processes are processes that do not require conscious effort to execute them, such as listening comprehension (in one’s native language). Another process that needs to be automatic is word recognition. Word identification processes need to be automatic in order to have resources available for understanding what the text is about. A text is represented at several levels, including a linguistic structure, a semantic representation, and a so-called situation model, that is, a mental model of what the text is about. Perfetti et al. (2005) suggest that the essential skills children should acquire include the following: (1) The parsing of meaning and form of sentences into a text representation; (2) Building up a situation model on the basis of the text representation; and (3) Drawing inferences, that is, making the text coherent, because no text is completely explicit. Finally, the model developed by Perfetti and colleagues assumes that the real “bottleneck” in reading for meaning is decoding skill, that is, quick word recognition (see also Perfetti 1985).

2 Individual Differences in Reading Development

In the following section we describe where and how multimedia might benefit literacy learning, having first looked into developmental and behaviour-genetic studies of reading.

Longitudinal Studies of the Development of Reading Skills

Stanovich (1986, 2000) conducted a series of studies to explain the ‘fan-spread’ effect on the variability of reading skill. He observed that students who start at a relatively high level of initial reading skill developed their skills much quicker than students who were less able when they started learning to read. He coined the term for this difference the ‘Matthew-effect’, from the biblical reference of the rich getting richer, the poor getting poorer. From recent research we know that the driving factor behind the increasing differences in reading skill is leisure time reading. More precisely, leisure-time reading activities were related to differences in the size of the vocabulary, and, in turn, vocabulary size promotes reading comprehension (Bast and Reitsma 1998).

Differences between students already exist when formal reading instruction starts, usually at the time they become 5, 6 or 7 years of age. It is clear that general cognitive skill is a powerful predictor of reading ability, as long as no specific skills for the effective processing of print are learnt, that is, when measured in kindergarten (Bowey 1995). Bowey (1995) and De Jong and Van der Leij (1999) explained with an assessment of vocabulary in kindergarten between 15% and 22% of the variance in reading in the first grade. Most probably, general cognitive ability contributes to reading success through efficient perceptual processes, such as being able to discriminate letters and sounds. Within normally developing children it is verbal ability at preschool age, rather than general cognitive ability, which determines later success at learning to read (Stanovich 2000). Subsequent studies have examined which specific aspects of verbal ability predict early reading achievement. Vocabulary predicts about 25% of the variance in end-of-first grade readers (Bowey 1995), whereas grammatical skills predict about 17% (Scarborough 1990). Phonological memory, commonly measured with a nonword repetition task (Baddeley and Gathercole 1992), predicts reading development in both deep (English) and relatively shallow orthographies, like Dutch (De Jong and Van der Leij 1999) and German (Naslund and Schneider 1996). Most of the research concentrating on speech perception and speech production has been carried out by Scarborough (1990) who found that errors in spontaneous speech in 30-month-old children predicted reading attainment in the second grade, and by Elbro et al. (1998) who observed that the distinctness with which Danish children pronounced phonologically complex words predicted later reading success, even when effects of letter knowledge and other factors were controlled for.

Phonological sensitivity is perhaps the factor most researched. The initial finding that kindergartners’ ability to count and manipulate phonemes and syllables in spoken words predicts later reading achievement (Mann and Liberman 1984) has led to an enormous amount of research not only in normally developing children, but also in children with dyslexia. The tasks typically require children to select a rhyming word with a given word, to say a word leaving out the last sound, or similar. Phonological skills play a relatively large role in learning to read in a deep orthography such as English, but are developmentally limited in shallow orthographies (Wimmer et al. 2000), that is, they are only relevant during a limited period (in the beginning of the year in which children start learning to read). Letter-name knowledge appears to be a very strong predictor of later reading achievement, explaining up to 36% of the variance in word identification at the end of the first year of reading instruction, especially when phonics reading programmes were used (Bowey 2005).

Finally, rapid automatised naming (RAN) has been a factor of much research interest. In RAN tasks a subject has to name as quickly as possible a continuous series of stimuli such as digits, common objects, colours, letters or words. There is still a debate over whether RAN is an independent contributor to early reading achievement over and above phonological skills. When assessed with digits and letters, it is likely that the effects are mediated through letter knowledge (Wagner et al. 1997).

Behaviour-Genetic Studies of Reading

The power of behaviour-genetic studies in which monozygotic twins (MZ), who share 100% of their genes, are compared with dizygotic twins (DZ), who share about 50% of their genes, is that it facilitates an assessment of the genetic, shared environment, and non-shared environmental influences. An example of a shared environmental factor is, for example, the school, the teacher, and the reading method used. If one of the twins breaks a leg and misses school for some time is an example of a non-shared or unique environmental factor. If the correlation in DZ twins is more than half the MZ correlation, then there is an influence of the shared environment.Footnote 1 If the correlation is smaller, genetic factors play a relatively more important role. In short, behaviour-genetic studies can inform us of where teachers have the best chances to make a difference for their students and of where best to use technology, that is, where influences of the shared environment are relatively large. Behavioural genetics can also help us to find those components of reading skill that are only moderately or less heritable. These components depend much more on the environment and are sensitive to changes in the environment, for example, to teaching, training or intervention (with multimedia).

Behavioural-Genetic Studies of Decoding Skill

With a genetically sensitive design in three different countries (U.S., Australia, and Norway and Sweden together), Samuelsson et al. (2007) looked at the contributions of phonological awareness (PA), rapid automatized naming (RAN), verbal memory, vocabulary, knowledge of grammar and morphology, and, knowledge of and experience with print to reading and spelling at the end of kindergarten. PA, RAN, and verbal memory showed substantial heritability, whereas knowledge of and experience with print and vocabulary showed strong influences of shared environment. Oliver et al. (2005) found similar results in a study conducted with a larger sample of twins in the UK.

Behavioural-Genetic Studies of Reading Comprehension

Byrne et al. (2009) replicated earlier findings that reading comprehension is substantially heritable and mostly determined by vocabulary, which has both substantial heritability and shared environment components in Grade 2. Keenan et al. (2006), working with older students in which the assessment of reading comprehension is less confounded with decoding skill, found that listening comprehension and word recognition (decoding) were the most important variables that independently drive reading comprehension.

3 Multimedia

Multimedia in the context of this meta-analysis refers to the integration of text, images, and sound presented electronically. Children, even very young children, are increasingly exposed to electronic media in the form of television, video, DVDs, computer programmes, electronic books, talking books, the internet, video games, tablet and smart phone applications, and interactive toys, to name a few.

As long as nearly 30 years ago, researchers called into question the efficacy of the prevailing teaching paradigm of one-dimensional, primarily verbal delivery of instruction (Clark and Paivio 1991) and recognised the potential for multimedia technologies to facilitate interactive learning opportunities. The National Association for the Education of Young People (NAEYC) issued a position statement acknowledging that “used appropriately, technology can enhance children’s cognitive and social abilities” and recommended that “computers should be integrated into early childhood practice physically, functionally, and philosophically” (NAEYC 1996, p. 2). An update was published in collaboration with the Fred Rogers Centre in January 2012 (http://www.fredrogerscenter.org). However, while some recognise the potential for multimedia to enhance learning, others debate the desirability of technology in early childhood education settings (Buckingham 2000; Lankshear and Knobel 2003; Stephen and Plowman 2003). Some argue that the use of technology in early childhood may not be developmentally appropriate, particularly in terms of cognitive overload (Kirschner 2002). Conversely, proponents of dual-coding theory maintain that the combination of visual with auditory stimuli results in enhanced comprehension (Sadoski and Paivio 2007). Some reference teacher resistance to incorporating technology into lessons (Turbill 2001), while others argue that the cost of integrating technologies into classrooms, particularly those of young children, costs much and produces little in measurable educational gains (Yelland 2005). Still others go so far as to contend that the use of technology undermines the very nature of childhood (Buckingham 2000). Whether or not young children should engage with multimedia has been long debated. Nonetheless, it is clear that children are, in fact, doing so on a daily basis (Etta, this volume; Rideout and Hamel 2006; Rideout 2014). Depending on which side of the debate one hails from, those who view technology as a powerful resource for early literacy enhancement, supporting ‘children of the digital age’ (Marsh 2005) or, alternatively, those who criticize technology as ‘the death of childhood’ (Buckingham 2000), a meta-analysis can tell us how effective multimedia applications are.

More importantly, it needs to be considered how technology and multimedia applications in particular might work, that is, how they actually might benefit literacy learning. Cheung and Slavin (2012) suggest that (new) technology might improve (1) the quality of instruction, because “content can be presented in a visual, varied, well-designed, and compelling way”; (2) the appropriate level of instruction because of the capacity to adapt the pace and level of the instruction to individual needs. Also, (3) the incentives to learn can be increased, as well as (4) the time on task and providing feedback.

Reviews of Multimedia

Several literature reviews have attempted to provide an overview of the existing research on the topic (Courage this volume; Hisrich and Blanchard 2009; Kamil et al. 2000; Lankshear and Knobel 2003; Plowman and Stephen 2003; Bus et al. this volume; Yelland 2005; Zucker et al. 2009). See also recent reviews on the topic, Courage’s chapter and Bus, Sari, and Takacs’s chapter in this book. Kamil et al. (2000) undertook a comprehensive review of 350 articles including empirical studies and research reviews on the effects of multimedia on literacy. It was suggested that the use of multimedia facilitates children’s comprehension through ‘mental model building’, hypothesized to be a result of information presented as animation. Similarly, Lankshear and Knobel (2003) provided a synthesis of the research on the use of technology in promoting early literacy, focusing on young children. They found only 22 published articles that were relevant for review. In their quantitative assessment of the literature it was found that the research literature was unevenly distributed, with most focusing on the conventional aspects of reading such as decoding, rather than comprehension, or generating texts. Most significantly, they concluded that the effects of technology on early literacy development were “radically under-researched”. Likewise, Burnett’s (2009) literature review on literacy and technology in primary classrooms also noted a lack of research on the topic. A review of 38 studies published between 2000 and 2006, 22 quantitative and 16 qualitative, was conducted. It was concluded that the studies reviewed were limited in scope, as technology was used to support literacy in the same ways as print literacy, “assimilating technology by grafting it onto existing practices”, and therefore rendering the differential impact of multimedia on literacy development difficult to ascertain.

Recognising the need for research evidence on the topic, Zucker et al. (2009) provided a synthesis of studies published between 1997 and 2007 on the effects of electronic books (e-books) on the literacy outcomes of children from preschool through fifth grade. Seven randomized-trial studies and 20 quasi-experimental narrative studies met the selection criteria for their review. The aim of the study was to examine effects of e-books on children’s comprehension and decoding-related skills, specifically in relation to emergent and beginning readers and children with reading disabilities or at risk of reading failure. Of the seven randomized-trial studies included, results of their meta-analysis showed small to medium effect sizes for comprehension. The effect on decoding was inconclusive, as only two studies that met the inclusion criteria examined it. The 20 studies included in the narrative review indicated mixed results. While it was found that e-books overall supported comprehension, they could, under some circumstances, actually undermine it (De Jong and Bus 2002). More recently, Cheung and Slavin (2012) found effect sizes of.37 for low-ability children,.27 for middle-ability children, and.08 for high-ability children, respectively, when reviewing 84 studies conducted in K12 over the period 1980–2010. Although these effects are small, it clearly indicates that those who need it most, benefit most: an indication that Matthew effects can be reversed!

4 Computer-Assisted Instruction

Since the late 60s computers have been used to assist in the teaching of reading and in the remediation of reading problems. Some computer programmes aim at practising a specific subskill of reading. Other programmes have been designed to combine the training of various subskills. An example of a combination of repeated reading, phonological awareness, and decoding is the WordBuild programme (McCandliss et al. 2003; Harm et al. 2003). The following two categories still seem to describe CAI for reading adequately: (1) computerised versions of basal reader programmes, and (2) tools that have especially been developed for (older) struggling readers.

Computerised Versions of Basal Reader Programmes

These programmes come with a standard reading method and may differ from each other in several ways. In some reading methods the accompanying computer programme offers additional practice for struggling readers, in others all children go through the same programme, more or less in the same pace. The main characteristic is that these programmes contain several types of practice, usually from training phonological skills to text reading. More recently, reading and math programmes have been developed that keep motivation levels high by providing tasks that are not too easy nor too difficult for the individual learner (e.g., Klinkenberg et al. 2011).

Tools Especially Developed for Struggling Readers and Older Persons with Dyslexia

These programmes serve the purpose of supporting the user in reading, by reading aloud texts, such as Kurzweil 3000 (http://www.kurzweiledu.com/). Kurzweil 3000 offers also the possibility of scanning books while keeping the original layout, including pictures, drawings, and tables. The spoken text can be exported as a MP3 file and then can be listened to everywhere, without the need to take a computer with you.

Reviews of CAI

The Stanford project, aimed at a complete replacement of the teacher by a computer, was the first project to be evaluated. It did, however, not live up to the expectations (Fletcher and Atkinson 1972). The main reason that these reading programmes never would have become cost-efficient is because they ran on very expensive mainframe computers. Slavin (1991) evaluated IBM’s Writing to Read programme in a meta-analysis study by looking at 29 studies and concluded that the efficiency of the programme was very low, that is, the costs in comparison to the learning effects were too high, a conclusion that is in line with other reviews (Krendl and Williams 1990).

Seven reviews that evaluate the use of CAI and beginning reading were published since 1990 (as far as we are aware). Two used a meta-analytic techniques and found effect sizes of 0.25 (SE = 0.07) and 0.16 (SE = 0.08), Kulik and Kulik (1991) and Ouyang (1993), respectively. Qualitative reviews were conducted by Torgesen and Horen (1992), Van der Leij (1994), Wise and Olson (1998) and by the National Reading Panel (2000), which were generally positive. However, Torgesen and Horen (1992) pointed out that much work needed to be done on the integration of the computer with the existing curriculum that was highly teacher-driven. The qualitative studies conducted by Van der Leij (1994) and Wise and Olson (1998) both concerned the use of computers with reading-disabled children. Van der Leij (1994) found that studies that concentrated on a specific subskill were generally more effective than multi-component programmes. Wise and Olson (1998) concluded that talking computers combined with phonological awareness training had a positive effect on learning outcomes, especially in children with relatively stronger phonological skills. The National Institute for Literacy report (2008) also concluded that talking computers show promise.

Although most of the recent studies seem to be positive about effects of CAI, the two studies that analysed effect sizes within a meta-analytic approach do however not give much reason for optimism, as mean effects of about.20 with a standard error of around 0.07 are reported. In the terminology of Cohen’s (1988) these are small effects. However, it is likely that due to improvements in computer hardware and software and the integration of the computer in classroom learning activities, CAI has become more effective. Therefore, Blok et al. (2002) analysed studies undertaken in the 1990–2000 period. They categorised the studies, which all were concerned with beginning reading, along a variety of criteria in order to be able to find out what the elements are that make computer programmes work. In particular, they looked in 45 studies that reported on 75 experimental conditions at effect sizes and characteristics such as year of publication, language of instruction, experimental design (with or without control group, with or without pretest), subject assignment (blocking, randomisation, matching, within-subjects), size of control and experimental group, population (normally developing, reading-disabled), age of participants at the beginning of the study, type of programme (phonological awareness, speech feedback, flash words, reading while listening, or mixed), duration of the programme in weeks and in hours, type of the dependent variable (phonological skills, letter identification, word accuracy, word speed, text accuracy, text reading speed, mixed), type of posttest score (observed score, gain score, score adjusted for covariates). The combined effect size was 0.254 with a standard error of 0.056. Experimental subjects thus were on the average 0.254 standard deviations better off than students in the control condition or compared with a baseline score. The variance of the effect sizes was 0.083, which means that there were considerable differences in effect sizes between the studies. Thirty-four per cent of the variance could be explained by entering the effect size at pretest into the equation. Language of instruction explained another 27% of the variance; studies conducted with English as medium of instruction obtained effect sizes that were 0.319 SD larger than non-English studies. No other variable was related to effect size at the posttest. The conclusions were very straightforward: computer-assisted instruction has little effect. As said, another 10 years of further developments in hard and software has not produced any better results than in the decade before. The language effect comes, however, as a surprise. The authors explained it as an effect of the transparency of the language. If this explanation however were viable the same would be expected for the Danish studies (there were two Danish studies in the sample), because Danish is nearly as deep as English with respect to the orthography of the language (Seymour et al. 2003). The language effect may reflect that there is more room for improvement in deep orthographies, as reading development lags behind in deep orthographies compared with shallow orthographies.

5 Purpose of the Study

The aim of the current systematic review is to analyse the studies that were conducted after the Blok et al. (2002) review, that is, studies published between 2000 and 2010, an equally long period. The review was extended with e-books that became widely available during that period, together with TV/Video. Furthermore, defining characteristics of the studies associated with the effect sizes are examined. We expected that multimedia applications would be more effective than before, because of the following technological changes. Availability of the Internet in schools made it possible to have access to large databases of learning materials. Generally, also, video and audio animations improved, and, due to new programming methods, programming computers, tablets and smart phones became easier.

6 Method

6.1 Search Criteria

Specific key terms and phrases related to multimedia and early literacy were identified by reviewing the following reference books: Handbook of Early Literacy Research (Neuman and Dickinson 2001), Handbook of Research on New Literacies (Coiro et al. 2008), and International Handbook of Literacy and Technology, Volume II (McKenna et al. 2006). The first two authors independently devised key word search strings, and then cross-referenced these, resulting in the following list of primary search key words: children, young children, children at risk, minority children, language minority children, cultural minority children, low SES children, disadvantaged children, children with reading disabilities, dyslexic children. Secondary search key words were: literacy, emergent and early literacy, reading, early and beginning reading, writing, early writing, beginning writing. Finally, the following tertiary search key words were used: media, multimedia, electronic media, digital media, technology, ICT, information technology, educational technology, interactive technology, digital books, on-line books, talking books, digital books, electronic books (e-books), CD-ROM, computers, computer-assisted learning, computer-based learning, CAI, internet, World Wide Web, television (educational television, children’s television), Sesame Street, Between the Lions, DVD, mobile phones.

6.2 Search Strategy

The Educational Resources Information Centre (ERIC) and PsychINFO were searched simultaneously using the aforementioned key word search strings. The broadest terms were input first and ‘find all search terms’, ‘apply related words’, and ‘also search full text’ were options selected in order to attain the highest number of hits. In PsychINFO, a selection was made to narrow the subject age range by selecting the age group ‘childhood (birth – 12 years)’. These databases were then searched for peer-reviewed articles published in English between 2000 and 2010. In addition, the following key journals published in the same period were manually searched: Journal of Early Childhood Literacy, Journal of Research in Reading, Journal of Early Childhood Literacy, Reading Research Quarterly, Early Childhood Research Quarterly, Journal of Literacy Research, Reading and Writing, Computers & Education, and Journal of Computer Assisted Learning. Finally, the following special issues on technology and young children were searched: ‘Technology in early childhood education’ in Early Education and Development (Vol. 17, 1, 2006), ‘Using technology as a teaching and learning tool’ in Young Children (November 2003), ‘Literacy and technology: Questions of relationship’ in Journal of Research in Reading (Vol. 32, 1, 2009), ‘Technology special issue’ in Contemporary Issues in Early Childhood (Vol. 3, 2, 2002), ‘Technology and young children’, downloaded from www.technologyandyoungchildren.org. References for several hundred potential studies were located. After reviewing the abstracts of each, 92 studies were acquired through the library or, if published in an E-journal, downloaded for further evaluation. The search and the review process were carried out by each of the first two authors independently and then cross-checked.

6.3 Selection of Relevant Studies

Studies were included in the meta-analysis by meeting the following criteria, based on the content of the article abstract, if it provided the necessary information, and full-text, if the abstract was not sufficient. (1) Quantitative research on literacy interventions published in peer-reviewed journals between 1 January 2000 and 1 May 2010. (2) Studies in which participants were classified as ‘early childhood’, that is, subjects 0–8 years old. (3) Studies that included children at risk for literacy failure (e.g., dyslexia, low SES and/or language/cultural minority children). (4) Studies that included mainstream children. (5) Studies that measured at least one of the following literacy outcome variables: phonological awareness, reading comprehension, spelling, accuracy of reading words, accuracy of reading nonwords, fluency of reading, learning about print concepts, vocabulary learning, letter learning, rapid automatized naming and listening comprehension. (National Institute for Literacy 2008). (6) Studies that were published in English.

6.4 Coding

The first two authors independently coded all studies as to the following study characteristics. (1) Age group of the participants. Categories included kindergarten, preschool and kindergarten, first graders, second graders, kindergarten through second graders, second and third graders. (2) Specificity of treatment. Studies were coded as either training one subskill or training more than one subskill. (3) Risk of reading failure: at-risk (low SES, second language learner, or reading failure) or not-at-risk students. (4) Language of instruction: English, Dutch, French or Hebrew. (5) Country in which the study was conducted: US, UK, Canada, Netherlands, France or Israel. (6) Media type: e-book, computer-assisted instruction, TV/Video. (7) Type of control group/treatment: traditional medium/curriculum, alternative reading treatment, alternative non-reading treatment (e.g., math), pretest used as baseline assessment or no-risk group used as control. (8) Grouping of participants for intervention/treatment: mixed groupings, individual, whole class, small groups. (9) Type of test used to assess learning outcome: standardised test, experimental test. (10) Transfer of training: training to test, transfer of training/curriculum-based. (11) Duration of treatment in weeks. (12) Number of sessions over whole treatment period. (13) Average session duration in minutes. (14) Type of posttest score: raw observed, adjusted (e.g., for pretest score), transformed (e.g., standardised score). (15) Design – experimental: pretest-posttest untreated control group, posttest untreated control group (with gain-scores analysed), pretest-posttest control group with alternative reading treatment, posttest no control group, pretest-posttest no control group. (16) Design – statistical: between classes, within classes, between schools, within schools, counterbalancing within class. In addition, publication order was computed by using the year of publication (2000–2010) and the issue number (1–4 or 6) of the journals into a scale that ranged from 1 to 10. From 12 and 13 the total time-on-task in minutes was computed. For analysis purposes, this number was divided by 100 and centred around 10. See Table 1 for the coding of all studies.

Table 1 Study coding

Coding of Literacy Outcomes

The selected studies were also coded for type of literacy outcome, according to generally accepted definitions (see Stanovich 2000). However, we have reported elsewhere about whether the various literacy outcomes are differentially affected by the use of multimedia applications (Van Daal and Sandvik 2013). The results are summarised in Appendix 2. In this paper the different literacy outcomes are amalgamated, see below.

Phonological awareness (PA) is defined as the ability to detect, manipulate, or analyse the auditory aspects of spoken language, including the ability to distinguish or segment words, syllables, or phonemes, independent of meaning. Reading comprehension is the ability to comprehend and recall a written story and to make inferences. Both conventional (‘write the word or the sentence’) and invented spelling tasks (for preschoolers) are used to tap spelling ability. The accuracy of reading is defined as the ability to correctly read real words, sentences or text. The accuracy of reading nonwords is defined as the ability to correctly read nonwords or low-frequency words. In some studies, lexical decision-making (decide whether a string of letters is a word or not) was used as a reading accuracy task. Fluency of reading is measured with timed reading of words, sentences or texts tasks. Learning about print concepts is defined as knowledge of print conventions (e.g., from left to right and from top to bottom of a page reading, and going through a book from front to back) and concepts such as book cover, author, and purpose of books. Vocabulary learning comprises of being able to use words actively and passively. Letter learning entails knowledge of the names and sounds associated with printed letters, including letter naming fluency, sound discrimination, and letter-sound relations. Rapid automatized naming (RAN) is defined as the ability to rapidly name a sequence of random letters, digits, colours, or objects. Finally, listening comprehension is the ability to comprehend and recall an oral story and to make inferences.

7 Results

7.1 Descriptive Statistics

After reviewing the abstracts or full-text of each article collected, 51 studies met the selection criteria.Footnote 2 If studies included more than one treatment or more than one experimental group, we treated them as separate studies. Nine studies were later excluded for missing relevant statistics (number of participants, means, standard deviations, or non-aggregated statistics; only five corresponding authors replied positively to our request to supply us with missing statistics). Twenty-eight articles reported on single studies, whereas seven contained multiple treatments/experimental groups. Of the remaining 42 studies, 26 studies included children at risk of reading failure. Of the studies of children at risk, 11 studies reported on interventions with second language learners, most stemming from cultural or language minority groups, six studies included children of low socio-economic status, and nine studies dealt with underachieving readers. Twelve studies on the effects of multimedia interventions in mainstream children were found. The 35 studies that were submitted to the meta-analysis are marked with an asterisk (∗) in the References.

The majority of studies were conducted in English-speaking countries, USA (15), UK (4), and Canada (4). Thirteen studies were conducted in The Netherlands (Dutch), one in France and five in Israel (Hebrew). Two studies dealt with embedded multimedia (TV/Video) in teachers’ reading lessons, two with subtitled video, 14 with e-books, and 24 with Computer-Assisted Instruction. Most of the studies were published in the last 16 months of the period we examined (16), five in 2003 and five in 2006, whilst other publications were evenly spread over the other years. Thirty-eight studies were carried out with participants from preschool and kindergarten. About half of the studies trained a single subskill (18). Seventeen studies used the traditional medium/curriculum, six an alternative reading treatment, eight an alternative non-reading treatment and 11 a pretest baseline or a no-risk group as a control condition. Twenty-seven studies provided an individual treatment. Twenty-three studies used a standardised test to assess the learning outcomes, whereas 19 studies used experimental tests. Twenty-seven studies trained to the test, whereas 13 aimed at transfer or tested targets from the existing curriculum, whilst two studies were unclear about what sort of test was used. Duration of the treatment varied from 3 to 40 weeks, whereas the number of sessions varied from 1 to 74 with average session duration varying from 6 to 90 min. The intensity of the training in terms of total time-on-task varied between 6 and 2220 min. All but three studies analysed raw observed scores, whereas eight studies did not include a control group at all. Finally, three studies compared treatments between classes, 26 within classes, 6 between schools, and 5 within schools. In one study treatments were counterbalanced within classes. In total 2525 children participated across all studies, 1201 as experimental subjects (on the average 28.6 per study) and 1324 as control subjects (on the average 31.5 per study).

7.2 Meta-analysis

For each treatment/experimental group/literacy outcome Hedge’s g, to account for small sample sizes, was computed, that is, the difference between the means (of either the experimental group and the control group, or the posttest and the pretest in case there was no control group) was divided by the pooled standard deviations, as different units of measurement were used across studies (see Cornell and Mulrow 1999). Within each study, the effect sizes of multiple literacy outcomes were averaged. In Table 2 multiple and averaged effect sizes of all studies are presented, together with the numbers of participants of the studies.

Table 2 Aggregated effect sizes and effects sizes for different learning outcomes

However, it was first checked whether literacy outcomes could be averaged without losing information by running a principal component analysis on the results of two studies in our sample that contained the widest range of literacy outcomes. In the Savage et al. (2009) study in total 11 outcome measures were taken, of which the raw data were made available to us. We combined the two measures for PA (elision and blending) and the two for RAN (objects and letters). All nine remaining outcomes loaded between.466 and.878 on one factor that explained 58.3% of the variance. Steve Hecht ran a similar analysis on the primary data set of the Hecht and Close (2002) study and kindly shared the SPSS output with us. Six assessments explained 58.5 of the total variance and loaded between.631 and.870 on one single factor. Although there were differences between the studies as to instruments used, the age of the participants, and the types of computer programmes, the results of the principal component analyses, which both examined effects of analytic and synthetic phonics in both studies, definitely converge. It was therefore concluded that it is appropriate to average literacy outcomes within studies. Because the factor contains both outcomes that are close to reading (e.g., PA, reading fluency) and literacy (e.g., listening comprehension, vocabulary), we prefer to keep using the term ‘literacy outcomes’. The results of both principal component analyses are presented in Appendix 1.

7.3 Multilevel Modelling

Multilevel modelling (MlwiN) was used to assess the effect that study characteristics have on the effect sizes reported in the studies (Rasbash et al. 2005). For this analysis, studies were regarded as nested under publication year. An average effect size of.645 (SE = .112) was found, whilst 19.53% (.126, SE = .085) of the total variation in effect sizes was explained by the year, in which the study was published and 31% (.200, SE = .064) was due to differences between studies. In Table 3 the parameter estimates and the standard errors of the estimates of the final model with only significant effects are presented.

Table 3 Results of multilevel modelling for the effect of study characteristics on effect sizes

Factors that positively affect effect size include total time-on-task (an increase of.153 with every 100 min more, SE = .051), which was slightly moderated by the number of sessions the participants engaged in (a decrease of −.035 with every additional session, SE = .010). Effects are.854 larger for preschoolers and kindergartners in comparison to first graders and older children (SE = .248). In comparison with studies in which the traditional medium or curriculum forms the control condition, effects sizes are larger if the control condition consists of an alternative reading intervention (1.139, SE = .357) and if there is a pretest as base-line or if a no-risk group is used as the control group (.758, SE = .266). A study design in which gain scores are analysed gave smaller effect sizes (−.882, SE = .410).

Effect sizes are not influenced if the control condition consists of a non-reading task. Publication year, specificity of the training, type of risk factor, language of instruction, media type, grouping of the participants, type of test used, type of scores analysed, and design (statistical comparison) all did not affect the effect sizes of the studies. Nor did any interaction between significant parameters in the final model.

Finally, we checked whether publication bias affects the current meta-analysis. Publication bias refers to a tendency to publish studies with significant results, thus with sizable effect sizes. The presence of publication bias is assessed by examining the correlation of the effect size of studies with a measure of precision, such as sample size, standard error, or the inverse of the standard error (Cornell and Mulrow 1999). This can be done by visually inspecting the scatterplot of the correlation and by statistically testing the correlation under the assumption that studies are symmetrically distributed in a funnel shape with precise studies having less variable and less precise studies having more variable effect sizes, if publication bias is absent. In Fig. 1 effect sizes of the primary studies are plotted against the sample sizes. Visual inspection shows that there are relative few studies with many (over 80) participants. A funnel-like shape can be recognised in the studies with less than 80 participants. For these 30 studies the Kendall rank correlation is −.130 (p = .317). Given the relative high p-value, it is unlikely, even given a relatively small number of primary studies, that publication bias forms a threat to the validity of this meta-analysis. However, the studies with large numbers of participants show clearly no funnel shape. This may well be due to the fact that very few large-scale studies can be conducted at all, due to financial constraints.

Fig. 1
A plotted graph represents the effect size from negative 1 to 3 versus the sample size from 0 to 200.

Standardised effect sizes plotted against sample sizes

8 Discussion

This systematic review sought to assess how effective multimedia applications were in the 2000–2010 period, in which major developments in hard- and software took place. In addition, this study examined which characteristics of the primary studies are positively related to the effect sizes obtained. The hypothesis that multimedia applications would be more effective than before, is supported, as a medium overall effect of.65 was obtained, which is substantially larger than reported by Blok et al. (2002) in their review of CAI and by Zucker et al. (2009), who reviewed the efficiency of e-books. Moreover, this study shows that effects can and have been replicated in non-English speaking countries, though on a small scale. It also complements a previous report (Van Daal and Sandvik 2013), in which the effect of multimedia applications on specific literacy outcomes was evaluated.

Nineteen percent of variation in effect sizes in the current study can be ascribed to the year in which the study was published, whilst 31% reflected overall differences in effect size between the primary studies. Note that effect sizes vary between studies according to year of publication, but there is no significant association between effect size and year of publication. Time-on-task and being a preschooler or kindergartner were positively related to effect size obtained in the primary study. Three aspects of the design affected the effect size of studies. Larger effect sizes were obtained in studies that compared interventions/treatments with a traditional medium or curriculum. Also, effect sizes were larger for studies that used a pretest-posttest design without a control condition or took a no-risk group for comparison. Smaller effects were obtained, if gain scores were analysed.

The largest effect size, 2.25 on a comprehension measure, was obtained in a study by De Jong and Bus (2004), which compared effects of electronic books and being read aloud by parents with a counterbalanced design. In 10 other studies aggregated effect sizes were greater than 1. In most studies with multiple literacy outcomes a considerable variation in effect sizes across literacy outcomes was found. This is probably due to different contents and different forms of practice. For example, in the Comaskey et al. (2009) study both the analytic and the synthetic phonics training was very effective for letter learning and PA but less so for word and nonword reading, whilst in the Hecht and Close (2002) study a combination of analytic and synthetic phonics training was more effective for word reading and PA but not for letter learning.

In contrast to these two studies, experimental groups were often compared with a control group that did ‘nothing’, which may produce inflated effect sizes. A more telling comparison would be to look at the so-called ‘added value’ of multimedia applications. Several of the studies included in this review offer such a possibility in addition to the aforementioned studies that compared multimedia interventions with regular classroom instruction. For example, in the study by Chambers et al. (2008) computer-assisted tutoring was compared with embedded multimedia. Effect sizes were larger for embedded multimedia than for computer-assisted tutoring with respect to comprehension (.56), word reading (.75), nonword reading (.46), and letter learning (.47). Another way to learn more about how multimedia may work is to include different kinds of experimental groups, as Verhallen et al. (2006) did. The effect size for the experimental group that was presented video pictures was larger than the effect size of the experimental group that was presented static pictures. In addition, the current study clearly showed that larger effect sizes were obtained in studies that compared the experimental group with an alternative reading treatment group, or using the pretest as a baseline, if there is no control group, or using a no-risk group as a control group. The latter should be positively interpreted: if at-risk children can catch up with their not-at-risk peers with help of multimedia, Matthew effects can be turned around (Stanovich 2000), which is also supported by the finding that effects were larger in preschoolers and kindergartners compared to older children. In other words, the earlier you intervene, the greater the chances of a positive response to intervention. In addition, population (at-risk, not at-risk) did not matter; it can thus be inferred that multimedia applications were equally effective in both populations: at-risk children did not get further behind their peers.

Over the years the methodological quality of the primary studies has definitely increased. Whereas Blok et al. (2002) observed that only 25 of 75 had a rigorous design, that is, included a control group and did not lack essential statistics, the current study includes 35 (out of 42 studies) with a control group. In most studies possible differences at pretest between experimental and control groups were accounted for. Also, the use of standardised tests has increased, and unreliable assessments, such as the use of gain or difference scores have become rare. Nevertheless, three studies that analysed gain scores were included in the current meta-analysis and yielded significantly smaller effect sizes. This is due to a relative large error variance of such compound scores and a reduction of the true variance, which leads to an underestimation of the effect size (see for a discussion of the use of compound scores Adèr et al. 2008, p. 261).

We were able to demonstrate that time-on-task between studies makes a difference. This has also been a topic of investigation within studies. For example, Hecht and Close (2002) found that time spent on using the Waterford Early Reading Programme uniquely contributed to effects four out of six literacy outcomes. A similar result was obtained by Segers and Verhoeven (2005), who found that the more time spent on playing a computer game that promoted phonological skills in native Dutch and immigrant children, the more was learnt. The moderating effect that was found for the number of sessions across primary studies in our meta-analysis could be due to regression to the mean, that is, with very many sessions an asymptote of the effectiveness is reached.

Consistent with Blok et al. (2002) no influence of study characteristics such as year of publication, design (statistical), population (being at risk or not at risk), and specificity of training (one or more subskills trained) was found. Cheung and Slavin (2012) found no effect of year of publication either. It is remarkable that year of publication did not affect the effect sizes, as one would expect that researchers gain insights from previous studies and build more effective multimedia. On the other hand, effect sizes are based on mean differences and variances. This means that some children profit more from interventions with multimedia than others. It could well be that multimedia interventions across different years of publication are beneficial for different subgroups of children without showing an overall increase of effect size. Design-statistical characteristics of the study (between classes, within classes, between schools, within schools, and counterbalancing within class) most probably did not affect the effect sizes, because when weaker designs such as the between comparisons were used, it was usually checked whether still valid conclusions could be drawn, for example, differences between groups at pretest could be excluded as a possible confounder.

Study characteristics, which Blok et al. (2002) did not examine, but made no difference in our review, include media type (e-book, CAI, TV/Video), country, language, grouping of students (mixed, individual, whole class, small groups), transfer of training (train to test, transfer of training/curriculum-based), and type of scores analysed (raw observed, adjusted, transformed). Unfortunately, if these effects existed at all, the design of the current study would not have had sufficient power to detect them.

8.1 Comparisons with Other Meta-analyses: CAI

It seems useful to compare our results with the results from other meta-analyses. As far as we know, two recent studies are relevant here, Hattie (2008), who conducted over 800 meta-analyses of existing meta-analyses, which encompassed 52,637 original studies and Cheung and Slavin (2012), who focussed on the impact of technology in literacy learning, synthesizing 84 studies. Hattie (2008) synthesized meta-analyses of CAI, of which only three original studies focussed on literacy learning, including the aforementioned study by Blok et al. (2002) and two others with respective effect sizes of.19,.27 (published in 2000) and.31 (published in 1995). Cheung and Slavin (2012) using very stringent inclusion criteria, that is, only studies with print-related outcomes, − no phonological awareness or listening comprehension were included – found a 95% confidence interval for the effect size that ranged from.12 to.21. Cheung and Slavin (2012) also reported a relatively larger effect size for comprehensive models of instruction, that is, using CAI along with other non-computer activities supported by teachers.

Where does the difference between the current study’s results and the results obtained by Blok et al. (2002) and Cheung and Slavin (2012) come from? We think that, whereas our inclusion criteria were similar to the ones used by Blok et al. (2002), there are disparities with the ones used by Cheung and Slavin (2012). Firstly, Cheung and Slavin (2012) selected 84 studies from the 1980s onwards, of which 47 were published in the 2000–2010 period, including not only journal articles as we did (they selected 15), but also unpublished doctoral theses (11), web publications (4), and reports (17). All studies they included were American, of which 4 were included in our meta-analysis. On the other hand, we included 11 more American studies and 21 studies conducted outside the US. In addition, the selection of studies by Cheung and Slavin (2012) was narrower with respect to literacy outcomes as they selected only print-related outcomes, but much wider with respect to the context in which the multimedia applications were used and the age of the participants. Thus, it could well be that selecting studies from the 1980s onwards, conducted in a wider educational context and with older participants in the original studies has led to finding relatively lower effect sizes of multimedia applications. Please note that we found that older participants profit relatively less from multimedia applications.

8.2 Implementation Variables in CAI

In intervention research, pilot and efficacy studies are first run in the lab and under controlled circumstances in schools. Then, a manualised intervention is implemented in real-world settings and it is evaluated whether intervention outcomes which are generalizable across various settings and participants (Kaderavek and Justice 2010). Pilot studies and efficacy research are carried out in a controlled setting to assess the causal relation between an intervention and an outcome, for example, whether a phonics programme influences phonological awareness. Maximum control is usually achieved by random allocation of participants to the experimental group, which receives the treatment, or to a comparison group, which receives an alternative treatment and/or a control group, which engages in ‘business as usual’, combined with pre- and post-testing. Efficacy research results in identification of the ‘active ingredients’ of an intervention; it answers the questions of why the intervention produces positive outcomes, of how and why an intervention is effective and of why it works better than other interventions (Longabaugh et al. 2005). In addition, efficacy research informs the effectiveness of the intervention (characteristics) in terms of effect sizes.

Finally, through effectiveness research it is examined how effective an intervention is as implemented (Hulleman and Cordray 2009), that is, how treatment effectiveness reduction can be countered when moving from the lab to the field. The reduction in treatment effectiveness is examined by studying treatment fidelity. Treatment fidelity is defined as the degree to which field implementation of an intervention corresponds to the prototype implementation (Hulleman and Cordray 2009). There are two sources of treatment infidelity which decrease treatment strength: (1) in the experimental condition the treatment may not be implemented as prescribed (the teacher does not follow the manual or missed professional development training sessions), so that the intervention becomes less effective, and (2) in the control condition a teacher may add components from the experimental treatment or an alternative treatment, so that the control becomes more effective than it otherwise would have been. In sum, evidence-based practice (EBP) is based on results of both efficacy and effectiveness research.

Thus, in order for multimedia to be successful in the classroom situation or at home we already mentioned that the use of computers should be integrated with other teaching/learning activities (Cheung and Slavin 2012). Archer et al. (2014) examined therefore the moderating effects of (1) the quality of training and support teachers received for implementing a CAI intervention, and (2) the degree of implementation fidelity by combining three comparable meta-analyses. These meta-analyses comprised of original US studies conducted between 1990 and 2007. The overall effect size was.18, whereas there was an added effect size of.58 for training and support, a result that corroborates the finding of relatively larger effect sizes for comprehensive models of instruction (Cheung and Slavin 2012). However, no effect of treatment fidelity was found.

8.3 Comparisons with Other Meta-analyses: e-Books

For a very comprehensive systematic review of storybooks, see Bus et al. (2015). As far as we know, there is one meta-analysis specifically on the effectiveness of multimedia and interactive features in storybooks (Takacs et al. 2015). They analysed 57 effects on 5 outcomes from publications between 1980 and 2014 with 2147 participants, aged between 3 and 10 years of age. Effects were.17 (p = .04),.20 (p = .04), −.08 (n.s.),.16 (n.s.) and.26 (n.s.) for story comprehension, expressive vocabulary, receptive vocabulary, code-related literacy skills and engagement and child-initiated communication during reading, respectively. In addition, Takacs et al. (2015) found that animated pictures, music and sound effects were beneficial, whereas hotspots, games, and dictionaries were distracting. It seems difficult to compare effect sizes from this study with ours, as Takacs et al. (2015) also included TV, video and more, whereas we seem to have included interventions based on the very first lab studies, which had been tested in the field by researchers.

8.4 Limitations

As with all research, this study has its limitations. We discuss one of them that pertain to meta-analyses in general: whether causal conclusions can be drawn.Footnote 3 For example, it would seem sensible to conclude from this study that especially young children should expand the time they spend on learning with multimedia applications, because it was found that effect sizes were relatively larger for younger children in comparison to older children and for studies, in which children spend more time on using the multimedia applications in comparison to studies, in which less time was spent. This is however not necessarily true, because we don’t know how exactly the multimedia are used. We suggest an examination of how multimedia applications on are actually used. Generally, smart phones and tablets offer opportunities for more interactivity through touch screens that can be used by even very young children. However, evaluating apps is a challenging task for the following reasons. (1) There are very many apps available,Footnote 4 which makes it difficult to choose from, not only by teachers and parents, but also by researchers. (2) It is unlikely that any commercially available app is fully adaptive with respect to instruction and testing a (literacy) learner, because a sizable item bank is usually lacking. It is therefore unlikely that many primary studies with adaptive interventions can be run, let alone conduct a systematic review. A future methodology that exploits advantages of smart phones and tablets (Dufau et al. 2011) could entail to design an app based on a proven adaptive learning system available to very many children and to tag the devices over the Internet in order to collect data.

9 Conclusion

Multimedia applications evaluated over the 2000–2010 period have proven to be effective, especially when delivered to preschoolers and kindergartners, and if they are used intensively.

We expect that CAI will continue to be used in schools and homes. However, it seems unlikely that tablets and smart phones equipped with touch screen technology will soon be replaced. However, as we indicated, it will be hard to conduct evaluation studies for these hand-held systems. Nevertheless, this should be done.

An interesting topic for future research is, in our opinion, to look at when children are ready to use educational apps and games on smart phones and tablets. Looking at school readiness, Duncan et al. (2007) found that school-entry maths, reading and attention predicted later achievement best. Others, for example, Diamond (2012) and Nicolson (2016) have suggested that children are ready for learning maths and reading if their attentional skills are well developed. Moreover, Diamond (2013) found that attentional skills could be trained. Going back to educational apps and games, it would therefore be worthwhile to research how children can best be trained to use educational apps, thereby avoiding distraction by ‘bells and whistles’ (Bus et al. 2015; Takacs et al. 2015).