In the fall of 2017, 15.9% of kindergarten (K) students in the USA were English Learners (ELs; Hussar et al., 2020). In the USA, the proportion of ELs with reading difficulties (RD) is significantly greater than the proportion of English-monolingual (EM) students with RD. This is due to a combination of factors, including the significant contribution of English language vocabulary, morphological awareness, and syntactic knowledge to English language reading comprehension (e.g., Adlof & Catts, 2015; Gottardo et al., 2018; Kieffer & Lesaux, 2012; Lipka & Siegel, 2012; Tong et al., 2014). Economic disadvantage is another contributing factor. In the USA, EL students are disproportionately subject to the stressors and deprivation associated with economic disadvantage. In particular, Spanish-speaking ELs in the USA are more likely than EM students to have family incomes below or near poverty levels (Fry & Gonzales, 2008; Hernandez et al., 2008) and parents with relatively low levels of education and literacy (Capps, 2005; Hernandez et al., 2008). They are also more likely to be enrolled in under-resourced, low-performing schools (Capps, 2005; Cosentino de Cohen et al., 2005). As a result, many ELs in the USA have fewer opportunities to access texts and educational experiences that contribute to academic English literacy acquisition.

Reading difficulties can have serious consequences. Longitudinal research demonstrates that children who struggle with reading in the primary grades are very likely to continue to struggle through high school (e.g., Adlof et al., 2010; Boscardin et al., 2008; Francis et al., 1996; McNamara et al., 2011) and are at higher risk of school dropout, incarceration, and unemployment (Daniel et al., 2006; Greenberg et al., 2007; Hernandez, 2011). Fortunately, there is evidence that early reading interventions are effective in preventing this cascade of negative outcomes. The primary grades (K–3) represent a unique window of opportunity for early reading intervention, with the effects of reading intervention tending to be larger when they are delivered earlier rather than later (Al Otaiba et al., 2009; Lovett et al., 2017; Wanzek et al., 2018).

Much is known about effective early literacy instruction for EM students with or at risk for RD in the primary grades. More than three decades of research demonstrate that interventions with a primary focus on explicit, systematic instruction in foundational word reading skills (including alphabetic principle, phonological awareness [PA], phonics, and reading fluency) and some instruction focused on meaning (both on word meaning and on local and global comprehension of connected text) are associated with positive effects on measures of reading achievement (e.g., Gersten et al., 2020; Wanzek et al., 2016, 2018) for this population. Such multi-component reading instruction for students in the primary grades has the potential to reduce not only the severity but even the incidence of RD (Fletcher et al., 2018; Torgesen, 2002). Al Otaiba et al. (2009) reviewed intervention research evidence demonstrating that effective core (i.e., tier 1) reading instruction can reduce the number of children with below-level reading achievement to around 5–7%, and supplemental tutoring can reduce these numbers further to 0.2–2%.

Although there is much research establishing an understanding of effective early reading instruction for EM students, less is known about the degree to which this approach is effective for primary grade ELs. ELs may encounter challenges in acquiring proficient English reading skills for a variety of reasons. The degree to which decoding and language comprehension skills in the early elementary grades predict reading comprehension in the later elementary grades appears to be very similar for both EL and EM students (Kieffer & Lesaux, 2012; Lesaux et al., 2007). However, research suggests that, at least by the time students progress to the upper elementary grades, reading comprehension problems in ELs are, on average, more strongly predicted by linguistic comprehension difficulties than by decoding difficulties (Babayiğit, 2014; Cho et al., 2019; Grant et al., 2011; Proctor et al., 2005), and the association between language comprehension difficulties and reading comprehension difficulties is stronger for EL students than for EM students (Babayiğit, 2014; Cho et al., 2019). Based on this research, it stands to reason that approaches to reading intervention for primary grade ELs might be more effective in the long term if they target oral language proficiency to a greater degree than do literacy intervention programs designed for primary grade EM students. This reasoning is reflected in the Institute of Education Sciences (IES) practice guide on teaching academic content and literacy to ELs (Baker et al., 2014), which calls for the integration of vocabulary and oral language instruction into reading instruction. Therefore, it is surprising that recent systematic reviews have either not investigated reading intervention effects on oral language outcomes or have identified few research studies investigating the effects of interventions that focus on building oral language comprehension (Richards-Tutor et al., 2016).

Recent Systematic Reviews on Reading Instruction for ELs

Two recent meta-analyses (Ludwig et al., 2019; Richards-Tutor et al., 2016) and one best-evidence synthesis (Cheung & Slavin, 2012) reviewed reading intervention research for ELs, including ELs in Grades K–3. In the best-evidence synthesis conducted by Cheung and Slavin (2012), the authors identified 18 studies with the majority of participants in Grades K–3. While not all effect sizes reported were positive (ES range on combined reading outcomes, −0.38 to 0.68), there were promising findings for a number of programs, including Direct Instruction (DI; Adams & Engelmann, 1996), English Language and Literacy Acquisition (ELLA; Irby et al., 2010), and Success for All (Slavin et al., 2009), as well as for instructional approaches that incorporated cooperative learning, small-group instruction, and one-to-one (1:1) interventions.

The meta-analysis published by Richards-Tutor et al. (2016) reported consistently positive, statistically significant effects on foundational word reading outcomes for the seven included studies that delivered early reading interventions to EL participants with or at risk for reading disabilities (RD) in Grades K–1 (ES range, 0.58–0.91); however, they noted mixed effects of interventions in the two studies with participants in Grades 2–3 (ES range, −0.13–1.00). Richards-Tutor et al. identified three studies with participants in Grades K–3 that reported effects on vocabulary/oral language outcomes, not all of which were positive (ES range, −0.17–0.78); five studies with participants in Grades K–3 reported effects on reading comprehension outcomes, which were positive but varied in effect size (ES range, 0.06–1.00). The last meta-analysis, conducted by Ludwig et al. (2019), identified 20 studies with students in Grades K–3 and found that reading interventions had positive effects overall, with mean effects of 1.22, 0.80, and 0.50 on reading accuracy, fluency, and comprehension outcomes, respectively.

All three previous reviews articulated criteria for inclusion that systematically excluded studies that were included in the present review. For example, the best-evidence synthesis conducted by Cheung and Slavin (2012) only included studies evaluating the effects of English reading interventions on reading performance for Spanish-speaking ELs and excluded studies that did not (a) measure reading comprehension, (b) have a minimum of a 12-week duration, and/or (c) employ a rigorous research design (i.e., random assignment of students to conditions or matching/statistical adjustment for pretest differences). The mean effect of all studies K–6 in Cheung and Slavin (2012) was 0.23, although the authors did not further disaggregate findings (e.g., grade, outcome type), except by program type. The present synthesis differs from Cheung and Slavin in that we did not require EL participants to be Spanish speakers; we also included studies with a range of reading outcomes, including foundational word reading skills (e.g., PA, phonics) outcomes and/or language comprehension (e.g., vocabulary knowledge, listening comprehension) outcomes that are known to be robust predictors of reading comprehension.

The meta-analysis conducted by Richards-Tutor et al. (2016) only included studies for which EL participants were identified as having or being at risk for RD. We did not impose this requirement because many ELs who are at risk for RD are not identified as such in early grades; it is difficult to guarantee that young EL students have been properly screened and identified for risk for reading difficulties (Hutchinson et al., 2004). When EL students are assessed only in English, their difficulties with language acquisition and early literacy skills may be overestimated and attributed to reading difficulties when they are instead explained by students’ still-developing English language proficiency (Sullivan, 2011). Conversely, there is some evidence that ELs may be identified for special education later than their EM peers and thus may not receive appropriate educational supports in their first years of school (O’Connor et al., 2013; Samson & Lesaux, 2009). Due to the unreliable methods for determining and interpreting risk status, we noted when studies only included students with RD (see Table 1) but did not exclude studies when participants were not identified as having or being at risk for RD.

Table 1 Study characteristics

In contrast to Ludwig et al. (2019), we included studies that set out to measure reading outcomes beyond reading accuracy, reading fluency, and reading comprehension. Our decision to analyze effects on additional outcomes was based on previous research (Babayiğit, 2014; Cho et al., 2019; Grant et al., 2011; Proctor et al., 2005) and an IES practice guide (Baker et al., 2014) that suggest the importance of building oral language/vocabulary knowledge and listening comprehension in EL students prior to the upper elementary grades. That said, our review was more exclusive than Ludwig et al. (2019) in one way: like Cheung and Slavin (2012) and Richards-Tutor et al. (2016), who established inclusion criteria that excluded studies with less stringent research designs/methods, we set out to include in this review only studies that met best-evidence synthesis requirements (Cheung & Slavin, 2005; Slavin et al., 2008; Slavin & Lake, 2008), enabling us to identify and describe in detail interventions that were demonstrated to be effective by rigorous research.

Purpose and Research Questions

This best-evidence synthesis aims to provide researchers and practitioners with guidance in understanding current K–3 EL reading interventions evaluated by means of rigorous, high-quality research studies. We followed best-evidence synthesis procedures described by Slavin (1986) and refined in subsequent studies by Slavin and colleagues (e.g., Cheung & Slavin, 2005; Slavin et al., 2008; Slavin & Lake, 2008). To be included in this best-evidence synthesis, studies needed to have (a) a sample size of at least 15 participants per condition, (b) a duration of at least 12 weeks such that reading instructional programs could be considered practical/have the potential to affect reading outcomes, and (c) either random assignment of students or a quasi-experimental research design that used matching to ensure equivalence between groups at baseline/post-hoc statistical adjustments to account for pretest differences. After identifying studies that met these best-evidence criteria, we described the programs in detail. We analyzed the relations between program components and effects achieved by these programs in order to provide meaningful guidance for primary grade teachers of students who are ELs. Finally, to further contextualize recommendations that emerge from this best-evidence synthesis, we identified the quality of the counterfactual condition (i.e., alignment of instruction to evidence-based practices). We did this in response to calls for better measurement and description of the counterfactual and its inclusion of evidence-based reading practices, which enables the field to achieve a better understanding of the relative effectiveness of interventions (Lemons et al., 2014; Scammacca et al., 2016).

We believed this best-evidence synthesis was warranted for the following reasons. First, as described above, too little is known about the content of effective instruction for primary grade ELs, and, in particular, about the ways reading interventions can improve EL students’ language comprehension. Second and relatedly, within the last twenty years, there has yet to be a best-evidence synthesis that has investigated the effects of reading interventions delivered to ELs on a diverse range of reading outcomes (i.e., not just on reading comprehension outcomes, which was the only type of outcome evaluated by Cheung & Slavin, 2012) and including interventions that had a duration of less than one year (i.e., less than the duration of studies included by Cheung and Slavin (2005)). We aimed to build on findings reported in prior research (Cheung & Slavin, 2012; Ludwig et al., 2019; Richards-Tutor et al., 2016) by conducting this best-evidence synthesis (Cheung & Slavin, 2005; Slavin et al., 2008; Slavin & Lake, 2008) and describing the effects of rigorously researched reading instruction for Grade K–3 ELs. We asked:

  1. 1.

    What is the overall mean effect of reading programs identified in this best-evidence synthesis, and what is the mean effect of best-evidence programs on disaggregated word reading/foundational word reading skill (e.g., PA, phonics, nonword reading, word reading), passage reading fluency, oral language (e.g., vocabulary knowledge, listening comprehension), and reading comprehension outcomes?

  2. 2.

    Which reading programs demonstrated the largest overall reading effect sizes?

  3. 3.

    Which programs demonstrated the largest effects on disaggregated word reading/foundational word reading skill, passage reading fluency, oral language, and reading comprehension?

  4. 4.

    What are the characteristics of the programs that demonstrated the largest reading effect sizes within each outcome domain (i.e., word reading/foundational word reading skill, passage reading fluency, oral language, reading comprehension) and overall?

Method

Search Strategy and Inclusion Criteria

To identify studies for inclusion in this meta-analysis, we searched peer-reviewed research articles published between January 1, 2000, and December 31, 2020, to represent recent research on reading instruction for K–3 ELs. The search procedure followed PRISMA guidelines (Liberatti et al., 2009) with a three-step process and each step being independently completed by a minimum of two members of the research team. First, we searched the electronic databases of ERIC and psycINFO with a combination of the following terms: (read* or “phonological awareness” or “phonemic awareness” or vocab* or fluen* or decod* or comprehend* or lit* or “language arts”) and (“English language learners” or “English second language” or “English as a second language” or “second language learn*” or lingual* or “language minority”) and (“elementary education” or “elementary school” or “primary education” or kinder* or “grade 1” or “grade 2” or “grade 3”). Studies were included if they met the following criteria:

  1. 1.

    At least 90% of participants were identified as ELs. Various labels (e.g., limited English proficient, language minority) were accepted, as long as labels conveyed that English was not the first language for participating students. Studies with fewer than 90% EL participants were included if disaggregated data were provided for EL students (Hall et al., 2017); only disaggregated data were used to calculate effect sizes in these cases. We queried authors when studies reported disaggregated findings for EL students but did not include enough data to calculate an effect size.

  2. 2.

    Participants were in Grades K–3 (ages 5–9), or the sample mean age was within the targeted range.

  3. 3.

    Reading instruction focused on PA, phonics/word reading, vocabulary, fluency, comprehension, or a combination of these domains.

  4. 4.

    The primary language of instruction was English (i.e., greater than 50% of instruction was delivered in English). However, studies were included when first language supports were incorporated into English language instruction.

  5. 5.

    The setting was school-based, including school-based after-school tutoring.

  6. 6.

    The study had at least one calculable effect size on a reading or oral language outcome.

  7. 7.

    The study used an experimental or quasi-experimental group research design with assignment of participants to at least one treatment and one comparison condition (defined as a no-treatment control condition with or without exposure to materials or a weaker instructional condition that mirrored classroom practice). For studies that did not use random assignment, researchers employed a matching procedure and/or statistical adjustments for pretest differences when appropriate. In this latter case, effect sizes without statistical adjustments were excluded. Historical comparison conditions were not included (e.g., Leafstedt et al., 2004).

  8. 8.

    Studies had a minimum duration of 12 weeks and a minimum of 15 students per condition.

  9. 9.

    Studies were conducted in the USA and were either unpublished dissertations, published in a peer-reviewed journal, or published as a technical report.

Our search yielded 4441 articles. Of these, 4339 articles were excluded on the basis of disqualifying information reported in article abstracts or because they were duplicates. We reviewed the full text of 102 articles. Of these, 92 articles were excluded because they were conducted outside the USA (k = 9); did not include or did not disaggregate findings for K–3 students (k = 5); had an ineligible comparison condition (e.g., comparison condition did not include 90% ELs; k = 40); had fewer than 90% EL participants and findings were not disaggregated for ELs (k = 4); primarily delivered instruction in a language other than English (k = 5); presented follow-up data to an already-published article (k = 7); did not deliver a reading intervention (k = 6); represented an unpublished dissertation which was later published as part of a published article that was reviewed (and excluded) or included in this meta-analysis (k = 3); delivered an intervention that was less than 12 weeks (k = 6); included an inadequate sample size (k = 5); and for studies that did not use random assignment, there was no student matching with appropriate adjustments for pretest differences (k = 2).

After the initial search, we completed an ancestral search of the reference sections of prior relevant literature reviews (Cheung & Slavin, 2012; Ludwig et al., 2019; Richards-Tutor et al., 2016) as well as of studies meeting the inclusion criteria of the current meta-analysis. Finally, we completed a hand search of published articles in the following journals, all of which either published articles included in this synthesis or tend to publish relevant articles related to reading and/or ELs: American Educational Research Journal, Bilingual Research Journal, Exceptional Children, Journal of Behavioral Education, Journal of Educational Psychology, Journal of Literacy Research, Learning Disabilities Research and Practice, Learning Disability Quarterly, Reading and Writing Quarterly, Reading Research Quarterly, Scientific Studies of Reading, and The Elementary School Journal. This hand search began with articles published in 2015 and concluded with articles published by December 31, 2020. No further articles meeting inclusion criteria were identified. A total of ten published articles (Baker et al., 2016; Ehri et al., 2007; Foorman et al., 2018; McMaster et al., 2008; Nelson et al., 2011; Tong, Lara-Alecio, et al., 2008b; Vadasy & Sanders, 2010, 2011; Vaughn, Cirino, et al. 2006; Vaughn, Mathes, et al. 2006) met inclusion criteria and were coded.

Coding Procedures

Code sheets captured information about study and participant characteristics, study outcomes, and WWC determinants of study quality (IES, 2020; Vaughn et al., 2014). All coders were trained by the first author. Prior to coding, all coders independently coded articles until they obtained a minimum of 90% reliability with the first author in each code sheet section (e.g., study characteristics, study outcomes). Once coding began, all articles were independently double coded by the first author and one additional coder. When coding discrepancies arose, coders reached consensus through re-reading and discussion.

One set of articles warrant additional explanation as to their coding (Tong, Irby, et al., 2008a, Tong, Lara-Alecio, et al., 2008b). Findings from Tong, Lara-Alecio, et al., 2008b) Tong, Irby, et al., 2008a) reported on findings from years two and three of the same longitudinal study. We opted to include only the year 2 study (Tong, Lara-Alecio, et al., 2008b), because it is most similar to the other studies in this review as no other study exceeded a duration of one year. In Tong, Lara-Alecio, et al. (2008b), we compared the Structured English Immersion-Enhanced/Experimental group to the Structured English Immersion-Typical/Control group. We did not investigate results for the Transitional Bilingual Education-Enhanced/Experimental or the Transitional Bilingual Education-Typical/Control groups, because the primary language of instruction for those groups was Spanish.

Coding of Study Characteristics and Outcomes

In coding study characteristics and outcomes, we used processes similar to those used in previous meta-analyses of reading intervention studies (Wanzek et al., 2016, 2018). Similar to Wanzek et al. (2016), we coded for intervention type by differentiating between studies of interventions that focused on foundational skills, those that were meaning-focused, and those that were multi-component. Foundational skill interventions focused on PA, phonics, and word/nonword reading accuracy or fluency with or without a simultaneous focus on building passage reading fluency. Meaning-based interventions focused on vocabulary or comprehension (i.e., vocabulary knowledge, reading comprehension, listening comprehension). Multi-component interventions included a combination of foundational skills and meaning-based components. We used similar categories to describe reading outcome domains, differentiating between (a) word reading/foundational word reading skills measures (we refer to this category as “foundational skills” measures going forward, and it includes measures of letter naming, PA, phonics, and word/non-word reading accuracy or fluency), (b) passage reading fluency measures, (c) oral language comprehension (i.e., vocabulary, listening comprehension), and (d) reading comprehension outcomes.

We coded comparison conditions based on the extent to which the counterfactual condition utilized evidence-based instruction. Based on a modified coding scheme designed by Scammacca et al. (2016), we developed three categories of counterfactual quality (i.e., high, moderate, low) and a not defined category. The counterfactual condition was considered high quality if all students received an evidence-based, commercially available reading curriculum or an instructional program that was described as evidence-based instruction in phonemic awareness, phonics, fluency, vocabulary, comprehension (National Reading Panel, 2000); moderate quality if some students received a research-based, commercially available reading curriculum or explicit instruction in phonemic awareness, phonics, fluency, vocabulary, comprehension; and low quality if no students received such a curriculum or instructional program. The counterfactual condition was coded attention control if it constituted a condition defined as a researcher-delivered weaker treatment in a program other than reading (e.g., math, social studies). Finally, the counterfactual condition was considered not defined if the counterfactual was not described in sufficient detail to be included in one of the categories outlined above. To the best of our knowledge, this is the first systematic review (i.e., best-evidence synthesis, meta-analysis) on EL reading instruction to code for counterfactual quality.

Effect Size Calculations

We calculated standardized mean differences between intervention and control groups estimated with Hedges’ g (Hedges, 1981), using the reported posttest mean and standard deviation (SD) by conditions.

For each study, we calculated mean effect sizes separately for different reading outcomes and for the combined outcome using a random effects model with restricted maximum likelihood estimator (REML) for variance estimation. There were two programs (Proactive Reading and Sound Partners) that were evaluated in more than one study and each study reported multiple effect sizes. In such cases, program-level effect sizes were estimated using robust variance estimator along with small sample size correction (RVE; Hedges et al., 2010; Tipton, 2015) with a rho of .80.Footnote 1 All the analyses were conducted using metafor (Viechtbauer, 2010) and robumeta (Risher, Tipton,, & Zhipeng, 2017) packages in R (R Core Team, 2020).

Results

Table 1 presents study characteristics for the ten studies included in this best-evidence synthesis. Program and within-study effect sizes are reported in Tables 2, 3, 4, 5, and 6. Studies encompassed 76 individual effect sizes and a combined sample of 2150 students. Three of the ten studies were published between 2000 and 2009. The remaining seven studies were published between 2010 and 2020. Studies were conducted in kindergarten (k = 3), Grade 1 (k = 5), Grades K–1 (k = 1), or Grades K–2 (k = 1). No studies included students in Grade 3, and therefore, findings reflect the effects of interventions for EL students in Grades K–2. Seven of the ten studies solely included students with or at risk for reading difficulties. Studies provided instruction to students 1:1 (k = 3), in small groups (k = 5), and as a whole class (k = 1). One study delivered instruction within a school-wide multi-tiered system of support framework with students receiving instruction in various grouping arrangements. Counterfactuals were rated as high quality (k = 2), moderate quality (k = 3), and low quality (k = 1). The low-quality counterfactual was interactive book reading (Nelson et al., 2011). Four studies did not describe the counterfactual condition. No study had a counterfactual condition rated as attention control.

Table 2 Early Vocabulary Connection and within-study effect sizes
Table 3 K-PALS and within-study effect sizes
Table 4 Proactive Reading and within-study effect sizes

Three studies implemented foundational skill–focused interventions (McMaster et al., 2008; Vadasy & Sanders, 2010, 2011). McMaster et al. (2008) evaluated the effects of K-PALS, which included both PA and phonics components. The two studies reported by Vadasy and Sanders (2010, 2011) implemented a version of Sound Partners (e.g., Vadasy et al., 2006; Vadasy et al., 2008; Vadasy & Sanders, 2012). The remaining seven studies implemented multi-component interventions. Six studies delivered a multi-component intervention with all five reading instructional components (i.e., PA, phonics, fluency, vocabulary, comprehension; Baker et al., 2016; Ehri et al., 2007; Foorman et al., 2018; Tong, Lara-Alecio, et al., 2008b; Vaughn et al., 2006a, b). Baker et al. (2016) implemented a researcher-designed vocabulary-focused intervention titled Transition Lessons. The instructional intervention delivered by Ehri et al. titled Reading Rescue used a sequence of lessons aligned to fiction and non-fiction books that increased with difficulty over time. Foorman et al., Vaughn, Cirino, et al., and Vaughn, Mathes, et al. all used scripted, commercially available curriculums. Multi-component instruction in the study reported by Tong, Lara-Alecio, et al. was delivered within a school-wide multi-tiered system of support framework and included Santillan Intensive English (Ventriglia & González, 2000), Story Telling for English Language and Literacy Education (STELLA; Irby et al., 2004), Academic Oral Language (AOL; in kindergarten only; Lakeshore Learning Materials, 1997), Academic Oral Language in Science (AOLS; in first grade; Lakeshore Learning Materials, 1997), Early Intervention in Reading (EIR; Mathes & Torgesen, 2005), and communication games with students receiving varying levels of support and curriculum components based on need. Nelson et al. (2011) were the only multi-component study that did not include all five coded reading components. In Nelson et al. (2011), the authors delivered a commercially available phonics and vocabulary-focused reading intervention titled Early Vocabulary Connections (Nelson & Vadasy, 2007).

Reading Program Descriptions

Early Vocabulary Connections

Nelson et al. (2011) was the only study to implement the researcher-designed Early Vocabulary Connections. This program focuses on early vocabulary knowledge development, targeting word meaning, phonology, and orthography in an integrated fashion. The curriculum introduces 184 decodable root words in a way that systematically builds student knowledge of grapheme-phoneme correspondences. Instructional activities teach word meanings while also reinforcing the application of decoding skills (Nelson et al., 2011). Activities include defining, blending, and spelling target words; reading decodable text; and using cloze procedures, matching word meanings to pictures, and sentence production tasks to demonstrate understandings of target words. While this program was designed to be delivered in a variety of instructional settings, Nelson et al. (2011) chose to evaluate the effects of instruction in a small-group format (2–5 children per group) in which the activities listed above were delivered by paraeducators in 20-minute sessions. Nelson et al. (2011) had a mean effect of 0.38 for the Early Vocabulary Connections intervention. Table 2 presents the within-study effects.

K-PALS

One study utilized K-PALS (McMaster et al., 2008), implementing a supplemental peer-tutoring program for kindergarten students. This program emphasizes building phonemic awareness, knowledge of letter-sound correspondences, decoding, and fluency using both teacher-directed activities and activities performed by students in pairs. By designating reciprocal roles to each pair of students (which included a higher-performing and a lower-performing reader), the program helps students facilitate their peers’ learning; students take turns being both coach and reader (D. Fuchs et al., 2001). McMaster et al. (2008) implemented K-PALS as designed. Students engaged in Sound Play and Sounds and Words activities during 20- to 30-minute sessions. Sound Play was teacher-directed and consisted of 5 phonemic awareness games, lasting approximately 5–10 minutes. During the remaining 15–20 minutes, reciprocal pairs engaged in Sounds and Words activities in which they practiced identifying sounds, reading words, and reading sentences. McMaster et al. (2008) had a mean effect of 0.29 for K-PALS instruction. Table 3 presents the within-study effects.

Proactive Reading

Proactive Reading (Mathes & Torgesen, 2005) consists of a scripted curriculum that facilitates teachers’ provision of explicit instruction in phonemic awareness, word recognition, and comprehension strategies. Because efficient word identification is a primary focus of this program, students spend much of each session learning and reviewing letter-sound correspondences, sounding out and reading words, and reading decodable connected text. They are also provided with opportunities to spell words and learn/apply comprehension strategies (Mathes & Torgesen, 2005). Proactive Reading is designed to be delivered in a small-group format.

Two studies utilized Proactive Reading (Vaughn et al., 2006a, b). Each study enhanced the curriculum by incorporating oral language development activities during each session. Interventions in both studies were delivered by teachers to groups of 3 to 5 students, 5 days a week, 50 minutes each day, for an average of 115 sessions. Vaughn et al., 2006a, b) designated 10 minutes of each session to vocabulary, listening comprehension, and academic language development. This included teaching two or three vocabulary words connected to the text, posing questions about vocabulary and key ideas, supporting students in retelling stories, and engaging students in dialogue. Vaughn, Mathes, et al., 2006b) also devoted 10 minutes of each session to the development of oracy and vocabulary. Both studies (Vaughn et al., 2006a, b) included additional instructional practices identified as effective in improving academic outcomes for ELs in previous research, including the use of visuals/gestures, clarification of English word meanings, and opportunities for teachers and students to work together to elaborate on student responses. Vaughn et al., 2006a, b) reported a mean effect of 0.08 and 0.65, respectively. The mean program-level effect for Proactive Reading was 0.35. Table 4 presents within-program and within-study effects for studies that examined the effects of Proactive Reading.

Reading Rescue

Reading Rescue, evaluated by Ehri et al. (2007), is a comprehensive intervention model in which objectives, instructor training, observational tools, and progress-monitoring assessments are integral components that accompany curriculum materials and instructional procedures. This program, guided both by a scope and sequence of skills and by students’ performance on assessments, explicitly and systematically provides instruction in phonological awareness, phonics, fluency, vocabulary knowledge, and comprehension. Teacher analysis of students’ assessment performance and written observations during instruction ensures that instruction targets individual students’ needs (Ehri et al., 2007). The amount of time varied for each component, as it was dependent upon each student’s needs as determined by assessments/observations. Reading Rescue is designed to be delivered 1:1, and Ehri et al. (2007) mostly did so, with the intervention provided by certified reading specialists, educators certified in other areas (e.g., counseling, math, and social work), and paraprofessionals. In this quasi-experimental study, we only reported effects on the one measure that could be adjusted for pretest differences, a standardized composite measure of foundational skills and reading comprehension (g = 0.50); this was the only effect size from Ehri et al. (2007) included in this best-evidence synthesis.

Sound Partners

Sound Partners (Vadasy et al., 2006; Vadasy et al., 2008; Vadasy & Sanders, 2012) is a scripted reading program in which students receive daily, 30-minute sessions of 1:1 instruction in foundational reading skills, including phonemic awareness, phonics, and spelling. Each lesson begins with the introduction of a new grapheme-phoneme correspondence, followed by review of previously taught correspondences. This is followed by phoneme decoding in which students orally blend phonemes corresponding to graphemes on the page. Irregular words are then introduced and practiced through a read, spell, and reread format. Spelling is also integrated into the letter sounds and irregular words activities. At the conclusion of the lesson, students engage in oral reading of decodable texts through independent, partner, or echo reading practices.

Three studies examined the effects of Sound Partners (Foorman et al., 2018; Vadasy & Sanders, 2010, 2011). Both Vadasy and Sanders studies evaluated effects when paraeducators delivered Sound Partners 1:1, 4 days a week, during 30-min sessions for 18 weeks. Foorman et al. (2018) implemented Sound Partners for 25–30 minutes daily using locally hired interventionists and school paraprofessionals. Foorman et al. (2018) supplemented Sound Partners with Bridge of Vocabulary (Montgomery, 2007) instruction, which was delivered three times per week for 15 minutes each session to build listening, speaking, and reading skills using manipulatives. In addition, Foorman et al. (2018) included a language instruction supplement titled Language in Motion. Language in Motion (Phillips, 2014) was delivered two times per week to support students’ syntax and language comprehension by means of activities that incorporated manipulatives, stories, and games. Foorman et al. (2018), Vadasy and Sanders, (2010), and Vadasy and Sanders (2011) reported mean effect sizes of −0.01, 0.56, and 0.20, respectively. The mean program-level effect for Sound Partners was 0.22. Table 5 presents the within-program and within-study effects for studies using Sound Partners.

Table 5 Sound partners and within study effect sizes

Santillana Intensive English, STELLA and AOL(S)

Tong, Lara-Alecio, et al., 2008b) delivered reading instruction in both tier 2 and tier 3 contexts for student participants with the lowest levels of reading performance. The tier 2 intervention utilized Santillana Intensive English as one of three components. The tier 2 program was designed to be used in a variety of instructional settings. It involves content-based English language instruction, during which a math, science, or social studies topic is explored over a four-day period. Lessons focus on the development of vocabulary and listening comprehension, but also of foundational reading skills. In Tong, Lara-Alecio, et al., 2008b), students in kindergarten and first grade received 40 minutes of daily language instruction, provided by either teachers or paraprofessionals. Vocabulary words were introduced and practiced using flashcards, stories, and role-play conversations with pairs and/or small groups of students.

In addition to Santillana Intensive English, Tong, Lara-Alecio, et al., 2008b) incorporated STELLA and AOL/AOLS into their tier 2 interventions. STELLA was used for 25 minutes daily in kindergarten and 40 minutes daily in first grade to support the development of higher-order thinking skills during reading and discussion of culturally relevant texts. Additionally, 10 minutes of each day was spent developing academic oral language via the AOL/AOLS interventions in both kindergarten and first grade, with the emphasis on general academic language in kindergarten and on the academic language of science in first grade.

Tong, Lara-Alecio, et al., 2008b) also included a tier 3 intervention for the lowest-performing students. The tier 3 intervention included an additional 10 and 20 minutes of researcher-developed communication games targeting vocabulary and phonemic awareness for students in kindergarten and the first half of first grade, respectively. The tier 3 intervention in the second half of first grade was EIR level I, a commercially available intervention with explicit and systematic instruction in phonemic awareness, fluency, and comprehension. This study included two standardized measures of oral language with an oral language composite score outcome of g = 0.002. This effect size represents an aggregate for students who received the tier 2 and the combined tier 2 and tier 3 interventions.

Transition Lessons

Baker et al. (2016) was the only study to use the researcher-developed Transition Lessons with small groups of students who were receiving tier 2 instruction. These lessons were designed for Spanish-speaking students who were transitioning from reading instruction in Spanish to reading instruction in English. Within this program, there were two fundamental components: the development of academic language through content vocabulary instruction, read-alouds of stories, and instruction in comprehension strategies; and skill-building practice in phonemic awareness, letter-sound knowledge, and decoding (Baker et al., 2016). Scripted lessons were taught by teachers and instructional assistants within an explicit instruction framework that included teacher explanations and modeling and scaffolded opportunities to respond and receive corrective feedback. This study had a mean effect of g = −0.02. Table 6 presents the within-study effects for Baker et al. (2016).

Table 6 Sound Partners and within-study effect sizes

Discussion

This best-evidence synthesis sets out to examine the effects of reading interventions on reading and language outcomes for K–2 EL students, as reported in current studies that employed rigorous research designs/methods. Seven programs were implemented across ten studies. Across 76 effect sizes there was an overall mean effect of 0.23, which is smaller than the effect sizes presented in Ludwig et al. (2019; ES range = 0.50–1.22) and the K–1 effect sizes in Richards-Tutor et al. (2016; 0.58–0.91) but not out of line with the effect size range (−0.13–1.00) reported by Richards-Tutor et al. (2016) for the two studies with participants in Grades 2–3. The 0.23 effect size we identified was the same effect size reported in Cheung and Slavin (2012; i.e., 0.23). The fact that the mean effect size for this best-evidence synthesis was smaller than ones reported in meta-analyses that allowed for a wider range of study quality in their inclusion criteria reflects the previous research finding (Cheung & Slavin, 2012; Hall et al., 2017; Scammacca et al., 2015; Slavin & Smith, 2009) that effect sizes associated with higher-quality studies tend to be smaller than those associated with lower-quality studies. In the present meta-analysis, mean effects were largest on fluency outcomes (g = 0.30), followed by reading comprehension (g = 0.27) and foundational skill outcomes (g = 0.27), and finally oral language (g = 0.11) outcomes. This fact that effects on oral language were relatively small is concerning, given the significant contribution of oral language to reading comprehension for ELs in the upper elementary grades (e.g., Babayiğit, 2014; Cho et al., 2019). This meta-analysis differed from previous meta-analyses (i.e., Cheung & Slavin, 2012; Ludwig et al., 2019) that did not report effects on oral language/vocabulary outcomes or did not include studies that investigated effects on oral language/vocabulary outcomes without also investigating effects on reading comprehension outcomes. However, even given our more expansive criteria for inclusion, we did not identify many studies that yielded practically significant immediate effects on oral language outcomes; in fact, only the study conducted by Nelson et al. (2011) fits this description. This finding highlights the need to develop and pilot interventions that show more promise in improving oral language outcomes for ELs, a need that has been articulated in previous reviews (Baker et al., 2014; Richards-Tudor et al., 2016). It may be necessary to provide multiple years of content-rich, academic language–focused intervention to see meaningful effects; nevertheless, research suggests that language comprehension is a uniquely powerful contributor to reading comprehension for EL students and this is therefore worth investigating in future research.

There were several programs that were found to be effective, although it is worth noting that the studies found to be the most effective had counterfactual qualities that tended to be lower or undefined. In terms of building foundational skills, the multi-component Proactive Reading curriculum (Vaughn et al., 2006a, b) produced a mean foundational skill effect of g = 0.50 after seven to eight months of instruction. This effect size was larger than that achieved by any other intervention that targeted foundational skills outcomes. Provision of 18 weeks of the foundational skill–focused Sound Partners (Vadasy & Sanders, 2010) produced a mean effect size of g = 0.49, although in the 20-week Vadasy and Sanders (2011) study of Sound Partners, the intervention produced a smaller mean foundational skill effect size of g = 0.24. Proactive Reading was designed to be delivered in a small group setting; Sound Partners was designed to be delivered in a 1:1 setting. Both Proactive Reading and Sound Partners had samples of students with or at risk for RD and an undefined counterfactual quality. The whole-class K-PALS intervention study (McMaster et al., 2008) had a comparable mean effect on foundational skill outcomes (g = 0.38) when delivered for 18 weeks and a counterfactual quality rating of moderate; this intervention is thus a good option when instruction must be delivered in a whole-class setting.

Effects on fluency outcomes were largest for the 1:1 Sound Partners intervention (Vadasy & Sanders, 2010, 2011; g = 0.53) and for small-group Proactive Reading (Vaughn et al., 2006a, b; g = 0.26), both of which had a sample of students with or at risk for RD and an undefined counterfactual quality. The mean effect size (g = 0.14), when compared to a moderate-quality counterfactual condition, achieved by the whole-class K-PALS intervention (McMaster et al., 2008) was not much smaller than that achieved by small-group Proactive Reading. Furthermore, K-PALS may be more feasible as it was delivered for a shorter duration (18 weeks) than was Proactive Reading (7–8 months).

Of the studies that measured reading comprehension, all exclusively included students with or at risk for RD. Two long-duration, multi-component programs produced the largest effect sizes. Proactive Reading (Vaughn et al., 2006a, b) produced an effect size of g = 0.54 relative to an undefined counterfactual condition. Reading Rescue (Ehri et al., 2007) produced an effect size of g = 0.50, relative to a counterfactual condition in which some students received a research-based, commercially available reading curriculum or explicit instruction in phonemic awareness, phonics, fluency, vocabulary, comprehension (i.e., moderate quality rating). Reading Rescue had a duration of six months, and the Proactive Reading intervention was slightly longer at seven to eight months. The shorter duration phonics-only Sound Partners intervention yielded a similar effect of g = 0.47 in one study (Vadasy & Sanders, 2010), but a much smaller effect size of g = 0.10 in another (Vadasy & Sanders, 2011). It should be noted that, while foundational skills have been shown to contribute to reading comprehension, reading comprehension is the ultimate goal of reading instruction. We acknowledge that it is difficult to measure reading comprehension in very young students (e.g., in kindergartners) who are able to read only the simplest of texts. Still, future research with K–2 ELs should take care to measure the degree to which improvements in foundational skills transfer to reading comprehension when possible.

Again, the smallest effects were reported for oral language outcomes. Across the five programs with measures of oral language, four had small oral language effect sizes: Transitions (g = 0.08; Baker et al., 2016), Santillan Intensive Language as part of a school-wide system of support (g = 0.002; Tong, Lara-Alecio, et al., 2008b), Sound Partners as part of a multi-program package (g = −0.002; Foorman et al., 2018), and Proactive Reading (g = .06; mean effect from Vaughn, Cirino, et al., (2006) and Vaughn, Mathes, (2006)). Only Early Vocabulary Connection (Nelson et al., 2011), the intervention that dedicated the greatest percentage of instructional time to building oral language skills, produced meaningful effects on oral language, with an effect size of 0.24 and 0.67 on standardized and unstandardized measures, respectively (mean effect on oral language was g = 0.45). For those considering Early Vocabulary Connection (Nelson et al., 2011) as an intervention to support EL oral language outcomes, it is worth considering the fact that the counterfactual condition employed in the study reported by Nelson et al. (2011) consisted of interactive book reading (i.e., low rating) and the sample of students did not only include students with or at risk for RD.

In this best-evidence synthesis as in all syntheses, effect sizes need to be interpreted relative to the counterfactual conditions in place in included studies. Counterfactual conditions are dynamic conditions that can vary in quality over time, as evidence-based instructional best practices are disseminated (Lemons et al., 2014) and other contextual factors. Therefore, we described the quality of the counterfactual (high, moderate, low, or not defined) for all included studies. Both Baker et al. (2016) and Foorman et al. (2018) included a high-quality counterfactual condition. The strength of the counterfactual condition in both of these studies may be related to the fact that they had the lowest overall mean effect sizes, ranging from g = −0.01 to −0.02. Three studies compared the effects of an intervention with a moderately rated counterfactual condition (Ehri et al., 2007; McMaster et al., 2008; Tong, Lara-Alecio, et al., 2008b), with mean effect sizes ranging from g = 0.002 to 0.50. It is worth noting that Ehri et al. (2007) reported an effect size of g = 0.50 on a standardized, composite measure of foundational reading skills and reading comprehension. This is among the largest effect sizes reported in any study and the largest reported in a study with a counterfactual condition rated as moderate or high quality. That said, because only one effect size meeting best-evidence synthesis criteria (e.g., pretest mean or similar information was provided so that it was possible to statistically adjust for pretest differences between groups in quasi-experimental studies) could be calculated for the Ehri et al. (2007) study, conclusions that can be drawn from this evidence are somewhat limited. Nelson et al. (2011) was the only study to have a low-quality counterfactual; it reported an effect size of g = 0.38 in favor of the intervention condition. The four studies with undefined counterfactual conditions had the largest range of mean effect sizes (g = 0.08 to 0.65). Descriptively, this synthesis does provide some evidence that aligns with observations made by Lemons et al. (2014) that reading effect sizes tend to decrease when the counterfactual condition is of higher quality (i.e., employing evidence-based practices).

Limitations and Future Research

This meta-analysis has limitations that suggest topics for future research. First, only ten studies over the past 20 years met the criteria for inclusion in this best-evidence meta-analysis, such that there was inadequate power to investigate the effects of potential moderator variables. The limited number of studies identified also resulted in the majority of the programs (k = 5) having only one study to demonstrate program-level findings. Furthermore, the remaining programs evaluated in more than one study (i.e., Proactive Reading, Sound Partners) often demonstrated significant variation in effect sizes within programs, across studies. This limited program-level research with effect size variation in similar studies of similar programs highlights the need for more research, including replication studies, evaluating the effects of reading instruction designed to improve outcomes for EL students. Funding and conducting future studies that rigorously evaluate the effects of reading programs for primary grade ELs would allow researchers to better understand how student and intervention characteristics impact intervention effects and could enable educators to better individualize interventions for K–3 ELs. These studies would do well to also investigate the effects of interventions in Grades 2–3, as only Foorman et al. (2018) included students enrolled in grades beyond Grade 1 in their sample, and no study included Grade 3 ELs. Among other limitations, the measurement of reading comprehension is limited in Grades K–1; it would be valuable to determine the effects of reading interventions on reading comprehension for ELs in Grades 2 and 3.

Furthermore, there is a need (also articulated by Richards-Tutor et al., 2016) for more research on interventions that target vocabulary and language development. This best-evidence synthesis also found a need for more research in this domain. In particular, the findings from Nelson et al.’s (2011) study of Early Vocabulary Connection, which spent the majority of instructional time on oral language activities, suggests that interventions targeting academic language development in young EL students (with or without reading difficulties) have the potential to produce large effects on oral language outcomes, and effects on oral language that are larger than those achieved by studies focusing primarily on foundational skills and/or on multiple components with a lesser focus on building academic language knowledge. Finally, there is also value in measuring intervention follow-up effects in order to determine how intervention impacts are sustained over time. It is even possible that intervention impacts may increase over time as a result of increasing students’ access to grade-level texts in the upper elementary grades, when a greater percentage of classroom time is spent reading to learn as compared to learning to read (Chall & Jacobs, 1983).

This best-evidence synthesis also revealed that studies with high-quality counterfactuals tended to have lower reading outcomes than studies with low-quality counterfactuals. This suggests that counterfactual quality is worth considering when interpreting intervention and synthesis findings. Future research needs to acknowledge that not all counterfactual conditions are equal. When studies compare an intervention to a high-quality counterfactual condition, the intervention’s effectiveness may appear to be reduced. So that future meta-analyses and syntheses can control for counterfactual quality, researchers need to better measure and document the nature of business-as-usual instruction received by students; note that four of the ten studies included in this best-evidence synthesis did not include counterfactual descriptions.

Recommendations for Practice

This best-evidence synthesis demonstrates the effectiveness of a number of commercially available programs and/or approaches that are replicable by educators. Proactive Reading, delivered in a small group setting to students with or at risk for RD for nearly a school year, produced consistently large effects on foundational skills, fluency, and reading comprehension outcomes, if not on oral language outcomes. Outcomes were similar but smaller for Sound Partners, whose studies also included samples of students with or at risk for RD. Outcomes for Sound Partners were shown to be effective even when delivered for a shorter duration, when delivered in a 1:1 setting. Even though Reading Rescue was associated with only one effect size in this best-evidence synthesis, the fact that the composite foundational skill and reading comprehension standardized measure effect size was g = 0.50 makes it worth considering when resources are available to provide 6 months of intervention in a 1:1 setting. Additionally, Early Vocabulary Connection, delivered in a small group setting for 20 weeks, had the largest effects on oral language outcomes. Finally, for administrators or educators looking to identify an effective whole-class intervention, K-PALS is a feasible program shown to improve foundational skills and fluency. These latter two studies included students with diverse reading abilities rather than requiring included students to be with or at risk for RD.

In summary, this best-evidence synthesis suggests that reading interventions targeting foundational reading skills, fluency, and comprehension have the potential to benefit reading outcomes for K–2 ELs. That said, this best-evidence synthesis clearly identified that more high-quality research is still needed; having a sufficient sample of high-quality, rigorous studies to analyze will provide greater confidence in findings. Additionally, this best-evidence synthesis documented a need for evaluations of reading interventions with ELs in Grades 2, and a better documentation of business-as-usual reading practices. Finally, more research is needed to better understand the impact of interventions targeting oral language on improved reading outcomes for K–2 ELs.