Introduction

In the United States, the Individuals with Disabilities Education Improvement Act (IDEA, 2004) prioritizes the implementation of early intervention models for Learning Disabilities (LD) over the “wait-to-fail” model. In this context, the Response to Intervention (RtI) model emerges to assess the student's response to the intervention and adjust pedagogical strategies accordingly. This model consists of four essential components: (1) multilevel prevention system, (2) universal screening, (3) progress monitoring, and (4) data-based decision making (National Center on Response to Intervention [NCRTI], 2010). The multilevel prevention system has three levels or tiers (i.e., Tier 1, 2, and 3), with the quality and intensity of the intervention increasing along with them. The other components ensure that students with learning difficulties are correctly identified and receive supplemental intervention based on data obtained from early assessment tools (i.e., universal screening). In Spain, early diagnosis and intervention for literacy and numeracy difficulties have been mandated by the education reform (Ley Orgánica 8/2013, 2013). However, the education system has not yet incorporated a multitiered approach to the early detection and support of students at risk of LD.

Ensuring implementation fidelity of K-12 core and supplemental intervention is critical for addressing reading problems and improving student outcomes (Benner et al., 2011; O’Donnell, 2008). Accordingly, the present study examines the role of fidelity in the effects of a Tier-2-type intervention on growth in reading performance among Spanish students at risk for reading difficulties. Implementation fidelity is broadly defined as the extent to which an intervention is delivered as planned (Wolgemuth et al., 2014). When measured as a construct, researchers have mainly considered five aspects of implementation fidelity: (1) adherence—whether the components of the intervention are being delivered as designed; 2) duration—the number, length, or frequency of sessions implemented; 3) quality of program delivery—how intervention teachers utilize prescribed techniques in delivery, 4) participant responsiveness—the level of engagement by participants, and 5) program differentiation—ensuring that the intervention experience is different from those in nonintervention (Dane & Schneider, 1998; O’Donnell, 2008).

In studies reporting treatment fidelity data, adherence data and implementation quality were the most frequently reported dimensions (Capin et al., 2018). In a review conducted by Capin et al. (2018), nearly 90% of studies that reported treatment fidelity data included an indicator of treatment adherence. This dimension was the only dimension of treatment fidelity identified as an “essential quality indicator” for group research in special education (Gersten et al., 2005, p. 152).

Implementing reading interventions in kindergarten and first grade

Two RtI approaches for the prevention and intervention of academic and social problems have emerged: standard-treatment protocols and problem-solving models (Lerner & Johns, 2012). In the present study, we followed a standard-treatment approach that requires using the same empirically validated treatment for all children with similar problems in a given domain (Fuchs et al., 2003). The advantage of this approach is a clear lack of ambiguity, which streamlines decision-making for school staff. With scripted protocols, everyone knows what to do; thus, treatment fidelity can be maximized (Fuchs et al., 2010).

Since the publication of the IDEA law (2004), studies conducted mainly in English have evaluated the efficacy or effectiveness of an intervention based on Tier 2 of the RtI model with students at risk of reading difficulties. In the last decade, there have been several systematic reviews on reading interventions for students with reading disabilities (Gersten et al., 2020; Wanzek et al., 2016, 2018). Across these reviews, reading effects were found to be higher in grades K–3 than in the upper grades (Wanzek et al., 2016, 2018). Although evidence for the main components prescribed by the National Reading Panel (NRP, 2000) (i.e., phonemic awareness, phonics, reading fluency, vocabulary, and comprehension) has primarily come from English language studies, similar findings are expected for other languages, such as Spanish. In fact, a review of studies conducted in Spanish, a transparent orthography, suggests that the instruction in the NRP components could be effective for monolingual Spanish-speaking children at risk for reading difficulties (Crespo, Jiménez, Rodríguez, Baker, & Park, 2018).

Nevertheless, only a few studies conducted in Spain (Jiménez et. al., 2010; Crespo et al.,  2018; González & Martín, 2017) have included some of the components prescribed by the NCRTI (2010) for Tier 2 intervention. For instance, González and Martín (2017) analyzed the effectiveness of an early written language intervention aimed at Spanish children at risk of LD in reading. After 3 years of intervention, students in the experimental group performed higher than the control group on syllable and phoneme awareness. Jiménez et. al. (2010) conducted a pilot study examining the effectiveness of the Program for the Prevention of Specific Learning Difficulties in Reading (PREDEA by its acronym in Spanish) from kindergarten to second grade. The results revealed that the experimental group achieved higher scores than the control group on phonological awareness, letter-sound knowledge, and oral reading fluency. However, the source of variation due to school was not controlled in this study. In a second study, Crespo et al., (2018) analyzed the effects of PREDEA using Hierarchical Linear Modeling (HLM) to control the variability due to the subject (2nd level of the HLM) and the school (3rd level of the HLM). The authors found significant differences between the experimental group and the control group in vocabulary tasks in kindergarten, and in phonological awareness tasks in kindergarten and first grade. They found no differences in nonsense word fluency or oral reading fluency. However, none of the studies conducted in Spain included fidelity of implementation, which is one of the main focuses of the present study.

Implementation fidelity in reading intervention

Although a substantial body of research has examined the relationship between implementation fidelity and nonacademic outcomes (O’Donnell, 2008), few studies have examined this relationship in reading interventions (Nelson et al., 2020). Authors rarely reported treatment fidelity in their analysis of treatment effects (Capin et al., 2018). According to a review of implementation fidelity in reading intervention studies (van Dijk et al., 2019), fidelity is often conceptualized and measured in many different ways. While implementation fidelity is thought to include at least five elements (Dane & Schneider, 1998), the most common elements examined in these studies are adherence and dosage. Adherence is mostly measured as the average percentage of steps included across multiple observations. In this sense, when implementer adherence is reported as the number of observed components delivered, the amount of the program content provided is easy to quantify. Adherence may also be reported as a categorical variable “highly adherent” vs. “low adherent” (Cross & West, 2011). In fact, some authors have classified implementers’ adherence using a categorical variable (i.e., high adherence or low adherence) and then used this classification as a blocking variable in the final analysis (e.g., Cantrell et al., 2013; Paeplow, 2015; Vadasy et al., 1997). Paeplow (2015) considered adherence between 66% and 100% “high”, 33% to 66% was categorized as “mid”, and lower than 33% was “low”. This same “high” cut-off value of 66% was also used by Cantrell et al. (2013); however, this value became the cut-off because it was the mean adherence value. Intervention providers scoring below 66% were considered “low adherents”. Therefore, the distribution of the scores can shift the cut-off point.

In her review, O’Donnell (2008) found five studies in the education setting that measured the relationship between treatment fidelity and outcomes among K-12 studies. In all of these studies, higher treatment fidelity was associated with statistically improved student outcomes. Nevertheless, only one study examined the extent to which the fidelity of implementation enhanced reading outcomes (Hall & Loucks, 1977). In addition, Capin et al. (2018) provided a systematic synthesis of authors’ efforts to support and measure treatment fidelity in early reading intervention studies for students with or at risk for reading difficulties. These authors found that only 5% of the studies reported examined the effects of fidelity in treatment effects. Some of these studies found that treatment fidelity was a significant predictor of student reading outcomes (e.g., Nunnery et al., 2006). For example, Nunnery et al. (2006) used a 3-level hierarchical linear model to estimate the impact of a reading program on students’ reading growth trajectories. Although the quality of their program implementation did not predict growth, it was correlated with a reduction in the negative effect of the learning disability status. Other studies have found that high adherence to treatment was associated with improved reading outcomes (e.g., Vadasy & Sanders, 2009; Vadasy et al., 2015; Wolgemuth et al., 2014). Vadasy and Sanders (2009) assessed treatment adherence to explain differences in reading gains for students in second and third grades. In kindergarten, Vadasy et al. (2015) found that adherence to treatment was associated with improved outcomes in students at risk for reading problems. Similar results were obtained by Wolgemuth et al. (2014) with students at risk of reading difficulties between the ages of 4 and 8. More recently, Hill and Erickson (2019) reviewed the evidence regarding program fidelity in education. They found treatments with low fidelity averaged 24% fewer positive outcomes than those with medium or high fidelity.

The present study

This study was part of a larger research project that consisted of piloting the RtI model to prevent reading and math LD in the Canary Islands (Jiménez, de León, & Gutiérrez, 2021). Because students attended different schools across the Canary Islands, socioeconomic status, core reading instruction, and home literacy practices were evaluated through the Reading Teaching Practices Questionnaire (Jiménez, 1986) (i.e., different approaches to teaching reading such as phonics vs. whole language), and the Early Literacy Practices at Home Questionnaire considering the components of the Home Literacy Model (Sénéchal & LeFerve, 2002) (i.e., motivational, functional, recreational, and instructional). Jiménez et al., (2021) did not find significant differences between experimental and control groups in classroom practices, home literacy practices, families’ income, or educational level. Furthermore, to ensure that all intervention teachers knew how to successfully implement early identification and intervention, they were given online training through the Letra web-based training program. In addition, Jiménez et al., (2021) explored differences in teachers’ knowledge at the beginning and the end of the intervention program and found that the training carried out through Letra improved teachers’ knowledge of early intervention with students at risk for reading difficulties.

In the current study, we evaluated the role of fidelity implementation in kindergarten and first grade students’ reading outcomes within a Spanish-language Tier-2-type intervention. Our analysis of implementation fidelity focused on whether and to what extent intervention was delivered as programmed, assessed through different measures of screening and instruction implementation. Each measure was administered in two modalities (i.e., self-report and direct observation) to evaluate the accuracy with which teachers followed the assessment and instructional sequence proposed in the reading program. We were also interested in the relationship between implementation fidelity measures and student outcomes to gauge whether higher fidelity yields better outcomes for students’ reading growth. Thus, this study addressed the following research questions:

  • Research Question 1: Are there significant differences in the reading growth (i.e., a composite measure of foundational reading skills) of at-risk kindergarteners who have received a high-fidelity Tier 2 RtI intervention, medium-fidelity Tier 2 RtI intervention, and no intervention?

  • Research Question 2: Are there significant differences in the reading growth (i.e., a composite measure of foundational reading skills) of at-risk first graders who have received a high-fidelity Tier 2 RtI intervention, medium-fidelity Tier 2 RtI intervention, and no intervention?

Method

Participants

Schools

A total of 77 public schools in the Canary Islands took part in the study; 62 schools located on the seven islands of the Autonomous Community of the Canary Islands were included in the experimental group, while 15 schools located on the capital islands (i.e., Tenerife and Gran Canaria) were included in the control group. To guarantee that both groups were similar in terms of geographical area (i.e., urban, suburban, or rural) and socioeconomic status, the selection of the groups was carried out by the educational administration itself through the Agencia Canaria de Calidad Universitaria y Evaluación Educativa (ACCUEE [Canarian Agency for University Quality and Educational Evaluation]) from the Canarian government.

Students

Table 1 summarizes information about the total sample of students evaluated in the fall, winter, and/or spring of each grade in the experimental schools and the control schools. The sample of the experimental schools was made up of all students in the classrooms evaluated by the 62 intervention teachers who carried out Tier 2 of the RtI model. The sample of the control schools consisted of all students in the classrooms evaluated by research support staff.

Table 1 Total sample assessed in fall, winter, and/or spring in the context of RtI model implementation in the Canary Islands

At-risk students

Students in the experimental and control schools were identified as at-risk of reading difficulties by using a composite score from the curriculum-based measurement (CBM) Indicadores de Progreso de Aprendizaje en Lectura (IPAL, [Indicators of Progress in Learning to Read]) (Jiménez & Gutiérrez, 2019) in the fall and/or winter, calculated by averaging unweighted standardized subtest scores (percentile < 25) and a specific cut-off score with sensitivity above .80. Only at-risk students in the experimental group (henceforth, EG) received the Tier 2 intervention by the selected teachers (N = 62). Two cohorts of students participated in a one-year intervention. One cohort included kindergarteners, whereas the other included first graders. Since this study was longitudinal and conducted in a natural setting, some students dropped out of the study because they moved to a different school or did not attend school during the posttest. At-risk students who could be evaluated only once (n = 12), students who were identified as at-risk in the winter but not evaluated in the fall (n = 35), and students who presented some type of Special Educational Needs (n = 5) were not included in the sample (e.g., sensory impairment, neurological problems, etc.). This information was provided by the Canary Islands Government’s Directorate General of Educational Planning, Promotion and Innovation. Table 2 shows a summary of the at-risk students included in the EG and Control Group (henceforth, CG) in the fall and/or winter after applying these exclusion criteria.

Table 2 Distribution of students identified at risk of reading difficulties in the experimental group and control group in fall and/or winter through the IPAL

Intervention teachers

A total of 62 teachers participated in the study (91.6% women and 8.4% men). All teachers taught reading in Spanish and carried out the intervention program within the Tier 2 of the RtI Model. The participating teachers mainly had the specialty of special education (37.1%), pedagogical support teachers (24.2%), tutor teachers (17.7%), teachers that were part of the management teams of the schools (6.5%), and other school personnel (14.5%). Teachers’ years of experience ranged from 0 to 38 in kindergarten (M = 7.51, SD = 8.90) and 0 to 29 in first grade (M = 7.49, SD = 11.3).

The intervention

We used a quasiexperimental design with assignment at the school level to carry out a Tier-2 intervention over the course of 17 weeks from January to May, with a frequency of three to four sessions per week and a duration of 45 to 55 min per session in small groups of 3 to 5 students, and always in a supplementary way to classroom instruction. Those children who received the intervention outside the classroom, once the session ended, they returned to the ordinary classroom to receive the core program. It was recommended to carry out the intervention during the first 3 h of the day. The intervention was based on the essential components of reading, following the proposed learning sequence and the recommended methodology in the instruction sheets: direct (i.e., explain to students what is expected of them), explicit (i.e., what steps should be followed), and systematic (i.e., fixed instructional dynamics, offering different practice opportunities in order to consolidate learning).

Measures

Student reading outcomes and instruments

Kindergarten

In kindergarten, IPAL included three alternate forms (i.e., fall, winter, and spring) of the measures: alphabetic knowledge fluency, phonological awareness, concepts about print-questions, concepts about print-images, and riddles. Alphabetic knowledge fluency consisted of two tasks. First, the student was asked to provide the name of the 27 letters of the alphabet and then the sounds as fast as possible in 1 min for each task. Phonological awareness required children to isolate the initial phoneme of words presented orally. Concepts about print questions evaluated basic knowledge about how print in general works through six questions that the student must answer by observing a textbook suitable for kindergarten. Concepts about print-images consisted of 14 images presented in the student's booklet. The student was asked to point at several stimuli according to the following instructions: (1) Indicate what can be read in these images (four pages); (2) indicate which of these signs are letters; (3) indicate which of these signs are numbers; and (4) indicate where you see a written word. Riddles aimed to evaluate expressive vocabulary and oral comprehension. The tester read aloud a riddle that describes an object, place, or living being while showing the student three drawings that represent words from the same semantic field. The student was asked to say aloud the name of the drawing that corresponds to the riddle's solution. For example, in the question “Which animal can fly?” the student was shown three drawings: a dove, a donkey, and a cow. The IPAL for kindergarten has shown adequate validity and reliability indexes (Gutiérrez, Jiménez, de León, & Seoane, 2020). The composite score used to select at-risk students revealed AUC values from good to excellent across waves (fall = .83; winter = .97; spring = .94), and 3-month delayed parallel analysis showed statistically significant and positive correlation coefficients (fall-winter = .66, p < .001; winter-spring = .78, p < .001).

First grade

IPAL also included three alternate forms (i.e., fall, winter, and spring) of the measures: Alphabetic knowledge fluency, concepts about print-questions, phonemic awareness fluency, nonsense words fluency, cloze sentences, and oral reading fluency. Alphabetic knowledge fluency (i.e., letter-sound fluency and letter-name fluency) and concepts about print-questions are administered in the same way as for kindergarten. Phonemic awareness fluency required children to segment nonsense words into phonemes as fast as possible in 1 min. Nonsense word fluency required students to read as many nonsense words as possible, presented on a sheet of paper, in 1 min. Cloze sentences was a 5-min assessment of vocabulary and phrase-level comprehension. Students were presented with 20 cloze sentences in which every last word is deleted and replaced with three multiple-choice alternatives. Oral reading fluency was a 1-min fluency assessment consisting of orally reading a connected text with speed and accuracy. In first grade, IPAL has shown adequate validity and reliability indexes (Gutiérrez, Jiménez, & de León, 2021). The composite score used to select at-risk students revealed high AUC values across waves (fall = .98; winter = .98; spring = .97), and 3-month delayed parallel analysis showed statistically significant and positive correlation coefficients (fall–winter = .87, p < .001; winter–spring = .91, p < .001).

Corrective reading program

The Instructional Activities for Early Reading Improvement (IAERI) program (Jiménez et al., 2019) was included in the Letra program. It is a structured Tier 2 reading intervention program based on the principles of systematic and explicit instruction and the use of modeling, scaffolding, and feedback (Gersten et al., 2009). It included structured activities and materials designed to facilitate the development of phonological awareness, alphabetic knowledge, vocabulary, fluency, and comprehension. The IAERI program followed the following sequence: (1) presentation of the activity; (2) modeling; (3) group activity; (4) group corrective practice; (5) individual activity; and (6) individual corrective practice. The materials have been designed both for students and teachers to guide teachers’ instruction and guarantee that all teachers are conducting the same instructional methodology. Teachers were given a sheet to record students’daily attendance and the activities they carried out during the sessions. For both the curriculum-based measurement and the instructional sequence proposed in the IAERI program, teachers marked whether the screening process was completed through the Fidelity of the Assessment Scale (FAS), (see Appendix Table 5), and also whether the lesson steps were completed after each lesson through the Fidelity of the Intervention Scale (FIS), (see Appendix Table 6).

Fidelity implementation

To examine the role of treatment fidelity in the reading outcomes of kindergarten and first grade students within a Spanish-language Tier-2-type intervention, two measures were designed: the FAS and FIS scales. Each measure was developed in two modalities: self-report and direct observation.

The FIS scale assessed the accuracy with which teachers follow the instructional sequence proposed in the IAERI program. It included 12 items in a 5-points Likert-type scale (0 = none, 1 = little, 2 = some, 3 = most, 4 = all). The FAS scale assessed the accuracy with which teachers follow the instruction of the administration of the IPAL. It included 12 items in a 5-points Likert-type scale (0 = none, 1 = little, 2 = some, 3 = most, 4 = all).

Both the self-report and the external observation scales, for evaluating the application of the IPAL, showed adequate indices of internal consistency (αself = .92; αobs =.84). Regarding the interrater reliability of the scales, the intraclass correlation coefficient (ICC) of the global observation scale revealed an 84.22% agreement (95% CI = .74–.90) between two observers who made five observations together per month for five months (i.e., 30% of the total follow-up observations). Regarding fidelity of instruction, both the self-report and the external observation scales showed adequate indices of internal consistency (αself =.94; αobs =.85). As for the ICC, the FIS showed a 94.65% of agreement (95% CI = .91–.97) between two observers who made five observations together (i.e., 30% of the follow-up observations).

The sample of intervention teachers was divided into two groups based on the degree of accuracy of the implementation of the intervention (i.e., high-fidelity vs. medium-fidelity). In our study, given that all teachers had fidelity values above 2 points, we decided to use the 50th percentile, so the three groups of interest (i.e., control, medium-fidelity, and high-fidelity) would have a more similar number of subjects. This cut-off score (i.e., cut-off point = 3.34) of the total score generated from the FIS and FAS scales was used to identify a high-fidelity group (M = 3.58, SD = .18) and another medium-fidelity group (M = 2.99, SD = .34). In kindergarten, the high-fidelity group was made up of 120 students, and the medium-fidelity group by 105. For first grade students, the group who received the intervention with high fidelity was made up of 167 students, and the medium-fidelity group had 150 children. The Student's t test for independent confirmed that there were significant differences between the accuracy means (i.e., total score) of both groups (t60 = 8.56; p < .001).

All teachers were asked to fill out a monthly self-report scale about the quality of their implementation for both intervention (i.e., IAERI) and assessment (i.e., IPAL). These self-reports were hosted on the WebRtI (see WebRtI.ull.es).

External observations were made by psychology graduate students to evaluate the application of the IPAL (i.e., external observation using the FAS) and the instructional sequence proposed in the IAERI program (i.e., external evaluation using the FIS). According to Johnson et al. (2006), the frequency with which teachers are observed to ensure fidelity of implementation varies depending on teacher experience level, teacher request for support, overall class performance, or the increase or decrease of special education students. Several researchers have argued that all teachers should receive at least one observation for implementation fidelity (e.g., Johnson et al., 2006; Nelson et al., 2011). Therefore, we ensured that all participating teachers received at least one direct observation when implementing the IAERI program and another direct observation using the IPAL assessment. However, some authors have used monthly observations (e.g., Nelson et al., 2011). In fact, monthly or weekly observations are desirable as a way to assess implementation fidelity stability through the year. However, we did not receive funding enough to afford monthly assessments of all teachers. Therefore, observations were planned so that 75% of the participating teachers were observed at least once (i.e., unique observations) during a complete evaluation of a student using IPAL (approximately 30 min) and in a full session of the implementation of the IAERI program (approximately 50 min). The remaining 25% of the teachers were observed monthly for both IPAL’s administration and the IAERI’s implementation (i.e., follow-up observations). Since the beginning of the school year, each teacher was assigned to an observer for the whole school year. In order to assess interjudge reliability, 30% of the follow-up observations were carried out with two observers. In addition, the observers offered advice and feedback on how to implement the IPAL and the IAERI program after each observation session. Once the evaluation was completed, they had to make the corrections in the examiner's template and upload the data to the WebRtI.

Procedure

To carry out this study, a joint collaboration agreement was established between the Dificultades Específicas de Aprendizaje, Psicolingüística & Tecnologías de la Información y la Comunicación (DEAP&NT) research group of the Universidad de La Laguna, and the Department of Education and Universities of the Autonomous Community of the Canary Islands (Boletín Oficial de Canarias, 2017).

A mixed methodology was carried out for teacher training (online and face-to-face). For the online training, the teachers who carried out the intervention received online training through the Letra (see http://letras.ull.es/) (Jiménez et al., 2020). The main aim is to provide teachers with knowledge and strategies for teaching the basic components of early reading instruction within the RtI model effectively. The training was carried out over a period of 12 weeks (from September to November), and a researcher from the DEAP&NT group tutored each group of teachers. Letra includes three modules: theoretical, assessment, and intervention. The theoretical module includes information about the RtI model, LD in reading, prevention, and intervention. The assessment module provides information about the use and administration of the IPAL. Finally, the intervention module includes information on implementing the RtI Tier 2 using the IAERI program for phonological awareness, alphabetical knowledge, vocabulary, fluency, and comprehension. Thus, this initial training focused on the tutorials in the theoretical training module, CBM module, and IAERI program. In the face-to-face meetings, an example of the materials was given to the teachers to carry out the evaluations: examiner's manual, student's booklet, examiner's IPAL template, examiner's IPAL record sheet, and stopwatch.

Because the Canary Islands are made up of seven islands, face-to-face training was carried out in the provincial capitals (i.e., Tenerife and Gran Canaria). Therefore, teachers from the other islands needed to travel from their islands to the capital ones. The teachers received a total of four face-to-face sessions lasting 4 h each. These face-to-face meetings were held to guarantee the proper administration of the IPAL and implementation of the IAERI program by the teachers. For this purpose, the teachers followed the instructional methodology established for Tier 2. A courier company was in charge of sending all the materials to the schools of the different islands.

Screening assessments, in both the experimental and the control schools, took place three times throughout the school year (i.e., fall, winter, and spring) and were carried out with all students in the selected classrooms, regardless of their risk status in previous evaluations. IPAL administration was carried out individually, in a room isolated from noise and distractions, and during the first three class hours. Experimental group assessments were carried out by the intervention teachers who received the Letra and attended the face-to-face meetings.

The evaluations in the control group were carried out by external evaluators who received theoretical and practical training. It consisted of a team of 19 examiners for screening evaluations, all specialized in psychology or teaching, and three of them were researchers from the DEAP&NT group who were only responsible for recording the observations. Each examiner attended three 4-hour sessions (i.e., one session per quarter) about IPAL administration and data entry into the RtI website. The RtI website allowed teachers and research assistants to identify students’ risk status after introducing the students’ raw scores for each measure. The RtI website identified at-risk students with a red dot, low-achieving students with a yellow dot, and typically developing students with a green dot. Furthermore, the website used all the students’ scores throughout the year to monitor their progress. The examiners also received training in external observation scales to evaluate the IPAL assessment and IAERI program.

Data analysis

The role of treatment fidelity in reading outcomes of kindergarten and first grade students was examined by tracking changes in the composite score over time using Hierarchical Linear Modeling (HLM) analysis (Raudenbush & Bryk, 2002). A composite IPAL (C-IPAL) score was calculated to compare the growth curves between the EG and the CG groups. The C-IPAL measure was calculated using the percent of maximum possible (POMP) method (Cohen et al., 1999), which set the range of each scale from 0 (=minimal possible) to 100 (=maximum possible). The resulting POMP-transformed scores can be interpreted as percentages of the maximum possible score (Moeller, 2015). Three-level models were estimated with the repeated observation measures of student outcome over time at the first level, differences among the students at the second level, and the schools at the third level. The step-by-step procedure was followed (Peugh & Heck, 2016) to choose the most parsimonious model. Both linear and quadratic models were explored. The likelihood-ratio test and deviance statistics were used as indices to choose the HLM model with the best fit (Raudenbush & Bryk, 2002). When collinearity (r < −.80, > .80) effect was found between the intercept and the time set as random, both at the student and school levels, only time remained as random (Barr et al., 2013). Two models for longitudinal nested data analyses were considered: unconditional and conditional. The unconditional or baseline model was used to explore the need for controlling the variability among schools (i.e., model 0). In the conditional model, the factor group was added to the analysis as a second-level predictor to analyze whether there were differences in the rate of change of the global C-IPAL score by the groups (i.e., model 1). Statistical analyses were carried out using SPSS (v. 21.0) and ULLRToolbox (Hernández-Cabrera, n.d.) for R (v. 3.4.3.) software.

Results

The fidelity analysis of the IPAL administration using both teachers’ self-reports (M = 3.54, SD = .35) and research assistants’ reports (M = 3.44, SD = .50) through the FAS, indicated that teachers administered the IPAL with high fidelity. Regarding the use of the IAERI program, both teachers’ self-reports (M = 3.28, SD = .52) and research assistants’ reports (M = 2.85, SD = .62) through the FIS, showed adequate indices of fidelity. Likewise, teacher self-reports on adherence positively correlated with those of independent observers for IPAL administration (r = .39, p < .01) and IAERI program (r = .41, p < .01).

EG students received an average of 34.3 (SD = 17.2, minimum = 4, maximum = 75) intervention sessions in kindergarten and 34.1 (SD = 16.3, minimum = 3, maximum = 73) intervention sessions in first grade. Through the HLM analysis, we compared the growth rates of the EG and the CG. For this analysis, the EG was divided according to the accuracy with which the teacher carried out the intervention (i.e., high-fidelity [EGH] and medium-fidelity [EGM]). We used the 50th percentile, so both groups (i.e., high-fidelity and medium-fidelity) would have a similar number of subjects. The cut-off score (i.e., cut-off point = 3.34) of the total score generated from the FIS and FAS scales was used to identify the high-fidelity group (M = 3.58, SD = .18) and the medium-fidelity group (M = 2.99, SD = .34). HLMs were performed with the global C-IPAL score as the dependent variable for kindergarten and first grade. In the conditional HLMs, the CG was included as the reference group to compare the students’ growth with that of the EGH and EGM students.

Kindergarten

Table 3 shows the results obtained after performing the HLMs following the step-by-step procedure (Peugh & Heck, 2016; Raudenbush & Bryk, 2002) throughout the school year. The analysis of residuals revealed that all models met the assumption of normality; therefore, no transformation was necessary. Figure 1 shows a graph with the growth in the C-IPAL measure of the different groups. The unconditional baseline model showed that the ICC value due to the school was 14.96 for kindergarten, highlighting the need to control for variability among schools.

Table 3 Growth prediction of the C-IPAL score across groups using hierarchical linear models throughout kindergarten
Fig. 1
figure 1

Growth slopes predicted through the hierarchical linear model in kindergarten C-IPAL scores

The unconditional growth model (i.e., model after including the time variable but without the inclusion of the group variable) revealed significant growth for the C-IPAL score (i.e., the growth was > 0 at the different moments of measurement). Once the group variable was included in the model (i.e., final conditional growth model), the analysis revealed no significant differences between the groups at the beginning of the school year (i.e., intercept). Regarding the growth slopes, the interaction of the CG and EGM slopes (i.e., time × medium-fidelity) was significant and positive. Likewise, the interaction of the CG and EGH slopes (i.e., time × high-fidelity) was also significant. In general, the EGH group revealed greater growth coefficients than the growth of the EGM group. The global C-IPAL score showed that all the groups had a similar performance (i.e., intercept γ000 = 24.82 points) at the beginning of the school year. However, the EGH and EGM showed significantly greater growth than the CG (i.e., 6.69 points per measurement time), and the EGH growth (i.e., 14.66 points per measurement time) was greater than the EGM (i.e., 11.38 points per measurement time). At the 2nd level of the model, which considered the variability of the scores that could be attributed to the students, the time variable was included to model the slope of the random effects. The correlation between the intercept and the slope of this 2nd level was positive, indicating that students who started the school year with a higher level had higher growth slopes. Additionally, the C-IPAL score showed a positive correlation (ρu000u110 = .49), indicating that students who started with an overall higher score had faster growth on this measure. The 3rd level of the model, which considered the variability of the scores that could be attributed to the schools (i.e., an estimator of the expected correlation between the scores of students from the same school), was also calculated, revealing a positive correlation for the C-IPAL score (ρu000u101 = .14), although of low magnitude.

First grade

Table 4 shows the results obtained after performing the HLMs following the step-by-step procedure in first grade. The analysis of residuals revealed that all models met the assumption of normality; therefore, no transformation was necessary. Figure 2 shows a graph with the growth in the C-IPAL measure of the different groups. The unconditional baseline model showed that the ICC value due to the school was 15.32 for first grade, highlighting the need to control for variability among schools.

Table 4 Growth Prediction of the C-IPAL Score across Groups Using Hierarchical Linear Models throughout First Grade
Fig. 2
figure 2

Growth slopes predicted through the hierarchical linear model in first grade C-IPAL scores

The unconditional growth model (i.e., model including the time variable but without the inclusion of the group variable) revealed a significant growth for the C-IPAL score (i.e., the growth was > 0 at the different moments of measurement). Once the group variable was included in the model (i.e., conditional growth model), the contrast of the model at the beginning of the school year revealed that compared with the CG, the performance in the EGH and EGM was significantly lower. Regarding the growth slope, the interaction of the slopes of the CG and the EGM (i.e., time × medium-fidelity) was not significant. On the other hand, the interaction of the slope of the CG and the EGH (i.e., time × high-fidelity) was significant. As in kindergarten, the EGH revealed greater growth coefficients than those of the EGM.

In contrast to kindergarten, the HLM of the global C-IPAL score showed that the EG had a significantly lower score than the CG at the beginning of the school year (i.e., intercept γ000 = 16.17 points; EGM = 9.82; EGH = 10.02). The EGM showed no differences in growth with respect to the CG (i.e., CG’s slope = 12.24; EGM’ slope = 14.19). In contrast, the EGH showed significantly higher growth compared to the CG (EGH’ slope = 16.47). At the 2nd level of the model, the intercept and the slope of this 2nd level were positive. Furthermore, as in kindergarten, the C-IPAL score showed a high correlation (ρu000u110 = .78), indicating that students who started with a higher score had a higher growth slope. At the 3rd level of the model, the intercept and the slope revealed a positive correlation (ρu000u101 = .35), indicating that the schools with higher averages in C-IPAL at the beginning of the academic year showed faster growth than schools with lower averages.

Discussion

The purpose of this study was to examine the extent to which fidelity of implementation enhanced or constrained the reading growth of at-risk students who received a Tier 2 intervention with high fidelity and at-risk students who received the same intervention with medium fidelity, compared to an at-risk control group. Our analyses revealed that most teachers administered the IPAL to students with high accuracy rates. The mean fidelity rate measured through the assessment scale was 87.2%. When implementing the IAERI program, the teachers and the external observers rated the intervention's accuracy using the fidelity of the intervention scale, with a fidelity of 76.6%. These results indicate that the degree of coincidence between teachers and external observers was above 80% in the IPAL administration, and above 70% in the implementation of the instructional reading program. Therefore, the quality of both assessment and intervention administration was guaranteed.

Adherence to delivery of the lessons as designed (i.e., follows the lesson format) and assessment (i.e., follows the screening format) had a statistically significant effect on the students' reading outcomes. The results of the global longitudinal measurement of the IPAL (i.e., C-IPAL), both for kindergarten and first grade, showed that the students at risk for reading difficulties benefited from an intensive Tier 2 intervention carried out with high fidelity. The intensity was increased by offering a supplementary intervention to Tier 1 and by working in small groups. As expected, in both grades, the accuracy with which the program was deployed affected reading performance. In the present study, this effect was more evident among first grade students, where only the group that received high-fidelity intervention improved significantly compared to the at-risk control group. One possible explanation for this finding was the fact that, in first grade, students were evaluated using measures that required word and sentence reading but, in kindergarten, we used foundational reading skills. We found that the number of children at risk was higher than in kindergarten, and intervention teachers were recommended to prioritize those students with a very low reading performance at the beginning of the school year. Consequently, at the beginning of the school year, there were differences in reading performance between the at-risk students in the experimental group and those in the control group.

However, although students in the intervention group started with significantly lower performance, those who received a high-fidelity intervention were able to show significantly higher growth. These results are similar to those found in other studies showing that students who have reading difficulties or who are at-risk readers benefit more from a systematic and controlled instruction following a structured program than from less direct implementation approaches (Richards-Tutor et al., 2015). Therefore, although it can be concluded that fidelity of implementation improved reading growth among at-risk students who received a Tier 2 intervention, the intervention implemented with medium fidelity was only effective in kindergarten. In another study, Jiménez et al. (2021) analyzed how the risk status changed for the students who participated in the present study. They found that the earlier the intervention, the greater the percentage of students whose status switched from at-risk to typically developing. Therefore, it is shown that early reading difficulties in Spanish are reduced, preventing later word reading difficulties or dyslexia over-identification in Spanish.

A few studies have examined fidelity’s role in the treatment effects for statistically or educationally significant effects on student outcomes (Nunnery et al., 2006; Vadasy & Sanders, 2009; Vadasy et al., 2015; Wolgemuth et al., 2014). Overall, a consistent finding among these studies was that treatment adherence accounted for differences in reading gains for students with reading difficulties. However, these studies did not reveal what a sufficient level of fidelity required for reading gains was. Within this context, Hill and Erickson (2019) concluded that “moderate and strong fidelity yield the same likelihood of on-average positive impacts on student outcomes, leading to the intriguing hypothesis that moderate fidelity may be enough to yield positive program outcomes” (p. 596). However, the present study went beyond the previous studies to test the effects of the fidelity level on reading performance in kindergarten and first grade students. We found that higher fidelity interventions had greater effects on reading growth in children at risk of developing reading difficulties. Interestingly, our results suggest that medium- and high-fidelity interventions did not yield the same results on student outcomes for each grade; when we analyzed the effect of the fidelity level in reading outcomes, the difference between medium and high fidelity seemed to be much more relevant for explaining the reading growth of at-risk first graders. Nevertheless, we must be cautious when making this comparison across studies because the review presented by Hill and Erickson was focused on classroom-level interventions, unlike our study, which was focused on a small group intervention.

In sum, we evaluated the role of fidelity measured by the degree to which teachers adhered to intervention components within the context of a standard protocol reading intervention for at-risk students. This study found that treatment adherence influences reading performance growth among at-risk students, thus suggesting it is a particularly important component of implementation fidelity. In addition, while both high- and medium- level fidelity interventions were effective in kindergarten, our findings highlight the need for a higher level of fidelity to improve the reading growth of first grade students.

Educational implications

This study shows that fidelity of implementation of supplemental intervention with kindergarten and first grade students at risk of reading problems is critical for improving student outcomes. Previous studies in Spanish have shown the effectiveness of a Tier 2 approach implemented from kindergarten to second grade (Crespo et al., 2018). However, to the best of our knowledge, this is the first study examining the intervention's fidelity in this population. We believe the findings from our study should be considered when implementing RtI policies in this context. Just implementing a multi-tier system is not enough for improving the reading skills of at-risk students. This implementation should be supported by teacher professional development that also provides feedback and resources (e.g., materials and time) throughout the implementation. The implementation of the RtI model might improve students’ outcomes, avoiding the overidentification of students presenting reading disabilities later on.

Limitations of the study and future lines of research

This study has some limitations. First, we did not collect data for all dimensions of the fidelity construct. We used adherence checklists because it is the most common approach to monitoring intervention fidelity (Nelson et al., 2020). However, other important aspects of implementation fidelity (i.e., exposure, quality of delivery, responsiveness, and program differentiation) can also contribute to students’ outcomes (Hill et al., 2012). Deficiencies in any of these areas could diminish the impact of a program or confound the interpretation of the outcomes in a study. Consequently, we would have produced a more complete picture of the adequacy of program delivery if we had monitored the implementation of all the above factors. Therefore, future research should explore the separate effects of each fidelity dimension and examine their impact on reading outcomes.

Second, implementation fidelity measures were applied only to the intervention groups; therefore, results should be cautiously interpreted as we did not control for the fidelity of the instruction received by the control group.

Third, we used self-report and field observations to evaluate the fidelity of the intervention. Previous studies have noted that teacher self-reports on adherence negatively correlated with those of independent observers (Bickman et al., 2009). As a result of these issues, it is generally not recommended that self-reports be used as the sole means of assessing treatment integrity. Nevertheless, when self-reports were used simultaneously with field observations, researchers found that self-report data indicated higher fidelity levels than field observations (Emshoff et al., 1987). For the fidelity of intervention, the mean of teacher self-reports was higher than that of teacher observations, but this did not happen with the fidelity of assessment. This could be influencing the results found, although both the fidelity of the intervention scale as self-report and as external observation showed adequate indices of internal consistency. In addition, teachers' self-reports of adherence were positively correlated with those of independent observers.

Fourth, although the training insisted that the intervention should be carried out as an ancillary intervention to Tier 1, no data was collected to this end. The over-identification of at-risk students in first grade could also be related to the lack of quality instruction at Tier 1. Nevertheless, this requires providing specialized training to teachers working in the general classroom and the coordination of the different educational agents of the school.

Fifth, another limitation was the lack of resources for the attention of all the students at risk in the experimental group. This limitation was especially important in first grade. Considering the over-identification of at-risk students, the teachers had to choose the students who showed a higher risk status for the intervention group, thus preventing comparability with the control group. Although HLM controls the starting point of the students, we have seen that the initial performance also has an important effect on the students’ growth slopes and, therefore, it is more difficult to isolate the effect of the intervention if the groups are not comparable at the beginning of the school year. Consequently, future research should make observations or collect data about the type of attention that at-risk students received at control schools.