Acquisition of the component skills of alphabetic principle – letter recognition, letter name identification, and letter–sound correspondence – is the best predictor of future success in reading, spelling, reading comprehension, and vocabulary skills (Cunningham & Stanivich, 1997; National Early Literacy Panel [NELP], 2008; Petscher et al., 2020). Failing to acquire these skills is correlated with a future diagnosis of a specific learning disability in reading (Gallagher et al., 2000) and behavioral and emotional problems (Lonigan et al., 1999).

Federal policy, such as the No Child Left Behind Act (2001) from the US Department of Education, emphasizes implementing scientifically based reading instruction with fidelity. However, early childhood educators have reported using free, commercially available programs (e.g., Zoo-Phonics®) and selecting components of programs based on personal preference or teaching style despite the lack of research-based evidence (Gischlar & Vesay, 2018; Kretlow & Helf, 2013; Pianta et al., 2009). Thus, it is imperative to understand the effects of programs and program components commonly implemented in the classroom. Many reading programs utilize multisensory techniques, in which mnemonic devices are paired with stimuli to engage multiple senses (i.e., multisensory). Mnemonic devices include auditory stimuli, tactile stimuli, visual stimuli, and kinesthetic movements and are designed to increase the rate of acquisition for the target skill and enhance short- and long-term maintenance and application of the information, especially when the information is unfamiliar, complex, abstract, and extensive (Levin, 1993; Putnam, 2015).

In addition to The National Reading Panel (2001), many researchers highlight the importance of using visual mnemonics to teach early literacy skills (Robers & Sadler, 2019; Shmidman & Ehri, 2010), but research comparing visual mnemonics to no mnemonics on letter–sound correspondence has shown varied efficacy depending on the visual mnemonic presentation. That is, whether the picture associated with the letter is presented next to the letter (extra-stimulus) or within the letter (within-stimulus; e.g., “w” forming parts of the wings in a butterfly). Marsh et al. (1974) compared extra-stimulus visual mnemonics and no mnemonics (i.e., letters presented in isolation) on letter–sound correspondence. Participants in the extra-stimulus visual mnemonics group performed significantly better during training trials than participants in the no mnemonics group, but there was no significant difference between groups during testing trials when letters were presented in isolation. This finding may be due to overshadowing, in which the more salient stimulus (picture) in the compound stimulus controlled responding rather than the target stimulus (letter; Dittlinger & Lerman, 2011). Alternatively, within-stimulus visual mnemonics were found to be significantly more efficacious at promoting the acquisition of letter–sound correspondence than the extra-stimulus visual mnemonics and the no mnemonic interventions with kindergarteners (Agramonte & Belfiore, 2002), first-grade students (Ehri et al., 1984; Fulk et al., 1997), and fourth-grade English Language Learner students (Sener & Belfiore, 2005).

Maintenance and generalization following the acquisition of letter–sound correspondence using visual mnemonics were assessed in several studies. Regarding within-stimulus visual mnemonics, Agramonte and Belfiore (2002) and Sener and Belfiore (2005) demonstrated mastery-level responding 1-, 2-, and 3-weeks post-intervention. Fulk et al. (1997) demonstrated skill maintenance 2-weeks post-intervention and a decrease at 4 weeks. In contrast, de Graaff et al. (2007) implemented a set 6-session training (i.e., no mastery criteria) in a group design and found an increase in correct responding 4 weeks post-training. Further, Ehri et al. (1984) and Shmidman and Ehri (2010) evaluated how letter sound acquisition via extra-stimulus and within-stimulus pictures affects generalization to additional reading skills with a series of pre- and post-tests. Pre- and post-tests included letter–sound correspondence, picture identification, letter writing (Ehri et al. 1984; Shmidman & Ehri, 2010), letter–sound identification, matching letter sounds to pictures, spelling, and reading words (Shmidman & Ehri). On average, the within-stimulus picture group performed significantly better than the extra-stimulus picture group on all post-tests. These results suggest that pairing within-stimulus pictures with letters result in an immediate benefit (i.e., letter sound acquisition) and long-term advantages (i.e., generalization to reading skills).

Pairing kinesthetic movements with literacy skills (i.e., engaging in physical movement while emitting the auditory target skill) has also been demonstrated as an effective teaching method for young children (Lozy et al., 2020; Rule et al., 2006). Lozy et al. found that pairing kinesthetic movements with a traditional drill flashcard intervention produced faster letter–sound acquisition compared to the traditional drill flashcard intervention alone. Rule et al. demonstrated the efficacy of a tactile intervention on phonemic awareness (auditory discrimination of language into discrete parts of words, syllables, and individual sounds) in which students manipulated letters. Further, Lozy et al. demonstrated that children who engaged in more kinesthetic movements during training trials acquired letter sounds with fewer sessions compared to students who engaged in fewer movements during training trials. The maintenance and generalization of paired kinesthetic movements have also been demonstrated to be superior to traditional teaching approaches. Specifically, researchers  showed a higher number of correct responses during maintenance for the paired kinesthetic movement condition compared to the traditional teaching approach at an end-of-day post-training session (Glenberg et al., 2004) and 1-, 3-, 5-, and 7- weeks post-training (Lozy et al., 2020). Related to generalization, Campbell et al. (2008) demonstrated an increase in nonsense word reading due to paired kinesthetic movement training for literacy skills.

Although previous studies examined the effects of visual and kinesthetic mnemonics in isolation, these techniques are commonly presented in combination. The few studies that have examined multisensory curricula as a package came to discrepant conclusions about the benefits of multisensory instruction over traditional instruction without mnemonics (Callinan & Zee, 2010; DiLorenzo et al., 2011). A beneficial preliminary step to determining the necessary and sufficient components of multisensory curricula is systematically examining the effects of components in isolation and in combination. Combined effects of mnemonics cannot be assumed by attempting to add their isolated effects.

The general purpose of these experiments were to extend the literature on the multisensory teaching approach by evaluating commonly used mnemonics in early literacy programs, both in isolation and when combined. First, we evaluated the effects of pairing mnemonic devices with letters on letter–sound correspondence with preschool and kindergarten students. Experiment 1 compared kinesthetic movements to within-stimulus pictures paired with a traditional drill flashcard intervention (i.e., no mnemonic) and a probe-only (no intervention) condition. Experiment 2 compared a single mnemonic intervention condition (kinesthetic movements, based on the results of Experiment 1) to a combined mnemonic condition (within-stimulus pictures plus kinesthetic movements), both paired with a traditional drill flashcard intervention, and a probe-only condition. We selected to pair mnemonics with a traditional drill flashcard intervention because it has been demonstrated to be an effective method to increase early literacy skill acquisition with preschoolers (Griffin & Joseph, 2015; Lozy & Donaldson, 2019) and allows for the isolation of the effects of paired mnemonics (Lozy et al., 2020).

Second, we examined the generalization of letter–sound correspondence, acquired via each intervention, on other pre-reading tasks using a pre-test–post-test design. Pre-and post-tests evaluated stimulus generalization, letter sound recognition, letter sound identification, reading nonsense words, and spelling nonsense words. Third, we evaluated participant preference for each intervention condition using a concurrent chains procedure (Hanley et al., 1997; Lozy et al., 2020). Finally, we evaluated the maintenance of letter–sound correspondence by conducting probes 1 to 9 weeks post-letter set mastery.

Experiment 1

The purpose of Experiment 1 was to compare kinesthetic movements and within-stimulus pictures when paired with a traditional drill flashcard procedure on letter–sound correspondence. Two participants did not have enough unknown stimuli following the pre-test to allow for a replication of all conditions. For these participants, we included a traditional drill flash-card condition (TD; no mnemonics) in the evaluation.

Method

Participants, Consent, Assent, and Setting

Experimenters recruited students from a public preschool and kindergarten in southeast Louisiana and by word-of-mouth. Participants were selected after receiving administrative, teacher, and caregiver consent, and child assent. Teachers and caregivers referred 24 children for participation, and 14 children met the inclusion criteria. Children were included if they (a) responded with fewer than 10 correct responses during the letter–sound correspondence pre-assessment, (b) responded with the correct echoic for at least 12 letter sounds during the echoic assessment (in the absence of the letter card), and if recruited from the school, (c) were regularly in class. Inclusion criteria were based on (a) the minimum number of unknown letters to allow for an intervention comparison; (b) ability to correctly emit the letter sound; and (c) availability to receive the intervention. Seven participants experienced the intervention conditions and seven participants, who did not receive an intervention, served as the control participants. One participant assigned to the intervention group, Dale, provided inconsistent assent based on intervention condition (i.e., only assented to participant in the intervention condition that included pictures). For this reason, we modified the intervention procedures used to target letter–sound acquisition with this participant. Procedures and data for Dale are available via Supplemental Information.

The experimenters intended all participants to experience the interventions after post-tests; however, due to the COVID-19 pandemic, the school year ended prior to the scheduled date.

Participant demographics and assignments are listed in Table 1. We conducted sessions in an empty classroom or cafeteria in the participants’ school or at a table in the participant’s home (Nico). We conducted two to five sessions per day, 3 to 4 days per week in a single session block with 2-min breaks between sessions.

Table 1 Participant demographics

Materials

Intervention materials included stimulus cards, contingency correlated stimulus cards, and reinforcers. Letters were printed on the front and back of the traditional drill cards. The corresponding picture from the Zoo-Phonics® program was printed on the front of the within-stimulus picture cards. The modeling, error correction procedure, and description of the kinesthetic movement (if applicable) were printed on the back of the cards to increase the probability of experimenter treatment fidelity. Sample stimuli are presented in Appendix A. Contingency correlated stimulus cards, denoting each condition, are presented in Appendix B. Pre- and post-tests included test-specific materials. Sample stimuli for pre- and post-tests are presented in Appendix C.

Reinforcers were selected by asking teachers or caregivers about each student’s preferred activity. Reinforcers included access to games on an iPad® or edibles (Milo only, beginning at session 34, respectively). When the iPad® was the reinforcer, the experimenter used a token board with eight Velcro® hook-and-loop dots to signal the duration of access at the end of session; each token represented 15 s of access. When edibles were the reinforcer, the experimenter implemented an accumulated reinforcer arrangement (DeLeon et al., 2014; Ward-Horner et al., 2017). Experimenters placed a dime-sized piece of the selected edible on a plate per each correct response and provided participants access to the plate at the end of the session.

Dependent Variables

Data collectors used paper and pencil data sheets to record the participants’ responses during all phases. During the pre-assessment, observers scored a correct response if the participant emitted the letter sound corresponding with the letter card within 5 s of the experimenter presenting the card and vocal instruction. For vowels, observers scored a correct response if the participant emitted the short vowel sound. Observers scored an incorrect response if the participant failed to respond or responded with anything other than the correct response. Nothing was recorded if the participant emitted the long vowel sound or the letter name, and the experimenter said, “That’s one sound the letter makes, but what’s another sound?” or “That’s the name of the letter, but what sound does it make?,” respectively. We only scored the short vowel sound as correct because the long vowel sound is identical to the letter name (Piasta, 2014). The experimenter then scored the participant’s subsequent response. During the echoic assessment, a correct echoic was scored if the participant’s response matched the sound emitted by the experimenter. Experimenters scored an incorrect echoic if the participant did not respond or if the participant’s response was anything other than the correct echoic.

During the pre- and post-tests, correct responses were specific to the test and were scored if the participant emitted the correct response within 5 s of the prompt. We also evaluated the percentage of correct responses per test by dividing the number of correct responses by the total number of trials and multiplying that number by 100. Correct responses are denoted in Table 2.

Table 2 Pre- and post-tests (Experiments 1 and 2)

The primary dependent variables in the intervention comparison were the number of correct responses during mastery probes and the number of training sessions to mastery criteria (i.e., participant emitting the correct response to all condition stimuli during two consecutive mastery probes). The secondary dependent variables were the percentage of correct responses during training sessions and the percentage of correct movements emitted during training sessions for the kinesthetic movement conditions. We calculated the percentage of correct responses by dividing the number of correct responses by the total number of trials and multiplying that number by 100. During kinesthetic movement conditions, observers scored a correct movement if the participant emitted the physical movement that corresponded with the stimulus card within 5 s of the experimenter presenting the discriminative stimulus. An incorrect movement was scored if the participant responded with anything other than the correct movement. No movement was scored if the participant did not respond with any physical movement. The percentage of correct movements for each stimulus was calculated by dividing the number of correct movements by the total number of trials and multiplying that number by 100. A tertiary dependent variable was the percentage of training sessions at 100% accuracy. We calculated the percentage of sessions at 100% accuracy by dividing the number of sessions at 100% accuracy by the total number of training sessions and dividing that number by 100.

The dependent variable was the selection response during the preference assessment, defined as the participant picking up or pointing to the contingency-correlated stimulus card. The selection response was recorded as the condition selected.

Interobserver Agreement

A second observer independently collected in vivo data on the card presentation and participant response and movement (when applicable) for all participants. An agreement was defined as the experimenter and observer recording the same card, participant response, and movement (when applicable). An experimenter calculated interobserver agreement (IOA) using the trial-by-trial exact agreement method. We divided the number of agreements by the total number of trials per session, multiplied that number by 100 to obtain a percentage, and then averaged the percentages across sessions for each condition. During preference assessments, an agreement was defined as both observers recording the same selection response. IOA was calculated using the exact agreement method across sessions. An agreement was coded as 1, and the total number of agreements was divided by the total number of selection responses and multiplied by 100 to obtain a percentage.

During the pre- and post-tests, across all tests, IOA was assessed during an average of 35% of sessions and was always 100%. During the first evaluation, across all intervention conditions, IOA was assessed during 31% of sessions for Nico, 78% of sessions for Quade, 52% of sessions for Cain, 54% of sessions for Sloan, and 48% of sessions for Calla. Mean IOA was 100% for Nico, Quade, Cain, Sloan, and Milo; and 96% for Calla (range, 90% to 100%). During the second evaluation, across all conditions, IOA data were collected during 34% of conditions for Nico and was 100%. The experimenters assessed IOA during an average of 50% of preference assessment sessions, and IOA was 100%.

Screening

Pre-assessment

An experimenter conducted a pre-assessment to identify letters to be included in the intervention comparison. The experimenter divided 26 letters into three sets of eight, nine, and nine letters and conducted two sessions for each set across different days. During sessions, the experimenter presented each card once and asked, “What sound does this letter make?” The experimenter delivered praise and a reinforcer following correct responses and no consequence for incorrect responses. If the participant responded with the letter name or long vowel sound, the experimenter did not record the response and said, “That’s the name of the letter, but what sound does it make?” or “That’s one sound the letter makes, but what’s another sound?” The participant’s subsequent response was recorded, and contingencies for emitting a correct and incorrect response were identical to the initial trial. Additionally, the experimenter made descriptive comments (e.g., “You’re working really hard!”) after approximately every third incorrect response.

Experimenters matched participants together who had the same or a similar number of correct responses during the pre-assessment. Then, each participant in each matched pair was randomly assigned to either the intervention or control group. Nico was not matched with a control participant because research sessions were conducted during the summer prior to recruiting all other participants. Additionally, Cassie was not matched with an intervention-group participant due to an uneven number of participants during the school year.

Echoic assessment

An experimenter conducted an echoic assessment to determine if participants could correctly imitate the letter sound for letters to which they responded incorrectly during the pre-assessment. The experimenter equally divided the incorrect stimuli into three equal sets and conducted one session for each set. The number of stimuli per set depended on the number of incorrect responses but ranged from four to seven. During sessions, the experimenter did not show the card and told the participant to repeat after them (e.g., “Say /p/”). Experimenters delivered praise following correct echoics. The experimenter provided no consequence for incorrect echoics.

Experimental Design

We used a multielement design in which intervention and probe-only conditions were rapidly alternated in a 6:1 or 4:1 ratio, depending on the number of intervention conditions for each participant. We also used a between groups pretest-post test design to evaluate the intervention effects on complex literacy skills that were not directly targeted during the intervention. We used a matched pairs control group, based on the number of correct responses during the pre-assessment. In addition, we used a concurrent chains procedure to assess participant preference for intervention conditions.

Procedure

All participants followed the same procedure: Pre-tests, pre-intervention training, intervention comparison, post-tests, replication (Nico only), preference assessment, and maintenance. The protocol is outlined in Fig. 1. Each session block, for all procedures, were between 10 to 15 minutes in duration.

Fig. 1
figure 1

Flowchart of procedures for participants in Experiment 1 and Experiment 2. Note. Dashed line represents participants in the matched pairs control group. Solid line represents participants in the intervention group

Pre-Tests

The experimenter selected either 16 or 12 letters (3 or 4 vowels and 9 or 12 consonants) depending on the number of letters the participant responded incorrectly to during the pre-assessment and correctly imitated during the echoic assessment. The experimenter divided the letters into three or four groups of four letters, each containing one vowel and three consonants. The experimenter conducted each test once for all letter sets, and tests were similar to those implemented by Shmidman and Ehri (2010). Tests included (1) stimulus generalization (Wunderlich & Vollmer, 2017; Wunderlich et al., 2014), (2) letter sound recognition, (3) letter sound identification (an essential skill for fluent reading; NELP, 2008), (4) reading nonsense words, and (5) spelling nonsense words. The use of nonsense words increased the confidence of intervention effects because the chance of receiving outside instruction for the words was reduced (MacQuarrie & Tucker, 2002). Test descriptions are denoted in Table 2. During each test, the experimenter delivered a reinforcer contingent on a correct response within 5 s of the prompt and did not deliver feedback for incorrect responses.

After Test 3, the experimenters re-assigned the letters to three or four four-letter sets based on the number of correct responses per stimulus during Tests 1, 2, and 3. The experimenters also matched letter sets based on variables pertinent to letter sound acquisition (Wolery et al., 2014). That is, letters were split into sets that resulted in an approximately equal number of correct responses. In addition, letters were separated if they began with the participant’s first or last name (Turnbull et al., 2010), shared similar features such as b and d (Treiman et al., 2006), or were commonly learned later in development, such as v and j, (Justice et al., 2005). The experimenter conducted Tests 4 and 5 with the new letter sets.

Letter Sets

The experimenter assigned letter sets to intervention and probe- only conditions based on the number of correct responses during the pre-tests. The experimenter assigned the letter set with the highest number of correct responses to the probe-only condition and randomly assigned the remaining sets to the intervention conditions. The intervention conditions included a traditional drill flashcard method (TD; no mnemonic), a paired kinesthetic movement with traditional drill (KM-TD), and a paired within-stimulus picture with traditional drill (WS-TD).

TD was included for two participants because they emitted an incorrect letter sound and correct echoic for 16 letter sounds. Because there is a cap on stimuli available (i.e., 26 letters in the alphabet), these participants had insufficient remaining letters to conduct a replication of all conditions with new letters.

Pre-Intervention Training

Pre-intervention training consisted of three phases: phoneme segmentation, picture identification, and kinesthetic movement training. During all phases, the experimenter delivered praise and a reinforcer contingent on correct responses and delivered a phase-specific response contingent on incorrect responses. Each training phase ended after the participant independently emitted the correct response for each stimulus in the set.

Phoneme segmentation training was conducted to ensure participants could imitate and segment the first sound in words. Procedures were similar to Ehri et al. (1984) and were conducted with words that began with the letters assigned to each condition. The experimenter selected words that were different from the picture and kinesthetic movement paired with the letter for the intervention comparison. Prior to training, the experimenter said, “I’m going to say a word and the beginning sound, and I want you to copy me.” During training, the experimenter said the word and the first letter sound. For example, the experimenter said, “/k/ kite /k/” and waited for the participant to imitate them. If the participant failed to respond after 5 s or responded with anything other than the correct response, the experimenter repeated the prompt.

Picture training was conducted to ensure participants could correctly identify the animal that would be presented in the WS-TD condition. Before training, the experimenter said, “I want to see if you know these animals.” During training, the experimenter presented each animal that was going to be used in the WS-TD condition and asked, “What animal is this?” If the participant responded with the incorrect animal, the experimenter responded with the correct animal name and shuffled the card back into the pile until the participant emitted the correct response.

Kinesthetic movement training was conducted to ensure participants could emit the physical movement paired with each letter during KM-TD conditions. Prior to training, the experimenter said, “I’m going to do some movements, and I want you to copy me.” During training, the experimenter emitted the movement and said, “Copy me.” An incorrect movement resulted in the experimenter physically guiding the participant to engage in the movement.

Intervention Comparison

Mastery probes were conducted at the beginning of each research block and were conducted with cards only containing the letter (i.e., no picture for the WS-TD set) to determine letter–sound mastery prior to training. Preceding each probe, the experimenter said, “I want to see if you know these letter sounds.” During the probe, the experimenter presented each letter once from each condition and said, “What sound does this letter make?” for a total of 12 or 16 trials. Letters for each condition were grouped together within the probe, and the order of the conditions was randomized each day. Contingencies for emitting an incorrect response, the letter name, and the long vowel sound were identical to the pre-assessment screening. A correct response resulted in praise and the delivery of a reinforcer. If the participant emitted the correct response to all letters for one condition in the probe, the respective training session was not conducted. We defined the mastery criterion as the participant emitting the correct response to all condition stimuli during two consecutive mastery probes.

After the mastery probe, prior to the start of the intervention sessions, the experimenter presented the contingency correlated stimulus card and stated a condition specifying statement (e.g., “We are going to learn letters when they have pictures in them”). Additionally, the experimenter conducted modeling before every intervention session with letters that were incorrect in the mastery probe. Modeling procedures were dependent on the intervention condition and were repeated until the participant imitated the experimenter.

All intervention sessions consisted of eight trials. At the start of the session, the experimenter shuffled the training cards, presented each letter once, asked, “What sound does this letter make?,” reshuffled the cards, and then presented each card again. A correct response resulted in praise and the delivery of a reinforcer, and an incorrect response resulted in an error correction procedure identical to modeling for each condition. The error correction procedure was repeated until the participant correctly imitated the experimenter. Contingencies for emitting the letter name and long vowel sound were identical to the pre-assessment screening.

If a participant did not meet the mastery criterion for one condition after 10 intervention sessions following mastery in another intervention condition, the experimenter presented the unmastered set according to the first mastered condition.

Probe-only

A set of letters was only presented in the probe-only as a control to evaluate if correct responses increased due to outside instruction. That is, one set of letters never entered intervention.

Traditional Drill

Prior to TD modeling, the experimenter said, “We are going to learn letters.” During modeling, the experimenter presented the letter card, said the letter sound twice, and told the participant to imitate them. A TD intervention set was included in Sloan and Cain’s intervention comparison.

Kinesthetic Movement-TD

Prior to KM-TD modeling, the experimenter said, “We are going to learn letters with movements.” During modeling, the experimenter presented the card, said the letter sound, named the animal paired with the movement, and said the letter sound while engaging in the movement. Then, the experimenter told the participant to imitate them. For example, if the experimenter presented the letter card “a,” they said “/a/, alligator, /a/” while extending and opening and closing their arms as if they were an alligator.

Within-Stimulus Picture-TD

Before WS-TD modeling, the experimenter said, “We are going to learn letters with pictures.” During modeling, the experimenter presented the letter card, said the letter sound, named the picture, said the letter sound, and told the participant to imitate them. For example, if the experimenter presented the letter card “b,” they said, “/b/, bear, /b/.”

Quade, Sloan, and Cain did not meet the mastery criterion for the WS-TD condition 10 intervention sessions following mastery in the KM condition. Thus, the WS-TD stimuli were presented using the KM-TD procedure.

Post-Test

We evaluated the extent to which the intervention increased correct responses for related literacy skills. Five post-tests, identical to the pre-tests, were conducted immediately after set mastery. When post-tests were conducted for a participant who experienced the intervention, post-tests were conducted for the matched-pair control participant. Post-tests were not conducted with Milo nor matched-pair control participants Riley, Cassie, and Wyatt due to the COVID-19 pandemic school closures.

Replication

After participants mastered each intervention set, we replicated the pre-assessment screening procedures with the probe-only letters and letters excluded from the initial evaluation. If the participant responded correctly to more than five letters during the replication pre-assessment, a replication intervention was not conducted due to lack of available stimuli. A replication was only conducted with Nico.

Preference Assessment

We conducted a concurrent chains preference assessment to identify each participant’s preference for each condition (Hanley et al., 1997; Lozy et al., 2020). First, the experimenter conducted a mastery probe with the letter set previously assigned to the probe-only condition. Second, the experimenter presented the initial link: they placed the contingency correlated stimuli in front of the participant, pointed to each stimulus while saying the condition specifying statement, and asked the participant about the conditions paired with each card. Third, the experimenter said, “I’m going to teach you the letters. Pick how you want to learn them,” and then waited 5 s for the participant to make a selection. If the participant selected more than one card, the experimenter repeated the condition specifying statements and said, “Choose one.” The terminal link consisted of a session of the intervention procedure selected by the participant. For example, if the participant selected the card correlated with WS-TD, the experimenter would implement the WS-TD training procedure. The experimenter presented four contingency correlated stimuli: KM-TD, WS-TD, TD, and probe-only (excluding Nico, whose choices were only the intervention conditions). The concurrent chains procedure was repeated prior to every training session (e.g., 2–3 times per training block) during the preference assessment.

The preference assessment ended when the participant met the mastery criteria for the letter set (i.e., 100% accuracy across two mastery probes across two days) as opposed to a preference criterion. For one participant, Nico, an additional set of stimuli that were not included in the intervention comparison were presented in a second preference assessment. We defined preference as three or more consecutive selections of one condition.

Maintenance

We evaluated the extent to which each condition resulted in maintenance of mastered letter sounds. Maintenance probes were identical to the mastery probe. To equate the delay from set mastery to maintenance probes between conditions, maintenance probes were conducted based on the timing of mastery of each set rather than mastery of both letter sets. Maintenance probes were conducted 1, 5, 7, and 9 weeks post set mastery.

Treatment Integrity

We assessed treatment integrity by determining if the experimenter presented each card twice during training sessions and if the experimenter delivered the programmed consequence on each trial. We calculated treatment integrity by dividing the number of correct trials (i.e., correct card and consequence) by the total number of trials per session and multiplying that number by 100 to obtain a percentage. A second observer collected data on trial consequences to assess IOA for treatment integrity. We defined an agreement as the experimenter and observer recording the same consequence. We calculated IOA using the trial-by-trial exact agreement method.

During the first and second intervention and preference assessment, the experimenters assessed treatment integrity during 100% of WS-TD, KM-TD, TD, and probe-only sessions. The experimenter delivered the correct consequence during 99% (range, 98% to 100%) of all sessions for all participants. During the first and second evaluation, the experimenters assessed IOA for treatment integrity during 48% of WS-TD sessions, 53% of KM-TD sessions, 45% of TD sessions, and 41% of probe-only sessions. Treatment integrity IOA was 100% across all sessions for all participants.

Experiment 1: Results and Discussion

Pairing Mnemonics with Letters and Letter–Sound Correspondence

Figure 2 depicts the number of correct responses during mastery and follow-up probes for the KM-TD, WS-TD, TD, WS-TD presented in KM-TD, and probe-only conditions. Of the six completed evaluations, four participants mastered the stimulus set in the KM-TD condition prior to the WS-TD condition (see Quade, Cain, and Sloan’s evaluations and Nico’s second evaluation), and two participants mastered the WS-TD and KM-TD sets at approximately the same time (see Calla and Nico’s first evaluation). Of the four participants who met the mastery criterion in the KM-TD condition first, three participants met the criterion for the WS-TD stimuli to be presented in the KM-TD condition (see Quade, Cain, and Sloan’s evaluation). Two evaluations included a TD stimulus set: Sloan and Cain. Sloan mastered the TD and KM-TD sets at approximately the same time (i.e., a difference of five sessions, which could occur in one to two session blocks), and both TD and KM-TD were mastered before mastering letters initially assigned to the WS-TD condition that were subsequently mastered in the KM condition. Cain mastered the stimulus sets in the following order: KM-TD, TD, and WS-TD presented in KM-TD. In addition, in 6 of 6 evaluations, participants did not meet the mastery criterion for the probe-only set.

Fig. 2
figure 2

The number of correct responses per mastery and follow-up probes. Note. WS-TD = Within-Stimulus Picture; KM-TD = Kinesthetic Movement; PO = Probe-Only; TD = Traditional Drill. (1) = Evaluation 1; (2) = Evaluation 2

Engaging in Movements and Letter–Sound Acquisition

For KM-TD sessions, we examined the correlation between the percentage of correct movements and the number of stimulus trials to meet mastery criterion. We calculated the percentage of correct movements for each evaluation by dividing the total number of trials in which the participant emitted the correct movement by the total number of trials and multiplying that number by 100. Quade, Nico (evaluation 1), Calla, Cain, Nico (evaluation 2), and Sloan emitted the correct movement during an average of 1%, 17%, 35%, 41%, 44%, and 48% of trials across each evaluation, respectively. They mastered each set after 40, 50, 22, 20, 8, and 36 training trials per stimulus, respectively. The experimenters conducted a one-tailed Pearson correlation test. Results yielded a moderate negative correlation between the number of sessions required for participants to meet the mastery criterion and the percentage of engagement in the correct movements (r = –.63). As the percentage of correct movements emitted increased, the number of intervention sessions to mastery decreased. This finding is consistent with Lozy et al. (2020), who found a strong negative correlation between engaging in movements during training and the number of sessions until participants met the mastery criterion.

Current Responding during Training and Letter–Sound Mastery

The left panel of Fig. 3 depicts the percentage of correct responses during training sessions for KM-TD, WS-TD, TD, and WS-TD presented in KM-TD conditions (Experiment 1). Across all evaluations, participants responded with 100% accuracy during training sessions without mastering the letter set during an average of 14% of KM-TD sessions (range, 0% to 31%), 31% of TD sessions (range, 15% to 48%), and 42% of WS-TD sessions (range, 8% to 63%). During almost half of WS-TD training sessions, responding was at 100% accuracy despite not meeting the mastery criterion during the mastery probes when the letter was presented without the picture.

Fig. 3
figure 3

The percentage of training sessions at 100% accuracy. Note. WS-TD = Within-Stimulus Picture; KM-TD = Kinesthetic Movement; TD = Traditional Drill; CM-TD = Combined Mnemonics; (1) = Evaluation 1; (2) = Evaluation 2

Generalization to Untaught Literacy Skills

The top panel of Fig. 4 depicts the percentage of correct responses during pre- and post-tests for all participants per test. On average, participants who received interventions responded with fewer correct responses during the pre-tests and more correct responses during the post-test than those who did not receive interventions. We conducted a two-way repeated measures analysis of variance (ANOVA) between intervention participants and matched-pair control participants at the level of the test, pre and post. That is, we analyzed the pre- and post-test data between intervention participants and matched-pair control participants, and the pre- and post-test data for intervention participants compared to matched-pair control participants. Results generated a significant interaction between participants at the pre- and post-test, F(1, 8) = 12.71, p < 05. The main effect of test level (pre- and post-test), F(1, 5) = 20.65, p < .05, and participants (intervention and control participants), was significant, F(1, 8) = 42.44, p < .05. In short, participants who experienced the interventions, on average, showed larger gains in untaught literacy skills than participants who did not receive the intervention (matched-pair control participants)

Fig. 4
figure 4

Experiment 1: pre- and post-test data. Note.; NSW = Nonsense Word; KM-TD = Kinesthetic Movement; WS-TD = Within-Stimulus Picture; PO = Probe-Only. Each data point represents data from one participant

The bottom panel of Fig. 4 depicts the percentage increase from pre- to post-tests, averaged across all tests, for the probe-only and intervention conditions for the participants in the intervention group. The average increase was 64% for the probe-only condition (range, 2% to 153%), 107% for the TD condition (range, 40% to 175%), 129% for the WS-TD with KM-TD condition (range, 104% to 118%), 129% for the KM-TD condition (range, 39% to 262%), and 156% for the WS-TD conditions (range, 35% to 299%).

Preference Assessment

Table 3 indicates the number of intervention sessions for each condition during the choice evaluation. On average, participants experienced 14 KM-TD training sessions (range, 4–10), 24 WS-TD training sessions (range, 11–35), 5 WS-TD presented with KM-TD training sessions (range, 3–6), and 17 TD training sessions (range, 13–21).

Table 3 Number of intervention sessions per condition and preferred intervention condition

The last column in Table 2 denotes the preferred condition per participant, and Fig. 5 indicates cumulative intervention selections per session for Nico, Quade, Sloan, and Calla. Of the four participants for whom KM-TD was more efficacious than WS-TD; preference was not identified for one participant (Calla). TD was preferred for one participant (Quade), WS-TD and TD were preferred for one participant (Cain), and WS-TD was preferred for one participant (Nico, second evaluation). Preference was likely not identified for Calla due to a lack of discrimination between the conditions. Anecdotally, when Calla was asked to identify the conditions correlated with the stimuli, she did not respond or responded incorrectly. Additionally, when Calla would select the no-intervention condition, she cried when the experimenter told her it was time to go back to class. Of the two participants for which there was little differentiation between KM-TD and TD, WS-TD was preferred for one participant (Nico, first evaluation). We were unable to assess preference for one participant (Sloan). Anecdotally, Sloan would rotate between interventions each day, wanting to experience each condition.

Fig. 5
figure 5

Experiment 1: preference assessment. Note. WS-TD = Within-Stimulus Pictures; KM-TD = Kinesthetic Movement; TD = Traditional Drill; PO = Probe-Only

Maintenance of Letter Sounds

Participants’ number of correct responses during follow-up probes remained relatively consistent across time (Fig. 2). That is, participants responded correctly to an average of 3.71 stimuli at week 1 (range, 3–4), 3.42 stimuli at week 5 (range, 1–4), 3.5 stimuli at week 7 (range, 1–4), and 3.78 stimuli at week 9 (range, 2–4). In addition, participants responded with approximately equal correct responses for all stimulus sets during all weeks. At weeks 1, 5, 7, and 9, participants responded correctly to an average of 3.34 KM-TD stimuli, 3.82 WS-TD stimuli, 3.33 WS-TD: KM-TD stimuli, and 3.71 TD stimuli, respectively.

General

The results of Experiment 1 demonstrate that, overall, pairing letters with movements was more efficacious than pairing letters with pictures and that both interventions and a traditional flash-card method were more efficacious than no intervention. However, the number of letters correct during the maintenance probes, and the generalization to more complex reading skills did not significantly differ between intervention conditions. Nevertheless, the intervention conditions produced consistently better generalization effects than the probe-only condition for the intervention group. The participants who received intervention consistently showed greater gains in the generalization tests than the control participants. Efficacy of the intervention did not correlate with participant preference.

Experiment 2

The purpose of Experiment 2 was to compare the effects of paired combined mnemonics (kinesthetic movements and within-stimulus pictures) to a paired single mnemonic (kinesthetic movements) with a traditional drill intervention on letter–sound correspondence. In the most commonly used multisensory curricula, multiple mnemonics are paired with each letter. The combined effects of mnemonics are not necessarily the sum of their isolated effects. Therefore, systematic evaluations of single and combined mnemonics are needed. KM-TD was selected as the single mnemonic condition based on the results from Experiment 1. That is, in 4 of 6 evaluations, KM-TD was mastered with fewer training sessions than WS-TD. In the remaining 2 of 6 evaluations, KM-TD and WS-TD were mastered after approximately the same number of sessions. Additionally, we compared both mnemonics conditions to a probe-only condition in which letters were only presented during mastery probes.

Method

Participants, Consent, Assent, Setting, and Materials

Recruitment, consent, and assessment procedures were identical to Experiment 1. Teachers and caregivers referred three children for participation, and all children were eligible. Participant demographics are listed in Table 1. We conducted sessions in an empty classroom (Amara) or at a table in the participants’ homes (Ginny and Lyle). We conducted two to five sessions per day, 3 to 4 days per week in a single session block with 2-min breaks between sessions.

Session materials were identical to Experiment 1 and included stimulus cards (Appendix A), contingency correlated stimulus cards (Appendix B), and reinforcers (iPad® and edibles).

Dependent Variables and Interobserver Agreement

Dependent variables and interobserver agreement procedures were identical to Experiment 1. During the pre- and post-tests, across all tests, IOA was assessed during an average of 23% of sessions, and IOA was 100%. During the first evaluation, across all intervention conditions, IOA was assessed during 53% of sessions for Amara, 26% of sessions for Ginny, and 35% of sessions for Lyle. Mean IOA was 100% for all participants. During the second evaluation, across all conditions, IOA was assessed during 30% of conditions for Ginny and was 100%. The experimenters assessed IOA during an average of 40% of preference assessment sessions, and IOA was 100%.

Screening: Preassessment and Echoic Assessment

An experimenter conducted a pre-assessment to identify letters to be included in the intervention comparison and an echoic assessment to determine if participants could correctly imitate the letter sound for letters to which they responded incorrectly during the pre-assessment. Procedures were identical to Experiment 1. The average number of correct responses across pre-assessments is listed in the last column in Table 1.

Experimental Design

We used a multielement design in which intervention and probe-only conditions alternated in a 4:1 ratio. We used a pre-test–post-test design to evaluate the interventions’ effects on literacy skills that were not directly targeted during the intervention. In addition, we used a concurrent chains procedure to assess participant preference for intervention conditions.

Procedure

All participants followed the same procedure: Pre-tests, pre-intervention training, intervention comparison, post-tests, preference assessment, and maintenance. The general procedure is identical to Experiment 1, as outlined in Fig. 1.

Pre-Tests and Letter Sets

The experimenter selected 12 letters (3 vowels and 9 consonants) to which the participant responded incorrectly during the pre-assessment but correctly during the echoic assessment and divided the letters into three groups of four letters (one vowel and three consonants). Tests were identical to Experiment 1 and are denoted in Table 2.

Identical to Experiment 1, after Test 3, the experimenters re-assigned the letters to three four-letter sets based on the number of correct responses per stimulus during Tests 1, 2, and 3 and attempted to match sets on variables pertinent to letter sound acquisition (e.g., beginning letter of first and last name, similar features). The experimenter assigned letter sets to intervention and probe-only conditions based on the number of correct responses during the pre-tests. The intervention included a paired kinesthetic movement (KM-TD) and a combined mnemonics (CM-TD) condition.

Pre-Intervention Training

Pre-intervention training consisted of three phases (phoneme segmentation, picture, and kinesthetic movement training) and was identical to Experiment 1.

Intervention Comparison

Mastery probes were conducted once at the beginning of each research block with cards only containing the letter (i.e., no picture for the CM-TD set). Mastery probe procedures and mastery criteria were identical to Experiment 1. Intervention procedures and criteria to present an unmastered set in a previously mastered condition were identical to Experiment 1.

Probe-Only

A set of letters was only presented in the mastery probe as a control to evaluate if correct responses increased due to outside instruction. That is, one set of letters never entered intervention.

Kinesthetic Movement-TD

The contingency specifying statement and modeling procedures for KM-TD were identical to the procedures described in Experiment 1.

Combined Mnemonics-TD

Prior to CM-TD modeling, the experimenter said, “We are going to learn letters with pictures and movements,” and presented the contingency correlated stimulus card. During modeling, the experimenter presented the letter card, said the letter sound, named the animal picture, and said the letter sound while engaging in the movement. Then, the experimenter told the participant to imitate them. For example, if the experimenter presented the letter card “a,” they would say “/a/, alligator, /a/,” while extending their arms and opening and closing their arms as if they were an alligator.

Two participants, Ginny (second evaluation) and Lyle, did not meet the mastery criterion for the CM-TD condition after 10 intervention sessions following mastery in the KM-TD condition. Thus, the CM-TD stimuli were presented using the KM-TD procedure.

Post-Test

Five post-tests, identical to the pre-tests, were conducted immediately after set mastery.

Replication

A replication was only conducted with Ginny.

Preference Assessment

The concurrent chains procedure was similar to Experiment 1 with the following changes. The experimenters conducted the procedure once at the beginning of each day (as opposed to prior to each training session as in Experiment 1). After three selections of the same condition, the contingency correlated stimulus for that condition was removed from the initial link array to determine a hierarchy of preference. The preference assessment included five contingency correlated stimuli: KM-TD, CM-TD, WS-TD, TD, and probe-only.

Maintenance

Maintenance probes were conducted 1, 5, 7, and 9 weeks post set mastery.

Treatment Integrity

Treatment integrity was assessed identically to Experiment 1. During the first and second evaluation and preference assessment, the experimenters assessed treatment integrity during 100% of CM-TD, KM-TD, TD, and probe-only sessions. The experimenter delivered the correct consequence during 100% of sessions for all participants. During the first and second evaluations, the experimenters assessed IOA for treatment integrity during 34% of CM-TD sessions, 35% of KM-TD sessions, and 36% of probe-only sessions. Treatment integrity IOA was 100% across all sessions for all participants.

Experiment 2: Results and Discussion

Pairing Mnemonics with Letters and Letter–Sound Correspondence

Figure 6 depicts the number of correct responses during mastery and follow-up probes for the KM-TD, CM-TD, and CM-TD presented in KM-TD, and probe-only conditions. Of the three completed evaluations, two participants mastered the stimulus set in the KM-TD condition prior to the CM-TD condition (Lyle and Ginny’s first evaluation), and one participant mastered the KM-TD and CM-TD sets at approximately the same time (Ginny’s first evaluation). Despite the incomplete evaluation for Amara, when she mastered the KM-TD set, there was no increase in the number of correct letter sounds demonstrated for the CM-TD set. Of the two participants who met the mastery criterion in the KM-TD condition first, both participants met the criterion for the CM-TD stimuli to be presented in the KM-TD condition. In addition, in 4 of 4 evaluations, participants did not meet the mastery criterion for the probe-only set.

Fig. 6
figure 6

Experiment 2: the number of correct responses per mastery and follow-up probes. Note. KM-TD = Kinesthetic Movement; CM-TD = Combined Mnemonics; PO = Probe Only; (1) = Evaluation 1; (2) Evaluation 2

Correct Responding during Training and Letter–Sound Mastery

The right-side panel of Fig. 3 depicts the percentage of correct responses during training sessions for KM-TD and CM-TD conditions (Experiment 2). Across all evaluations, participants engaged in training sessions at 100% accuracy without mastering the letter set during an average of 16% of KM-TD sessions (range, 11% to 20%) and 48% of CM sessions (range, 0% to 76%). That is, almost half of CM-TD training sessions were at 100% accuracy despite not meeting the mastery criterion. This effect is consistent with Experiment 1.

Generalization to Untaught Literacy Skills

The top panel in Fig. 7 depicts the percentage of correct responses during pre- and post-tests for the participants per test. On average, participants responded with more correct responses during the post-test than the pre-test for all tests, with the greatest percentage increase occurring in identifying the nonsense word beginning sounds. The smallest percentage increase occurred in the test requiring the participant to match the letter with the picture beginning with that letter. The bottom panel of Fig. 7 depicts the percentage increase from pre- to post-tests, averaged across all tests, for the probe-only and intervention conditions. The average increase was 397% for the probe-only condition (range –18%-813%), 2975% for the CM-TD in the KM-TD condition (range 0–9900%), 3590% for the KM-TD condition (range 33–9900%), and 5740% for the CM-TD condition (range 300–9900%).

Fig. 7
figure 7

Experiment 2: pre- and post-test data. Note. KM-TD = Kinesthetic Movement; CM-TD = Combined Mnemonics; PO = Probe-Only. Each data point represents data from one participant. (1) = Evaluation 1; (2) = Evaluation 2

Preference Assessment

Table 4 indicates the number of intervention sessions for each condition during the preference assessment. On average, participants experienced 11 KM-TD training sessions (range, 5–19), 20 CM-TD training sessions (range, 17–22), and 5 CM-TD presented in KM-TD training sessions (range, 4–6). Of the three participants, a preference assessment was not conducted with one participant (Amara), and preference varied for the remaining two. Figure 8 indicates cumulative intervention selections per day for Ginny and Lyle, and a dashed line denotes the removal of a condition. Ginny initially demonstrated a preference for KM-TD and, after KM-TD was removed from the selection array, demonstrated a preference for CM-TD. Lyle initially demonstrated a preference for TD and, after TD was removed from the selection array, demonstrated a preference for WS-TD.

Table 4 Number of intervention sessions per condition
Fig. 8
figure 8

Experiment 2: preference assessment. Note. CM-TD = Combined Mnemonics; WS-TD = Within-Stimulus Picture; KM-TD = Kinesthetic Movement; TD = Traditional Drill; PO = Probe-Only

Maintenance of Letter Sounds

Participants’ number of correct responses during follow-up probes remained relatively consistent across time (Fig. 6). That is, participants responded correctly to an average of 2.67 stimuli at week 1 (range, 2–4), 2 stimuli at week 5 (range, 1–3), 2.2 stimuli at week 7 (range, 0–4), and 2.6 stimuli at week 9 (range, 1–4). In addition, participants responded with approximately equal correct responses for all stimulus sets during all weeks. At weeks 1, 5, 7, and 9, participants responded correctly to an average of 2.6 KM-TD stimuli, 2.8 CM-TD stimuli, and 2.2 CM-TD: KM-TD stimuli, respectively.

General

The results of Experiment 2 demonstrated that pairing letters with movements was more efficacious than pairing letters with both pictures and movements and that both interventions were more efficacious than no intervention. Similar to Experiment 1, the number of letters correct during the maintenance probes and the generalization to more complex reading skills did not significantly differ between intervention conditions. However, participants responded consistently better on generalization tests with intervention condition stimuli than probe-only sets.

General Discussion

We evaluated the efficacy, generalization, and maintenance of, and preference for mnemonics paired with a traditional drill flash-card intervention on literacy skills with eight preschoolers and one kindergarten student. Overall, across both studies, pairing kinesthetic movements with a traditional drill flash-card method (KM-TD) for letter sound acquisition was more efficacious than presenting letters alone (TD), presenting letters with within-stimulus pictures (WS-TD), and presenting letters with combined movements and pictures (CM-TD). All interventions evaluated were more efficacious at teaching letter–sound correspondence than the probe-only condition. Despite the efficacy of presenting letters with movements on letter–sound correspondence, there was no significant difference between mnemonic interventions on generalization post-tests and maintenance data. However, during post-tests, participants who received the intervention consistently responded better on the intervention conditions than the probe-only conditions and consistently responded better than participants in the control group. Further, participants preferred intervention conditions that were presented with pictures. Thus, when selecting an intervention targeting literacy skill acquisition, clinicians should be attentive to the costs and benefits of each intervention.

The dense schedule of reinforcement during the WS-TD and CM-TD conditions may account for participants’ preference for conditions presented with pictures. The number of training sessions at 100% indicates the density of reinforcement participants experienced during each condition. Participants earned more reinforcers during WS-TD training with a delay in acquisition during mastery probes, compared to earning fewer reinforcers during KM-TD training with relatively faster acquisition during mastery probes. One way to increase the density of the reinforcement schedule during KM training is to intermittently present already acquired targets with unknown targets, as in incremental rehearsal (Finn et al., 2023; Tucker, 1988) and strategic incremental rehearsal (Kupzyk et al., 2011). Because strategic incremental rehearsal is just as effective or more effective than traditional drill (Lozy & Donaldson, 2019), future research should examine the effects of strategic incremental rehearsal, in which the target skill is paired with movements, on learner preference for KM-TD compared to WS-TD.

These results are consistent with previous research demonstrating that pairing movements with letters is effective at increasing literacy skill acquisition (Lozy et al., 2020) that results in some degree of maintenance (Agramonte & Belfiore, 2002) and generalization (Campell et al., 2008). These results also extend research in this area by first evaluating the maintenance of skills with a long delay from training to follow-up; previous research conducted maintenance sessions after a maximum of 4 weeks (Agramonte & Belfiore). Second, only a few studies have evaluated the emergence of untaught literacy skills post mastery, and the skills evaluated were limited (Campell et al., 2008). This study demonstrated the emergence of a range of literacy skills after letter–sound correspondence training.

These results are inconsistent with previous research on visual mnemonics. Presenting pictures as extra-stimulus prompts has been shown to be less effective than presenting letters alone (Marsh & Desberg, 1978; Marsh et al., 1974), but embedding the picture within the letter (i.e., within-stimulus pictures) has been found to be more effective than a traditional drill method (Agramonte & Belfiore, 2002; Sener & Belfiore, 2005). However, this study found that the within-stimulus picture condition was less effective than the traditional drill condition, and that the within-stimulus picture condition hindered acquisition for some participants. There are two possible reasons for this inconsistency in findings. First, experimenters conducted mastery probes at the beginning of each day prior to training, and previous studies conducted mastery probes at the end of training sessions. Conducting probes at the end of sessions may have increased the probability of correct responding due to recent exposure to stimuli. Second, this study was conducted using single-subject methodology, whereas previous studies used group designs. Group designs have the potential to mask the effects of the intervention on individual participants and diminish the effects of outliers that may have failed to respond by averaging effects across participants.

The hindrance of letter sound acquisition by within-stimulus pictures can be explained within the framework of many different theoretical approaches. First, the lack of efficacy for within-stimulus pictures may be conceptualized via the Gestalt law of continuity. The law of continuity states that humans continue to see forms past their actual stopping or breaking point (Moore & Fitz, 1993; Pelli et al., 2009). Concerning letter identification, research has demonstrated correlations between letter font (easy- versus hard-to-read) and typographic variables (e.g., italicized) with speed of and perceived effort for letter identification and memory (Diemand-Yauman et al., 2011; Keage, 2014), and that the degree of perturbation (i.e., disconnect) is inversely proportional to letter identification efficiency (Pelli et al., 2009). Due to the limited number of participants and variability of letters assigned to the WS-TD condition across participants, we were unable to determine if a relationship exists between letter mastery and breaks in visual continuity per letter due to the within-stimulus picture. However, given previous research, it is possible that the degree of picture overlap and the number of letters with breaks due to the picture accounted for some participants’ difficulty in mastering the WS-TD condition. Thus, future researchers should examine if the number of breaks per letter is related to mastery of the letter. If the number of breaks per letter due to the picture is related to mastery, instructional materials using visual mnemonics should be modified so that the letter is fully visible in front of the picture.

A second reason the WS-TD condition was not effective for all participants may be due to blocking and/or overshadowing (Dittlinger & Lerman, 2011; Johnson & Cumming, 1968; Singh & Solman, 1990). Blocking and overshadowing refer to one stimulus (e.g., picture) controlling the learner’s response in the presence of a compound stimulus (e.g., picture and letter), with prior learning history being a critical component of blocking and saliency of the stimulus the critical component of overshadowing. If the participant had an extensive learning history with the picture, this may have blocked the letter from gaining stimulus control over their response. Although we collected data on whether participants responded correctly to the animal-only picture during the pre-training phase, some animals were likely more familiar (e.g., bear, alligator) to participants than others (e.g., weasel, ox). It is also possible that the picture was the more salient stimulus in the letter–picture compound stimulus and controlled participant responding (i.e., overshadowing). Blocking and/or overshadowing may explain why some participants responded at 100% accuracy for an average of 50% of training sessions but failed to respond correctly during the mastery probes.

A third reason for the discrepancy during WS-TD training and mastery probes may be due to a stimulus change decrement. Stimulus change decrement refers to the decrease in accurate responding contingent on differences between or novelty in stimuli (Bindra, 1959; Bouton & Todd, 2014). WS-TD training stimuli included an animal picture within the letter, whereas the mastery probe stimuli included the letter only. The change in stimuli may have caused a change in responding, and it is likely that, if mastery probes were conducted with within-stimulus picture cards, accurate responding would have increased. However, including the picture in mastery probes would have reduced the social validity of the intervention because effective reading requires responding to letters and words in the absence of pictures. Further, conducting the mastery probes without the pictures also permits an evaluation of generalization.

One mechanism that can account for the efficacy of KM-TD is that movement functioned as a mediating response. A mediating response is a response that occurs between the presentation of the contingency correlated stimulus and the response and may act as the controlling variable for the response (Michael et al., 2011). Emitting a unique movement and vocal response in the presence of a stimulus during discrete trial training has been conceptualized as a mediating behavior because it specifies the conditions under which the response will be reinforced (Lozy et al., 2020). The movement during training likely acted as a mediating response, increasing the probability that the participants would emit the correct response. KM-TD provided participants with an additional response in which they could engage in the presence of the letter card that could strengthen the letter sound response.

One explanation that can account for the efficacy of the KM-, WS-, and CM-TD conditions may be due to a differential observing response (DOR). A DOR is a procedural modification commonly used during conditional discrimination, such as match-to-sample tasks, to increase correct responding. When a DOR is implemented, a unique response is paired with each stimulus to increase the saliency of critical stimulus features (Urcuioli & Callender, 1989; Walpole et al., 2007). DORs may be inadvertently implemented when mnemonics are paired with stimuli because the experimenter typically emits the picture name followed by the target stimulus (e.g., letter sound or word) and prompts participants to imitate both responses (DiLorenzo et al., 2011). During modeling and error correction for KM-, WS-, and CM-TD, the experimenter presented the letter, emitted the letter sound, animal name, and letter sound, engaged in the movement (when applicable), and prompted the participant to repeat the sequence. The experimenter repeated the modeling and error correction procedure until the participant emitted the letter sound at least once (i.e., they were not required to emit the animal name). Although data were not collected on whether the participant said the animal name, for those who did, it may have acted as a DOR that increased the letter’s saliency and therefore increased correct responding when the picture was removed for the WS- and CM-TD letter set. Specific to the KM-TD condition, the movement may have also acted as a non-vocal DOR (Lozy et al., 2020). Future research should directly compare whether emitting the animal name during modeling and error correction affects mastery of each condition.

In addition, future research should evaluate the effect of including the animal name in modeling and error correction procedures on letter sound acquisition. In the current experiments, the experimenter presented KM-, EP, and CM-TD cards in the same format. They presented the card, stated the letter sound, animal name, and letter sound once again, with the difference in procedures being the presentation of a within-stimulus picture, engaging in a movement, or both. However, we are unsure to what extent the emission of the animal name is necessary. Future research should compare paired movements and pictures with and without stating the associated animal name to evaluate its effect on letter–sound mastery.

Data were not collected on the duration of each intervention session. It is possible KM-TD sessions required more time due to the examiner engaging in a physical movement during the training and error correction as compared to WS-TD sessions. This difference may have resulted in increased letter exposure for one condition (KM-TD), accounting for a difference in efficacy. Future researchers may consider equating the duration of each flashcard presentation.

Despite the consistency of intervention results across experiments, participants demonstrated better maintenance and a greater increase from pre- to post-tests in Experiment 1 compared to Experiment 2. This may be because participants in Experiment 1 were receiving classroom instruction on letter sounds, and for two of three participants, Experiment 2 was conducted during the summer when participants were less likely to be receiving instruction. Although we did not collect data on the specific duration or modality of classroom instruction on alphabetic principles, preschool children spend approximately 30% of their classroom day on language and literacy activities (Connor et al., 2006).

This study has several limitations that warrant discussion. First, not all participants in Experiment 1 experienced all the same conditions; two participants experienced three intervention conditions as opposed to two. It is possible the addition of another intervention condition altered their rate of acquisition for two reasons: (a) a smaller number of sets with more stimuli per set is more effective than a larger number of sets with fewer stimuli per set (Kodak et al., 2020), and (b) increasing the number of letter sets decreased the degree of control of letter selection. The experimenters balanced letters across sets with respect to important variables, such as vowels, consonants, and letters that shared similar features. Increasing the number of letters selected from the available letters, based on pre- and echoic assessments, limited the degree of equity across sets. Thus, the participants with more intervention conditions may have had more difficult letter sets with respect to the aforementioned variables.

Second, letter sets differed from pre- to post-tests for Tests 1 through 3. In order to equate the number of correct pre-test responses across sets per letter, experimenters re-organized the letter sets after pre-test 3. However, equating sets based on pre-test responses resulted in a different combination of letters in post-tests 1 through 3 and may have affected participant responding. A difference in responding is less likely for Test 1 (stimulus generalization) because letters were presented one at a time but more likely for Test 2 (letter sound identification) and 3 (match letter sounds to pictures) because letters were presented in an array. Despite the different combinations of letters, the pre- and post-test data still provide an appropriate comparison because (a) the same letters were evaluated from pre- to post-test, (b) the chance value of selecting a correct letter remained at 25%, and (c) the same considerations for equating variables pertinent to learning letter sounds were taken (e.g., not combining letters with similar features such as b and d). Although future researchers should equate letters across sets based on pre-test responses, it may be beneficial to conduct post-tests with the original pre-test sets (i.e., the sets prior to counterbalancing).

Third, within-participant replications were only possible for two participants, and the replication yielded different results for both participants. Specifically, Nico and Ginny’s first evaluations demonstrated both intervention conditions as equally efficacious, and their second evaluations demonstrated KM-TD to be more efficacious. In addition, Nico and Ginny experienced fewer training sessions in the second evaluation compared to the first. One possible explanation for these differences may be due to the learning-to-learn phenomenon (Green, 2001; Shepley & Grisham-Brown, 2019). Both evaluations were conducted during the summer in which participants experienced few structured academic periods. It is possible that the first evaluation targeted stimulus control during structured periods of opportunities to respond and contingencies for reinforcement and may have masked the differences between condition efficacies. Future researchers may also consider pairing mnemonics with arbitrary stimuli to evaluate the basic effect of mnemonic devices on acquisition. Although the use of arbitrary stimuli would decrease the influence of outside instruction, it would also limit the applied value of participating in the research.

Finally, future researchers should collect qualitative social validity data. Although teachers made anecdotal comments regarding their student’s improvement in literacy skills, qualitative data concerning the acceptability of the goals, procedures, and outcomes would be beneficial (Ferguson et al., 2018; Wolf, 1978).

Despite the data supporting KM-TD as the most efficacious condition, participants rarely selected this option during the preference assessment. Preference assessment data and patterns of assent highlight the importance of balancing efficacy and preference when selecting instructional methods. Although embedding pictures during letter–sound correspondence instruction may hinder acquisition, it may afford students more opportunities to respond if they are more likely to interact with the instruction. Future research should examine the effects of implementing preferred instructional modalities on student learning outcomes as well as determine methods to make efficacious instruction more preferred.

We evaluated visual mnemonics and kinesthetics movements derived from the multisensory program, ZooPhonics©, based on the availability of stimuli and limited research support. That is, the embedded pictures and physical movements were specific to the program and do not represent all possible stimuli that are or can be paired with letters for mnemonic-based learning. Although we demonstrated the efficacy of KM-TD compared to WS-TD, the effect may be moderated by the stimuli specific to ZooPhonics©. Future researchers may consider replicating Experiment 1 with stimuli derived from other programs, such as See the Sound-Visual Phonics (Gardener III et al., 2013), Itchy’s Alphabet (DiLorenzo et al., 2011), or JollyPhonics (Loyd, 1992).

Although there is some evidence of the efficacy of combining multiple mnemonics for literacy skills (DiLorenzo et al., 2011), some mnemonics may hinder acquisition, such as within-stimulus pictures, as demonstrated in these experiments. Lozy et al. (2020) was the first study to compare a traditional drill and paired kinesthetic movement flash-card method in a single-subject design, and results demonstrated kinesthetic movements as being an efficacious procedure to increase letter–sound correspondence. The present studies further evaluated kinesthetic movements compared to and paired with within-stimulus pictures and demonstrated that pairing kinesthetic movements was generally more efficacious than both alternative methods and no intervention. The results also demonstrate the lack of correspondence between the most efficacious and preferred condition, suggesting we should consider alternative approaches to incorporating pictures into reading instruction in a manner that makes the instructional arrangement more fun for children but does not hinder skill acquisition.