Introduction

Individuals with autism spectrum disorder (ASD) face challenges related to social interaction and communication, and engagement in repetitive behaviors (American Psychiatric Association 2013). Over the last several decades, researchers have attempted to establish evidence-based and intervention strategies for improving outcomes for this unique group of learners across a wide range of skill repertoires (e.g., communication, academic, daily living, employment). Many of these practices involve direct instruction of skills and are designed to be implemented in highly-structured teaching arrangements with 1:1 teacher to student ratios (Collins et al. 1991). In these arrangements, an instructor uses prompting strategies to facilitate students’ performance of a targeted response and then provides feedback contingent on accurate responding (Duker et al. 2004). Gradually, instructional supports are faded until students respond independently under naturally occurring conditions.

Despite its well-documented effectiveness, 1:1 instructional arrangements pose several challenges for teachers and students. First, this arrangement requires that much of an instructor’s time is directed at a single student (Bitterman et al. 2008) and, thus, necessitates the availability of extensive resources for hiring, training, and maintaining qualified staff to implement programming. This may be problematic in school settings as data suggest there are persistent teacher shortages in the area of special education (U.S. Department of Education 2014). Second, these arrangements may limit students’ opportunities to interact with peers with and without disabilities and may preclude development of social skills critical for success in less restrictive environments (Kamps et al. 1990).

An alternative to 1:1 arrangements is the application of small-group instruction (SGI) in which a teacher instructs a small group of students at the same time (Ozen et al. 2017; Xin and Sutman 2011). Small-group instruction offers several advantages over 1:1 instructional arrangements in that it enables a more efficient use of teachers’ time and requires fewer staff resources (Collins et al. 1991). Further, SGI more closely approximates less restrictive educational settings in which individuals are often near others and are required to engage in skills that may facilitate positive social interactions (e.g., turn taking, attending to the responses of others) (Fink and Sandall 1978). Researchers have demonstrated the efficacy of SGI arrangements to teach a number of skills to people with ASD including sight words (Ledford et al. 2008; Schoen and Ogden 1995), social skills (Kroeger et al. 2007), and job functions (Leaf et al. 2013). Additionally, small-group instruction provides the opportunity for students to acquire new skills by observing the performance of their peers (Carr and Darcy 1990; Ganz et al. 2008). Researchers have demonstrated that students with ASD can benefit from observational learning, that is, acquiring the instructional targets of their peers by watching them during instruction (Taylor and DeQuinzio 2012). For example, Ledford and colleagues (2008) instructed sight words to six children with ASD in groups of two and demonstrated that five of the participants vicariously learned non-target words taught to their peers.

Despite its potential benefits, small-group instruction often requires more time to prepare and can be more difficult to implement than 1:1 arrangements. Kamps et al. (1992) compared performance of students with ASD and developmental disorders in 1:1 and SGI arrangements. Although the results of the study indicated successful performance of students in both formats and that teachers perceived group instruction was good for students, SGI format was rated less preferably by the teachers as it required more preparation time. One possible approach to address this challenge is to capitalize on recent advances in technology such as virtual reality and social robotics.

Virtual Reality

A recent technological advancement in computer-assisted instruction (CAI) is the use of virtual reality and pedagogical agents. Virtual reality (VR) uses computer-generated three-dimensional graphics to simulate a realistic environment. Pedagogical agents (PAs) are interactive characters with instructional roles incorporated in virtual learning environments. These characters interact with students through dialogue, gesture, and emotional expressions. By simulating predictable social environments, VR allows students with ASD to learn skills while interacting with virtual characters that present fewer social requirements and potentially less anxiety (Kandalaft et al. 2013; Wallace et al. 2010). VR enables repetitive practices of skills without the fear of making mistakes in a safe, highly-controlled environment where instruction can be customized to students’ needs. This potentially reinforcing environment may increase students’ motivation during instructional activities, which also may facilitate engagement and ultimately independence, requiring less supervision by an instructor. Extant literature supports the effectiveness of virtual environments and characters during the instruction of various skills and behaviors such as vocational competencies of young adults with ASD (Smith et al. 2014, 2015; Strickland et al. 2013), management of phobia in young people with ASD (Maskey et al. 2014), and social skills of children with ASD (Cheng et al. 2015; Kandalaft et al. 2013; Ke and Im 2013).

To date, most virtual environments for people with ASD have been implemented as desktop virtual environments (DVEs) displayed on computer screens and with traditional input devices such as keyboard or mouse as a means of interaction (Miller and Bugnariu 2016). As such, DVEs are typically less capable of immersing users into the synthetic world. The user’s level of immersion is determined by the level of sensory stimulation and impression of presence inside the virtual environment (Biocca et al. 2003). Immersive virtual environments (IVEs), on the other hand, attempt to surround users in the synthetic world through the use of head-mounted displays (Parsons and Carlew 2016) or full domes (Bohil et al. 2011). Both DVEs and IVEs have been shown as effective means of ASD intervention although their relative efficacy and effects on user experience are less known. It has been suggested that higher immersion levels may produce greater engagement and motivation, thereby enhancing learning outcomes (Miller and Bugnariu 2016). Other studies emphasize the potential of IVEs to produce high ecological validity which in turn is conductive to skill generalization to real world (Parsons and Carlew 2016; Parsons and Cobb 2011). On the other hand, there are concerns over the usage of IVEs for individuals with ASD with sensory sensitivities, VR-induced symptoms (i.e., cybersickness) in IVEs, and irritation caused by head-mounted displays (Wallace et al. 2010).

Social Robotics

Another technological approach with potential for treatment of ASD is social robotics. Similar to VR, robots can furnish customized, controllable learning experiences with high degrees of repeatability and engagement (Saadatzi et al. 2012; Scassellati et al. 2012). Unlike virtual characters, robots’ physicality in the real world amplifies their application potential in intervention protocols (Boucenna et al. 2016). Due to their embodiment and communicative shape, they allow for touch and physical interactions. In addition, their physical presence in real world may help individuals establish a rapport (Powers et al. 2007) and may serve to facilitate joint attention between children with ASD and adults (David et al. 2018). Joint attention (JA) skills are considered an essential building block to social communication (Charman 2003; Mundy 1995) and are often impaired in individuals with ASD (Mundy and Crowson 1997; Werry et al. 2001). Research in robotic intervention for ASD demonstrates that robots can elicit episodes of JA in individuals with ASD (David et al. 2018; Warren et al. 2015). Since robots are appealing and not common in daily experiences of individuals with ASD, they tend to draw their attention and elicit a playful response (Henkel and Bethel 2017; Kanda and Ishiguro 2012). This capacity can be harnessed to direct these individuals’ attention to relevant cues of the learning environment and improve their engagement with the instructional material. This novelty may serve as a reinforcer and, hence, boost these students’ investment in the treatment.

Collectively, the available research literature suggests that both virtual reality and robots have the capacity to play important roles as therapeutic tools for individuals with ASD. In the present study, the authors sought to extend the literature by incorporating both technologies into an instructional package to teach sight words.

Sight Word Instruction

Reading skills are critical to individuals’ independence across a range of contexts (Beirne-Smith et al. 1994; Gupta and MacWhinney 1997). Unfortunately, many individuals with ASD have difficulties in acquiring the skills to become competent readers. For example, researchers have demonstrated that some learners with ASD may not achieve phonemic awareness and subsequently fail to learn decoding strategies (Gabig 2010). In the absence of proficient decoding skills, learners must rely on the recognition of whole words to read print within their environments. This approach, referred to as sight word reading, can help individuals increase their fluency and confidence when learning phonetic approaches to reading (Browder and Lalli 1991), but also can assist individuals in navigating their environments through the rapid identification of words relevant to everyday independent functioning (e.g., bathroom, grocery words, job-related words). Substantial evidence supports the effectiveness of sight word instruction in improving reading ability of individuals with ASD (Saadatzi 2016; Saadatzi et al. 2017).

Research Questions

In the current investigation, the authors developed an intelligent tutoring system that resembled a small-group arrangement. This system included a classroom environment with a pedagogical agent playing the role of a teacher, and a humanoid robot with the role of a peer. With the introduction of the robot peer, the traditional dyadic interaction in tutoring systems was augmented to a novel triadic interaction in order to enhance the social richness of the learning environment and to facilitate observational learning. The developed tutoring system reflects the first application of triadic interaction amongst a PA, robot, and students with ASD to teach reading skills. In this study, the following research questions will be addressed; (a) Is there a functional relation between PA-delivered instruction within a group arrangement that includes a robot peer and the number of target words read by students with ASD, and (b) is there a relation between PA-delivered instruction within a group arrangement that includes a robot peer and the number of words acquired through observational learning?

Methods

Participants

The participants were recruited through advertisements on the websites of a University autism center and local parent support group, as well as flyers posted around campus and local community. Subsequently, interested parents called or emailed the research office to volunteer their children. At the beginning of the study, informed consent and assent forms were attained from the parents and children, respectively. Approval to conduct the current study was obtained from the university’s institutional review board.

Three children, ages 6–8 years, with medical diagnoses of the ASD participated in the study. All three participants received special education services under the eligibility category of autism and were recruited based on their individualized education plans (IEP) containing objectives related to increasing sight word vocabulary. All three participants met the following inclusion criteria; (a) could vocally describe a wide range of stimuli within the environment, (b) identified as primarily a sight word reader, and (c) reported to have no known visual or hearing impairments.

Student J was an 8-year old male with ASD. His most recent evaluation indicated that his intellectual functioning fell in the low average range (i.e., 82; Wechsler Preschool and Primary Scale of Intelligence [WPPSI; Wechsler et al. 2012]). On the Development Profile 3 (DP3; Alpern 2007), he scored 103, 61, and below 50 for adaptive behavior, social-emotional, and cognitive tests, respectively. He scored a 38 on the Childhood Autism Rating Scale (CARS; Schopler et al. 2010) indicating performance in the severe autism range. His score on the Brigance Transition Skills Inventory (Brigance 2010) at the beginning of the study indicated that his reading skills (e.g., word recognition, comprehension) were commensurate with students in the first grade. Finally, his official psychological evaluation at the university hospital reported that he did not initiate or maintain interaction with peers and adults, and rarely made eye contact.

Student I was an 8-year old male with ASD. In his most recent evaluation, he obtained a standard score of 75 (Borderline range) for verbal IQ and 49 (Extremely Low range) for performance IQ (WPPSI; Wechsler et al. 2012). He scored a 42 on the CARS indicating performance in the severe autism range. He was also diagnosed with Anxiety Disorder Not Otherwise Specified and a moderate-to-severe receptive-expressive language delay. His score on the Brigance Transition Skills Inventory (Brigance 2010) at the beginning of the study indicated that his reading skills were below the first-grade level. Finally, his official psychological and occupational therapy evaluation at the university hospital indicated mild delays in the areas of communication/language and adaptive skills. According to his teacher’s report, he struggled with maintaining engagement during classroom instruction, and occasionally demonstrated inappropriate behaviors and aggression toward peers and teachers.

Student V was a 6-year old male with ASD. In his most recent evaluation, he was administered the Autism Diagnostic Observation Schedule-2 (ADOS-2; Lord et al. 2012) and WPPSI (Wechsler et al. 2012). He was diagnosed with high functioning autism and full-scale IQ of 107. His score on the Brigance Transition Skills Inventory (Brigance 2010) at the beginning of the study indicated that his reading skills were equivalent to second-grade students. According to an official psychological evaluation, behavior concerns in school included high activity level, inattention and off-task behavior, as well as noncompliant behavior. He also struggled to interact appropriately with his peers, and usually played alone. His mother reported that he had outbursts during which he aggressed toward family members with little provocation.

All participants had previously received sight word instruction and were familiar with constant time delay procedure. They, however, did not have previous exposure to pedagogical agents or humanoid robots.

Settings and Materials

All sessions were conducted in a small room (i.e., 4 × 3 m2) within our laboratory at University of Louisville. The room contained a small desk, a 26-inch computer screen, and two chairs. The computer screen was placed on the desk and adjusted to an appropriate height for the participant’s eye level. The robot peer was seated next to the participant facing the computer screen. Figure 1 shows the experiment setting. The experimental sessions occurred one to two times a week over a 4-month period.

Fig. 1
figure 1

The experiment setting. A special stand was designed for NAO so as to securely sit and perform gestures. Participants sat on the black chair

A webcam was mounted on top of the screen to record the interactions throughout the experiment for the purposes of offline analysis, inter-observer reliability, and procedural integrity. At the beginning of each session, the primary experimenter launched the educational software as well as an application for recording the video/audio stream from the webcam. The content of the screen (i.e., the PA and chalkboard) was simultaneously recorded in a picture-in-picture format, which enabled recording the PA’s activities along with the RP’s and participant’s reactions and answers. On these recordings, the reading material presented to both group attendants (i.e., the participant and the RP) and their responses could be seen at the same time.

Virtual Environment and Pedagogical Agent (PA)

A desktop virtual environment (DVE) was developed using a commercial VR design package, Vizard from Worldviz, and included a PA, a chalkboard where the stimuli were shown in written form, a text-to-speech engine, and an automatic speech recognition engine (Saadatzi 2016; Saadatzi et al. 2017). To avoid any potential sensory overload, the virtual environment was designed to be austere (i.e., few visual distractors, low noise levels). The PA, which played the role of a teacher and instructed sight words, was designed as a full-bodied character. Such full embodiment (as compared to face only or waist up), enabled the PA to model pointing cues via its arms, gaze, and body orientation. With this pointing action, it was intended to draw students’ attention and direct their gaze toward the stimuli (Fig. 2b). Alcorn et al. (2011) studied how gaze following and joint attention (JA) can be best stimulated from individuals with ASD by an embodied virtual agent in a flower-picking game. Their findings showed that the combination of gaze and pointing bids elicited significantly higher JA compared to the single cues. In addition, close-up of virtual face has been reported to induce uncomfortable feelings, attempts to increase distance (Argyle and Dean 1965), and anxiety in individuals with ASD (Welch et al. 2010). Via a full-body representation of the PA, it also was intended to avoid inducing anxiety and to provide a comfortable conversation.

Fig. 2
figure 2

a PA gazing at the student, b PA pointing to the presented word, c PA clapping for the student

The authors programmed the PA to implement a research-based teaching procedure, namely constant time delay (CTD), that requires the implementation of a prescriptive teaching sequence (Neitzel and Wolery 2009). Initially, the PA presented a sight word and a request for the student to read it, but then immediately provided a vocal model (i.e., controlling prompt) of the correct response. After several trials, the PA faded the controlling prompt by inserting a response interval to allow the learner to emit an unprompted response. Across both types of trials, the PA delivered reinforcement for correct responses and corrective feedback for incorrect ones. CTD has been shown to be an effective strategy for teaching a range of skills to students with ASD including sight words (Ledford et al. 2008). Researchers previously have used CTD as a part of CAI to teach multiplication facts (Wilson et al. 1996) and sight words (Mechling et al. 2007; Saadatzi et al. 2017).

A real-time behavior/character algorithm was programmed to produce verbal and expressive attributes of the PA. For correct responses, the PA nodded its head, smiled, and provided verbal praise. In response to errors, it shook its head, and provided the corrective feedback (i.e., correct answer). When no response, the PA provided a vocal model of the correct word without any expressive movements. Although the tutoring system was capable of automatically recognizing words read by the participants, to ensure a high procedural integrity, an experimenter monitored the sessions and advanced the instruction by listening and manually determining whether the participants read the words presented correctly or not. This capability was added to the autonomous PA developed in our previous study (Saadatzi et al. 2017) due to difficulties related to the accuracy of automatic speech recognition when used with persons with disabilities. All other activities of the tutoring system, including the PA’s actions (i.e., body gestures, facial expressions, delivering feedback and reinforcement) were kept autonomous since the tutoring system delivers perfect procedural integrity in those domains.

Robot Peer (RP)

In this study, NAO, a humanoid robot from Softbank Robotics, was used to emulate a peer within the learning environment. NAO is a programmable humanoid robot that stands 58 cm in height and has 25 degrees of freedom which can be used to program various body gestures. NAO cannot control its eye gaze relative to its head and thus used a head turn to approximate gaze. During instruction, the robot read the stimuli presented by the PA via its built-in text-to-speech engine.

Instructional Stimuli

During instruction, target words were displayed with a 90 pt. Arial style font on the virtual chalkboard and with sharp contrast to ensure visibility. The words were presented in lower case letters to simulate their most common occurrence in books and reading materials. A group of unknown words was selected for the child from a screening group of words. From this group, two lists of words, LC and LR, each containing four words (i.e., LC = {walk, from, them, stop}, Lr = {walk, from, cold, best}), were chosen for the child and the RP, respectively. These lists were defined as LC = {Γ1, Γ2, Δ1, Δ2}, and LR = {Γ1, Γ2, Π1, Π2}, where Γ1 and Γ2 were common between the child and the RP. Δ1 and Δ2 were unique to the child, whereas Π1 and Π2 were unique to the RP. These two lists consisted of six different words all unknown to the child. The PA instructed the words in LC to the child, and the words in LR to the robot. During the instruction of these words, and because the student and the RP participated the tutoring sessions at the same time, three different modes of sight word learning could occur.

Explicit Learning (EL)

Δ words (i.e., Δ1 and Δ2) were unique to the child and served as explicit words, in that, they were not instructed to the RP. Therefore, the student could learn these words only through direct dyadic interaction with the PA.

Vicarious Learning (VL)

Π words (i.e., Π1 and Π2) were exclusive to the RP. The RP simulated learning of these words over time through dyadic interaction with the PA, and the child saw the words instructed to the RP. The child could also hear the RP read those words and observed the consequences (reinforcement or corrective feedback) delivered to the RP. Therefore, Π words could be vicariously learned by the student while watching these words being instructed to the RP but not through direct interaction with the PA.

Explicit-Plus-Vicarious Learning (EVL)

Γ words (i.e., Γ1 and Γ2), were common between the child and the RP. Therefore, the child could acquire these words both explicitly, through dyadic interaction with the PA, and vicariously, through watching the PA instructing them to the RP. These words could be acquired by the child through triadic interaction with the PA and the RP.

Screening

Words were selected from the Dolch and Brigance functional word lists (Brigance 1978; Dolch 1948). Prior to the experiment, each participant underwent a series of tests to identify words that would be used during intervention. The first stage of screening was performed by the primary experimenter in a one-to-one format in the experiment room. He presented the words, printed on flash cards, and asked “What word?”, and noted the words that the participant either read incorrectly or did not read at all. A total of 100 unknown words were identified for each participant. During screening, the participant did not receive feedback for his responses.

In the second stage of screening, the words identified in the previous stage as unknown to the participant were again presented to him. The only difference was that the words were displayed on the computer screen instead of flash cards. Likewise, no feedback was provided. This stage was repeated three times on three different days to ensure the words were unknown to the participant. The word order was randomized for session. From the words that were never read correctly by the participant, a list of 30 words with equal difficulty (based on the number of letters and syllables) were selected. From this list, 18 words were randomly selected for the participant to be included in the study. Since the experiment was conducted during the summer, and the participants did not receive any educational services during this study, there was no concern about the participants receiving instruction of their chosen words. Additionally, the parents agreed not to instruct the words to their children for the duration of this experiment.

Experimental Design

To evaluate the efficacy of the instructional package, a multiple-probe across-word-sets design, replicated across three participants (Tawney and Gast 1984) was employed. For each participant, baseline data were collected on reading performance across all three word sets (18 words). After three data points across 3 days, instruction began on one of the word sets (six words). As soon as the participant met criterion (i.e. 100% accuracy across three sessions in a row) on the first set, another set of data was collected on all three sets for three sessions. Upon completion of this stage, the instruction was delivered on the second word set. This procedure was carried on until the participant reached criterion on the third word set. After the participant reached criterion on the third word set, data across all three word sets were again collected. Additionally, a pre/posttest design was employed to assess participant’s ability to maintain over time and to generalize the words acquired to a non-laboratory setting.

Dependent Measures

Data were collected from recorded sessions on two primary dependent variables: (a) percent of words read correctly (i.e., percent reading accuracy) and (b) number of errors made by the participant. A word was considered as correct if the student stated the word presented by the PA within 5 s. Incorrect responses and no responses were counted as errors. The number-of-errors measure was defined as the total of errors from the first intervention session to the participant’s third consecutive day at 100%. Data on the generalization and maintenance of words were also collected.

Familiarization Session

Prior to the beginning of baseline sessions, to reduce the potential novelty effects and enhance task comprehension, the PA and RP were presented to each participant. The primary researcher described the procedures and the roles of PA and RP, and then took each participant to the experiment room. He then introduced the PA and RP and provided a 10-min demonstration of the robot’s capabilities including speech and regular body movements. The participant was then allowed to touch the robot and familiarize himself with it.

Baseline Sessions

Prior to the beginning of the first baseline session, the experimenter described the task to the participant to facilitate task comprehension. He then launched the software and entered the participant’s name. The PA started by greeting the participant by name (Fig. 2a) and stating “We are reading some words today. I need your attention.” After 3 s, it presented one of the words at random, and asked “What word?” The word was displayed for 5 s on the PA’s chalkboard and then disappeared. No feedback was delivered to the student. Before the PA advanced to the next word, there was a 1-s inter-trial interval. This procedure continued until all 18 words were presented. Across the baseline sessions, the word order was randomized. During the baseline sessions, the RP was seated beside the participant and programmed to be idle. For more familiarization, after each baseline session, the participant was again allowed to interact with and touch the robot.

Instructional Sessions

During instruction, the PA implemented a CTD procedure to instruct sight words to the participant and the RP. In each session, the PA presented three trials per each word, that is, one 0-s delay trial followed by two 5-s delay trials. The PA began each session by greeting the student and delivering an attentional cue to the RP (i.e., “NAO, are you ready?”). The RP looked at the PA and responded “Yes.” Then, the PA repeated the question to the participant. When the participant responded, the PA emitted a group attentional cue by saying “Alright. Everybody look at me.” After a 1-s pause, the PA started the first round of trials (i.e., 0-s delay) for each word and group attendant. During each trial, the PA pointed to a randomly selected word from the LR word list and delivered a task directive (i.e., “NAO, what word?”), immediately followed an auditory model, and waited 5 s for the RP to respond. The RP was programmed to randomly choose among three options; state the word correctly, state the word incorrectly, or say nothing. Rather than always correctly reading the words presented, the RP was programmed to demonstrate learning of the words over time to simulate a natural group setting. Therefore, the likelihood of reading the words correctly was programmatically increased over time for both common target words (Γ words) and RP-specific words (Π words). The PA reacted to correct responses with verbal praise statements (e.g., good work, excellent, nice job) and to errors or no response with corrective feedback (i.e., modeling the word again). When receiving feedback from the PA, the RP was programmed to demonstrate random happy gestures. This differential reaction was programmed into the RP’s behavior repertoire to potentially increase student engagement as and to facilitate students’ discrimination of consequences for the RP.

After the first trial for the RP and a 1-s inter-trial interval, a word randomly chosen from the list LC appeared on the chalkboard. The PA called the participant by name and asked “what word?”, delivered an auditory model, and waited 5 s for the participant to respond. Subsequently, the PA reacted to correct responses with verbal praise and to incorrect or no responses with corrective feedback.

The PA then continued this procedure alternatingly with the RP and the participant until all the words of the lists LR and LC were presented to the RP and the participant, respectively. After a 1-s pause, at the beginning of the second round of trials (i.e., the first 5-s delay trials), the PA emitted another group attentional cue by saying “Everybody look at me.” During the second round of trials, the PA used procedures similar to those used in the first round, with the exception that a 5-s delay interval followed the PA’s presentation of the words. Likewise, the third round of trials (i.e., the second 5-s delay trials) was performed identically to the second round. Each session concluded (after the third round of trials) by the PA providing non-contingent verbal praise to the participant, such as “You are doing great!” or “Keep up the good work,” while clapping for him (Fig. 2c). The RP also delivered non-contingent praise to the participant by orienting his head toward him and saying “Good job.” Across sessions, the presentation of words was randomized to prevent participants from memorizing the words by order. The experimenter did not intervene in any way since the participant was expected to work independently. The tutoring system addressed the participant and the RP by their names and delivered prompts using language similar to that used in a typical classroom setting.

During instructional sessions, the primary experimenter closely observed the interaction and advanced the instructional trials by determining whether the responses were correct. On a remote keyboard, he typed “C” if the response was correct, and, consequently, the PA delivered reinforcement. On the contrary, he typed “F” for an incorrect response. In that scenario, the PA provided corrective feedback. The experimenter’s keystrokes were continuously monitored by the software in order to determine whether to deliver reinforcement or corrective feedback for each word presented. Additionally, the tutoring system had an internal timer to measure the 5-s interval. As soon as the interval was over (when no keystrokes in that period), it automatically scored a “no response” for that specific word, and the PA presented a verbal modeling of the word. With this mechanism, if the student responded prior to the end of the delay interval, the PA immediately reacted by delivering corrective feedback or reinforcement. Only in the case of no response, the PA waited for the entire 5 s. Instructional sessions were 4 min in length.

Post-instruction Probes

When the participant reached criterion on each word set, an assessment phase was conducted across three sessions. Each assessment session was identical to the baseline sessions, where the participant was required to read the presented stimuli and did not receive any word modeling, feedback, or reinforcement from the PA. These probes were repeated after the participant met criterion for each word set.

Follow-up Sessions

An additional maintenance probe was performed 2 months after completion of the last post-assessment session for the third word set. Procedures were identical to that of baseline sessions. Also, a final generalization probe was conducted at the participant’s home and with his parent to test whether the acquired words would transfer outside the experiment room and to different people. This session was similar to the screening session before the experiment began. The parent was instructed to present the words printed on flash cards, one-by-one, and to record correct and incorrect responses without delivering feedback or reinforcement.

Reliability

The first author trained an independent observer (i.e., a graduate research assistant) to collect dependent and independent reliability data using the presentation of simulated scenarios and recordings of a pilot study. The second observer collected dependent variable reliability data on 30% of baseline, instruction, post-instruction evaluation, and maintenance sessions. Inter-observer agreement (IOA) was then calculated using point-by-point comparison between the two observers’ judgements. The number of agreements was divided by the total number of agreements plus disagreements, and then multiplied by 100% (Tawney and Gast 1984). IOA was 100% across all three participants. No IOA data were calculated for the generalization sessions as there were no video recordings available.

Data also were collected on independent variable reliability across 30% of baseline, instruction, post-instruction evaluation, and maintenance sessions. Since the tutoring system’s functions (such as the PA delivering group attentional cues, saying the RP’s and participants’ names, delivering task directions and prompts, timings and responses to the participants as well as the RP’s responses and bodily reactions) were automated, a 100% procedural integrity was obtained regarding the tutoring system’s behaviors. In addition, the primary experimenter delivered the baseline, post-instruction evaluation, and maintenance sessions with 100% compliance with the planned steps. The procedural integrity, however, minimally dropped to 99.3% due to incorrectly typing C (to indicate a correct response) instead of F (to indicate an incorrect response).

Results

The percent of correct reading responses for words in explicit learning (EL) and explicit-plus-vicarious learning (EVL) groups across baseline, instructional sessions, and post-instruction probes are displayed in Figs. 3, 4, and 5 for students J, I, and V, respectively. Prior to the intervention, none of the participants read EL or EVL words correctly. Immediately following the introduction of the small-group PA-delivered instruction, data showed an immediate change in a therapeutic direction and ultimately criterion level performance for all three word sets across participants. The average number of sessions to criterion across participants and word sets was 5.89 sessions with a range of 5–8. During the maintenance and generalization probes, all three participants successfully read the words they had acquired through EL and EVL modes (i.e., Δ and Γ words, respectively) with 100% accuracy. Overall, the participants made few errors (see Table 1). Data indicated that participants made substantially fewer errors for the EVL words than for the EL words except for Student I where in his third set he made 6 and 14 errors in the EL and EVL modes, respectively. Although, at the beginning of the study, it was sought to identify words of equal difficulty, it may have been that the EVL words in the third set were more difficult than the corresponding EL words for student I. As a matter of fact, the number of errors for his third EVL word set (i.e., 14) is a statistical outlier in Table 1. It might have been the case that the difficulty level of those words was higher than his reading age.

Fig. 3
figure 3

Percentage of student J’s correct responses

Fig. 4
figure 4

Percentage of student I’s correct responses

Fig. 5
figure 5

Percentage of student V’s correct responses

Table 1 Number of errors made to reach criterion for explicit learning (EL) words and explicit-plus-vicarious learning (EVL) words

Tables 2, 3 and 4 show participants’ percentage of correct responding to words in the vicarious learning (VL) category. Participants’ performance on VL words after instruction of those words to the RP are italicized. During the pretest, all participants consistently read VL words (which were later taught exclusively to the RP) at 0% accuracy. Data showed that all participants learned a high percentage of the RP’s exclusive words (i.e., Π words) through vicarious learning as the PA-delivered instruction to the RP. Student J, interestingly, learned, maintained, and generalized all the Π words with 100% accuracy. Student I acquired 5 of the 6 VL words and demonstrated 100% maintenance and generalization of those words. Student I’s reading accuracy in the maintenance and generalization sessions was 83.33% across all three sets. Student V acquired all the VL words although he did not show generalization and maintenance for one of the Π words in the first set. He maintained and generalized his Π words (acquired through VL) with 83.33% accuracy on average across the word sets.

Table 2 Student J’s correct responding percentage of vicarious learning (VL) words
Table 3 Student I’s correct responding percentage of vicarious learning (VL) words
Table 4 Student V’s correct responding percentage of vicarious learning (VL) words

Discussion

In response to recent calls for innovation (e.g., U.S. Department of Education’s Office of Educational Technology 2017; Vasquez et al. 2015), the current research combined recent advancements in virtual reality and social robotics to develop an effective tutoring system that reflects a new paradigm in automated instruction for individuals with ASD. This tutoring system featured a virtual character functioning as a teacher and a humanoid robot emulating a peer. This pedagogical scenario mimicked a small-group instructional arrangement which enhanced the social richness of the tutoring system and afforded novel learning channels such as imitational and vicarious learning. The authors, specifically, evaluated the effectiveness of sight word instruction by the virtual teacher on reading accuracy of children with ASD within this technology-assisted group arrangement. Further, the effects of this instructional arrangement on participants’ observational learning of sight words solely taught to the robot peer was investigated. The results indicated that all three participants learned all the target stimuli that were instructed exclusively to the participants or to both participants and the robot peer. The participants emitted few errors especially with words that were instructed to both participant and robot peer. Additionally, two of the participants vicariously learned all words exclusively taught to the robot, and a third participant acquired half of them. They maintained their target words for 2 months following the intervention, and were reported to generalize performance to home settings, to written words on paper, and to their parents. They also demonstrated maintenance and generalization of majority of non-target words (i.e., with the mean level of 88.89%).

These acquisition rates of target and non-target stimuli results are consistent with studies on human-delivered sight word instruction to students with disabilities within typical group arrangements (Campbell and Mechling 2009; Falkenstine et al. 2009; Stonecipher et al. 1999; Wolery et al. 1991) in that all the participants acquired target words in relatively few sessions (i.e., 5–8). In addition, the participants in the current investigation acquired non-target words at levels similar to and, in some cases, higher than studies involving teachers and typical group arrangements (Ledford et al. 2008; Campbell and Mechling 2009; Falkenstine et al. 2009; Ross and Stevens 2003). In addition, the generalization levels of both target and non-target words are high when compared to studies on sight word instruction in small group arrangements involving human teachers and peers. For example, Ledford et al. (2008) reported that participants’ generalized target information and observational information at 83 and 69%, respectively. These findings suggest that the developed package may be as effective and, in some cases, more effective than traditional group arrangements during reading instruction for individuals with ASD. Further research is warranted to confirm these findings through direct comparisons of group arrangements.

In the current study, the virtual teacher was programmed to draw participants’ attention toward the reading material via the virtual teacher’s gaze, body orientation, and pointing. Further, the robot peer was programmed to enhance students’ engagement using its locomotion as well as meaningful, interactive, and contingent behavior. For example, it interacted with the participants, delivered reinforcement, modeled appropriate group-specific social behaviors, and provided opportunities for the participants to respond. This permitted participants to practice reading sight words in an environment free from potential negative feedback, a common instructional stressor. Finally, the length of instructional sessions was short (i.e., 4 min) to accommodate for typically short attention spans in children and to avoid boredom and fatigue. More research (i.e., component analysis) is needed to parse out the most active ingredients within this novel instructional package.

Slater and Wilbur (1997) argued that user’s sense of presence and perceptual experience are influenced by the environment’s level of inclusion which is the degree to which extraneous signals (e.g., keyboard, joystick, weight of head-mounted displays) are removed. Since the virtual teacher and robot peer communicated with participants through natural language, and participants answered to the virtual agent’s queries by speaking their responses, neither a mouse nor a keyboard were required. This capitalizes on the use of recent advances in text-to-speech and automatic speech recognition technologies. These technologies not only can enrich the learning environment’s ecological validity but also have the capacity to increase access to computer-based instruction by students with motor impairments or deficits in keyboarding/mouse skills.

During intervention, it was observed that participants responded to the RP’s performance with positive statements (e.g., Thank you, nice job). One of the participants consistently greeted the robot and hugged it when the session was completed. Another participant began imitating the robot’s happy gestures and ultimately emitted these gestures independently after acknowledgement of a correct response. Interestingly, his parent confirmed that he had never exhibited those gestures prior to the experiment. These observations suggest that this package may serve as a context under which learners can safely practice the performance of critical social responses.

Future Implications

Many individuals with ASD lack skills prerequisite for participating in and benefiting from traditional group instructional arrangements including tolerating intermittent reinforcement, turn-taking, and maintaining attention to instructional stimuli (Greer et al. 2006; Leekam et al. 1998; Masia and Chase 1997; Sallows and Graupner 2005; Stone and Yoder 2001). Added emphasis must be placed on developing strategies that prepare these individuals for group learning and equip them with skills required for group participation. The tutoring paradigm proposed in this study models a small group arrangement where users are required to apply and practice group/classroom skills, which, in turn, may increase the likelihood for generalization of those skills to multi-student contexts. The authors suggest that in the developed instructional package learners with ASD may be able to acquire critical instructional prerequisites within the tightly controlled and potentially reinforcing group-specific interactions involved. Although it is not suggested that the developed package should replace learning environments that more closely reflect natural contexts, the authors purport that it may serve to facilitate increased participation in them.

Future Research

The current study raises some questions that warrant further examination. In this study, the RP was programmed to show gradual acquisition of words to resemble the most frequent situation in a natural group setting. However, future investigators should assess the differential effects of the RP as a model of varying levels of competency, resulting in different ratios of praise to corrective feedback. Further, since many group instructional arrangements include more than two learners, investigators should evaluate the effects of adding more learners to the current arrangement. For example, this would permit one to compare participants’ vicarious learning of words presented solely to the robot to those words presented to a human peer.

The authors implemented a CTD procedure within the instructional package. In this procedure, if students do not know the answer, they wait a specified amount of time before another prompt is emitted. Researchers should investigate the efficacy of other prompting procedures within similar technology-assisted arrangements such as simultaneous prompting and progressive time delay. Furthermore, they might consider the incorporation of active student responding strategies such as choral responding. The virtual teacher delivered instruction to the robot and participant in a predictable order (i.e., alternatingly). In future investigations, the effects of unpredictable order should also be studied. Unpredictable order of instruction might result in higher attention to the instructional material and, thereby, higher acquisition rates.

Limitations

The findings, although encouraging, should be deemed preliminary in the context of three limitations. First, the research employed a multiple-probe design across a small number of participants. Though this design limits threats to internal validity associated with repeated testing of participants during baseline conditions, it is prone to cyclical variation and generally considered less robust than the multiple baseline design. This issue accompanied by the use of only three participants, certainly limits the generalizability of the findings. Studies with larger number of participants with wider range of age and ASD characteristics are certainly needed. Second, participants spent a limited number of instructional sessions (17.67 sessions on average among students) interacting with the tutoring system. It is unknown whether extended period of instruction enhances or negatively impacts student motivation and ultimately responding. It is plausible that the novelty of the instructional package contributed to the participants’ performance and that extended exposure to the package might have produced gradually weaker effects. Third, it is important to acknowledge that data related to the generalization of responding within the home setting is based on parent report and, thus, must be viewed with skepticism.