Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Introduction

There is a plethora of factors that influence listening comprehension such as speaking speed, accent, or topic knowledge. One aspect that has not been examined fully is repetition type. When investigated, the effects of repetition in testing listening comprehension have been mixed for several reasons. The primary reason is that repetition has been defined and operationalized too vaguely. Some researchers have defined it as just repeating the stimuli. Another reason is that repetition has also not been the primary focus of a study; rather it has been one of the several research questions investigated.

The theory of the spacing effect (Ebbinghaus 1885/1913) is the finding that memory performance is better when repetitions are separated by other items (i.e., spaced repetition), than when the repetitions immediately follow one another (i.e., massed repetition). The lag effect is the amount of time between the repetition of items; research has indicated that a longer lag between repetitions increases recall. Often the spacing effect and lag effect are confused with each other. The spacing effect refers to the comparison between a massed and spaced method, whereas the lag effect refers to the different retrieval times used in the spaced method.

The use of repetition is common in our daily lives. The spacing effect commonly used in commercials, advertising, speeches, or public announcements. In teaching, the spacing effect is often associated with the long-term effects of repetition, namely recycling material over a course or material in spiraled curriculum. The spacing effect, however, has benefits in the short term as well as the long term. Greater awareness of the spacing effect would enhance its application in education.

L2 Memory Processes

Second language listening processes do not differ from first language listening processes in any physical aspect (Kroll et al. 2012) except that processing capacity is generally reduced (Call 1985; Faerch and Kasper 1986; Hasegawa et al. 2002). The difficulty for second language learners arises in comprehending specific elements of the language, and any necessary compensation, such as using background knowledge to modify the deficiencies, provide another opportunity for miscomprehension. Even two native speakers encounter misunderstandings and do not accurately comprehend everything they hear at all times. Compensatory skills, such as using visual cues or world knowledge, can help listeners compensate for incomplete listening comprehension.

Mackey et al. (2010) found that working memory accounted for less than 20 % of the variance in measuring oral output, a finding that indicates that other factors contribute to the output. Sunderman and Kroll (2009) suggested that working memory for language learners, who are beyond the basic elements of the language, is the ability to control attention and suppress competing processes. The ability to control attention explains why high L1 working memory capacity does not transfer directly to high L2 working memory capacity; the L1 might interfere with the L2. As numerous studies indicate (De Bot and Kroll 2010; Costa 2005; Dijkstra 2005; Marian and Spivey 2003; Schwartz et al. 2007), L2 learners cannot suppress L1 activations even when they are not in use. This finding holds for reading, listening, and planned speaking. These studies indicate that the parallel activation of both languages influence working memory capacity.

Spacing Effect in L2

Serrano and Muñoz (2007) outlined several studies where intensive learning was more productive than spaced learning. In vocabulary acquisition, Collins et al. (1999) found that vocabulary items learned over five months were better recalled that items learned over after 10 months. In speaking, several studies (Freed et al. 2004; White and Turner 2005) indicated that fluency increased over a short intensive time than over a long period. In listening, several studies indicated that intensive or compressed classes helped listening comprehension (Lapkin et al. 1998; Lightbown and Spada 1994). Although these studies demonstrate intensive study can be more effective than traditional instruction, there is a matter of how the terms spaced and massed were defined. Most of the studies were based on learning over a year. For example, one class had instruction for 350 h over 10 months versus 350 h over 5 months. Or, the amount of instruction time differed between the conditions. For example, one intensive course had 350–400 h of instruction over the year with 18–20 h per week, but the traditional course only had 120 h of instruction over the year with a maximum of 4 h a week. These results may not reflect the underlying working of the spacing effect due to the lag effect.

There have been several studies focusing on L2 listening comprehension using repetition. Brindley and Slatyer (2002) explored the influence of tasks on test results involving the Certificates in Spoken and Written English (CSWE). The principal aim was to identify how five key task characteristics and task conditions affected test difficulty. One of the variables examined was the number of hearings. The participants were 284 adult ESL learners enrolled in Certificate III in teaching centers across three states in Australia. These participants engaged in various combinations of three tasks. One task (control condition) was given to all the participants. The participants listened once to a two-minute recorded monolog concerning the Australian educational system, and then completed a sentence-completion task in which a few words of English were written in ten sentences. Four versions of the remaining two tasks were created based on the item difficulty variable; these tasks were randomly assigned after the participants had completed the first task. Other than the topic change, the second task’s baseline was similar to the first task, but the other versions included changing the item format, repetition, and the use of live speech. The third task’s baseline was similar to the control version, but the topic was about the Guide Dog Association in Australia. Additionally, the third task’s versions included short answers instead of sentence completion, a dialog instead of a monolog, and a faster speech rate.

In order to determine which tasks influenced item difficulty, the researchers first analyzed the scores using a Rasch-based program to obtain person ability and item difficulty estimates. In order to do this, all of the test forms were combined and treated as a single test containing 89 items with task 1 (control) providing a linking set of common items. Based on the ability/difficulty scale, the tasks could be interpreted as easier and more difficult. The easiest task required the participants to complete a table after listening. The most difficult task was characterized by increased speech rate. Although not addressed specifically, the results from a graph indicated that the number of hearings did not affect item difficulty when compared with other stimuli variables such as listening once, or live versus recorded speech.

Chang and Read (2006) investigated four listening support formats in which one of the conditions was repeated input. The first research question concerned whether different types of listening support would affect listening performance. The second research question asked whether the listening support types would affect higher or lower proficiency participants in the same manner. They examined the effects of the different formats on listening comprehension with 160 students from intact classes studying business at a college in Taipei, Taiwan. The participants were given one test condition based on their class and class day. Based on a TOEIC test, each group was further divided into low and high listening proficiency sub-groups. The participants in each condition completed two listening tests with 15 multiple-choice questions for each listening text. In the repeated input condition, the students were asked to listen to the text without any special preparation. Then they previewed questions before listening to the text twice, so they heard the text three times in all. Thereafter they answered 15 multiple-choice questions in three minutes. The steps were repeated for the second listening text.

Chang and Read (2006) conducted a 4 × 2 ANOVA. The dependent variable was the combined test score. The independent variables were four types of listening support (previewing questions, repeated input, topic preparation, and vocabulary instruction) and two listening proficiency levels (high and low). The results indicated that repeated input generated the second highest mean test scores while topic preparation was the highest. For the first research question, significant main effects were found for listening support and listening proficiency. There was also a statistically significant interaction between listening support and listening proficiency. Comparing the two proficiency groups, two of the four types of listening support, previewing questions and repeated input were statistically significant. The high proficiency group scored higher using these two types of listening support than the lower proficiency group. Their results indicated that the different types of listening support affected comprehension scores for the different proficiency groups, but the effect sizes were small.

In answer to the second research question, the researchers reported that listening support activities affected the low and high proficiency levels differently. Their results indicated that high proficiency learners benefitted the most from repeated input, but the differences between two of the three listening support activities were not statistically significant. The low-proficiency learners benefitted the most from topic preparation, but the difference of this activity from repeated input was not statistically significant whereas the other listening support activities were.

Chang and Read (2007) examined listening support factors on listening comprehension with 140 students at a five-year post secondary educational program at a Taiwanese college with low levels of listening proficiency as measured by the TOEIC listening section (a scaled score of 165 out of 465). Two research questions were investigated. The first asked was what type of listening support, repetition, visual, or textual, would enhance comprehension for low-proficiency listeners. The second asked what type of support (visual, textual, or repeated input) would affect the students’ perceptions of the listening task. Chang and Read stated that learners bring beliefs to the task, and those beliefs influence task performance.

They conducted a one-way ANOVA. The independent variables were visual support, textual support, and repeated support. The dependent variable was the listening comprehension score. The results indicated that all three types of listening support resulted in significantly higher scores than the control condition, and that repeated input produced significantly higher scores than the other types of listening support.

Cognitive Difficulty of Questions

One aspect that influences cognitive item difficulty is the interaction between the item type and test-taker. For instance, the test item and its relationship to the participant’s background knowledge plays a role in determining question difficulty. Yi’an (1998) concluded that background knowledge played a role in how the participants answered the questions. For higher proficiency learners answering multiple-choice questions, background knowledge acted as a facilitator as it allowed the learners to use the stem questions or distractors when responding. However, as Yi’an pointed out, this facilitating effect does not guarantee a correct response. For lower proficiency learners, multiple-choice questions acted in a compensatory way to fill in missing information. Again, this effect did not necessarily mean that the learners responded correctly.

Another aspect is the relationship of the question stem and response item. Nissan et al. (1996) concluded that inference items, which ask about information not stated explicitly in the passage, were significantly more difficult than items that required information explicitly stated in the passage. Kostin (2004) replicated Nissan et al’s. (1996) study and came to the same conclusion. In addition, both studies indicated that lexical overlap and redundancy helped comprehension. Further studies have confirmed these findings (Brindley and Slatyer 2002; Buck and Tatsuoka 1998).

For this study, question difficulty was considered in the following ways. First, Bloom’s taxonomy (1956) of six levels of cognitive difficulty was used as the basis for determining question difficulty. Brown (2001) interpreted Bloom’s taxonomy for language purposes and outlined seven levels (p. 172). The first level, which is considered the easiest level, is called knowledge questions. These types of questions ask for factual information, and test recall and recognition of information. The second level, which is considered more cognitively difficult, is comprehension questions. These types of questions ask individuals to interpret and infer information. The fourth level, which is more difficult than the preceding levels, is called inference questions. These questions include forming conclusions that are not directly stated in the input. Second, Henning’s (1991) definition was also adapted to make comparison more applicable. Therefore, lower order cognitive difficulty was defined as questions requiring an understanding of specific information stated in the passage within a single sentence. Higher order cognitive difficulty was defined as questions that can only be answered by using information from two or more sentences or inferences.

Methods

The participants were first and second year students attending a Japanese national university. The students are streamed into their course based on the institution’s TOEIC test. All of the participants’ TOEIC listening scores were under 300. The 242 participants were from six intact classes taking a required TOEIC course. One class from the first year and one class from the second year was randomly grouped together and given one of three treatment conditions. Table 1 gives the descriptive statistics. Although the group’s scores were not equal and a one-way ANOVA indicated a statistical difference between the massed condition and the other conditions, it does not necessarily indicate that the participants were at different proficiency levels for several reasons. First, the TOEIC standard error of measurement is ±25 points (TOEIC 2016a), so the control group and spaced group can be considered equal. Second, all of the TOEIC listening scores were below 300 and the average below 200, so based on TOEIC’s listening descriptions (TOEIC 2016b), the participants’ listening proficiency were roughly equal. Third, the university divides the classes based on the complete TOEIC score. As TOEIC (2016a) indicates, there is a high correlation between the scores. For example, the listening score correlates at approximately 0.8 to the reading score while listening score and the total score correlates at 0.9.

Table 1 TOEIC scores for each group

Toward the end of the semester in order to prepare for the semester-ending TOEIC test, the students took the teacher-made listening test over three weeks. Ten listening passages with five multiple-choice questions for each passage were created. Each passage is approximately one minute in length. There were five passages each of monologs and dialogs. The five multiple questions were created with two distinct types of questions, specific detail and inference. A key difference between the TOEIC test and the teacher-made test was the use of distractors. Commonly, TOEIC uses terms or phrases from the listening text as distractors. For this test, none of the distractors were from the listening text. The scores were not counted as part of the grade.

The procedures for each listening passage are shown in Table 2. The topic is introduced prior to listening to induce schema building in all the listening conditions. In the control condition, students listen to the passage once and then answer five questions. In the massed repetition condition, students listen to the passage twice and then answer five questions. In the spaced repetition condition, students listen to the passage, count 5–1 to interrupt the phonological loop, which will interrupt their working memory, listen a second time, and then answer five questions. The students were encouraged to take notes while listening, but were not allowed to look at the questions while listening.

Table 2 Procedures for each condition

Results

All Rasch analyses were conducted with WINSTEPS version 3.92.

The Rasch model allows us to examine all the items on one form of measurement, i.e., a ruler. For this study, all of the items under all three conditions were combined and analyzed under one measurement. Therefore, if the hypotheses hold true, the Rasch Model should indicate bias based on the condition the item was under. The raw mean score for the participants was 47.5 with a standard deviation of 1.6. The Rasch mean estimate was 50.0 CHIPS, item reliability estimate was 0.98, and item separation was 6.65. The Rasch person reliability estimate was 0.81, and person separation was 2.08. The person separation number is a little surprising since the participants were filtered into groups based on the TOEIC examination. Further examination indicated that the participants were not from one treatment condition.

Rasch Fit Statistics

The criteria for the fit statistics were set at ±0.7–1.5 for the mean squares. Checking outfit scores first, all fifty items were within the set criteria except for item 21 and item 4. Those items mean squares were 1.70 and 1.63, respectively, with standard z-scores of 1.5 and 1.9, respectively. Checking infit scores next, all items were within the set criteria.

Figure 1 shows the Wright map, which is the person–item relationship in a pictorial representation (Bond and Fox 2007). The CHIPS scale is shown on the far left side of the figure. According to Linacre (2008) CHIPS are a useful transformation, in which 1 logit = 4.55 CHIPS. In this user-scaling system, standard errors tend to be about 1 CHIP in size. A comparison of the locations of the person measures (left side) and item measures (right side) shows that the mean CHIPS for the person measures for the participants (M = 47.50; SD = 1.6) is equal to the mean of the item measures (M = 50.00; SD = 0.80). In addition, Fig. 1 shows several of the item measures were redundant in that they shared the same location on the logit scale with at least one other item. Nonetheless, the item measures were spread out sufficiently in that the items range were beyond the participants’ listening comprehension ability, except for only a few participants, on this test. Along the left side of the map, the participants are spread out over 22 CHIPS, minimum = 33.5 CHIPS, maximum = 55.8 CHIPS, with higher proficiency students toward the top of the map and lower proficiency students toward the bottom. Along the right side of the map, the items spread out over 23 CHIPS, minimum = 40.3 CHIPS, maximum = 63.5 CHIPS, with the easier items toward the bottom of the map and the more difficult items toward the top. In this case, the listening comprehension test covers the abilities of the highest students, but there are a few low-proficiency students that the test did not cover. The common linear interval data for persons and items gives a clear demonstration of whether the items matched the persons’ abilities for the construct measured. In this population sample, the mean of the items were slightly above the participants’ ability.

Fig. 1
figure 1

Wright map of items. Note Each “#” is 4 participants; Each “.” is 1–3 participants

The Rasch principal components analysis of the residuals was applied to detect other potential measurement dimensions in the listening test data. The Rasch model accounted for 25.2 % of the variance in the data. The first residual contrast, Fig. 2, had an eigenvalue of 4.0. Upon further examination, five items separated from the other items with loadings over +0.5. All five items were from the last listening passage. Although the test was given over three periods to reduce fatigue and help concentration, the last passage stands out against the other passages.

Fig. 2
figure 2

First contrast loading

The second residual contrast had an eigenvalue of 3.2. There was no clear break between the items, and most of the loadings were under 0.5. The top three items were questions of high difficulty while the bottom three questions were low difficulty.

A Differential Item Function (DIF) analysis investigates the different characteristic of a test item between subpopulations and is useful in finding biased items toward a particular subpopulation. In this study, it was hypothesized that spaced repetition would increase comprehension scores greater than the control group and the massed repetition group. Therefore, bias should exist in the model, favoring spaced repetition the most. There are two parts that need to be examined when analyzing DIF. The first is the logit needs to be big enough to examine and therefore logits greater than 0.5 are examined. Second, the Rasch-Welch t-value needs to be greater than 2.0 as to indicate that the significance did not happen by chance. Overall, as indicated in Fig. 3, there appears to be bias for the spaced repetition group. However, the bias is not consistently favorable, nor is it significant. For example, with the first item, the spaced repetition group found it much easier to comprehend than the other groups. The DIF contrast for item 1 is 4.0 between the control group and the spaced repetition group, while the massed repetition group is 0.2 between the control group. Additionally, the t-value is 2.60 for the spaced repetition group. However, with item 15, the spaced repetition group found it more difficult than the other groups. The DIF contrast for item 15 is −3.4 between the spaced repetition group and the control group with a t-value of 2.17, while the massed repetition group is 0.1. By examining each item using these above guidelines, only items 1 and 15 were biased. The remaining items might have logits greater than 0.5, but the t-values were less than 2.0. Overall, the items in the spaced repetition condition were inconsistent. About 22 items were easier in the spaced repetition condition than the other conditions while about 22 were more difficult. Additionally, the difficulty of the question was not favored for any condition either. For the most part, the control group and massed repetition groups mirrored one another and did not have any significant bias.

Fig. 3
figure 3

Measure for the differential item function. Note Black line Control group; Red line Massed repetition group; Green line Spaced repetition group; low Low cognitive difficulty; high High cognitive difficulty

Discussion

The purpose of this study was to investigate the effect of repetition on question difficulty on a listening comprehension test. Two types of repetition, spaced and massed, were given on a listening comprehension test in order to examine the affect on question difficulty. As with previous research (Brindley and Slatyer 2002), massed repetition had a similar effect as the control group, i.e., listening once. It was hypothesized that the spacing effect would have a greater positive effect, but overall, the results did not indicate greater listening comprehension. Nor did the spacing effect affect question difficulty in any significant way. The spacing effect had an effect on question difficulty, but it is not clear as to why it was positive for some items and negative for others.