Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Introduction

The use of multifaceted Rasch measurement (MFRM) has prevailed in the assessment field, especially in assessing second language (L2) speaking and writing, which involves complex interactions between test takers, tasks, raters, rating scales, and other factors. McNamara and Knoch (2012) describe how Rasch measurement, especially MFRM was adopted in L2 testing communities in the 1990s. Recent applications include Davis (2016) and Aryadousta (2016), both of which investigated the complex nature of L2 speaking assessment.

One type of L2 speaking assessment that has attracted attention from teachers and test developers is a paired oral test (paired oral, hereafter). In paired orals, test takers make pairs and talk with each other rather than an interviewer, and interactions are evaluated by raters. The English ability to interact with various speakers, convey facts precisely, and express one’s opinions promptly, while responding to listeners and managing interaction should be fostered and measured, since this ability is one of the determinants of success in today’s globalized world. Although there are benefits to using paired orals in L2 speaking assessment, paired oral-related research and applications are limited in Japan. Koizumi et al. (in press) developed and examined a paired oral test consisting of four tasks for Japanese university students learning L2 English. This study builds on Koizumi et al. and expands the number of paired oral tasks calibrated on a logit scale and examines its usefulness.

Previous Studies on Paired Orals

The literature has shown that paired orals have unique values different from other types of speaking assessment with teacher–candidate interaction, especially in two points: First, paired orals can measure oral interaction that is likely observed in natural, real-life conversation when conversation partners have equal status, because the test takers have chances and are required to be active in maintaining the conversation and producing the discourse in cooperation with another test taker of an equal status (e.g., Galaczi and ffrench 2011). Second, paired oral formats are usually similar to pair activities often conducted in communicative classes. Using paired orals can give students a clear message that what they are doing in class is important for good grades and eventually for their future use of English. Similarities between teaching and assessment activities also make it easier for teachers to relate the assessment results for instruction (e.g., Negishi 2015).

Paired orals have two main disadvantages (Negishi 2015). Firstly, test takers’ performance and scores can be affected by factors other than their L2 proficiency, such as their own and their partners’ L2 proficiency, personality, and degree of familiarity with each other (e.g., Galaczi and ffrench 2011). Second, paired orals tend to have lower reliability across raters and test occasions than examiner–interview and monolog formats, as can be inferred from the study of a group oral test with four test takers discussing topics (Van Moere 2006). Although these are certainly issues that need to be addressed, they may not matter much in low-stakes testing contexts such as classroom assessment, where teachers can assess and judge students’ ability on multiple occasions in combination with a few formats.

Paired orals have been examined from the perspective of factors affecting test scores (e.g., Davis 2009; Galaczi 2008, 2014; Kley 2015; Taylor and Wigglesworth 2009, special issue in Language Testing) and incorporated into major speaking tests such as the Cambridge English exams (Galaczi and ffrench 2011). However, in Japan, their research and applications are limited, with a few exceptions such as Negishi (2015) and Koizumi et al. (in press). Negishi (2015) compared university students’ performances across three formats (i.e., picture description, paired oral, and group oral) using MFRM. She reported that all test formats and raters fit the Rasch model, the paired oral was the most difficult, followed by the picture description, and the group oral, in that order, and test formats and test takers’ proficiency levels affected scores. Koizumi et al. (in press) developed a paired oral test and investigated the validity of the interpretation of paired oral test scores, from four viewpoints: First, all tasks and raters fit the Rasch model, with appropriate rating scale properties. Second, high reliability was observed with one task and two raters, or with three tasks and a single rater (with the cutoff score of φ = 0.70). Third, the test had a unidimensional structure of one factor affecting all scores. Fourth, paired oral scores were moderately correlated with scores of the Test of English as a Foreign Language (TOEFL) Institutional Testing Program (ITP), as predicted in the test development stage.

Current Study

Considering benefits of paired orals, especially in L2 classroom assessment, but limited applications in Japan, it is important to make them accessible to teachers in Japan. For this purpose, we create a bank of tasks whose difficulty levels are measured with a practical rating scale and whose features related to task are set based on Koizumi et al. (in press). To the authors’ knowledge, there are no publications describing an attempt to create a large task bank for paired orals in and outside Japan.

The current study aims to assess the L2 oral interactive ability of university students at the novice and intermediate levels. Using MFRM as well as structural equation modeling (SEM) and generalizability theory, we examine the following six questions that are associated with aspects of validity (Messick 1996) and inferences required to make a plausible validity argument (Chapelle et al. 2008). These six research questions (RQs) and validity aspects and inferences are shown below in the parentheses. The current study examines essential aspects of validity in order to provide building blocks of evidence for validity of the interpretation and use based on paired oral test scores.

  1. 1.

    Does the test have a unitary factor structure underlying the paired oral? (structural aspect; Explanation inference)

  2. 2.

    Do all tasks and raters fit the Rasch model? (content and structural aspects; Evaluation and Generalization inferences)

  3. 3.

    Do test tasks have a wide range of difficulty and no wide gaps in difficulty? (content and structural aspects; Evaluation inference)

  4. 4.

    Is the difficulty of student cards equal? (generalizability aspect; Generalization inference).

  5. 5.

    Does the holistic rating scale function properly? (structural aspect; Evaluation inference)

  6. 6.

    How many tasks and raters are minimally needed to obtain sufficient reliability? (generalizability aspect; Generalization inference).

RQ1 examines the test structure of the paired oral; assessing only a single dimension (unidimensionality) is a crucial assumption for Rasch analysis to be met. RQ2 and RQ5 address qualities of tasks, raters, and the holistic rating scale used in this study. RQ3 examines the distribution of tasks to determine whether the tasks are sufficient in number and range for assessing novice- and intermediate-level learners of English. We argue that a task bank should have many tasks with a wide range of difficulty and should not have no-task areas on the Rasch logit scale, since we intend to create a bank of tasks useful for teachers to choose from depending on their test purposes and on their target learners who may have different levels of ability. RQ4 examines comparability of student cards. In the paired oral, one student receives a student card either for Student A or B. We intend to make each card’s difficulty level equal and examine this in the RQ. RQ6 inspects the degree of reliability that the paired oral can assure depending on the number of tasks and raters.

Method

Participants

A total of 190 students from three private universities in Japan participated. Their majors were technology, medicine, or English. Their L2 proficiency levels ranged mostly from novice to intermediate. Most of the participants were originally from Japan, so their mother tongue was Japanese. A majority were first-year students who had studied English for at least 6 years at secondary school. Some students were from other countries, but we included them because we intended to create a test for classrooms at Japanese universities that have some overseas students. In their English lessons, the students were instructed to make a pair by themselves to mitigate the influence of familiarity. We did not control or examine the effect of proficiency this time but this should be addressed in the future.

Materials and Procedures

The test included an easy warm-up task and 11 assessment tasks—seven role-plays and four discussions. We used four tasks (Tasks 1–4) similar to the ones used in Koizumi et al. (in press) but modified some instructions by providing specific contexts for the conversation and more familiar place names, and created seven new tasks with familiar topics (see Table 1 for all the tasks). The students were requested to talk for about 2 or 3 min per task. They were not given any planning time and were encouraged to talk in a natural, two-way style with back channeling and eye contact.

Table 1 Warm-up tasks and 11 assessment tasks used

After making a pair, students received a student card either for Student A or B, which provided a warm-up and 11 tasks. For raters to identify who was speaking, students were told to begin each task with their name. Tasks were either role-play or discussions. In the role-play task, the card contained a role to play and who should speak first (see Table 1). For example, in Task 9 (Role-play 5: Toothache) Student A needs to begin the conversation, and say that s/he has a terrible toothache; Student B should respond with sympathy and suggest going to see a dentist or take a painkiller; Student A should refuse suggestions at least once and they should continue the conversation. Out of seven role-play tasks, Student A should begin in three, whereas Student B should do so in the remaining four. We intended to make the Student A and B cards comparable in terms of difficulty.

The order of performing tasks was partially counterbalanced: Approximately a third of students performed a warm-up task and tasks 1–10 (not 11) in that order. Another third performed a warm-up task, tasks 4–10, and 1–3 (not 11) and the rest performed a warm-up task, tasks 8–10, and 1–7 (not 11). One class performed only tasks 3, 5, 8, 6, 9, 7, 10, and 11, in that order, because of the limited class time.

Analyses

For rating, we used the same holistic rating scale of 1–3 in Koizumi et al. (in press), which considers interactional effectiveness and linguistic elements such as task achievement, fluency, accuracy, and appropriateness (see Table 2). We created a holistic scale since we weighed practicality over providing detailed feedback to students. We prioritized making a scale that enables teachers to evaluate by listening to the conversation once.

Table 2 Holistic rating scale

Using the scale, the three authors rated each talk independently. We had a 1-day rater training session assessing five pairs (n = 10), discussing any divergences, and adding some notes for the scale. We then evaluated the remaining students independently. One of the authors (Rater 1) rated all the remaining 180 students, Rater 2 rated 48, and Rater 3 rated 94. Scores from Raters 2 and 3 were combined and treated as Rating 2, while Rater 1’s scores were considered Rating 1.

For MFRM, we used 190 students’ scores, 11 tasks, and three raters, with missing values. For SEM and generalizability theory, we used 117 students’ scores, 10 tasks, and two ratings, without missing values. The two groups can be considered similar because they had similar means and SDs of Rasch ability estimates (M = 0.43, SD = 2.41, N = 190; M = 0.53, SD = 2.38, n = 117). For SEM, we used a robust weighted least squares (WLSMV) estimation method and the software Mplus (Muthén and Muthén 2014) since the scores were on an ordered scale of 1–3 (RQ1). For MFRM, we used the rating scale model in an MFRM program, Facets (Linacre 2014; RQ2 to RQ5), to estimate the test takers’ ability, task difficulty, rater severity, and rating scale. We performed generalizability theory using GENOVA (Center for Advanced Studies in Measurement and Assessment 2013) to calculate the number of tasks and raters needed to obtain highly consistent scores (RQ6).

Results and Discussion

Does the Test Have a Unitary Factor Structure Underlying the Paired Oral?

SEM allows us to construct models hypothesizing relationships between observed and latent variables, based on substantive theory and previous results, and to test whether these models fit the data well (see, e.g., Ockey and Choi 2015; Kline 2010). We hypothesized two models: a unitary model of one factor of oral interactive ability representing 11 tasks (Model 1) and an alternative model of two correlated factors (role-play and discussion abilities) representing two tasks each, as task formats may affect the structure (Model 2). For both models, we used Ratings 1 and 2 for each task (see Fig. 1).

Fig. 1
figure 1

Model 1 is on the left and Model 2 is on the right. Each observed variable is labeled by task type and rating. For example, RP1 Rating1 and D1 Rating1 refer to Role Play 1 Rating 1 and Discussion 1 Rating 1, respectively. See Table 1 for RP and D tasks

Table 3 shows fit statistics for the unitary (Model 1) and correlated models (Model 2). Although the chi-square statistic was statistically significant (χ2 = 371.921, df = 170, p < 0.01; χ2 = 371.612, df = 169, p < 0.01) for both models, some indices showed a good fit (CFI = 0.95, TLI = 0.95), while others showed only a moderate fit (RMSEA = 0.10 [0.09, 0.12] and WRMR = 1.21). Model 2 was particularly problematic since its covariance matrix was not positive definite. One reason may be a correlation greater than or equal to 1 between two latent variables. The standardized path between the two-ability factors was 1.002. Model 2 was excluded from further consideration.

Table 3 Model fit indices

Model 1 was revised based on theory and modification indices. A revised model—Model 3—explained the data well (CFI = 0.97, TLI = 0.97, RMSEA = 0.08 [0.06, 0.10], WRMR = 0.99), with the parameter estimates presented in Table 4. This suggests that the paired oral is considered to measure a single trait, which we interpret as oral interactive ability, which accords well with the intended test construct. The unitary structure adopted was the same overall as in Koizumi et al. (in press).

Table 4 Standardized parameter estimates for Model 3

We also conducted MFRM and found that 45.05 % of the score variance was explained by Rasch measures, which also suggests unidimensionality of the structure. This percentage of the variance explained by Rasch measures was a little smaller than but similar to Koizumi et al. (in press; 57.90 %).

Do All Tasks and Raters Fit the Rasch Model?

Figure 2 displays the relationships between ability, task difficulty, rater severity, and rating scale. As seen in Table 5, test takers’ abilities spread very widely from −3.67 to 7.51. The task difficulty varied from −1.14 to 1.10, with Task 4 (Discussion 2: Trip) being the most difficult and Task 3 (Discussion 1: Hobby) the easiest. Koizumi et al. (in press) used similar tasks and the order was Task 1 (Role-play 1: Club, the most difficult), 4 (Discussion 2: Trip), 2 (Role-play 2: Dinner), and 3 (Discussion 1: Hobby). Compared to the current study, the order of difficulty of Tasks 4, 2, and 3 was the same. One reason Task 1 had a higher difficulty level than this study was that we added the context (e.g., You are in a cooking club. B is considering whether to join it.) of talking about a club; without the context, students must have found it hard to talk in the previous study. Because of the modification, the difficulty seems to have decreased at an appropriate level.

Fig. 2
figure 2

Wright map for participants (N = 190), tasks (k = 11), raters (n = 3), and the rating scale. Ss = participants; * = 2 participants; . = 1 participant. Fifteen participants with measures of above 4.00 were omitted from the figure. Higher values mean higher ability in the second column, more difficult tasks in the third column, and more severe raters in the fourth column

Table 5 Descriptive statistics for the three facets

The rater severity differed across raters from −0.66 to 0.60, with Rater 2 as the most severe. Test-taker and task reliability were high (0.91–0.92), which shows consistency of scores across test takers and across tasks. High rater reliability (0.98) indicated that rater severity was different.

The infit mean square statistics between 0.5 and 1.5 were used to judge acceptable model fit (Linacre 2013). However, we did not regard an overfit as problematic (i.e., an infit mean square of below 0.5), because this indicates that the persons, tasks, and raters fit the model too well. We did not also regard an infit mean square between 1.5 and 2.0 as problematic, because it is “unproductive for construction of measurement, but not degrading” (Linacre 2013, p. 270). All the tasks and raters had values within this range, with 0.88–1.22 for the task and 0.83–1.10 for the raters. Furthermore, 15 students (7.89 %, 15/190) had values of less than 0.50 and were considered overfitting students, and 17 (8.95 %, 17/190) had infit mean squares of more than 1.5 and were considered underfitting students, but one test taker had 2.13 and he somehow performed inconsistently across tasks, performing well in one difficult task and worse in an easy task. However, this was only one highly underfitting test taker. The fit of tasks and raters was all appropriate, which was in line with Koizumi et al. (in press).

Do Test Tasks Have a Wide Range of Difficulty and no Wide Gaps in Difficulty?

To see the distribution of tasks available in the test, we examined a task strata statistic, as Linacre (2013) recommends, because we statistically hypothesized that the measure distribution is not normal due to the many items at the peripheral end. We also expected that high and low difficulty levels in task measures are derived because of high and low task difficulty. The task strata was 4.79, which means 11 tasks could be classified into at least four different levels of task difficulty. Although this satisfied a minimum required level, we hope that we can differentiate each level into a few more, so higher task strata would be ideal.

Figure 2 shows that tasks were spread far less widely (range = 2.24, from −1.14 to 1.10) than test takers’ abilities (range = 11.18, from −3.67 to 7.51) and that we should have more tasks at higher and lower ends of the scale, that is, more and less difficult tasks. Figure 2 also demonstrates the existence of some gaps on the logit scale. However, as seen in Table 6, most gaps were within the standard error of measurement and were regarded as not very substantive. For example, Tasks 11 (Role-play 7: Victory) and 3 (Discussion 1: Hobby) have a task difficulty of −0.77 and −1.14 respectively, but 68 % confidence intervals (CIs) overlapped (−1.18 to −0.36 and −1.28 to −1.00). There were two cases with different values beyond the standard error: between Tasks 8 (Discussion 5: Date) and 10 (Role-play 6: Driving), and between Tasks 2 (Role-play 2: Dinner) and 6 (Role-play 4: Movie). We can also argue that when we used 95 % CI, there was only a gap in the former case (0.71 to 1.21 and −0.01 to 0.49, not shown in Table 6 but calculated using Measure ± 1.96 * SE), whereas there was an overlap in the latter (−0.34 to 0.20 and −0.63 to −0.13). Nevertheless, we decided to use 68 % CI to strictly improve our test. These two gaps in between as well as at the higher and lower ends can be modified in a future revision by adding tasks with such difficulty levels.

Table 6 Task measurement report

Is the Difficulty of Student Cards Equal?

As explained in the Method section, in the test, a student received a student card for Student A or B, and Students A and B made a pair. We compared the students’ ability estimates across the two groups (Students A vs. B groups) but found no significant difference between the groups with the effect size being negligible (Student A: M = 0.30, SD = 2.43, n = 95; Student B: M = 0.57, SD = 2.41, n = 95; t = −0.78, df = 187.99, p = 0.44, d = −0.11, 95 % CI = −0.40 to 0.17). Thus, we can conclude that the difficulty level of student cards is considered equal. It should be noted that this result came from a situation where tasks requiring Student A to speak first are used almost the same number of times as tasks requiring Student B to do so; when teachers select tasks from the task pool, they may need to consider the balance of tasks from this perspective.

Does the Holistic Rating Scale Function Properly?

We analyzed functions of the rating scale based on Bond and Fox (2007). Table 7 indicated that results of the scale almost satisfied the criteria: There were more than 10 ratings at each level (420–1960). Thresholds, or difficulty estimates for choosing one level over another (e.g., −1.99 from levels 1–2) increased as the level increased, and the values of distances between thresholds between neighboring levels were 2.28 and 5.02; the former was between 1.4 and 5.0 logits but the latter was marginally beyond 5.0; we considered this to be minor. The probability curve (Fig. 3) had a clear top for Level 2. The level fit statistics were 1.0−1.1, less than 2.0. These results accorded well with the predicted patterns from the Rasch measurement and with Koizumi et al. (in press).

Table 7 Category statistics for the rating scale
Fig. 3
figure 3

The probability curve of the scale

How Many Tasks and Raters Are Minimally Needed to Obtain Sufficient Reliability?

Using generalizability theory (Brennan 2001), we decomposed the test score variance into variance components affected by seven sources: variations of (a) persons’ ability (the objects of measurement), (b) task difficulty, (c) raters’ severity, (d) person-by-task interaction, (e) person-by-rater interaction, (f) task-by-rater interaction, and (g) the residual, consisting of the person-by-task-by-rater interaction and random errors, in the generalizability (G) study. We considered tasks and raters (which are actually ratings, with Rating 1 [scores from Rater 1] and Rating 2 [scores from Raters 2 and 3 combined]) as random facets. This method is often used for data in which not all raters evaluate all task responses (Lin 2014).

Table 8 shows the percentages of variance explained by the seven sources. The results suggest that the largest variability was explained by the persons (35.81 %), followed by the residual (33.67 %), and, to a lesser degree, by person-by-task (11.71 %) and person-by-rater (11.31 %) interactions. The percentages explained by tasks, raters, and task-by-rater interaction were marginal, ranging from 0.71 to 4.34 %. This suggests that the tasks and raters had similar levels of difficulty and severity. This appears in contrast to results from MFRM stating that task difficulty differed across tasks. However, MFRM results do not show the impact of tasks and raters on scores, and G study results showed that the impact was limited. The pattern in G study was almost the same as in Koizumi et al. (in press) except that the percentage of person-by-task interaction (11.71 %) was larger in the current study than in Koizumi et al. (5.79 %), probably because of an increased number of tasks.

Table 8 Estimated variance component and percentage of variance explained

Using the decision (D) study, we investigated how test reliabilities change depending on the number of tasks and raters. We used phi coefficients (Φ), which are used for an absolute decision, but results of generalizability (G) coefficients, for a relative decision, were also presented for interested readers. We employed a criterion of Φ = 0.70 or more, considering the use in low-stakes classroom assessment. Table 9 showed that when one rater evaluates the test, even the use of ten tasks does not lead to high reliability; when two raters join, at least four tasks are needed to obtain reliable scores; when three raters evaluate, at least three tasks are needed. In classroom assessment, usually one rater is available and in this case, a teacher may need to know that paired orals tend to have low reliability and to use as many tasks as possible. When two raters are available, the required number of tasks is reduced to four and this may be manageable. Koizumi et al. (in press) showed that conditions of one task with two raters, and three tasks with a single rater would produce sufficient reliability. This seems to indicate that when we increase tasks in the task bank, we should check the number of tasks and raters needed because this increase may change the impact of related factors on test scores.

Table 9 Phi coefficient (Φ) and generalizability coefficient (in the parenthesis) in decision studies (p × t × r design)

Conclusion

We investigated six aspects related to the validity of the interpretation based on paired oral scores. We found that the structure of our paired oral has a unitary dimension, all tasks and raters fit the Rasch model, test tasks had a moderately wide difficulty range with gaps in between and at the higher and lower ends, the difficulty of student cards was equal, the holistic rating scale functioned properly, and the number of tasks and raters minimally needed to obtain sufficient reliability was at least four tasks with two raters and three tasks with three raters.

The results we obtained in this study were generally positive and as expected in the test developing stage. Major unexpected parts were the existence of gaps in between and at higher and lower ends of the scale, and they will be addressed and rectified in future research. We will also transcribe actual conversations and qualitatively examine relationships between linguistic functions intended to be elicited and those actually observed in the conversation. This information will help us identify what type of tasks should be included in the task bank together with the construct intended and the difficulty information that we obtained in the current study.

Our results will provide teachers with crucial information on how to use paired orals in their classroom. Moreover, we mainly used multifaceted Rasch measurement (MFRM), along with some auxiliary methods (structural equation modeling and generalizability theory) for the validation of our paired oral. MFRM has helped us identify strengths and weaknesses of our test and suggested improvements. The methods we used would be useful for other contexts where test takers, tasks, and raters are involved.