Keywords

Introduction

A performance test is “a test in which the ability of candidates to perform particular tasks, usually associated with job or study requirements, is assessed” (Davies et al. 1999, p. 144). In the assessment of second languages, tasks are designed to measure learners’ productive language skills through performances which allow candidates to demonstrate the kinds of language skills that may be required in a real-world context. For example, a test candidate whose language is being evaluated for the purposes of entry into an English-speaking university or college might be asked to write a short academic essay, or an overseas-qualified doctor might participate in a job-specific role play with a “patient” interviewer. These kinds of assessments are increasingly used in specific workplace language evaluations and in educational contexts to evaluate language gains during a period of teaching.

The relationship between task and performance testing is a complex one. In the context of language testing and assessment, performance assessment has become increasingly important over the last three decades and has been the focus of substantial empirical investigation. Performance-based assessments can be more or less specific in terms of the language skills they are designed to assess. Tests such as the IELTS or TOEFL are large-scale, high-stakes tests which are designed to evaluate largely academic language skills, while others have proved a valuable tool for assessing candidate performance in specific vocational contexts (e.g., the Occupational English Test, which is used for assessing the language skills of overseas-trained medical professionals prior to accreditation in Australia).

The role of tasks in performance-based assessments has recently attracted considerable attention, both from a theoretical and a practical perspective. Generally, there is little agreement about where “task-based language assessment” sits in relation to language testing more generally; Bachman (2002) uses the term “task-based language performance assessment” (TBLPA), while others (e.g., Norris 2002; Mislevy et al. 2002) refer more generally to task-based language assessment, or TBLA. However, Brown et al. (2002) define task-based language testing as a subset of performance-based language testing, clearly distinguishing between performance-based testing, in which tasks are merely a vehicle for eliciting language samples for rating, and task-based performance assessments in which tasks are used to elicit language to reflect the kind of real-world activities learners will be expected to perform and in which the focus is on interpreting the learners’ abilities to use language to perform such tasks in the real world.

Early Developments

Performance assessments have been used for the evaluation of second languages for at least half a century. McNamara (1996) argues that their development has been the result of two factors. The first stemmed from the need to evaluate the language of second-language learners entering English-speaking universities and from the need to ascertain the language abilities of second-language learners entering specific workplace contexts (e.g., doctors, nurses, flight controllers, pilots, teachers, tour guides). The second has resulted from the increasing focus in second-language learning and teaching on communicative language ability with its focus on the ability to use language communicatively and appropriately in different contexts. Bachman’s (1990) model of language proficiency, further developed in Bachman and Palmer (1996), with its focus on the learners’ abilities to use language has been hugely influential in developing the agenda for research into task and performance-based language assessments. For test candidates, this trend toward task and performance-based assessment means that they are evaluated on a much greater range of language skills than those traditionally measured by the more discrete, paper-and-pen-based tests. Thus, second-language task and performance assessments have evolved in parallel with increasingly multicomponential models of language ability. More communicative approaches to language learning and teaching have been necessitated by the need to assess language in use, rather than language as object. Building on Bachman’s (1990) model of language ability, Bachman and Palmer (1996) articulate a detailed framework of task characteristics intended as the basis for both test design and test-related research. These characteristics focus on the setting, the test rubrics, the input to the task (both in terms of format and language input), the expected response (again in terms of format and language), and the relationship between the input and the response.

Second-language performance assessments can be conducted in a variety of contexts. One option is in situ (e.g., in the classroom, in the workplace) through observation. McNamara (1996, following Slater, 1980 and Jones, 1985) calls this a “direct assessment” since the language behavior is being evaluated in the context in which it is being used. Alternatively, second-language performance assessments may be evaluated through simulations of real-world performance, i.e., tasks tailor-made for the particular communicative purpose of the assessment. McNamara (1996) argues that there are two factors which distinguish second-language performance tests from traditional tests of the second language: the fact that there is a performance by the candidate and that this is judged using an agreed set of criteria. Norris et al. (1998) add a third criterion arguing that the tasks used in performance assessments should be as authentic as possible.

McNamara (1996) argues a distinction between strong and weak forms of second-language performance assessment, based on the criteria used for judging the performance. In the “strong” sense, assessment is made on the basis of the extent to which the actual task itself has been achieved, with language being the means for fulfilling the task requirements rather than an end in itself. In the “weak” sense, the focus of the assessment is less on the task and more on the language produced by the candidate, with the task serving only as the medium through which the language is elicited – successful performance of the task itself is not the focus of the assessment. This distinction is revisited in the later work of Brown et al. (2002, pp. 9–11) in which the term performance-based testing was used where the tasks are used to elicit language samples for the purposes of rating – in McNamara’s terms, “weak” performance assessments – and task-based performance assessments involve assessments in which tasks are used to elicit language to reflect the kind of real-world activities learners will be expected to perform and in which the focus is on interpreting the learner’s ability to perform such tasks in the real world (p. 11), “strong” performance assessments in McNamara’s terminology. This provides two very different ways of defining the construct. In the “weak” version, the construct is defined as language ability. In the “strong” version, it includes everything which might contribute to the successful completion of the task, which means that there are more likely to be a range of confounding factors including task characteristics and test taker interactions with these that might affect score interpretation and use.

Major Contributions

In the second-language acquisition (SLA) literature, the properties and characteristics of tasks, and the different conditions under which they can be administered, have been the subject of intense scrutiny. A major focus of this research has been on how learners manage the differential cognitive load associated with different types of tasks and the extent to which these varying conditions and characteristics influence learner productions (see, e.g., Foster and Skehan 1996; Skehan and Foster 1997; Ellis 2003; Yuan and Ellis 2003; Ellis and Yuan 2004; Robinson 2007; Tavakoli and Foster 2008). Different variables have been systematically investigated incorporating the conditions under which the tasks are administered, i.e., those conditions external to the task. The task condition which has received considerable attention is the provision, or not, of varying amounts of planning time (see, e.g., Ellis 2005). The internal characteristics of tasks have also attracted substantial attention. In particular, the series of studies by Foster and Skehan (1996, 1999) and Skehan and Foster (1997, 1999) indicate that different task characteristics (e.g., dialogic versus monologic, structured versus unstructured, simple versus complex in outcome) have differential impacts on measures of fluency, complexity, and accuracy in the learners’ discourse (Skehan 2001). Much of the above work has been motivated by information-processing models of second-language acquisition (see Skehan 1998) and has used detailed analyses of elicited discourse (written or spoken) to evaluate changes in measures of complexity, accuracy, and fluency which might result from different task conditions and characteristics.

In relation to performance testing and assessment, the need to link test tasks to theoretical models of cognition and language learning is evident in Mislevy, Steinberg, and Almond’s (2003) “evidence-centered” approach to designing assessments and in Kane’s (2006) highly influential argument-based approach to test validation. Studies have focused on exploring how different task properties might impact on candidate performance in the context of classroom-based assessment practice and in relation to high-stakes assessments, such as TOEFL and IELTS. The approach taken by many of these studies has been to evaluate the learner performances on two levels – externally through rating and internally through analyses of candidate discourse.

Task-based performance assessments in teaching programs have proved particularly valuable because task-based assessments can be linked to teaching outcomes, provided outcomes are defined in terms of task fulfillment, rather than purely in terms of language ability. A further consequence can be that well-designed assessment tasks have the potential to provide positive washback into the classroom. However, the issues raised by the use of tasks for these types of assessments are considerable. Brindley and Slatyer (2002) examined the effect of varying the characteristics and conditions in listening assessment tasks used in the context of an outcome-based reporting system in which teachers themselves develop tasks for assessment purposes, and Wigglesworth (2001) undertook a similar investigation of speaking tasks by manipulating a series of task conditions and characteristics. Both studies found small effects as a result of manipulating the variables but also point out that interaction effects impact on the variables in ways which are difficult to separate. Such studies, which systematically manipulate different task variables, are of crucial importance since teachers are often involved in the development of assessment tasks and must understand how these work in order to produce comparable and defensible judgments of students for classroom assessment purposes.

In the high-stakes testing context, the impact of task properties and characteristics on performance has been investigated in a series of studies which used test scores to investigate potential differences (e.g., Lee 2006), as well as measures of complexity, accuracy, and fluency to determine whether finer distinctions imperceptible to raters are marked in the candidate discourse (see, e.g., Iwashita et al. 2001; Elder et al. 2002; Wigglesworth 1997; Brown et al. 2005; Elder and Wigglesworth 2005). The general outcome of these studies has been that raters perceive no differences, and in general, very few, if any, differences have been detected in the discourse. Necessarily, given the testing focus, task difficulty has been a particular focus of these studies, since for testing purposes, it would be useful to be able to design tasks of predictable levels of difficulty which can be manipulated to elicit appropriate performances across candidates. Norris et al. (1998) and Brown et al. (2002) provide a comprehensive empirical investigation of the problems of the comparability of real-world performance tasks, by systematically manipulating three cognitive processing variables (code complexity, cognitive complexity, and communicative demand) in a series of test tasks. In summarizing their findings in relation to task difficulty, Norris et al. (2002, p. 414) point out the importance of individual responses to tasks, which may impact on measures of task difficulty. They argue that:

initial evidence from this study did not support the use of the cognitive processing factors – as operationalized in our original task difficulty framework – for the estimation of eventual performance difficulty differences among test tasks. While there was some indication that average performance levels associated with the three cognitive task types differed in predicted ways, these differences did not extend to individual tasks. What is more, evidence suggests that examinees may have been responding to tasks in idiosyncratic ways, in particular as a result of their familiarity with both task content and task procedures.

Elder et al. (2002) asked candidates about their perception of task difficulty and found they too were unable to estimate the difficulty of a task even after they had performed it. Indeed, Bachman (2002) argues that the complex nature of task performances, which involve large numbers of interactions (e.g., between candidate and task, task and rater, candidate and interlocutor, etc.), means that task difficulty cannot be conceptualized as a separate factor. Specifically, in relation to speaking tests, Fulcher and Reiter (2003) question assumptions that underlie SLA approaches to conceptualizing task difficulty in terms of particular task conditions and characteristics, suggesting instead that difficulty is more likely explained by interactions between the pragmatic features of tasks and the first-language background of test takers.

While both writing and speaking performance test tasks need to be subjectively rated, with all that rater variables entail, performance testing in the assessment of speaking skills brings the additional variable of the interlocutor. As Brown (2003) shows, the same candidate can produce qualitatively different performances when interviewed by different interviewers, and this may mean that the raters interpret the candidate’s performance differently. Other studies (e.g., Morton et al. 1997; McNamara and Lumley 1997; Davis 2009; May 2009), where raters evaluated not only the candidate but the interlocutor performance as well, have found that raters tend to compensate for what they view as deficient interviewer behavior. Studies by Ducasse and Brown (2009) and Galaczi (2014) suggest that interactional features beyond topic development and organization, such as listener support strategies or interactional listening, turn-taking behaviors, and interactional management, should be included in rating scales.

Another aspect of a task which may influence the test scores is the nature of the rating scale used to judge performance. Since these judgments are by nature subjective, they require well-defined rating scales. Rating scales consist of a set of criteria upon which a performance can be judged. They are necessarily limited in scope because no rating scale can attend to all possible aspects of performance, and thus choices about what to rate (intelligibility, accuracy, complexity, clarity) must be made, as well as choices about what proportion of the score is appropriate to allocate to each rating criterion – in other words, some criteria may be weighted more heavily than others. Rating scales need to be designed to allow accurate judgments of the speech or writing samples elicited and need to be valid in terms of the relevant language construct. Rating scales may rate task performance globally, based on a holistic impression, or analytically on a feature-by-feature basis. Knoch (2009) compared two rating scales, holistic (consisting of general descriptors) and analytic, consisting of detailed, empirically derived descriptors. She found that the latter scale was associated with higher rater reliability and was preferred by raters. Fulcher et al. (2011) distinguish between two broad approaches to rating scale design and development: measurement-driven approaches, whereby descriptors are ordered in a linear fashion on a single scale, and performance data-driven approaches, whereby descriptors are empirically derived. The researchers argue that the latter approach provides richer and more meaningful descriptions of performances.

Rating scales can only ever guide human judgments, however, and decisions between raters may vary widely, with potential consequences for test fairness. It is now widely acknowledged that raters differ in both self-consistency and in their severity (Upshur and Turner 1999; Huhta et al. 2014; Granfeldt and Malin 2014) and also in the way they construe the different elements of the rating scale (Lumley 2002; Harding et al. 2011; Kuiken and Vedder 2014). Rater training thus becomes a critical component in task-based performance assessment . While ideally rater training may aim to reduce differences in severity across different raters, where this is not achievable, training needs to ensure that raters discriminate consistently in terms of severity across different levels of performance. As a result of these inherent differences in rater severity, best practice in assessment advocates double rating or even multiple ratings in the event of discrepancy between pairs. Statistical analyses of scores can then be used to gain a greater understanding of how different raters behave or to compensate for individual rater differences.

Work in Progress

A central tenet of task-based language assessments is that the tasks are designed to represent authentic activities which test candidates might be expected to encounter in the real world outside the classroom. In particular, as Douglas (2000) points out, authenticity is central to the assessment of language for specific purposes and is part of what differentiates it from more general types of language testing. This is because a “specific purpose language test is one in which test content and methods are derived from an analysis of a specific purposes target language use situation, so that test tasks and content are authentically representative of tasks in the target situation” (p. 19). However, the issue of authenticity is not a trivial one, and the extent to which specific tasks can represent authentic real-world activity has attracted considerable debate and empirical investigation, using a variety of different approaches (see, e.g., Cumming et al. 2004; Lewkowicz 2000; Spence-Brown 2001; Wu and Stansfield 2001).

While performance-based tests have traditionally focused on independently measuring the four core language skills (speaking, writing, listening, and reading), efforts to better simulate real-world task demands, thereby enhancing authenticity , have led to the development and use of integrated speaking and writing tasks (e.g., the TOEFL Internet-based test (iBT)). Integrated tasks require test takers to read or listen to source texts and to incorporate information from these texts into their speaking or writing test performances (Lewkowicz 1997). In addition to enhancing the authenticity of the tasks, integrated tasks also mitigate against some candidates having greater familiarity with the topic than others, since a common source of input is provided.

Existing research into the use of integrated writing tasks has examined how writers make use of the source material when responding to integrated tasks (e.g., Cumming et al. 2006; Plakans 2009; Weigle and Parker 2012), as well as the discourse produced by students across different score levels on the writing section of the TOEFL iBT (Gebril and Plakans 2013; Plakans and Gebril 2013). Studies addressing the use of integrated tasks as a measure of speaking ability have examined test takers’ strategic behaviors (Barkaoui et al. 2013), rater orientations to integrated tasks (Brown et al. 2005), the impact of task type on test scores (Lee 2006), and the way in which test takers incorporate source materials into spoken performances (Brown et al 2005; Frost et al. 2012). In a recent study, Crossley et al. (2014) examine the interaction between test takers’ spoken discourse, characteristics of task and stimulus materials, and rater judgments of speaking proficiency on a listening-speaking task of the TOEFL iBT. They found that the integration of source text words into spoken performances was predicted by three-word properties: incidence of word occurrence in the source text, the use of words in positive connective clauses, and word frequency in the source text. They also found that the incidence of source text words in the spoken responses was a strong predictor of human judgments of speaking quality.

Problems and Difficulties

While there is broad agreement that task authenticity is desirable in performance testing and assessment (e.g., Bachman and Palmer 1996; Douglas 2000; Norris et al. 1998; Brown et al 2002), the extent to which inferences can be made from the language elicited by particular test tasks as a reflection of the candidates’ ability to manage the task in a subsequent real-world context is not fully resolved.

Concerns that need to be addressed in relation to authenticity relate to the problem of the generalizability of the outcome. In the “weak” view of language testing, where concern is with the underlying language abilities, a criterion of task fulfillment may not be considered of great importance. In the “strong” view of performance testing, a task designed to assess the ability of candidates to carry out the activity in a real-world setting would need to be assessed on a criterion of task fulfillment rather than for its linguistic accuracy, for example. An unresolved issue here is who should decide whether the task has been carried out successfully – language specialists or specialists in the field of the task activity? The gap between linguistic criteria and the aspects of communication valued by professionals in the workplace, for example, is widely acknowledged. There are a number of studies which have examined this issue (e.g., Elder and Brown 1997; Brown 1995; Elder 1993; Elder et al. 2012; Knoch 2014; Kim and Elder 2015), but the question remains one of balancing authenticity and generalizability. While the “weak” view is likely to assess underlying language skills in ways which are relatively broadly generalizable, the “strong” view is likely to produce judgments which are more authentic and relevant to the real-life situations toward which the candidate may be moving. These judgments about the quality of performance may not, however, be replicable in other contexts.

Task-based performance testing is attractive as an assessment option because its goal is to elicit language samples which measure the breadth of linguistic ability in candidates and because it aims to elicit samples of communicative language (language in use) through tasks which replicate the kinds of activities which candidates are likely to encounter in the real world. As a test method, however, it remains one of the most expensive approaches to assessment and, in terms of development and delivery, one of the most complex. There is also the potential for reduced generalizability since tasks used in such assessments tend to be complex and context specific, which means that inferences which are based on them may not always extrapolate to the domains they are intended to represent. An additional difficulty is that of replicating tasks in a way which ensures consistency of measurement.

Future Directions

The development of appropriate tasks for use in performance assessment must be underpinned by an understanding of how the tasks relate to the construct and of which factors may potentially interfere with their validity and reliability. There is currently only a relatively limited amount of empirical research which systematically examines the types of tasks used in task and performance-based assessments and which can illuminate how different tasks work for assessment purposes. The complex nature of tasks, and their relationship to real-world performances, makes it crucial that we understand more about how the various different elements of the task, which impact on candidate performance with the task, interact.

Performance on integrated tasks, for example, requires candidates to engage skills and strategies that may extend beyond language proficiency in ways that can be difficult to define and measure for testing purposes. As Douglas (1997) and Lee (2006) have noted, test taker performances on integrated tasks involve not only productive skills but also comprehension skills and the ways in which these dimensions of language ability are integrated by test takers into their language performances remains, as yet, predominantly intuited by test developers. Furthermore, while it is well known that stimulus materials impact on test performance, the way in which test takers make use of these materials in their responses, particularly the strategies involved in summarizing and incorporating content from written and oral texts into speaking performances, is not well understood and requires further empirical investigation.

Testing is a socially situated activity although the social aspects of testing have been relatively under-explored (but see McNamara and Roever 2006). Testing and assessment activities take place in a social context, and this is particularly the case with task- and performance-based assessment. In speaking assessments, the interlocutor has a crucial role to play. However, while the interlocutor is often a trained interviewer, this role may also be taken by another test candidate or a group of test candidates. In relation to paired and group test activities, a whole raft of variables are ripe for exploration since “we can hypothesize that the sociocultural norms of interaction … contribute significantly to variability in performance” (O’Sullivan 2002, p. 291). The extent to which they contribute in systematic ways to the way tasks are interpreted and undertaken is yet to be determined.

Cross-References

Related Articles in the Encyclopedia of Language and Education