Keywords

Introduction and Early Developments

In most countries worldwide, individuals are subject to tests, whether to enter educational programs, to pass from one level to the next, or to be granted certificates to practice professions. Tests determine whether students will be allowed to enter high schools and higher education and in many cases even kindergartens and elementary schools. In schools, classroom tests are used in all subjects and grades and have an effect on students’ status in their classrooms as well as on their identities and self-concepts. Tests are used by teachers as disciplinary tools to control students’ behaviors and the curricula and to upgrade the status and prestige of specific topics and subjects. High-stakes tests lead to rejections and acceptances, to winners and losers, and to successes and failures and hence have an impact on people’s lives. For adult immigrants, tests determine whether they will be granted permission to immigrate and to obtain citizenship in countries they moved to or seek asylum.

Critical language testing (CLT) originated from a focus on the uses of language tests and the realization of their enormous power to influence education, societies, and even the status of nations as a result of performances on international tests. It is the power of tests and the detrimental decisions they bring about that grants them such status in society so that people change their behavior in order to succeed on tests (Shohamy 2001a). It is this very power that brings about decision makers and those in authority to introduce tests since they know that once a high-stakes test is introduced, it is most likely that principals, even if the curriculum has not changed, will start imposing the teaching of these topics, and students will be forced to learn them. Hence, there is a change in stakeholders’ behaviors in an intensive effort to achieve high scores. In fact, in many schools the content that is included in these tests becomes the de facto curriculum and often overlooks the written curriculum that already exists as those who introduce tests often have different educational agendas (Cheng 2004; Cheng and Curtis 2004 and others).

Two examples that demonstrate the phenomenon are the following: The first in the context of migration (Extra et al. 2009), where adult immigrants, moving to a new country are required to take tests of their proficiency in the language used in the new location as a condition for citizenship and residence. At times these tests are being administered still in their home countries and thus restrict the number of immigrants. Governments implement language testing regimes as a way to control the number of immigrants they allow to enter the country and/or of those who can stay there. In most nations nowadays immigrants are required to pass a test in the main official language of the country. This policy does not originate from research findings that demonstrates that proficiency in national languages is relevant for functionality; still, language tests become the tool for screening, leading to decisions as to whether immigrants are allowed to stay in the country or would be forced to leave. It also ignores situations when immigrants are at an age that they are incapable of learning the new language and/or cannot read or write in their own language or when there are no learning opportunities such as language courses where they can learn the new language (McNamara and Shohamy 2008; Shohamy and Kanza 2009). It is also known that many immigrants tend to be employed in their own communities and are very comfortable using their home languages which are functional for them in most domains of everyday lives. The test then is used primarily as a tool to screen immigrants, which brings about enormous criticism about the ethicality of these types of tests as they are used for purposes they were not intended to. The children of immigrants usually acquire the new language relatively fast in comparison to their parents because they are schooled in that new language as a medium of instruction, albeit, this too takes a long time (Levin and Shohamy 2008) as will be reported below.

The second case is the testing of immigrant and minority school students who lack high proficiency in the power language which is the medium of instruction in schools. In this case students are required to take standardized tests as mandated by national policies after a short time of being in the country. While research shows that it takes immigrants about 10 years to acquire a new language (Collier and Thomas 2002; Valdés et al. 2015; Levin and Shohamy 2008) and yet while they are still in the process of learning the new language, they are being tested in school content areas via the new language. Given that the students are not proficient in the language yet, they often fail these tests in the different academic subjects and become marginalized and discriminated against by their teachers and peers (Levin and Shohamy 2008; Levin et al. 2003).

In both of the above-described cases, language testing policies are used as disciplinary tools given that test takers have no choice but to comply with the policy demands. While test takers and regional educational systems comply with such disciplinary demands, they also resent them as they feel they were imposed on them without their voice being heard. It is the powerful uses of tests – their detrimental effects and their uses as disciplinary tools that are responsible for the strong feelings that tests evoke in test takers. It is the raising of critical questions about the testing policy and their impact and consequences as well as the intentions behind the introduction of these tests which is the essence of CLT.

Major Contributions

A social perspective. The use of tests for power and control was argued convincingly by Foucault. In Discipline and Punish: The Birth of the Prison (1979) Foucault stated that examinations possess built-in features that enable them to be used for exercising power and control. Specifically he mentions that tests serve as means for maintaining hierarchies and normalizing judgment. They can be used for surveillance, to quantify, classify, and punish. Their power lies in that they can lead to differentiation among people and for judging them. Tests consist of rituals and ceremonies along with the establishment of truth and all in the name of objectivity, as Foucault puts it:

The examination combines the techniques of an observing hierarchy and those of a normalizing judgement. It is a normalizing gaze, a surveillance that makes it possible to qualify, to classify and to punish. It establishes over individuals a visibility through which one differentiates them and judges them. That is why, in all the mechanisms of discipline, the examination is highly ritualized. In it are combined the ceremony of power and the form of the experiment, the deployment of force and the establishment of truth. At the heart of the procedures of discipline, it manifests the subjection of those who are perceived as objects and the objectification of those who are subjected. (p. 184) (my emphasis)

In Foucault’s biography, written by Eribon (1992), he provides evidence of Foucault’s personal experiences and sufferings from tests, making him a “test victim.” He shows that Foucault himself was a victim of tests, who failed on high-stakes tests. References are made to situations when tests played detrimental roles in his own life, possibly causing him to gain the special insight into the uses of tests as disciplinary tools. Foucault (1979) also noted that it is only in the twentieth century that testers made tests “objective unobtrusive” messengers, while in the past testers had to face test takers directly and to share the responsibility for the testing verdict.

The notion that tests represent a social technology is introduced by Madaus (1990) as an extension of the uses of tests as disciplinary tools. He claimed that tests are scientifically created tools that have been historically used as mechanisms for control and their power is deeply embedded in education, government, and business. The test is a means for social technology as it not only imposes behaviors on individuals and groups but also defines what students are expected to learn and know and can therefore be referred to as “de facto curriculum.” It therefore guaranteed the movement of knowledge from the teacher to the pupil, but it extracted from the pupil a knowledge destined and reserved for the teacher.

Bourdieu (1991) claimed that tests serve the needs of certain groups in society to perpetuate their power and dominance; thus, tests were rarely challenged. Tests have wide support of parents, as they lead to the imposition of social order. For parents who often do not trust schools and teachers, tests provide indication of control and order, especially given their familiarity with tests in their own years of schooling. For many parents tests symbolize control and discipline and are perceived as indications of effective learning. It is often observed that raising the educational standards through testing appeals to the middle classes, partly as it means gaining access to better jobs for their children, and for some it is also a code word for restricting minority access. The paradox is that low-status parents, minorities, and immigrants, who are constantly excluded by tests, have an overwhelming respect for them and often fight against their abandonment.

Hanson (1993) as well discusses the power of tests to affect and define people and notes that tests have become social institutions on their own, taken for granted with no challenging questions. Specifically, while a testing event is only a minute representation of the whole person, tests are used both to define and predict a person’s ability as well as to keep them powerless and often under surveillance. He adds the following:

In nearly all cases test givers are (or represent) organizations, while test takers are individuals. Moreover, test-giving agencies use tests for the purpose of making decisions or taking actions with reference to test takers – if they are to pass a course, receive a driver’s license, be admitted to college, receive a fellowship, get a job or promotion… That, together with the fact that organizations are more powerful than individuals, means that the testing situation nearly always places test givers in a position of power over test takers. (Hanson 1993, p. 19)

The use of language tests as disciplinary tools by powerful political institutions is discussed by McNamara (1998) who notes that tests have become an arm of policy reform in education and vocational training as well as in immigration policies. Such policy initiatives are seen within the educational systems as well as in the workforce. A concern for national standards of educational achievement in a competitive global economy, together with a heightened demand for accountability of government expenditures, has propelled a number of initiatives involving assessment as an arm of government educational policy in the national, state, and district levels.

A psychometric perspective. Some psychometricians who themselves develop tests have been critical about them. Most notable is Messick, who was employed at the Educational Testing Service in the USA, a center that develops and researches tests. Messick (1981, 1996) was among those who drew attention to the topic of impact, claiming that tests’ consequences should be incorporated into a broader perspective of a unified concept of validity. He argued that given that social values were associated with intended and unintended outcomes, the interpretations and uses which derive from test scores, the appraisal of the social consequences of tests should be subsumed as aspects of construct validity (1996, p. 13). Messick (1996) claimed that “[i]n the context of unified validity, evidence of washback is an instance of the consequential aspect of construct validity.” Thus, Messick’s concept of unified validity seems to be the bridge between the narrow range of effects included in washback and the broader one encompassed by “impact” which includes “…evidence and rationales for evaluating the intended and unintended consequences of score interpretation and use… especially those associated with bias… unfairness in test use, and with positive or negative washback effects on teaching and learning” (p. 12). The term “consequences” is used mostly by Messick to encompass washback and construct validity but with a stronger focus on ideological values. This is also how the term is used here to discuss the societal influences of tests in a larger scope. Messick notes that washback is only one form of testing consequence that needs to be weighed when evaluating validity, and testing consequences are only one aspect of construct validity, leading to the term consequential validity. An additional term often used to refer to the connection between testing and instruction is systemic validity (Frederiksen and Collins 1989) relating to the introduction of tests into the educational system, along with additional variables which are part of the learning and instructional system. In such situations tests become part of a dynamic process in which changes in the educational system take place according to feedback obtained from the test. Similar terms associated with the impact of tests on learning are measurement-driven instruction, referring to the notion that tests drive learning, and curriculum alignment, implying that the curriculum is modified according to test results.

Thus, although psychometricians developed sophisticated methods for test development and design, in terms of reliability and validity and quality of items and tasks, they tend to overlook the important dimension of consequences of tests. This leads to the need to pose questions that will incorporate the consequences such as: What are the tests being used for? What purposes are they intended for? Do they lead to decisions which are beneficial or harmful for people? Are they meant to evaluate the level of language proficiency or as sanctions for discipline and control? In other words, is a test really a pure measurement of language proficiency or is it used as a disciplinary tool for other agendas such as selection, expulsion, and differentiation leading to stigmas about different populations and their rejection from bastions of society?

Language testing perspective. In the book Measured Words, Spolsky (1995) surveyed the different language tests from a historical context. He brings up the cases of the different agendas that were associated with the TOEFL tests to prevent people from certain areas in the world to study in US educational institutions.

Shohamy (2001a) introduced the notion of CLT and focused on three studies which demonstrated how the introduction of high-stakes language tests brought about major changes in the behavior of the school: a test of oral proficiency in English as a second language which led to teaching to the test, a test in Arabic which turned the classes to prepare students for the test which meant studying the exact content of the test, and a national test for testing reading comprehension which in a year’s time changed drastically the reading curriculum to texts with multiple questions; in all cases, there was narrowing of the curriculum and the test dominated most activities. In other words, once the test was administered, the teaching returned to non-testing activities. A study (Shohamy et al. 1996) examined the effect of these tests several years later and showed that the only meaningful change took place in the high-stakes tests while in the case of low-stakes tests these changes were totally overlooked. Shohamy and McNamara (2009) critiqued the tests for citizenship. In Shohamy (2009) there is a strong argument and a list of reasons against the use of tests for enforcing such policies.

Studies by Alderson and Wall 1993 examined a number of hypotheses whereby one could expect a change in the school learning policy of languages due to tests but found very little effect due to the tests.

Cheng (2004; Cheng et al. 2011; Cheng and Curtis 2004; Cheng and DeLuca 2011) conducted studies focusing on the washback of high-stakes tests, especially in China but also in Canada and elsewhere. They found major impact of tests on teaching. More information about these studies can be found in the chapter “Washback, Impact, and Consequences Revisited” by Tsagari and Cheng in this volume, examining at least two different types of washback studies, one related to traditional standardized tests and the other in which modified versions of tests are examined as means for achieving more positive influence on teaching and learning.

Fulcher (2004) critiqued the growing number of rating scales which are expected to provide more accurate scores. Two of these well-known scales are the ACTFL scale used in the USA and the CEFR used mostly in Europe and elsewhere. Yet, major critiques have emerged from these scales, as to their linearity, and the scales not being appropriate for all learning settings. (see a chapter “The Common European Framework of Reference (CEFR)” by Barni and Salvati, in this volume).

Shohamy’s (2001a) brought up pleas for developing more democratic views of tests, increasing the responsibility of testers, minimizing the power of tests, protecting test takers, and posing a questions about the ethical roles of language testers.

One immediate outcome of the CLT was the development of a Code of Practice to protect test takers. Davies (this volume), who was very attentive to the notion of CLT, examined the professionalism of language testers who design tests and overlook their impacts. He posed ample questions about what it means to be an ethical and professional tester and their responsibilities. Davies served as the chair of the ILTA (International Language Testing Association) committee that developed the Code of Ethics and a Code of Practice to be used by language testers in the development and uses of tests, so testers become aware of their professional and ethical responsibilities. The real aim accordingly was to create tests which are more fair, considerate, constructive, and ethical in terms of their power.

Work in Progress

Over the years, a large number of questions emerged that have fallen under the paradigm of CLT. With the introduction of language citizenship tests for immigrants in an expanding number of countries in Europe, Asia, and elsewhere, ample studies pointed to the harmful effects of those tests. Milani (2008), based on protocols about integration in the Swedish Parliament, pointed to the debates about the tests revealing a taste of discrimination, given the goal of integrating immigrants into the main society. A special issue of the journal Language Assessment Quarterly (Shohamy and McNamara 2009) focused on these tests in a number of countries such as Estonia, Latvia, the UK, the USA, and Israel. A number of comprehensive edited books (Stevenson 2009; Extra et al. 2009) were published as well. Unfortunately, this research did not yield major changes in terms of government policies, and the problem continues as more countries adopt these tests. Recently, Norway is joining these countries with the introduction of new citizenship tests as of January 2017. These policies get stricter as the wave of immigration expands in Europe and elsewhere. Likewise in schools, while tests such as the ones mandated by the NCLB act ceased to exist in the USA, the new policy of the Common Core represents a new testing policy with higher cognitive demands introduced in schools, thus creating injustices for immigrants (Abedi 2001, 2004; Abedi and Dietal 2004; Shohamy and Menken 2015). Thus, these tests, as Valdés and Poza point out, discriminate against newcomers and minority groups (see their chapter “Assessing English Language Proficiency in the United States” in the present volume).

At the same time, the work on CLT continues (see the chapter “Washback, Impact, and Consequences Revisited” by Tsagari and Cheng in this volume). Indeed, the notion of the “power of tests” puts enormous responsibility on the shoulders of those who wield the tests. Yet, at the same time there are also new approaches that attempt to respond to the power of tests, to minimize and challenge it by focusing on tests geared for more effective learning rather than tools for punishment. Further, with the changes toward multilingualism, there is more of an emphasis on the meaning and essence of language in this day and age with regard to globalization, multilingualism, language varieties, and mixture of languages to include immigrants and minority groups in different types of multilingual tests.

These directions responded to questions such as the following:

  • Do the tests reflect the bi-/multilingual uses of language in this day and age in the context of plurilingual societies?

  • Do tests have realistic goals in terms of their levels of proficiency, considering the dynamic and fluid nature of language?

  • Are the validation procedures based on realistic norms and not on the native speaker?

  • Do they consider all components that contribute to performance, beyond language per se?

  • Are we ethical when we design tests based on definitions and goals provided by central agencies?

  • Are language tests open to monitoring by society, critiqued, and sanctioned?

  • How can immigrants and minority groups be included in spite of their language proficiency, given that they are educated, talented, good people, but have difficulties with language, or it takes them long time to acquire it?

  • How can immigrants who have to pay big amounts of money for language courses get resources that will help them learn the languages? And is the “almost” native speaker realistic for all people, regardless of age, background, etc. (see the chapter “Assessing English as a Lingua Franca” by Jenkins and Leung, which discusses the ELF variety that most English nonnative speakers use, as well as the chapter “High-Stakes Tests as De Facto Language Education Policies” by Menken, both in this volume)?

  • Do all immigrants need to pass a language test where there is no evidence that knowledge of “the” language necessarily contributes to good citizenship? That is, how can we minimize the use of language tests, preventing them from being a major tool in creating immigration policy?

In the section below, a list of additional new initiatives which can defy or minimize the power of tests will be briefly described.

Responses: Minimizing and Resisting Power

The topics below include strategies of assessment initiatives and practices which can lead to more positive outcomes of tests which are more fair, just, and mostly educational. These go beyond standardized tests into language testing that proposes means for diverting tests to learning and less for judgment.

Dynamic assessment: An approach whereby testing and teaching are connected and hence minimize the power of tests, based on Vygotsky’s sociocultural theory whereby the emphasis is the use of tests for learning (see the chapter “Dynamic Assessment” by Poehner, Davin, and Lantolf in this volume; and see also Levi 2015; and Levi 2016).

Assessment literacy increases the basic knowledge about assessment including CLT and focuses on their consequences and impact, which needs to be addressed and is part of language testing along with other factors (see chapters “Language Assessment Literacy” and “Training in Language Assessment” by Inbar-Lourie and Malone in this volume).

Test accommodation and differential item functioning (DIF) provide a tool for assisting immigrants and minority groups who are not familiar with the new language to obtain assistance and thus enhance the achievement in academic and content subjects, especially for the early years of migration while learning the new language (see chapter “Utilizing Accommodations in Assessment” by Abedi in this volume). Further, the focus is on the technique of DIF as a strategy to identify the test items and tasks which discriminate against students of different backgrounds. Removal of such items and tasks results in tests which are more fair to larger pool of test takers.

Formative/alternative assessment attempts to develop assessment strategies which are more constructive than standard external items, often developed by local agents at the schools and not by central agencies (see the chapter “Task and Performance-Based Assessment” by Wigglesworth and Frost and the chapter “Using Portfolios for Assessment/Alternative Assessment” by Fox, in this volume).

Multilingual/translanguaging and ELF tests. An approach built on the nature of the language construct as it is being viewed today, where languages are mixed and people use them in very creative ways. Shohamy (2011) demonstrated how the use of multilingual tests in testing mathematics of immigrant students (in Hebrew and Russian on the same test) result in higher mathematics scores than of those students who were tested in monolingual (Hebrew) tests in Israel. A case in point can best be demonstrated with English away from the concept of the native speaker (see the chapter “Assessing Multilingual Competence” by Lopez, Turkan, and Guzman-Orth and also the chapter “Assessing English as a Lingua Franca” by Jenkins and Leung in this volume).

Tests for indigenous contexts. Given that the essence of testing is that it grants importance to languages and provides a message that they should be empowered, there is a call to include indigenous language within the repertoire of testing (see the chapter “Language Assessment in Indigenous Contexts in Australia and Canada” by Wigglesworth and Baker in this volume).

Full language repertoire (FLR). This refers to expansion of assessment to include all the languages a person knows, regardless of each language level of proficiency. This is especially relevant with immigrant students who arrive in a new location, so the languages they know from the past will not be overlooked and ignored, but rather they should be incorporated into the whole language repertoire, viewing these languages as significant resources.

Other themes included in this volume that have the potential to reduce the power of tests and focus more on learning include the following: the chapter “Assessing Students’ Content Knowledge and Language Proficiency” by Llosa recommending a focus on content and less on language proficiency, the chapter “Culture and Language Assessment” by Scarino with emphasis on culture within assessment, and qualitative methods of validation by Lazaraton. Chapter “Assessing the Language of Young Learners” by Bailey especially warns about the overuse of tests with regard to young leaners. Other studies demonstrate the extent to which language tests are instrumental for control. Tsagari and Cheng show how significant it is to examine the consequences of tests so to limit their powerful status, which is related to the chapter “High-Stakes Tests as De Facto Language Education Policies” by Menken demonstrating that tests should avoid dictating the curriculum but rather reflect it. These are manifested in the use of tests as “de facto” curriculum, approaches which should be minimized. All these are warning signs regarding the ethics, professionalism, rights, and codes as described in the chapter “Ethics, Professionalism, Rights, and Codes” by the late Alan Davies. The existence of the Common European Framework of Reference (CEFR) requires testers to be even more cautious about the power of tests as these scales provide an extra tool to bring about homogeneity, as can be seen in the chapter “The Common European Framework of Reference (CEFR)” by Barni and Salvati. The dangers of using tests which are not appropriate to specific students are being critiqued by Poza and Valdés (chapter “Assessing English Language Proficiency in the United States”).

All in all, many of the chapters in this volume discuss and propose a number of ways to focus on learning and hence to minimize the power of tests by using the strategies described above and in many of the chapters.

Future Directions

CLT led to ample questions about the quality of tests, their consequences, and the difficulties they impose on test takers and systems. Tests often offer simplistic solutions for complex issues. The research in this field attempted to explore areas where tests are misused by examining their consequences and the intentions of those who introduced them. The responses are varied so that what is considered negative or positive is constantly debated given Messick’s views that these are related to the values of test takers, educational systems of nations, and ideologies of governments and regimes to use tests for power and control. Yet, the obligation of those engaged in language testing is to adopt CLT approaches to try to look beyond the tests themselves and toward their uses; in other words, a good test may be necessary but not sufficient. It is the obligation of all those working in test development and use to constantly ask questions as to intentions and uses of tests in education and society with regard to the multiple groups for whom national languages are second languages.

It is encouraging to see that in the past decade a number of different types of assessment strategies and procedures have been developed and implemented. These strategies are currently being used to broaden the construct of testing tests and provide successful ways of “talking back” to the power of tests that can minimize their power and protect test takers and parents, teachers, and principals, enhancing the uses of assessment procedures to minimize discrimination and marginalization and maximize learning, fairness, ethicality, equality, and justice. The purpose is not to eliminate tests but rather to see the values behind them as well as their hidden agendas in the area of accountability and the learning of languages and to reflect perspectives of languages in this day and age.

Cross-References

Related Articles in the Encyclopedia of Language and Education