Assessing L2 English Skills in an Online Environment: What Can This Look Like and How to Assess L2 English Writing Skills?

De Wilde, Vanessa; De Meyer, Geert; De Bruyckere, Pedro

doi:10.1007/978-3-031-27825-9_16

Vanessa De Wilde^20,21,
Geert De Meyer^22,21 &
Pedro De Bruyckere^23,21

Part of the book series: English Language Education ((ELED,volume 31))

532 Accesses
10 Altmetric

Abstract

Studies with young L2 English learners have shown differences in learners’ L2 English proficiency. This creates a situation in which learners in a classroom often form a very heterogeneous group. In this study, we report on the development of an online tool to assess pupils’ proficiency level at the start of formal instruction in Flanders. We used group concept mapping and a teacher questionnaire to investigate what this tool should look like. Results indicated that teachers opted for an online test that contained tasks for all four language skills. In the second part of the chapter, we report on one of the challenges that came with the development of this online tool, i.e. finding a method to assess learners’ writing that is both reliable and easy to use. In order to do this, we explored the possibility of using benchmark texts which were selected in a two-stage approach using comparative judgment. Results showed that this method with five benchmark texts that teachers can use to correct their learners’ writing can indeed be used reliably and efficiently.

Access provided by Autonomous University of Puebla. Download chapter PDF

The effect of using online language-support resources on L2 writing performance

Article Open access 25 March 2021

Large-Scale English Writing Assessment for Chinese Learners of English : An Introduction to Part I

Enhancing English writing competence in higher education: a comparative study of teacher-only assessment versus teacher and student self-assessment approaches

Article Open access 05 February 2024

Keywords

1 Introduction

This chapter reports on the development of an online tool designed to measure L2 English learners’ proficiency at the CEFR A2-level and as such inform teachers about the differences in these learners’ proficiency levels. The chapter will first discuss the development of the online tool which started from the needs expressed by Flemish L2 English teachers. Secondly, the chapter will focus on a valid and reliable way to assess learners’ writing skills. In order to be able to assess the learners’ writing in a reliable yet time-efficient manner, we explored the possibilities of a two-stage approach for marking L2 English writing using comparative judgment and benchmark texts.

Below we will discuss the need for an online tool measuring L2 English proficiency, some of the difficulties concerning the assessment of writing skills and the context in which this study took place. We will then describe the different steps that were taken in the development of the online tool, discuss the results of the study and end with some pedagogical implications.

2 Literature Review

2.1 The Need for an Online Assessment Tool

Studies with primary school age L2 English learners have found considerable differences in L2 English proficiency even before the start of formal L2 English instruction (De Wilde et al., 2020; Muñoz et al., 2018; Puimège & Peters, 2019). These large differences in learners’ prior L2 English knowledge pose considerable challenges to teachers so knowing about these differences is important as prior knowledge can have a huge impact on further learning, e.g. in relation to the amount of instruction that is needed (e.g. De Bruyckere, 2017; Hattie & Yates, 2013). Therefore, it was decided to develop a test to give teachers an opportunity to get information about their learners’ L2 English proficiency level at the start of the lessons in secondary school. The test is meant to inform teachers so they can adapt lessons to the learners’ various levels of proficiency and prior knowledge of English (e.g. through differentiated instruction). It is thus meant to be a low-stakes test.

2.2 Assessing Writing

Assessing students’ writing comprises many different aspects such as content, organization, and linguistic features. Therefore, scoring writing tasks is often considered a challenging and time-consuming exercise (Hamp-Lyons, 1990; Schoonen, 2005). Teachers and researchers have studied many different methods to rate writing skills. A distinction is often made between analytic and holistic methods. In order to rate writing tasks analytically, raters often use rubrics that list criteria that should be taken into account often also containing descriptors of the expected performance for the different levels of each criterion. In an analytic scoring method, the final score is a combination of partial scores (Crusan, 2013). Holistic scoring methods look at the text as a whole and attribute one single score to the writing product, whereas analytic scoring methods give different scores for different aspects of the text, such as linguistic or content-related aspects (Harsch & Martin, 2013). When scoring a writing task in a holistic manner, raters sometimes use a set of criteria that need to be considered when rating the task, but these criteria serve as a guideline to give one overall score. There are also other methods to assess writing tasks holistically. Two of these methods will be discussed below.

2.2.1 Two Holistic, Comparative Approaches: Comparative Judgment and Benchmark Texts

Apart from the more traditional analytic and holistic approaches in which students’ writing is assessed in an absolute manner, there are also comparative approaches, in which representations (e.g. written texts or images) are compared.

A method that has recently received some attention is comparative judgment, inspired by the work of Thurstone (1927), who claimed that people’s judgment is more reliable when comparing performances than when judging a single performance. The method was introduced into education by Pollitt and Murray (1996). In this method, multiple raters compare pairs of representations (e.g. written texts) and decide which of the two representations is the better one. After the raters have made a set number of comparisons of all the tasks, each learner’s writing task is assigned a place on a rank order ranging from the weakest to the strongest. The overall quality of a writing task is thus based on repeated comparisons (Lesterhuis et al., 2017). Recently, research teams have set up studies to organize this type of rating process digitally. To build an information system that could be used for comparative judgment, Coenen et al. (2018) identified several design requirements for the tool to be a success. These were: being able to do valid and time-efficient assessments, reduce cognitive load, increase reliability, support competence development, and support accountability. The tool which resulted from this study is called Comproved (www.comproved.com) but similar tools are available (e.g. No More Marking, Jones, 2016). The studies mentioned above have shown that this approach can result in reliable ratings, but various raters are needed, and many comparisons have to be made. The guidelines on the Comproved website for example, mention that for a reliability of .70 the following formula should be used: number of representations * 7.5 / number of raters = number of comparisons per rater. This shows that the procedure can be quite time-consuming which might be a hindrance for teachers in day-to-day classroom practice, as the number of holistic comparisons to be done can be high (Humphry & Heldsinger, 2020; Lesterhuis et al., 2017).

Another form of holistic rating can be done with the use of benchmark texts (Lesterhuis et al., 2017). In this procedure, several texts are chosen which represent different levels of overall writing quality to serve as benchmarks. Teachers, or other raters, then compare their students’ work with chosen benchmarks and decide which of the texts resembles their students’ work the most. The level associated with the most suitable benchmark text is the level allocated to their students’ writing. Bouwer et al. (2016) investigated possibilities of rating written texts with benchmark texts and found that benchmark ratings were as reliable as ‘absolute’ analytic and holistic ratings. They did this on paper however, while the aim of this study is to check if this can also be a suitable approach online.

Recently, several studies which investigated L1 writing have adopted an integrated, two-stage approach that combines the use of comparative judgment and benchmark texts (De Smedt et al., 2020; Humphry & Heldsinger, 2020; McGrane et al., 2018). In this approach, first, a set of written texts are compared through comparative judgment. After this procedure, experts choose a number of texts from this set that represent different levels of writing quality as benchmark texts. These benchmarks are then used when scoring new, similar writing tasks. Below, we report on a study in which we have adopted this two-stage approach for an L2 picture narration task to investigate whether using this approach which has been used in L1 writing, is also appropriate in L2 writing in an online environment.

This chapter reports on the development of a tool meant to assess L2 English learners’ proficiency level. It describes the process towards a test structure and content that meets teachers’ needs. It further investigates a method to address specific challenges concerning the assessment of written texts. The following research questions are central in this chapter:

RQ1: What should an online tool which aims to map learners’ L2 English proficiency at the start of formal instruction look like?
RQ2: How efficient and reliable is the assessment of L2 English writing tasks following a two-stage approach (combining the use of comparative judgement and benchmark texts)?

3 Context of the Study

Formal L2 English lessons are compulsory in Flanders, the Northern part of Belgium, from the first or second year of secondary school onwards, when learners are 12–14 years old. The starting age for English is rather late when compared to other European countries (Enever, 2011) because English is the second foreign language to be taught in Flemish schools. The first foreign language which is taught in Flanders is French, which is an official language in Belgium.

Pupils are expected to reach the A2 level of the Common European Framework of Reference (Council of Europe, 2009) for English by the end of the second year of secondary school. In primary education, L2 English is not a compulsory part of the curriculum, and formal instruction only starts in secondary school. However, this does not mean English is absent in most learners’ daily lives. Most learners have been exposed to English regularly before the start of the lessons (for example, when gaming or watching television), and this leads to big differences in pupils’ prior knowledge of English. De Wilde et al. (2020) did a study with 780 Flemish learners who were in the last year of primary school. They found that 25% of Flemish 11-year-olds obtained a score of 80% or higher on an A2-level listening test with a mean of 15/25 but overall, there was a broad range of test scores (from 0 to 25 out of 25). For the A2-level speaking task, scores were considerably lower (with a mean of 7/20), but still, a considerable number of the participants scored 80% or higher (14% of the participants). Finally, 10% of Flemish 11-year-olds obtained a score of 80% or higher on an A2-level reading and writing test, whereas more than half of the participants obtained a test score below 50%, again pointing to large individual differences prior to the start of formal instruction.

The online tool presented in this chapter was developed to give teachers in Flanders more insight in the actual differences in their learners’ L2 English proficiency level. First, we investigated what teachers expected from such a tool and in a second study we looked into an efficient and reliable way to score learners’ writing.

4 Research Question 1

Following, we proceed to explain the methodology and results obtained to answer our first research question, which is: “What should an online tool which aims to map learners’ prior knowledge look like?”

4.1 Methodology

4.1.1 Instruments and Procedure

To be able to develop a test that would meet teachers’ needs, we decided to consult teachers and other stakeholders before the actual test development was started. The teacher questionnaire was designed using group concept mapping (GCM). This method, which was developed by Trochim (1989a, b) and further adapted by Stoyanov and Kirchner (2004), can be used to gather and organize ideas in a structured manner. In this study, it was used in a simplified version which consisted of three rounds. In round one, we sent a list with open questions to experts in the field of education and assessment to gather answers which could lead to items for the questionnaire. The open questions were listed in an online form, the link to the form was then e-mailed to the experts who answered the questions anonymously. They were given 1 week to answer the questions. In round two, we sent the same seven experts a set of possible items for the questionnaire, which were based on the answers to the open questions from the first round. We asked the experts to rate how important these items were. Again, they had 1 week to complete the questionnaire. In the last round, we made an initial version of the questionnaire for the teachers and asked a focus group with five new participants with expertise in language testing and/or foreign language teaching to comment on the questionnaire and give suggestions for improvement. After the focus group, we made a second and final version of the questionnaire, which we made available for teachers. The questionnaire consisted of some questions asking about their teaching and experience and a number of statements about what they thought an L2 English test for their learners should look like. Answers to the statements were given on a Likert scale ranging from 1 (totally unimportant) to 5 (very important). The questionnaire can be found in Appendix A.

4.1.2 Participants

As mentioned above, we consulted teachers and other stakeholders in order to be able to have a clear view on their expectations for an L2 English proficiency test. In the first phase, we consulted experts in the field of foreign language education and assessment such as scholars, policymakers, and curriculum designers (n = 12). Seven experts took part in the group concept mapping procedure and five experts took part in the focus group. The participants in the focus group were part of the advisory committee for this research project.

In a second phase, 95 participants filled in the teacher questionnaire. Most participants were teachers in the first 2 years of secondary school (n = 64), 29 teachers also taught in secondary school but taught older pupils, one participant was a teacher trainer and one educational adviser for English took part in the study. The teachers who completed the questionnaire had various degrees of experience (between 1 and over 30 years of experience).

4.1.3 Analysis

The results of the teacher questionnaire were analyzed quantitatively and used to make decisions about the structure, content, and duration of the test. Descriptive statistics can be found in the results section.

4.2 Results

Teachers’ answers showed that they believed a test should contain activities looking into learners’ prior knowledge of the language skills (cf. Table 1). Therefore, it was decided that the test should consist of four parts, each testing one language skill: listening comprehension, reading comprehension, writing, and speaking. Another important aspect for teachers was the possibility to give feedback. The form of the test was less important for the teachers than the content, but they seemed to favor a digital test over a paper-based test. Descriptives statistics of the scores given by the teachers can be found in Table 1.

Table 1 Teacher questionnaire: descriptive statistics test characteristics (Likert scale 1–5)

Full size table

The majority of the teachers (60%) also asked that the duration of the test would be approximately 50 min, the equivalent of one teaching period in Flanders, 31% of the teachers opted for a shorter test (30 min) and 9% of the teachers would also use a test which would take more than 50 min. We decided to settle for a 50-min test. As the test is meant for learners who are at the start of formal education and learners could be absolute beginners, the instructions had to be available in both English and Dutch, which is the language of instruction.

During test development, we considered the test’s practicality, and we decided a type of scoring was needed that would be easy to use for the teachers, as they would be the ones scoring their learners’ tests. For the scoring of the writing task, we decided to investigate the possibility of using benchmark texts which were selected via a two-stage approach. This procedure will be discussed below.

5 Research Question 2

After having analyzed the testing tool, we proceed to answer our second research question, which is: “How efficient and reliable is the assessment of L2 English writing tasks using benchmark texts which are selected via a two-stage approach?”. We will do so through two different studies.

5.1 Study 1

5.1.1 Methodology

Participants

In order to be able to answer the second research question, 121 participants wrote one or two written texts. All the participants were at the start of formal L2 English education and were between 12 and 14 years old. We tested pupils in six classes in two different schools, three classes per school. The participants from school one were in the first year of secondary school, those from school two were in the second year of secondary school. All participants had just started formal L2 English education. They had received less than 15 h of formal English instruction.

Fifty-three raters took part in the comparative judgment procedure. All the raters had experience with rating L2 writing tasks: They were either working as teachers or teacher trainers (n = 11) or they were in the second year of a three-year bachelor’s program in which they were trained as English teachers (n = 42). The students from the teacher training program had already done a teaching practice in a secondary school and had been trained to score students’ work.

Instruments and Procedure

To be able to capture the different levels of proficiency among the learners while still giving sufficient support to the true beginners, we decided to use a picture narration task. According to Goodier and Szabo, the authors of the Collated Representative Samples of Descriptors of Language Competencies Developed for Young Learners (2018), the task of telling a simple story is a relevant task at the CEFR A2-level for learners aged 11–15 years. The visuals that were added in the writing task in the present study were meant to give extra support to learners with a low(er) proficiency level. Three different picture stories were designed. An example of one of the picture stories can be found in Fig. 1 below.

Four cartoon pictures numbered 1 to 4 display a story. The interaction between 2 representatives and a girl at the billing counter of a shop is portrayed. — **Fig. 1**

In the first phase of the study, we collected 177 writing samples. The participants described a set of four pictures which together made up a story. There were three different stories (picture story A, n = 60; picture story B, n = 56; picture story C, n = 61). The picture stories were designed in such a manner that all pupils would be able to relate to the situations depicted in the stories. No explicit time limit was given to the pupils. The writing tasks were paper-based and were digitalized by the researchers for the next phase of the study (comparative judgment).

The learners’ writing tasks were rated using the comparative judgment tool Comproved (Coenen et al., 2018). In this tool, raters are asked to compare two representations, in this case two of the 177 texts, and to decide which of the two representations is the better one. There were 53 raters who each made 33 comparisons, resulting in a total of 1749 comparisons. The number of comparisons is sufficient in order to obtain reliable results (cf. formula: number of representations * 7.5 / number of raters = number of comparisons per rater). Per comparison, raters were asked to select the best representation of two. There were no further instructions concerning how they should rate the writing sample, no criteria were given for the assessment. They were only asked to indicate which writing sample they believed had the highest quality of the two samples they were presented with in each comparison. Raters could choose to add some comments to justify their decision, but this was not obligatory, and it was not taken into account when making the rank order. Following the procedure of the two-stage approach (Humphry & Heldsinger, 2020), the results of the comparative judgments procedure were used to inform the choice of the benchmark texts. Descriptors of the benchmark texts were taken from the CEFR descriptors for young learners aged 11–15 years (Goodier & Szabo, 2018).

Analysis

Descriptive statistics of the rank order of the 177 representations (written texts) that resulted from the comparative judgment procedure are given in the results section. To investigate the reliability of the rank order, the scale separation reliability (SSR) was calculated. Verhavert et al. (2018) found this measure (with values between 0 and 1) can be used as an index for interrater reliability in comparative judgment.

5.1.2 Results

The raters’ work resulted in a reliable rank order of the 177 representations. In the current study, the SSR was high (.88), indicating that there was strong agreement between the raters on the quality of the written texts. Thus, we could be confident that the rank order that followed the 1749 comparisons was reliable.

We then compared the results of the comparative judgment procedure for the three different picture stories and chose the picture story with the best spread in results. The boxplot in Fig. 2 shows the spread of the scores of each representation (written text) per picture story. In Table 2, the descriptive statistics for the results of the comparative judgment procedure for the representations per picture story are given. Figure 2 and Table 2 show that the spread in the representations for story B (which is the example story given in Fig. 1) is almost evenly divided. The mean ability is around zero, some representations received a high score (maximum score = 5.52), others have a very low score (minimum score: −5.87), there are no outliers. We, therefore, decided to continue the study with story B in stage two.

A box plot of ability from negative 6 to 6 versus texts, A, B, and C. The outliers of the boxplots A, B, and C lie between negative 2.8 and 4, negative 5.8 and 5.8, and negative 4 and 5, respectively. — **Fig. 2**

Table 2 Descriptive statistics showing the spread in the rank order of the representations per picture story

Full size table

After the comparative judgment procedure, two researchers selected four benchmark texts. The selection was based on the rank order of the representations which was decided by the 53 raters who took part in the comparative judgment procedure. Starting from that order, the researchers chose texts which were a good representation of the four different levels they wanted to discriminate: above A2, A2, A1 and below A1 based on the level descriptors found in the Common European Framework of Reference for Languages (Council of Europe, 2009). If the learners were unable to answer in English or did not write anything at all, their writing was scored as ‘no output’ which was considered a fifth level. The top and bottom level benchmark texts corresponded to representations that were ranked very high (5.52) or very low in the comparative judgment procedure (score of −3.8 and lower). For the bottom level there is no benchmark text as this level corresponds with texts written in Dutch or tasks where participants did not write anything at all. The benchmark text which corresponds with the A2-level received a score of 0.34 in the comparative judgment procedure, the A1-level benchmark text corresponds with a score of −0.23, and the benchmark text that was chosen for the below-A1-level corresponds with a score of. −1.16. The scores of the benchmark texts show that the rank order of the comparative judgment procedure was respected in text selection. Following the procedure Humphry and Heldsinger (2020) used for the assessment of L1 writing, performance descriptors were added to the benchmark texts. These descriptors are meant to give the characteristics of a text at a certain level and can help teachers when they are in doubt about which benchmark text is closest to their students’ writing. The benchmark texts and descriptors together should give teachers the tools they need to assess similar writing tasks. The performance descriptors can also be used to give feedback to the students. The performance descriptors in this study were based on the CEFR descriptors for young learners aged 11–15 years (Goodier & Szabo, 2018). The benchmark texts and the descriptors for all levels can be found in Appendix B.