Introduction

Assessing students’ free-text answers (e.g., argumentative essays) is an important task for artificial intelligence (AI) and natural language processing (NLP) in education. This also involves developing tutoring systems based on AI-driven assessment procedures (e.g., Bai & Stede, 2022; Mathias & Bhattacharyya, 2020). Advantages of such systems involve (1) reduced workload for teachers, (2) immediate information about the performance level of their students without extensive manual correction effort, (3) more frequent and instant feedback for students, and (4) a consistent assessment procedure that is, for instance, not bound to human attention processes (see, e.g., Ramesh & Sanampudi, 2022; Uto, 2021; Yan, 2020). Recent research has emphasized the promise of AI-based tutoring systems in supporting students closely during writing and learning processes (e.g., Hussein et al., 2019; Injadat et al., 2021). However, to support students in the context of complex writing tasks such as argumentative essays, an accurate and comprehensive assessment of several aspects of writing is necessary (Bai & Stede, 2022). Such fine-grained scoring of different aspects is a challenging problem that is largely unresolved in the field of automated essay scoring (AES) (see, Horbach et al., 2017; Kumar & Boulanger, 2021). Different AES approaches using machine learning methods have been proposed to face the challenges of AES. Like in many NLP tasks, two general model types have been proposed: feature engineering and deep neural networks (DNN) (Bai & Stede, 2022; Ke & Ng, 2019; Kusuma et al., 2022; Uto, 2021). Recent studies have shown that hybrid models, combining both approaches, can outperform models based on a single resource (e.g., Dasgupta et al., 2018; Uto et al., 2020; see also Mizumoto & Eguchi, 2023). However, most of these comparisons used holistic scoring methods, i.e., assigning one overall grade per essay (Lagakis & Demetriadis, 2021). The holistic approach, however, provides assessment on a superordinate level that is not suitable for meaningful tutorial feedback or in-depth diagnosis of students’ writing abilities (Condon & Elliot, 2022; Narciss, 2008). Moreover, from a methodological point of view, holistic scoring makes it impossible to disentangle possible strengths and weaknesses of the different AES approaches regarding certain aspects of text quality (Andrade, 2018). In the current study, we therefore compare the performance of different AES approaches scoring analytic essay rubrics (also referred to as traits in the following). For this purpose, we use four different argumentative prompts, containing English essays written by L1 students (from the ASAP corpusFootnote 1) and L2 students (from the MEWS corpus, Keller et al., 2020; Rupp et al., 2019). We consider analytic benchmark scores assigned by trained human raters representing different aspects of text quality, such as language quality, organization, and content. In doing so, we compare the two single-input approaches to analytic scoring (linguistic features vs. essay-level contextual embeddings from DistilBERT) and explore which approach is superior regarding a given aspect of text quality. First studies comparing these approaches for specific text characteristics, such as lexical complexity, indeed imply performance differences (e.g., Crossley & Holmes, 2023). Furthermore, we investigate whether a hybrid model indeed outperforms the two single-resource approaches across prompts and traits. In addition, we use addition and ablation tests (i.e., stepwise removal/addition of certain groups of features) to uncover types of linguistic features that are especially important for specific traits and that can hardly be captured by (essay-level) contextual embeddings and making them particularly relevant for the hybrid architecture. Finally, we examine the cross-prompt performance of the models within the L1 and L2 corpora. We aim to answer the following research questions (RQ) that guide our experiments:

  1. (1)

    How do models based on linguistic features and text-level contextual-embedding-based models differ regarding their performance on scoring certain aspects of text quality?

  2. (2)

    Under which conditions does a hybrid approach outperform the single-resource models across different aspects of text quality?

  3. (3)

    Which linguistic feature types carry information not covered by the contextual embeddings of DistilBERT and are therefore most important for the hybrid approach?

  4. (4)

    How do the different model architectures differ regarding cross-prompt performance?

The remainder of this article is organized as follows. Section “Overview of Different AES Approaches” introduces different AES approaches and discusses their respective presumed advantages and disadvantages. Section “Method” describes the datasets, the different model architectures, and the training procedures used in the present study. In Section “Results”, we present the results of our experiments. Finally, we discuss our results and outline limitations along possible directions for future research in Section “Discussion”.

Overview of Different AES Approaches

Most machine learning (ML) approaches to AES follow a supervised learning strategy (Ke & Ng, 2019), where humans’ assessments of a given set of student essays are used as benchmark scores to train ML models. The trained models are then used to assign scores to new texts written in response to the same or new prompts.

Key characteristics of ML models used in supervised learning AES tasks are (1) the way texts (i.e. student essays) are represented as numerical input vectors (i.e. features) and (2) the actual ML architecture being employed (e.g., linear regression or DNN; see Ramesh & Sanampudi, 2022). Both aspects are interrelated, which has led to two overarching AES approaches being established in previous research: feature engineering and DNN approaches (Ke & Ng, 2019; Kusuma et al., 2022). In the feature engineering approach, domain experts select and manually design the text features to be used for a specific task in step (1), combining these hand-crafted linguistic features with predominantly shallow learning techniques in step (2). Conversely, DNN approaches autonomously learn a suitable representation of the features in step (1) and subsequently (or simultaneously) learns a DNN to score essays in step (2). In the upcoming section, we will outline key characteristics of both approaches. Our focus will be on different model inputs most relevant for the comparisons carried out in the present study (see Bai & Stede, 2022; Uto, 2021). A comprehensive overview of AES approaches can be found in Ramesh and Sanampudi (2022), Lagakis and Demetriadis (2021), and Uto (2021).

Feature Engineering Approach

AES models following the feature engineering approach are based on a theory-driven way of translating text into numerical data using NLP methods (Chen & Meurers, 2016; Crossley, 2020; McNamara et al., 2014; Zesch et al., 2015). Those features range from simple length-based representations, such as the number of words or paragraphs, to highly elaborated linguistic constructs, such as coherence (Mesgar & Strube, 2018) or cohesion measures (e.g., Crossley et al., 2017).

The feature engineering approach is the traditional AES approach (Lagakis & Demetriadis, 2021). Over the last decade, many tools have been proposed that provide the user with a vast range of linguistic features (Chen & Meurers, 2016; Crossley, 2019; Kumar & Boulanger, 2021; Kyle et al., 2018). In the following sections, we will provide a brief overview of common types of features applied in AES tasks.

Length and Occurrence Features

In the context of formal education, student essays are written under a specific time limit (time writing, see, Weigle, 2002). Therefore, essay length in terms of words, sentences, or paragraphs, has been demonstrated to be a powerful predictor of human scores (see, e.g., Fleckenstein et al., 2020; Zesch et al., 2015). Furthermore, ratios of words per sentence or sentences per paragraph can be interpreted as a proxy for syntactic complexity (Crowhurst, 1983). In addition, many traditional readability metrics consider word and sentence length (Pitler & Nenkova, 2008). Other length features typically used in AES models are mean word length (in characters) or the total number of unique words (e.g., Chen et al., 2016).

Occurrence features are closely related to length-based features and include, for instance, the counting of instances of certain part of speech classes, such as nouns, proper nouns, verbs, adjectives, as well as special characters. Classification of words into word types is known as part-of-speech tagging (POS; e.g., Mitkov & Voutilainen, 2012). In addition, the ratios of specific word types to the total number of words are also frequently used (X. Chen & Meurers, 2016).

Error Features

Another important aspect of student writing that usually factors into performance evaluations is the number of errors (e.g., typos or grammar errors). Thereby, error ratios are often calculated, such as the proportion of spelling errors to the total number of words. LanguageToolFootnote 2 is a powerful tool to automatize error counts.

Features Relating to Lexical Diversity and Sophistication

A common indicator for the lexical diversity of student essays is type-token ratio (e.g., Richards, 1987). Tokens are defined as all individual words in a text whereas types are defined as unique words. Thus, if the type-token ratio is close to one, lexical diversity is high. If the type-token ratio is close to zero, lexical diversity is low.

Common features to represent the lexical sophistication of essays are (weighted) counting of occurrences on large word-frequency corpora such as the British National Corpus (BNC) or the Brown frequency list. For example, to determine the predominant use of “easy words”, the top 1000 of the BNC word list have been used as a reference (X. Chen & Meurers, 2016). Conversely, “difficult words”, for example, have been defined as not being included in the top 2000 of the BNC word list (X. Chen & Meurers, 2016). However, these values are rather arbitrary and might also be adapted and aligned with the respective students’ characteristics.

Morphological Complexity Features

Morphological complexity measures are related to type-token ratio, but instead of reflecting lexical diversity, they capture the range of different inflections used (Brezina & Pallotti, 2019). For instance, a text with diverse inflected forms such as “writing, wrote, writes” would be deemed to have a higher morphological complexity than one that merely repeats the same form like “writing, writing, writing”. Morphological complexity measures can be calculated by taking the ratio of unique inflectional forms to the sum of all tokens of a given word class (per text, paragraph, or sentence), offering a quantitative insight into the diversity of morphological forms (Brezina & Pallotti, 2019).

Syntactic Complexity Features

Dependency parsing captures the grammatical relationships between words, offering a structured representation of sentences that reveals their underlying syntactic properties (e.g., Nivre, 2010). Dependency parsing has been used in AES context to count, for instance, the number of fragment clauses, prepositional phrases, coordinate clauses, or relative clauses (e.g., X. Chen & Meurers, 2016). This can provide valuable insight into a student’s ability to compose sophisticated sentences and present complex topics and ideas.

Cohesion Features

Cohesion refers to the lexical linking within a text, providing the reader with a sense of flow and consistency (Crossley et al., 2016). In general, more cohesive texts allow the reader to better follow the ideas presented and to understand links between different topics (Crossley et al., 2017). The lexical overlap between consecutive text segments, such as sentences or paragraphs, can numerically operationalize cohesion. Another measure of cohesion is, for instance, the frequent usage of connectors that structure the essay.

Feature Engineering Approach – Advantages and Challenges

One of the main advantages of the feature engineering approach is the theory-driven way of pre-processing the text-inherent information before feeding it into an ML model. This process of creating features provides a high degree of control over what information may be used by the algorithm. Feature engineering and feature selection might also be adapted to specific types of essays or learner populations. Furthermore, the explicit, theory-based approach of feature selection forms a prerequisite for explainable AES scores (answering how a given score is determined). Therefore, the feature-based approach has usually been combined with ML model architectures that allow a high amount of explainability, such as linear regression, logistic regression, random forests, or decision trees. This approach, however, is rarely combined with DNNs, whose hidden layers (often referred to as the “black box” of DNNs) make it difficult to understand and interpret the calculation of scores from a subjective point of view.

One primary challenge of the feature-based approach is the adequate representation of content, an element many consider pivotal, if not the most critical aspect of essays (e.g., Ramesh & Sanampudi, 2022; see also Perelman, 2014). A potential strategy to incorporate content in the realm of feature engineering without a loss of explainability is the application of bag-of-words or n-gram techniques (e.g., Ke & Ng, 2019). These methods employ either word frequencies (uni-grams) or word sequence frequencies (n-grams) to represent an essay’s content in AES tasks. Typically, these approaches are used in conjunction with stop-word filtering and lemmatization. Nevertheless, the representation of content through bag-of-words or n-gram techniques remains significantly limited, reducing it merely to the occurrence of specific words or chunks. This fails to account for the contextual nature of language, wherein a word’s meaning heavily relies on its surrounding lexical environment. Additionally, n-gram techniques pose a risk of feature explosion when every word and word sequence within a given text corpus is represented as an independent feature. However, a powerful alternative to process text data and encode the content of a text has been proposed in the context of DNNs, namely word embeddings.

Deep-Neural-Networks

Recent applications of DNNs in AES primarily rely on word embeddings (e.g., Beseiso & Alzahrani, 2020; Rodriguez et al., 2019; Uto et al., 2020). The basic idea of word embeddings is to represent the meaning of words with specific loadings (i.e., numerical values) on several latent dimensions. Each dimension represents a different (and largely unknown) aspect of semantic meaning. Each word has a unique set of loadings representing its meaning as a vector in an N-dimensional semantic space. Words with similar meanings have similar loading patterns (i.e., a similar vector representation in the semantic space). The number of latent dimensions N serves as a hyperparameter and can be set to arbitrary values. For instance, the embedding layer of the BERT-base model consists of 768 dimensions. Training embedding models involves a DNN that learns to predict words based on their surrounding words (i.e., the context). After extensive training on large samples of authentic texts, the final embeddings capture nuanced semantic relationships, such as syntactic and thematic similarity between words. Currently, pre-trained vector spaces, such as Word2Vec, are accessible and have been trained on extensive text corpora.

Based on such word embeddings, several text-processing DNN architectures, like recurrent-neural networks (RNNs) or long-short-term models (LSTMs), have been developed, and many of them have also been adopted for AES tasks (e.g., Alikaniotis et al., 2016; Taghipour & Ng, 2016; Uto & Okano, 2020). One further challenge in processing text arises from the fact that the meaning of a word is never fixed, but highly affected by the context in which it appears. Thus, the words’ latent representations should not be fixed either, but rather changed and adapted according to context. To tackle this issue, various advanced model architectures have been proposed, with attention mechanisms representing a groundbreaking development (Vaswani et al., 2017). Attention mechanisms facilitate the dynamic adjustment of word embeddings based on the surrounding words, enabling models to better capture the meaning of words in a given context. The implementation of such attention mechanisms in large pre-trained transformer models has recently led to significant improvements and breakthroughs in various NLP tasks, such as text classification (e.g., Yang et al., 2019), translation (e.g., Lample & Conneau, 2019), or summeriazation (e.g., Lewis et al., 2019).

In AES, the application of transformer models has improved state-of-the-art performances (Bai & Stede, 2022; Uto et al., 2020; Wang et al., 2022; Xue et al., 2021). In text classification or regression tasks using transformers like BERT (i.e. encoder models), a fixed-length text-level representation is required that is independent of the number of words or tokens in a given text. Consequently, researchers often employ text-level pooling methods (Shen et al., 2018) or utilize a special token representation, such as the CLS token approach (e.g., Uto et al., 2020). These strategies provide a contextualized representation of the text with a consistent length, which can be effectively used as features for the regression or classification tasks at hand (Mayfield & Black, 2020).

On the one hand, DNNs provide a powerful approach to AES with no need for elaborated feature engineering and with the promise of capturing content much better than n-gram or other content feature approaches such as prompt-similarity analysis or topic dictionaries. On the other hand, contextual embeddings are latent representations of textual information, which complicates the goal of explainable AES.

Hybrid Models

Several recent AES applications have suggested that contextual embedding-based DNNs and feature engineering approaches should not be considered as competing (see, e.g., Ke & Ng, 2019), but rather as complementing each other by forming a combined model (Bai & Stede, 2022; Kusuma et al., 2022). As demonstrated, for instance, by Uto et al. (2020) or Beseiso and Alzahrani (2020), such combined models, typically referred to as hybrid models, can outperform single-resource models (see also, e.g., Dasgupta et al., 2018 and Mizumoto & Eguchi, 2023). This result seems quite intuitive, as both approaches use different strategies to process text data and thus might capture different aspects of essay quality. However, the application of hybrid models has so far only been applied to holistic scoring. Using holistic scoring tasks makes it impossible to determine which approach has its merits in terms of which aspects of text quality. This limitation might be overcome with analytic AES applications.

Method

To address our research questions, we analyzed argumentative student essays written in response to different argumentative writing prompts. Using only argumentative prompts ensured that similar aspects of text quality (also referred to as traits in the following), were relevant across prompts and corpora. Additionally, the aspect of content is generally of particular importance in argumentative essays. We used English essays from L1 and L2 learners to assess the generalizability of the results across different learner populations (see, e.g., Crossley, 2020).

Datasets

To compare the performance of different AES approaches regarding different aspects of text quality, we used four argumentative prompts from two different corpora. Two of these prompts (\(\:{N}_{1}=1783;{N}_{2}=1800)\:\)stem from the widely-used ASAP competition. These two prompts are the only ones from the ASAP corpus that involve argumentative writing. Both prompts contain essays written by US-American L1 learners. Mathias and Bhattacharyya (2018) introduced analytic labels for these two prompts via the so-called ASAP + + annotation, covering five aspects of text quality: content, organization, word choice, sentence fluency, and conventions (more details can be found in the Appendix and in Mathias & Bhattacharyya, 2018).

To expand our analyses to L2 learners, we also included two prompts (\(\:{N}_{3}=1179;{N}_{4}=1112)\:\)from the MEWS corpus (Measuring Writing Skills in English as a Second Language; Fleckenstein et al., 2020; Rupp et al., 2019). The MEWS corpus contains argumentative essays written by German and Swiss L2 upper secondary school students (Keller et al., 2020). The two prompts required students to write essays on the following topics: (1) whether advertising to young children should be allowed (AD prompt) and (2) whether it is more important for teachers to possess excellent content knowledge or to relate well with students (TE prompt). The essays were labeled analytically in the context of the TrACE project (Training Assessment Competencies in English as a Foreign Language; Keller et al., 2024). The analytic labels contain three traits: content, organization, and language quality. This dataset is available on OSFFootnote 3.

The four writing prompts, as well as further information and descriptive statistics, can be found in Table 1. Figure 1 represents the prompt-specific distributions of essay lengths. The essays of the L1 learners are slightly longer on average and the distribution is noticeably wider.

Table 1 Prompts of ASAP and MEWS used in the Present Study
Fig. 1
figure 1

Distributions of Essay Length Counted in Words. The white dashed lines mark the respective mean values

Benchmark Scores and Rater Effects

While the ASAP + + trait scores are already provided as adjudicated true scores, the TrACE trait scores were available as double-rated raw rater data (i.e., at least two scores per essay and analytic rubric; details can be found in Appendix Table 7 and in Keller et al., 2024). As proposed by Uto and Okano (2020), we employed an IRT-based rater model to account for systematic rater effects. To do so, we used the software Facets (Linacre, 2019). However, to keep model complexity low, we decided to account for rater severity effects only (not, for instance, for rater centrality/extremity or consistency, but see Uto & Okano, 2020 or Robitzsch & Steinfeld, 2018, for alternative rater modeling approaches, which can also be combined with AES models). The derived essay scores after controlling for rater effects, are on a continuous scale but limited to the original scoring range (Linacre, 1994).

Figure 2 shows the distributions of the analytic target labels of each essay corpus. All targets are approximately normally distributed, except for the two organization traits of MEWS 1 and 2, which are highly negatively skewed.

Fig. 2
figure 2

Distributions of Essay Trait Labels

Model Inputs

Our guiding research questions focus on the performance of different ML approaches to AES that rely on different input resources. We created a standard DNN architecture (Fig. 3A and B) in tensorflow (TensorFlow Developers, 2024), which was used for all input types. However, it is well-known that model performance depends not only on the characteristics of the input vectors (e.g., linguistic features vs. contextual embeddings vs. hybrid) but also on the model architecture (e.g., depths of the DNN) and the fit between the input vector and the model architecture. We, therefore, systematically varied the hyperparameters of the model architecture in a random search procure (e.g., Bergstra, & Bengio, 2012). This procedure is described in detail in the subsection Training procedures. In the following, we first introduce the two different types of model inputs – linguistic features and contextual embeddings.

Fig. 3
figure 3

Feature-Based (A), Contextual Embedding-Based (B), and Hybrid (C) DNN Architectures. Number of layers, dropout rate, and number of units per Dense layer were varied during the random search procedure

Linguistic Features

We created a set of 220 different linguistic features representing relevant text features typically included in feature-based AES models following X. Chen & Meurers, 2016; Ke & Ng, 2019; Kumar & Boulanger, 2021; Zesch et al., 2015. Table 2 presents all feature types with examples from our feature set (a comprehensive list of all 220 features can be found in the Appendix Table 8 and on OSFFootnote 4).

Automatic scoring of student essays benefits from a comprehensive analysis of different types of linguistic features. Previous studies suggest that these features constitute relevant dimensions of text quality that can differentiate students’ writing abilities (Attali & Powers, 2008; Deane et al., 2024). For example, length features provide a surface measure of the extensiveness of an essay and its elements, such as sentences and paragraphs, and thus reflect the student’s ability to develop their ideas within a certain amount of time (Shermis & Burstein, 2003). Occurrence features, which count the frequency of specific words or structures, help identify key elements and themes within the text (e.g., Chassab et al., 2021). Error features are crucial for assessing the correctness of language use, highlighting issues in grammar, spelling, and punctuation (e.g., Gamon et al., 2013). Morphological complexity features examine the use of different word forms and structures, indicating the student’s command over language intricacies (Gamon et al., 2013). Cohesion features measure how well the essay’s parts fit together, revealing the student’s ability to create a coherent and logically flowing argument (e.g., Crossley et al., 2016). Readability features assess how easily the text can be understood, which is essential for effective communication (Pitler & Nenkova, 2008). Lexical diversity features indicate the range of vocabulary used, showcasing the student’s linguistic richness and variation (e.g., Jarvis, 2013). Lexical sophistication features highlight the use of advanced and nuanced vocabulary, reflecting a higher level of language proficiency (Crossley, 2020). Lastly, syntactic complexity features analyze the structure of sentences, showing the student’s skill in constructing varied and complex sentences, which is a hallmark of advanced writing ability (Crossley & McNamara, 2014). Together, these features provide a multidimensional assessment of essay quality, capturing both surface-level and deeper linguistic competencies. Empirical studies show that these features types can be relevant in AES tasks (e.g., Crossley, 2020).

We used the Python library spaCyFootnote 5 for POS tagging and dependency parsing, LanguageToolFootnote 6 for error detection, and the BNC, SUBLEX, and NGSL word lists as well as word lists from the Psycholinguistic Database (e.g., brown frequency list) to count easy words (i.e., frequently used words in large text corpora) and difficult words (i.e., less frequently used words in large text corpora).

DistilBERT’s Contextual Embeddings

Recent comparisons and reviews of AES applications employing pre-trained transformer models indicated that performance hardly increases when these models are fine-tuned (Mayfield & Black, 2020; Rodriguez et al., 2019). Additionally, runtime and computational demands largely increase due to the extensive fine-tuning processes when using such large models. To keep runtime low, we followed suggestions by Mayfield and Black (2020) and used the distilled version of BERT (DistilBERT; Sanh et al., 2019).

In our present experiments, we focus on the question of how contextual embeddings perform in AES for different traits (e.g., Beseiso & Alzahrani, 2020; Mayfield & Black, 2020; Nadeem et al., 2019). Therefore, we did not fine-tune DistilBERT and kept all its layers frozen (i.e., not trainable) during our experiments. However, we supplemented these non-trainable DistilBERT layers with an essay-level maximum pooling layer (e.g., Shen et al., 2018) and additional Dense layers. In doing so, we received a contextual embedding vector of length 768 from DistilBERT for each essay, which served as the input vector for our AES architecture (Fig. 3B).

Training Procedure and Model Architectures

Main Experiments

To train and evaluate our models, we followed a five-fold cross-validation strategy. For the ASAP datasets, we employed the splits introduced by Taghipour and Ng (2016) that imply five 60-20-20 splits in training, validation, and test data. As the test dataset of the ASAP competition is not publicly accessible, the experiments are based solely on the training data of the ASAP competition (see, Taghipour & Ng, 2016)Footnote 7. These predefined splits had also been used for trait scoring of the ASAP + + dataset by Mathias and Bhattacharyya (2018, 2020).

For the MEWS corpus, we also employed five-fold cross-validation. However, because of the considerably smaller datasets in MEWS, we decided to use 70% of each dataset as the training set, 10% of the data as the validation set, and 20% as the test data in each fold. To find the best epoch for each run, we used an early-stopping callback function that tracked the validation loss. The model showing the best performance on the validation set across folds was finally used for evaluation with the test data.

All DNN model architectures were set up in tensorflow (python code can be found on the OSF repositoryFootnote 8). We designed our AES models as regression models. The values indicating the trait-specific essay qualities are ordinally scaled. Since they range, for instance, from 1 = high quality to 6 = low quality and thus can be assumed to be continuous, we decided against classification approaches (see Beseiso & Alzahrani, 2020, for a comparison of classification vs. regression AES models). Thus, we used a single unit with linear activation in the output layer and the mean squared error (MSE) as the loss function in all DNN architectures.

Table 2 Feature Types with Examples from the Feature Set

We employed the Adam optimizer and the Mean Squared Error (MSE) loss function. For each trait of each prompt, different models were trained varying the type of input (features vs. embeddings vs. hybrid). In addition, we systematically changed the hyperparameters defining the model architectures to ensure valid comparisons across the different AES approaches. In doing so, we used a random search procedure (Bergstra & Bergio, 2012) varying the following hyperparameters: learning rate (5e−3, 1e−3, 5e−4, 1e−4), number of dense layers (0Footnote 9, 1, 2) units per dense layer (64, 128, 256) and dropout rates (0.2, 0.3, 0.4, 0.5, 0.6). During the random search, we tested 50% of this parameter space.

Hybrid Model

The hybrid architecture used both input resources – linguistic features and essay-level contextual embeddings from DistilBERT. In the first step, the two types of inputs were separately processed through additional Dense and Dropout layers in two parallel model parts (Fig. 3C). The hyperparameters of the feature input part were determined by the best performing models from the corresponding feature-based models. However, it turned out that the hybrid architecture was much more difficult to optimize and that additional Dense Layers hardly improved model performance of the embedding-based models (see Appendix Table 9). Therefore, we decided to employ a reduced second parallel model part for the embeddings. This second part using the embedding input was only fed through one additional dropout layer and then directly into the concatenation layer (Fig. 3C). As the first part of the model architecture was fixed, we only varied the dropout rate for the second (i.e., the embedding input) part of the model (0.3, 0.4, 0.5, 0.6, 0.7) and the learning rate of the Adam optimizer (5e−3, 1e−3, 5e−4, 1e−4). The concatenation layer was incorporated as the last stage of the hybrid model before the (single-unit) output layer. This implies that interactions between linguistic features and contextual embeddings were enabled in this hybrid architecture (Fig. 3C).

Addition and Ablation Tests

RQ 3 was concerned with the question which linguistic feature types were not covered by the contextual embeddings. To answer RQ 3, we ran two types of tests to gain more insights into the interplay of contextual embeddings and linguistic features. In the first series of tests (addition), we always started with the embedding-based DNN. From there on, we ran a reduced form of the hybrid model, supplementing the contextual embeddings with one feature group at a time. In doing so, we distinguished nine types of features (see Table 2). Thus, we reran each trait- and prompt-specific model nine times using the same cross-validation procedure as in the main experiment.

For the second series of tests (ablation), we took the full hybrid model as the starting point. In an iterative process, we reran each model nine times, each time removing a different feature group from the input. Again, we applied the same cross-validation procedure as in the main experiments.

Cross-Prompt Scoring

For cross-prompt scoring (RQ 4), we relied on the hyperparameter settings of the respective best-performing model of each prompt and trait. Again, we employed the cross-validation procedure outlined above but used the complete data from the respective other prompt within a given corpus as the test data instead.

Two Linear Regression Baselines

We additionally compared the three DNNs to two simpler baseline models from Scikit-learnFootnote 10. In doing so, we (1) combined a linear ridge regression with an N-gram-vectorizer with stop word filtering as input and (2) a linear ridge regression with our feature set as input (which will be described in the following sub-section). The n-gram baseline models allowed us to disentangle whether and to what extent more complex text processing, as provided by pre-trained transformer models using embeddings and attention mechanisms, are superior to simpler text processing using the classical n-gram approach in automated essay trait scoring tasks. Furthermore, the feature baseline model allowed us to explore whether (and to what extent) complex model architectures (i.e., DNN architectures) are superior compared to simple linear models in automated essay trait scoring. To fit the baseline models, we used the same cross-validation procedure and a grid search approach varying n-gram range (unigrams, bigrams, trigrams) and the alpha parameter of the ridge regression (1e−4, 1e−3, 1e−2, 1e−1, 1, 10, 100). The best-performing model on the validation sets across folds was evaluated with the test data.

Evaluation Metrics

We used quadratic weighted kappa (QWK; Cohen, 1968) as evaluation metric. QWK is the most frequently used metric in AES tasks and had also been reported in the course of previous analyses of the ASAP + + datasets (Mathias & Bhattacharyya, 2018, 2020).

A QWK value of one indicates perfect agreement between predicted scores and benchmarks, a value of zero corresponds to a chance agreement and a negative value represents systematic disagreement, with minus one as the extremum corresponding to complete disagreement.

To also compare model- and trait-specific performance across prompts, we employed average QWK (\(\:\stackrel{-}{\text{Q}\text{W}\text{K}}\); see, e.g., Taghipour & Ng, 2016) as well as mean QWK differences (\(\:\stackrel{-}{\varDelta\:\text{Q}\text{W}\text{K}}\)). However, as averaging QWK across different scales is notoriously problematic (e.g., Doewes et al., 2023), we also used average Pearson Correlation Coefficient (PCC) as an additional metric to compare model performance across traits.

T-Tests

To examine whether one approach (feature vs. embedding vs. hybrid) performed significantly better than the other approaches, we used pairwise T-tests (see, e.g., Uto et al., 2020). We employed one-sided testing for comparisons against the baseline models.

Results

Tables 3 and 4 present the trait-specific test data performances in QWK after the random search cross-validations of our main experiments for ASAP and MEWS, respectively. The best-performing hyperparameter settings for each model can be found in the Appendix (Table 9).

Features Versus Contextual Embeddings

In RQ 1, we aimed to compare the performance of a feature-based and a contextual embedding-based model. In Tables 3 and 4, the QWKs of the feature- and the embedding-based DNN predictions on the test data can be found in the (prompt-specific) third and fourth rows, respectively. For comparison, the same results measures in PCC can be found in the Appendix (Tables 10 and 11). Across traits and prompts, the feature-based model outperformed the embedding-based model in 11 out of 16 cases. However, the performance of both models was similar. The feature-based model achieved an overall average QWK of \(\:{\stackrel{-}{\text{Q}\text{W}\text{K}}}_{features}=\:0.614\) (\(\:\stackrel{-}{\text{P}\text{C}\text{C}}=0.673\)), and the embedding-based model achieved an overall average of \(\:{\stackrel{-}{\text{Q}\text{W}\text{K}}}_{embeddings}=\:0.563\) (\(\:\stackrel{-}{\text{P}\text{C}\text{C}}=0.625\)). The T-test across traits and prompts implied no significant differences between the two approaches (p = .345). In addition, it became apparent that the embedding-based model fell short, especially in the trait organization of the two MEWS prompts (QWK differences of \(\:{{\Delta\:}\text{Q}\text{W}\text{K}}_{MEWS\:1}=0.31\) and \(\:{{\Delta\:}\text{Q}\text{W}\text{K}}_{MEWS\:2}=0.33\), respectively), while performing almost equally well across all other traits (and prompts). Furthermore, the same pattern for the trait organization was evident in ASAP 2 but not in ASAP 1. Nevertheless, this finding seems plausible as (even contextual) embeddings might not carry information about an essay’s (meta-) structure, which is relevant for human annotators judging student essays. In contrast, such information, for instance the number of paragraphs, is represented in the feature set.

Table 3 Quadratic Weighted Kappa Across ASAP Essay Traits
Table 4 Model Performances Across MEWS Essay Traits

Beside these differences in the trait organization, no systematic superiority of one approach was found across traits. For ASAP, QWKs even implied more systematic differences between prompts than between traits. While the embedding-based DNN outperforms the feature-based DNN in four out of five traits of ASAP 1, the feature-based DNN outperforms the embedding-based DNN in all traits of ASAP 2.

Furthermore, we compared these two models against two simpler baseline models. These baseline models used a ridge regression with n-grams versus the feature input. The prompt-specific first and second rows of Tables 3 and 4 Model represent the test data performance for each trait. The comparison of our DNN target models with the n-gram baseline model revealed that both target models consistently outperformed the baseline. The one-sided T-tests indicated significant performance advantages (\(\:{p}_{features}\) < 0.001 and \(\:{p}_{embeddings}\) = 0.010, respectively). However, comparisons to the feature-based linear regression baseline only partly revealed advantages for the target models, and T-tests were not significant (\(\:{p}_{features}=0.818\), \(\:\:{p}_{embeddings}=0.465\)). The feature-based linear baseline model even performed consistently above the embedding-based DNN across all traits of the MEWS prompts (\(\:{\stackrel{-}{{\Delta\:}\text{Q}\text{W}\text{K}}}_{embeddings\:vs.\:\:baseline\:1}=-0.01\)). The feature-based DNN also fell short in four out of six traits in the MEWS prompts compared to the feature baseline (\(\:{\stackrel{-}{{\Delta\:}\text{Q}\text{W}\text{K}}}_{features\:vs.\:\:baseline\:1}=-0.02\)). Regarding the two ASAP prompts, however, the two DNN approaches almost consistently performed above the feature-based baseline. However, the differences were small (\(\:{\stackrel{-}{{\Delta\:}\text{Q}\text{W}\text{K}}}_{features\:vs.\:\:baseline\:2}=0.02\) and \(\:{\stackrel{-}{{\Delta\:}\text{Q}\text{W}\text{K}}}_{embeddings\:vs.\:\:baseline\:2}=0.01\)). These relatively small advantages imply that nonlinearities and interactions among features (as well as embeddings) were of minor importance when scoring the essay traits (see also Table 9 in the Appendix). This finding also matches expectations as raters typically follow strict judgment guidelines for benchmark scoring. Such guidelines are almost exclusively based on linear, additive scoring rules.

Hybrid Architecture

The goal of RQ 2 was to compare a hybrid model architecture containing both feature types – linguistic features and contextual embeddings – to the single-resource models. The trait-specific test set performance of the hybrid model is represented in the fifth row of each prompt in Tables 3 and 4. The hybrid model achieved an average performance of \(\:{\stackrel{-}{\text{Q}\text{W}\text{K}}}_{hybrid}=\:0.640\) (\(\:\stackrel{-}{\text{P}\text{C}\text{C}}=0.681\)). As expected, the hybrid model outperformed the single-resource models in most traits (12 out of 16) across prompts. However, the one-sided T-test comparing the performance of the feature-based model to the hybrid was not significant (\(\:{\stackrel{-}{{\Delta\:}\text{Q}\text{W}\text{K}}}_{hybrid-features}=0.03\), \(\:p=\:0.507\)). The difference between the embedding-based DNN and the hybrid also failed significance (\(\:{\stackrel{-}{{\Delta\:}\text{Q}\text{W}\text{K}}}_{hybrid-embeddings}=0.08\), \(\:p=\:0.156\)). Despite the non-significant results, the hybrid consistently proved to perform better than the single-resource models across prompts and traits. This finding meets expectations and is in line with recent findings from holistic scoring (Bai & Stede, 2022; Uto et al., 2020). Furthermore, it implies that both types of input indeed capture partially different text information relevant for scoring essay traits. Thus, both types of input complemented each other to a certain extent, even when most of the text information relevant for assessing essay traits seemed to be captured by both input types. This is plausible when considering that both single- resource models already achieved high QWK in almost all traits and prompts.

A closer look at the different traits revealed that the largest average gains comparing the feature model to the hybrid were apparent in the content and language traits (\(\:{\stackrel{-}{{\Delta\:}\text{Q}\text{W}\text{K}}}_{content}=\:0.04,\:\:{\stackrel{-}{{\Delta\:}\text{Q}\text{W}\text{K}}}_{language}=\:0.04)\). However, the advantages of the hybrid model were only slightly smaller for the organization traits on average (\(\:{\stackrel{-}{{\Delta\:}\text{Q}\text{W}\text{K}}}_{organization}=\:0.02\)). For this comparison, the three language traits word choice, sentence fluency, and conventions, used in the ASAP + + analytic scoring rubric, were all used as measures for language (to match the less detailed dimensionality of the MEWS rubric). However, a closer look at these three language traits in ASAP + + revealed that the performance on the trait conventions was least likely to benefit from the combined input of the hybrid model.

These findings are not surprising as the employed features hardly capture content-related information, and the contextual embeddings were a decisive contribution in this respect. Therefore, the most considerable performance gains had been expected for trait content. However, the powerful properties of contextual embeddings regarding language and writing style have also been repeatedly proven in recent years. In this context, the successful interplay of features and contextual embeddings for the language traits also seems to be expectable.

A closer look at the trait-specific gains comparing the embeddings-based and the hybrid model revealed the highest performance gains for the organization traits (\(\:{\stackrel{-}{{\Delta\:}\text{Q}\text{W}\text{K}}}_{organization}=\:0.16\), \(\:{\stackrel{-}{{\Delta\:}\text{Q}\text{W}\text{K}}}_{content}=\:0.02,\:\:{\stackrel{-}{{\Delta\:}\text{Q}\text{W}\text{K}}}_{language}=\:0.02\)). As mentioned above, this result also corresponds to our expectations, since embeddings hardly capture any information about the meta-structure of essays.

Addition and Ablation Tests

To shed more light on the interplay of contextual embeddings and specific feature types when scoring certain essay traits, we ran two series of tests. In the first series, we iteratively supplemented the embedding-based DNN with one feature type (addition tests). In doing so, we tracked the performance gains of these extended models compared to the DNNs that only relied on embeddings. These comparisons allowed us to explore essay characteristics that could hardly be covered by the contextual embeddings but by the appropriate features, thus improving model performance. Figure 4 presents the respective results in terms of QWK change (i.e., \(\:{\Delta\:}\text{Q}\text{W}\text{K}\)) for each trait and prompt. Across prompts, performance gains on the content traits appeared most often when the morphological complexity features supplemented the contextual embeddings. Supplementing the contextual embedding input, length features turned out to be most important for the organization traits. Furthermore, lexical sophistication, error, and occurrence features were most likely to achieve performance advantages across the language traits. Again, these findings seem reasonable. Length features describing the meta-structure of the essays provide structural information that embeddings cannot capture. In the context of the assessment of language traits, text characteristics, such as spelling or grammar errors, are also no natural ingredients of embeddings but are undoubtedly important to judge the language quality of student essays. The same applies to lexical sophistication and occurrence features, which describe aspects of language quality inaccessible by contextual embeddings. An interesting finding is that morphological complexity features were most relevant for the content traits. On the one hand, morphological complexity might not carry content-related information. On the other hand, comparatives and superlatives might be highly relevant inflections in argumentative writing. For example, these inflections can be relevant when different arguments are contrasted or weighted to draw conclusions. Students’ ability to contrast and weigh is essential for good argumentative writing.

In the second series of tests, we explored the unique contribution of single feature types. We used the complete hybrid architecture and iteratively removed one of the nine feature types (ablation tests). Figure 5 shows the performance drops for each trait- and prompt-specific model and the nine re-analyses. Consistent performance drops across prompts indicate that a particular feature type contains trait-relevant information and that the contextual embeddings and the other features do not capture this information. The results imply that the performance of models for the content traits dropped across all four prompts when readability and syntactic complexity features were removed. Therefore, both seem to contain unique information relevant to the assessment of content that the other feature types or contextual embeddings could not capture. Consistent performance drops were apparent when removing cohesion and, again, readability features from the trait models for organization. When removing occurrence, length, or error features, performance almost consistently decreased across the language traits. Throughout traits, length features, in particular, emerged as an essential feature type capturing important and unique text characteristics for judging the student essays.

Fig. 4
figure 4

Addition Tests Tracking Highest Performance Gains (\(\:\varDelta\:QWK\)) by Adding one Type of Features to the Embedding-Based Models

Fig. 5
figure 5

Ablation Tests Tracking Highest Performance Drops (\(\:\varDelta\:QWK\)) by Removing one Type of Features from the Hybrid Models

Cross-Prompt Scoring

Tables 5 and 6 present the cross-prompt performance of the DNN models trained on the ASAP and MEWS corpora respectively. The analyses show that across models and traits, the performance drop (\(\:\varDelta\:\text{Q}\text{W}\text{K}\)) when comparing the within-prompt performance to the cross-prompt performance was between − 0.01 and − 0.30 (\(\:\varDelta\:\text{P}\text{C}\text{C}\:\text{r}\text{a}\text{n}\text{g}\text{e}=[-0.15;-0.01]\)). For the test data of the MEWS 2 organization trait, the embedding-based model trained on MEWS 1 even slightly outperformed the embedding-based model trained on MEWS 2 (\(\:\varDelta\:\text{Q}\text{W}\text{K}=0.02;\) i.e., the cross-prompt performance was better than the within-prompt performance). However, the embedding-based models generally worked very poorly for the MEWS organization trait.

Regarding the models trained on the MEWS prompts and ASAP 1, the feature-based models outperformed the embedding-based models in cross-prompt performance across traits. However, the embedding-based models trained on ASAP 2 consistently outperformed the feature-based model in cross-prompt performance. T-tests revealed no significant cross-prompt scoring advantages for the feature-based DNN (\(\:{\stackrel{-}{\text{Q}\text{W}\text{K}}}_{features}=\:0.49\); \(\:{\stackrel{-}{\text{P}\text{C}\text{C}}}_{features}=\:0.52\)) compared to the embedding-based model (\(\:{\stackrel{-}{\text{Q}\text{W}\text{K}}}_{embeddings}=\:0.42;\:{\stackrel{-}{\text{P}\text{C}\text{C}}}_{embeddings}=\:0.45\)) (p = .131). Unsurprisingly, the results of these cross-prompt performance comparisons are in line with the within-prompt patterns (see Tables 3 and 4). However, adjusting for the within-prompt performance still implies slight advantages for the feature approach (\(\:{\stackrel{-}{\varDelta\:\text{Q}\text{W}\text{K}}}_{features}=\:-0.12\), \(\:{\stackrel{-}{\varDelta\:\text{Q}\text{W}\text{K}}}_{embeddings}=\:-0.15\); \(\:{\stackrel{-}{\varDelta\:\text{P}\text{C}\text{C}}}_{features}=\:-0.09\), \(\:{\stackrel{-}{\varDelta\:\text{P}\text{C}\text{C}}}_{embeddings}=\:-0.11\)).

Table 5 ASAP Cross-Prompt Scoring Performance in Terms of QWK (and Comparing Cross-Prompt and Within-Prompt Performance)
Table 6 MEWS Cross-Prompt Scoring Performance in Terms of QWK (and Comparing Cross-Prompt and Within-Prompt Performance)

The hybrid model also outperformed the embeddings-based approach regarding cross-prompt scoring but also just fell short of the feature-based model on average (\(\:{\stackrel{-}{\text{Q}\text{W}\text{K}}}_{hybrid}=\:0.48;\:{\stackrel{-}{\text{P}\text{C}\text{C}}}_{hybrid}=\:0.52\)). Surprisingly, adjusting for the within-prompt performance, the hybrid model even performed worse than both single approaches (\(\:{\stackrel{-}{\varDelta\:\text{Q}\text{W}\text{K}}}_{hybrid}=\:-0.16\), \(\:{\stackrel{-}{\varDelta\:\text{P}\text{C}\text{C}}}_{hybrid}=\:-0.12\)). However, T-tests revealed that these differences were not statistically significantly different from zero.

Furthermore, we also explored trait-specific cross-prompt performance losses. Across models, the most remarkable drop in model performance from within-prompt to cross-prompt scoring was revealed for the content traits (\(\:{\stackrel{-}{\varDelta\:\text{Q}\text{W}\text{K}}}_{content}=\:-0.17\); \(\:{\stackrel{-}{\varDelta\:\text{P}\text{C}\text{C}}}_{content}=-0.09\)). The comparably smallest drop was apparent for the organization traits (\(\:{\stackrel{-}{\varDelta\:\text{Q}\text{W}\text{K}}}_{language}=-0.10\); \(\:\:{\stackrel{-}{\varDelta\:\text{P}\text{C}\text{C}}}_{language}=-0.05\)). This result is also in line with expectations, as the topics of the individual prompts differ and the features’ importance can therefore vary depending on the prompt. In addition, indicators for language and organizational text quality might be more stable across different writing prompts.

Discussion

In the present study, we compared different supervised ML models for automated trait scoring of student essays using four argumentative prompts from L1 and L2 upper secondary students. Results implied small performance advantages for trait-specific models based on an extensive set of features compared to models based on contextual embeddings that stem from the pre-trained transformer DistilBERT. The differences between the two approaches were particularly evident in the organization traits. However, since contextual embeddings do not require extensive feature engineering, this approach can serve as a valuable baseline model for essay trait scoring, performing significantly better than an n-gram baseline model in our experiments. The hybrid approach, using both input types, consistently outperformed the two single resource models across traits. Addition tests revealed that the performance of the embedding-based models was consistently enhanced in content assessment when combined with morphological complexity features. In addition, performance gains were consistently achieved in organization assessment when combined with length features and in the assessment of language traits when combined with lexical complexity, error, and occurrence features. The feature-based models exhibited slight advantages in cross-prompt scoring over the embedding-based and hybrid models. When comparing trait-specific cross-prompt and within-prompt performance, losses were slightly larger in trait content across ML approaches and prompts compared to organization and language traits.

Limitations and Future Research

Despite the various models considered and the extensive experiments run, the present study also has limitations that imply several directions for future research.

First, even considering L1 and L2 learners’ essays, the present investigation is limited to upper high school / secondary school students from three countries (American L1 students and German and Swiss L2 students). The performance of different models might vary with learner populations and should be extended, for instance, to primary school (e.g., Trüb et al., under review) or higher education contexts (e.g., Beseiso et al., 2021).

Second, pooling contextualized embeddings on the essay level indeed implies a loss of information that is captured by transformer models. This essay-level pooling approach is only one possibility of using transformer models in AES tasks (see, e.g., Xue et al., 2021). Future studies might explore transformer models’ potential, for example, for feature engineering. Valuable strategies might be to use section-level embeddings or cosine similarities with prompts or best-practice solutions (see Bexte et al., 2022, 2023). Furthermore, sentence-level embeddings can be used for calculating cohesion measures.

Third, in our experiments, we relied on contextual embeddings as fixed features provided by DistilBERT. However, although requiring much more computational costs, transformers can also be fine-tuned to specific tasks. While performance gains were small at best in holistic AES tasks (see, e.g., Mayfield & Black, 2020; Rodriguez et al., 2019), less is known about fine-tuning of transformers for analytic scoringFootnote 11. Future studies could deepen this aspect of transformer models in automated analytic essay scoring.

Fourth, there are other essential topics in AES applications such as fairness and algorithms’ vulnerability to cheating behavior. Future studies could compare feature-based and embedding-based AES models regarding fairness (Schaller et al., 2024) and cheating behavior in trait assessment (see, e.g., Ding et al., 2020; also Bai & Stede, 2022).

Fifth, performance of supervised ML models highly depends on the number of training examples. This might explain to a certain extent the performance differences between ASAP and MEWS prompts in our experiments. However, further systematic experiments varying the amount of training data across ML approaches and prompts would be needed to quantify the relevance of training data size. Such investigations might also consider active learning approaches to minimize the required number of training examples (e.g., Firoozi et al., 2023; Horbach & Palmer, 2016).

Finally, the power of large language models (LLMs; i.e., extensively pre-trained generative transformer models such as GPT-4) have recently entered the AI world. They also offer new possibilities to the field of AES applications. First approaches have, for instance, explored their potential to be included in an LLM-based hybrid model (Mizumoto & Eguchi, 2023).

Practical Implications

The present study has several implications, especially for creating feedback tools and tutoring systems in the context of student essay evaluation. In our experiments, the feature engineering approach performed as well or better than the embedding approach across essay traits. Since the feature approach can provide more explainability and, thus, more concrete practical information for student feedback, we consider the feature approach as the most promising alley for implementing real-life AES tools. However, in AES applications, an embedding-based DNN approach can serve as a valuable baseline that is easy to set up as no feature engineering is required. Furthermore, our experiments imply that a hybrid approach can increase performance compared to single-resource models. Feature engineering approaches can benefit from embedding-based model inputs, especially scoring content and language quality traits.

In future applications, the hybrid approaches could be chosen for the summative assessment of essay traits if a sufficiently large amount of training data is available. The feature engineering approach, on the other hand, could be used primarily for formative feedback due to its explainability.