Introduction

Synthesis writing is widely taught across academic disciplines and serves as an important means of assessing writing ability, text comprehension, and content learning (Vandermeulen et al., 2019). However, this form of writing can be extremely challenging, even for students in higher education (Van Ockenburg et al., 2018). Synthesis writing differs from other types of writing in terms of both cognitive and task demands because it requires writers to integrate information across source materials. Thus, it is a hybrid task that requires both reading and writing (Spivey & King, 1989); students must read and select appropriate information from sources, contrast and integrate that information while writing, and continuously revise texts as a result.

Synthesis writing is therefore unique because it focuses on selecting information from multiple texts, integrating, organizing, and connecting that information into a written text, and thematically structuring the information (Solé et al., 2013; Spivey, 1997; Spivey & King, 1989). A small number of studies have focused on the processes involved in information selection (Martínez et al., 2015; Mateos & Solé, 2009; Solé et al., 2013), whereas another line of research has sought to identify the product of integration (e.g., the amount, accuracy, and manner in which information is integrated into an essay from a source text, Gebril & Plakans, 2009, 2009; Uludag et al., 2019; Weigle & Parker, 2012). Importantly, product approaches have almost exclusively relied on hand annotations of source integration, which are costly in terms of the amount of time and money invested (Williamson et al., 2012).

The purpose of this study is to examine methods to automatically annotate argumentative synthesis essays for aspects of source integration. To do so, we annotate a large corpus of argumentative, synthesis essays composed by college students, military members, and the general adult population for aspects of source integration using natural language processing (NLP) techniques and examine links between these annotations and human ratings of writing quality. The NLP techniques we used include hand-crafted NLP features and pre-existing NLP tools that calculate features related to semantic and keyword overlap between the essays and the source texts, plagiarism from the source texts, and instances of source citation and quoting. The goal of the study is to examine the potential for these NLP features to predict human scores related to holistic essay quality and source use/inferencing beyond text length, which generally is a strong predictor of essay quality (Ferris, 1994; Frase, Faletti, Ginther, & Grant, 1999; Guo et al., 2013). Such an approach can provide information about the incidence of source integration features in source-based writing and the associations between these features and text quality as well as help assess the reliability of NLP features.

Argumentative synthesis writing

Argumentative synthesis writing prompts individuals to write essays using information from source texts that have been provided to them. Learning to write from and about sources is a key academic skill often developed during secondary and tertiary education (Cumming, Lai, Cho, 2016; Vandermeulen et al, 2019). Synthesis tasks are often difficult, as they require the successful interpretation of the source material as well as the appropriate integration of information from the various sources (Martínez, Mateos, Martín, & Rijlaarsdam, 2015; Mateos et al., 2008; Solé et al., 2013). Despite the common use of source-based writing in academic contexts, many students struggle to master the writing task (Grabe & Zhang, 2013; Vandermeulen et al, 2019) even though the synthesis and integration of information into writing is considered essential for success, especially in tertiary education (Newell, Beach, & VanDerHeide, 2011; Haswell, 2000; Hirvela, 2011; Hood, 2008; Leki, 2017; Lillis & Curry, 2010; Melzer, 2009; Tardy, 2009). Indeed, success on synthesis tasks demonstrates that an individual is able to comprehend source texts, and in turn, develop written, argumentative responses that coherently incorporate source material (Solé et al., 2013; Spivey & King, 1989).

At its core, synthesis writing involves organizing ideas while reading and writing, selecting information from source texts, and integrating that information into writing based on inferences from the source texts (Spivey, 1997). More specifically, synthesis involves the accurate integration of source information, summarizing source information (Cumming, Lai, Cho, 2016), appropriate use of quotes, and avoiding plagiarism. Avoiding plagiarism during synthesis writing and developing ownership of information from a source is perhaps the most difficult task for writers beginning to practice synthesis writing. Thus, many studies have examined how students’ text integration represents inappropriate or inadequate textual borrowing, assumedly resulting from lack of knowledge regarding cultural or discourse conventions (Bazerman, 2004; Belcher & Hirvela, 2001; Chandrasoma et al., 2004).

There are two general approaches to assessing successful synthesis in writing: process and product approaches. Process approaches often use audio-visual recordings (Martínez et al., 2015; Mateos & Solé, 2009; Solé et al., 2013) or keystroke logging (Leijten et al., 2019; Vandermeulen et al., 2019) to examine time spent reading sources and writing as well as transitions between the source texts and essay. Such studies can provide information about sub-processes involved in synthesis writing including comparing and contrasting information found in sources, linking sources, and integrating source information. A product approach, which is amenable to NLP techniques, focuses on identifying the amount of information integrated into an essay from a source text, how the information is integrated (e.g., quoting, paraphrasing, or summarizing), and the accuracy of the integration (Gebril & Plakans, 2009, 2009; Petrić, 2012; Plakans, 2009; Plakans & Gebril, 2013; Uludag et al., 2019; Weigle & Parker, 2012).

Source synthesis and writing proficiency

Research examining the product of successful synthesis writing generally follows two methods. The first method examines associations between features of source integration found in the essay and human scores of writing quality. The second method compares differences in source integration between first language (L1) and second language (L2) writers. The hypothesis underlying this approach is that L1 writers are more proficient than L2 writers and differences between the two populations can help identify features of successful writing.

Multiple studies have examined links between textual features related to source integration as found in argumentative essays and human ratings of quality. All of this work has relied on hand annotating features related to source integration and the majority of this research focused on L2 writing. For instance, studies have indicated that L2 writers with lower writing scores synthesized discourse to a lesser degree and used fewer source-based ideas in their essays (Plakans, 2009) that less accurately represented the source text (Uludag et al., 2019). Gebril and Plakans (2009) (2009 and Plakans and Gebril (2013) also reported that more advanced L2 writers used paraphrasing and summarization to integrate source material. Additionally, advanced L2 writers used direct source quoting to a greater extent than less advanced writers. Similarly, Petrić (2012) found that in L2 theses, higher scored writers use more direct quotations and that the quotations produced by higher proficiency writers were fragments while lower proficiency writers use clause-based quotations. In contrast, Weigle and Parker (2012) found that text integration in L2 writing was generally restricted to short excerpts from the source-text regardless of proficiency level and that there were few differences by proficiency level in how text was integrated into the essay, although they did report that less proficient students tended to quote more extensively. Leijten et al. (2019) also found that source integration was not a predictor of essay quality for L2 writers, but it was a predictor for L1 writers.

There are also reported differences in source integration between L1 and L2 writers during argumentative synthesis writing. For instance, studies have reported that L1 writers include more citations than L2 writers (Borg, 2000; van Weijen et al., 2019) and that citations that involve quotations are longer in L1 writing compared to L2 writing (Borg, 2000). Research also reports that when L2 writers do integrate text from the source, they seem to borrow more direct strings of words and are less likely to properly cite the source than L1 writers (Shi, 2004; van Weijen et al., 2019). More recently, van Weijen, Rijlaarsdam, and van den Bergh (2019) conducted a within-writer comparison of L1 and L2 source-based essays. The essays were annotated for six elements of source integration (e.g., unique source integration, source copying, and patchwriting). Van Weijen et al. found that source integration was more strongly related to person-level differences as compared to language differences (although L1 samples included more unique sources and less copied material than L2 samples). Most importantly, they also reported no clear effects of L2 proficiency on source integration.

Annotating source-based writing features

Annotating synthesis writing features (as in the product studies above) can provide important information about features important to source integration. However, as noted, most annotations of source integration have involved hand-coding. Common elements that have been hand-coded include indirect source integration (paraphrasing and summarizing), direct source integration with quoting, direct source integration without quoting, total source integration (Gebril & Plakans, 2009, 2009; Plakans, 2009; Uludag et al., 2019), length of quoting, citation use (Leijten et al., 2019; Petrić, 2012; Weigle & Parker, 2012), switching between sources (Leijten et al., 2019), and source accuracy (Uludag et al., 2019). However, hand-coding data for language is a difficult, time-consuming, subjective, and resource intensive process (Dodigovic, 2005; Higgins et al., 2011) and is not a viable option for large corpora of texts. To analyze relevant linguistic features, patterns, and structures in large corpora, automated methods based on NLP are necessary (Granger et al., 2007; Meurers, 2015).

To date, most NLP approaches to source integration have focused on automatically detecting plagiarism. For instance, Clough and Stevenson (2011) examined algorithms to detect plagiarized texts (short passages) from Wikipedia articles. They developed new predictors of plagiarism including containment measures which calculated the normalized incidence of trigrams (i.e., three-word segments) that intersected between the source text and the short passages, and the longest sequence of words found in both the Wikipedia articles and the short passages (i.e., the longest common sequence). These features were able to accurately detect 80 percent of the plagiarized material. Other studies have examined how plagiarism detection can be improved by removing stop-words (e.g., pronouns, prepositions, and articles) and punctuations, lemmatizing words to bring them to their base form, predicting synonym use, and replacing numbers (Ceska & Fox, 2009; Chong et al., 2010).

A few studies have examined semantic similarity between source texts and written responses using word embedding algorithms such as Latent Semantic Analysis (LSA), which can compute semantic similarity between words, sentences, paragraphs, and texts based on distributional properties of words in large corpora. Studies have shown that semantic similarity between sources and responses is indicative of the quality of text summarizations and paraphrases (Crossley, Kyle, Davenport, & McNamara, 2016; Crossley et al., 2019).

Current study

The current study builds and extends previous research on synthesis writing by testing the predictive ability of NLP features to assess source integration. We extend previous NLP work on plagiarism and word embedding algorithms to assess the quality of source-based, argumentative essays in terms of holistic and source use/inferencing scores. We also introduce NLP features that represent aspects of source integration including features related to source citation and quoting. The research question that guides this study is: To what degree can NLP features of source integration predict synthesis writing quality and source use/inferencing quality?

Methods

Corpus

Our final corpus comprised 909 source-based essays collected from prior studies examining source-based writing.Footnote 1 The purpose of these experiments varied, from examining individual differences that contribute to source-based writing to interventions intended to improve the processing and integration of source information. Collectively, the essays were written by a diverse group of participants, including adults crowd-sourced from Mechanical Turk (n = 163), military recruits from the United States Navy (n = 177), and undergraduate college students from across the United States (n = 569). The Mechanical Turk participants were paid for their writing samples while the military members and the undergraduate students received course credit for participating. Demographic data was collected for all but 11 of the participants. The mean age of participants was 23.28 (SD = 8.73) with a minimum age of 17 and a maximum age of 74. The majority of the participants identified as White (n = 569) with the remaining participants identifying as Hispanic (n = 111), Asian (n = 86), Black (n = 83), Other (n = 46), Native-American (n = 2) or Middle Eastern (n = 1). In terms of gender, 401 of the participants were female and 497 were male. Two participants reported not graduating high-school and 245 reported having a high-school diploma or equivalent (but nothing else). The majority of the participants had some college experience (n = 550) with 61 reporting an Associate’s degree. Forty-three participants reported obtaining a Bachelor’s degree and one reported a Master’s degree. Additionally, 25 participants reported being non-native speakers of English. In light of Van Weijin et al.’s (2019) findings that source integration patterns were linked more strongly to individual differences as opposed to language differences, we retained these participants, which in turn allowed us to keep the database intact.

The essays were written on one of four argumentative prompts related to cell phone use and cancer, global warming, green living, and the locavore movement. For each of these prompts, participants were given a set of 4–7 source documents that provided information about the topics. They were asked to read the sources and write a source-based essay; they were explicitly given instructions to elaborate on the information in the text instead of summarizing. They were also asked to use information from the texts to support their ideas, but to put ideas in their own words. Across the studies, participants were given around 40 min to read the sources and write an essay. Within the corpus, each participant wrote a single essay (i.e., there were 909 essays written by 909 individual participants). The average number of words written per essay was 355.92 (SD = 159.22) with a minimum number of words of 16 and maximum number of words of 1,313. Information for the essays in the corpus by prompt are provided in Table 1. Prompts and assignments are provided in "Appendix A".

Table 1 Descriptive statistics for corpus

Scoring rubric

We developed a scoring rubric that included four analytic scores and one holistic score (see "Appendix B"). The analytic scores included argumentation, source use and inferencing, language sophistication, and organization. However, since this analysis was solely interested in the quality of source use, we did not analyze the argumentation, language sophistication or organization scores. Instead, we focus only on source use and inferencing scores and holistic scores.

For source use/inferencing, a good essay referred explicitly and accurately to the majority of sources, synthesized information across the sources, and went beyond simple paraphrasing of the sources. Thus, the source use and inferencing score refers to both source citation and source integration. The source use and inferencing scores were rated on a 1–4 scale (with 1 being low and 4 being high). Holistic essay score was based on a 1–6 scale with 1 labeled very poor and 6 labeled excellent.

Human raters

Human ratings for the analytic and holistic scores were provided by two expert raters. The raters were both white, female PhD students in an English composition program housed at a Predominantly Black Institution in the Southeastern United States. Both raters had 3 + years of teaching writing at the same university as well as 3 + years rating experience with standardized rubrics. The raters were paid $20 an hour for their work and were not included as co-authors on this paper. The raters were first trained on the rubric using source-based essays that were not part of the corpus used in this study. Raters scored each essay for analytic features first and then assigned a holistic score to the essay. When raters reached an acceptable level of reliability (Kappa > 0.70), they independently scored the source-based essays in the corpus. After initial scoring, Kappa scores ranged from 0.64 to 0.65 (see Table 2 for IRR scores). Raters then adjudicated all scores that showed a difference greater than one point. For these scores, raters discussed the essays and made adjustments as needed. The two scores for the raters were then averaged to create the final scores used in this study. After adjudication, all Kappa scores were greater than 0.7. Table 3 shows mean and standard deviation for source use/inferencing and holistic scores for all the essays and the essays by prompt. Source use/inferencing and holistic scores were correlated at r = 0.734.

Table 2 Inter-rater reliability scores
Table 3 Human quality ratings overall and by prompt: Mean (SD)

Linguistic features

We used previously developed NLP measures of source integration along with bespoke NLP features to examine source integration and source citation in the source-based essays. We focused specifically on NLP feature sets that examine lexical and semantic overlap between the essay and the source texts, instances of plagiarism, source citation, and quoting. We discuss each of the NLP feature sets below. In total, we explored the use of 50 features. Descriptions of these features are provided in "Appendix C".

Essay and source overlap

We used the Tool for the Automatic Analysis of Cohesion (TAACO, Crossley et al., 2019) to examine keyword and semantic overlap between the essay and the source texts. In terms of keyword overlap, TAACO measures the degree to which key words and n-grams (i.e., single words and multi-word units) from the source texts are located in the target text. Key words and n-grams are identified by their relative frequency in the source texts compared to the frequency of the same items in the magazine and news sections of the Corpus of Contemporary American English (COCA, Davies, 2008). Once key words and n-grams are identified, TAACO calculates the proportion of these items in the essays written for each specific prompt. TAACO computes keyness indices for single words, bi-grams (two-word phrases), tri-grams (three-word phrases), and quad-grams (four-word phrases). TAACO also examines part of speech tags in n-grams for nouns, adjectives, and verbs wherein part of speech (POS) tags are allowed to substitute for words.

For semantic overlap, TAACO relies on LSA (Landauer et al., 1998), Latent Dirichlet Allocation (LDA, Blei et al., 2003), and Word2vec (Mikolov, Chen, Corrado, & Dean, 2013). Semantic spaces for each of these models were developed using the newspaper and magazine sections of COCA (Davies, 2008). Overlap between essay and source texts for each of these models was then computed to examine semantic similarity.

Plagiarism

We used the freely available Python package Plagiarism Detection (based on Chong, Specia, & Mitkov, 2010 and available at https://github.com/AashitaK/Plagiarism-Detection/blob/master/notebook.ipynb) to assess the longest common sequences of words shared between the source text and the essay. The Plagiarism Detection package also calculates a containment measurement score based on the intersection of tri-gram count in the source texts and tri-gram counts in the essay normalized by the number of tri-grams in the essay. It should be noted that the Plagiarism Detection package does not differentiate between quoted and unquoted text.

Source citation

We developed automated counts for citation use. Citations were defined as situations where the letters or numbers of the associate source texts, and/or the title, the author, the publisher of the texts, and the organization affiliated with the author, were mentioned in the writing. In this sample, the reference to the source text generally comprised the word “Source” and a single letter from “A–G” such as “Source A” or the parenthesized version of it, such as “(Source A)”. Title, author, publisher, and organizational information were culled from the source text. We calculated simple citation metrics including citations counts, frequency of citation, percentage of most common cited source, source text coverage rate, and percent of paragraphs with citations. We also calculated locational information for citations including mean and standard deviation of location of citations in essays (both raw and normalized) by character location in essay, by sentence location, and by paragraph location. These variables allow us to calculate the depth of citations within an essay (e.g., are the citations found more near the beginning, middle, or end of an essay?).

Quoting

We developed automated counts for quotation use. Quotations were defined as any strings that were located between quotation marks in the essays. We calculated the number of quoted words in the text, the ratio of quoted words in the text, the number of quotations in the essay from the source text, and the percentage of quotations in the essay from the source texts.

Statistical analysis

The aim of this study was to investigate if NLP features related to source integration and citation were predictive of essays scores (source use/inferencing and holistic scores). We included text length as a baseline comparison because multiple studies have shown strong links between essay length and writing quality (Ferris, 1994; Frase, Faletti, Ginther, & Grant, 1999; Guo et al., 2013). We also included prompt as a categorical factor to examine the potential for prompt-based effects (Crossley, Varner, & McNamara, 2013; Hinkel, 2002; Huot, 1990; Tedick, 1990). All statistical analyses were conducted in R 3.6.1.

Bivariate Pearson correlations were run using the cor.test() function (R core team, 2017) between our dependent variables (either human ratings of source use/inferences or holistic writing quality), text length, and the NLP features related to source integration and source citations. When NLP variables are highly collinear (i.e., potentially measuring the same feature) it can make interpreting variable importance difficult. Thus, prior to developing our models, we calculated correlations amongst the human ratings and the NLP variables (including text length). If two or more variables correlated at r > 0.699, the NLP variable(s) with the lowest correlation with the human scores was removed and the variable with the higher correlation was retained. We also only retained variables that demonstrated at least a small relationship with the response variable (r > 0.099).

We used the CARET package (Kuhn, 2016) in R (R Core Team, 2017) and the lm() function (R core team, 2017) to develop linear models. Model training and evaluation were performed using a stepwise tenfold cross-validation. For the stepwise process, we used the leapSeq function in Leaps (Kuhn et al., 2016). In the tenfold cross-validation procedure, the entire corpus was randomly divided into ten roughly equivalent sets and nine of these sets were used as a training set and one set was left out as test set. The model from the training set was then applied to the left-out test set. This happened ten times such that each set was used as the test set once. Estimates of accuracy are reported using average summary statistics across the ten test sets including root mean squared error (RMSE), mean absolute error (MAE) between the observed and modeled human scores, and the amount of variance explained by the developed model (R2). It should be noted that since the source use/inferencing scores were scaled 1–4 and the holistic scores were scaled 1–6, direct comparisons between the models in terms of RMSE and MAE are not possible. F value and t values for each included variable were reported using lm() and the relative importance of the indices in each model was calculated using the calc.relimp() function in the relaimpo package (Grömping, 2009). Post-hoc analyses were conducted on the regression predictions for each model to see how they correlated with age differences in the populations and group differences (Mechanical Turk workers, Navy recruits, and undergraduate students).

Results

Source use and inferencing score

After controlling for multicollinearity amongst variables, 22 variables remained of which 14 showed at least a weak relationship with source use and inferencing scores. Of the 8 variables that did not report at least a small effect size, 7 of them were related to key words taken from the source text while the last variable was related to the number of times the most common source text was cited. The variables that demonstrated at least a small effect size were related to text length, source coverage, citation occurrence, semantic overlap between the essay and the sources, variance in citations, quotations use, and plagiarism (see Table 4 for descriptive statistics for each variable and Table 5 for correlation matrix for all variables and the source use/inferencing score).

Table 4 Descriptive statistics for selected variables
Table 5 Correlations between holistic score,

When these 15 variables were entered into a tenfold cross-validated linear model, the number of variables that performed the best in explaining source use and inferencing score was 7, including number of words, the Locavore prompt, and variables related to source integration and source citations. The linear model reported RMSE = 0.558, MAE = 0.446, r = 0.687, R2 = 0.472, F (9, 899) = 93.91, p < 0.001 (see model parameters summarized in Table 6). This model indicates that around 47% of the variance in the human scores for source use/inferencing can be explained by seven linguistic and experimental variables. The relative importance metrics indicate that the strongest predictor was number of words (explaining ~ 29% of the variance), followed by source coverage, depth of citations, semantic overlap between essay and source texts, variance in citation location, prompt, all of which were positive predictors. There was a single negative predictor, longest common subsequence (i.e., plagiarism), which reported the lowest importance. VIF values indicated no concerns with multicollinearity. Visual and statistical examinations of the residuals indicated they were normally distributed.

Table 6 Linear model to predict

We conducted post-hoc analyses of the predicted source use/inferencing scores from the regression model to assess performance based on age and group (Mechanical Turk workers, Navy recruits, and undergraduate students). The correlation between age and predicted holistic score was nonsignificant (r = 0.014). A one-way ANOVA between groups reported statistical differences (F(2, 906) = 23.39.82, P < 0.001). Tukey multiple comparisons indicated significant differences were reported between the predicted source use/inferencing scores for Navy recruits and Mechanical Turk workers and undergraduate students with means indicating lower predicted source use/inferencing scores for Navy recruits (see Fig. 1 for a boxplot of data).

Fig. 1
figure 1

source use/inferencing scores by group

Boxplots for

Holistic score

After controlling for multicollinearity amongst variables, 23 variables remained of which 15 showed at least a weak relationship with source use and inferencing scores. Of the 8 variables that did not report at least a small effect size, 7 of them were related to key words taken from the source text while the last variable was related to the number of times the most common source text was cited. The variables that demonstrated at least a small effect size were related to text length, depth of citation location, semantic overlap with sources, plagiarism, quoting, and source coverage (see Table 4 for descriptive statistics for each variable and Table 5 for a correlation matrix for all variables and holistic score). These were the same variables that were used in the source use and inferencing analysis.

When these 15 variables were entered into a tenfold cross-validated linear model, the number of variables that performed the best in explaining source use and inferencing score was 8, including number of words, two prompts, and variables related to source integration and source citations. The linear model reported RMSE = 0.715, MAE = 0.560, r = 0.720, R2 = 0.518, F (8, 900) = 121.00, p < 0.001 (see model parameters summarized in Table 7). This model indicates that around 52% of the variance in the human scores for holistic essay quality can be explained by eight linguistic and experimental variables. The relative importance metrics indicate that the strongest, positive predictor was number of words (explaining ~ 44% of the variance) followed by depth of citations, semantic overlap between essay and source texts, and source coverage. There was one negative predictor, longest common subsequence (i.e., plagiarism), which explained about 12% of the variance. VIF values indicated no concerns with multicollinearity. Visual and statistical examinations of the residuals indicated they were normally distributed.

Table 7 Linear model to predict holistic score

We conducted post-hoc analyses of the predicted holistic scores from the regression model to assess performance based on non-textual features including age and group (Mechanical Turk workers, Navy recruits, and undergraduate students). The correlation between age and predicted holistic score was nonsignificant (r = 0.006). We used a one-way ANOVA between groups to examine differences in means. The ANOVA reported statistical differences (F(2, 906) = 17.82, p < 0.001). Tukey multiple comparisons indicated significant differences were reported between the predicted holistic scores for Navy recruits and Mechanical Turk workers and undergraduate students with means indicating lower predicted holistic scores for Navy recruits (see Fig. 2 for a boxplot of data).

Fig. 2
figure 2

Boxplots for holistic scores by group

Discussion

This study builds and extends previous research on assessing text integration in synthesis writing. Here, we introduce automated techniques to assess the products of synthesis writing including integrating source material (Cumming, Lai, Cho, 2016) and the use of citations, quotations, and plagiarism. The goal of this study was to examine how natural language processing (NLP) for annotating source integration and citations could be used to predict human ratings source use/inference and holistic scores when controlling for essay length and prompt. Our NLP techniques focused on lexical and semantic similarity between sources and essays, plagiarism detectors, source citation metrics, and quoting.

The model for source use and inferencing explained ~ 47% of the variance in scores. As expected, the strongest predictor of the variance was number of words. However, five variables related to source integration and citation use explained variance beyond text length. The strongest predictor among these was source coverage, indicating that referring to a greater number of sources resulted in a higher score. Two other variables related to citation practice (i.e., standard deviation for citation location in paragraph and depth of citation location by character) were also significant predictors. These two variables indicated that stronger scores were related to producing citations later in the text (depth of citation) as well as spreading citations throughout the texts (standard deviation). Thus, it is not just incidence of citations that reflects good source use and inferencing; writers need to produce more sources later in the text, they need to cite a greater number of sources, and the citations need to be adequately spaced throughout the text.

A variable related to semantic overlap between the essay and the source text was also a strong predictor indicating that higher scored essays contained more semantic similarity with the source texts. In addition, more skilled writers are more likely to avoid copying long strings of words from the source text (i.e., plagiarism). Copying words from the source text is associated with lower source use and inferencing quality scores. The predictive features in this model align nicely with the expectations of the rubric, which indicated that greater source use and inferencing was the result of referring to a majority of sources, synthesizing information within and across sources, and providing interpretations of sources that were not simple iterations/paraphrases. As such, the results provide reliability metrics for these ratings, and in turn substantiate the claim that a number of NLP features can be used to provide proxies for certain aspects of source-based essays.

Our model of holistic score was also strong, predicting ~ 52% of the variance in the human ratings. As expected, the strongest predictor was text length, but four variables related to citation practices and source integration were also significant predictors beyond text length. These predictors were the same as found in the source use and inferencing scores, which indicate their importance in explaining writing quality in general (taking into consideration the relatively high correlation between holistic scores and source use/inferencing scores). The top predictor was related to depth of citation location (by character), indicating that greater holistic scores were related to citations being reported deeper in the essay. The next strongest predictor was a measure of plagiarism (i.e., longest common subsequence) followed by semantic overlap with sources and source coverage (i.e., the percentage of sources cited). In total, the model indicates that higher scored source-based essays result from the use of a greater number of citations that occur later in the text, fewer copied sequences from the source text, and greater lexical and topical overlap with the source, which were the same patterns reported in the source use and inference model.

We conducted post-hoc analyses for the regression predictions to better understand how the predictions differed amongst age levels and groups. The post-analyses for age for both the source use/inferencing scores and the holistic scores showed no associations. This result indicates that age seems to have no relationship with the NLP operationalizations of source integration and citation practices and their links to human scores of writing quality. However, we did find differences in groups, such that Navy recruits’ integration and citation practices and their effects on human scores of writing quality differed from college students and adult participants. Specifically, we found that Navy recruits used fewer integration and citation practices associated with successful writing when compared to other groups. These differences likely result from academic experiences and expectations between the Navy recruits and college freshmen and M-Turk populations. College freshmen are already enrolled in writing classes and have a strong motivation to successfully synthesize information into their writing as part of ongoing academic practice. M-Turk populations have also been shown to have higher education levels than the general population and are, thus, more likely to have experience with synthesis writing (Huff & Tingley, 2015).

We also noted differences between the source use/inferencing model and the holistic score model in terms of prediction accuracy. While both models included similar integration and citation features, the source use/inferencing model explained 47% of the variance while the holistic model explained 52% of the variance. At first glance, this may appear counterintuitive because the selected variables were generally associated with source integration, citation use, quotation practices, and plagiarism and one would expect these variables to predict source use and inferencing scores more strongly than holistic scores of writing. However, it is important to note that we included number of words as a variable in our models and that number of words was responsible for 44% of the holistic score model but only 29% of the source use and inferencing model. It is likely that the inclusion of number of words was the main difference in the total amount of variance explained between the two models.

Essay prompt was also a significant predictor in both the source use/inferencing and holistic models. In both models, lower scores were assigned to essays written on the Locavore prompt. This may the result of the Locavore prompt including seven different sources, the greatest number of sources in the data. For the holistic scores, the Global Warming prompt also showed significant differences, but it only contained four sources. Thus, it is unlikely that the number of sources per prompt influenced overall essay quality, but it may have influenced source use/inferencing scores which asked raters to consider the number of sources referenced.

The findings provide support for NLP approaches to not only explain successful source integration, but also for NLP approaches to provide automatic assessments of source integration that could be used to provide feedback to writers in online writing systems, allowing students opportunities at deliberate writing practice. From a researcher perspective, automatically annotating texts for evidence of source integration could also help spur additional research into source-integration studies. Previous studies that relied on hand coding of source integration features such as paraphrasing and summarizing, quoting, and citation use (Gebril & Plakans, 2009, 2009; Leijten et al., 2019; Petrić, 2012; Plakans, 2009; Uludag et al., 2019; Weigle & Parker, 2012) were resource intensive. The approaches used here can assess these features as well as additional synthesis features include plagiarism and citation depth, freeing researchers from hand-coding and allowing researchers to examine larger corpora of essays from various student populations and academic disciplines. However, unlike hand annotations, our NLP approaches are unable to assess whether or not the text integration measured was successful. Our NLP approaches simply measure the existence of ideas and language from the sources (i.e., semantic overlap, topic overlap, quoting, plagiarism) and citation practices (depth of citation, range of citations, and variance in citations). The approaches cannot measure if the language or ideas integrated into an essay are accurate or if the citations used are meaningful.

The majority of the features investigated in this study were predictive of human scores of writing quality. For instance, after controlling for multicollinearity, 14 of the 22 remaining variables showed meaningful correlations with source use and inferencing scores (i.e., 64%). Of the 8 variables that did not report meaningful correlations, all but one calculated the inclusion of key word and POS n-grams from the source text. Similar results were reported for holistic scores in which 15 of the 23 variables (i.e., 65%) remaining after controlling for multicollinearity were meaningfully correlated with human scores. Like the source use and inferencing score, 7 of the 8 variables that were not correlated measured the inclusion of key word and POS n-grams from the source text. These finding indicate that the repetition of words and phrases from the source are not strongly associated with essay quality (although summarizing ideas from the source as reported by the semantic overlap scores was predictive). In terms of variables that were predictive of quality scores, it should be noted that quotations in general and specific quotations from the source text were not found to be predictive in the developed models, although they did report positive, but weak correlations with essay scores. Thus, while quoting text is associated with essay quality, it is likely not as important to modeling writing quality as other elements of source integration like location of citations, number of citations, source similarity, and plagiarism.

The models derived from these studies support previous research indicating that higher quality synthesis writing involves the accurate integration of source information (Cumming, Lai, Cho, 2016). For instance, like the L2 writers found in Plakans’ studies (e.g., Gebril & Plakans, 2009, 2009; Plakans, 2009; Plakans & Gebril, 2013), we find that L1 writing that is of higher quality has greater overlap with source texts, indicating more semantic accuracy. However, unlike a few previous studies, we found no links between the use of quotes and the length of quotes and writing quality (Petrić, 2012), although we did find that borrowing longer strings of words from the source (i.e., plagiarism) did equate to lower writing quality (in a similar fashion to Shi, 2004; van Weijen et al., 2019). In terms of citation use, we find, like previous studies (i.e., Borg, 2000; van Weijen et al., 2019) that higher proficiency writers include more citations than lower proficiency writers.

Conclusion

Our initial research question for this study was: To what degree can NLP features of source integration predict synthesis writing quality and source use/inferencing quality? We found the variables related to source overlap, citation practices, and plagiarism explain a significant amount of variance in writing quality beyond text length and prompt. In doing so, we introduced automatic approaches using NLP techniques to derive features from essays related to source integration. These include semantic and lexical overlap features between an essay and a source text, citation practices including number of citations, variance in citations, and position of citations in an essay, plagiarism features, and quoting features. A variety of these features were predictive of not only holistic essay scores but also source use and inferencing scores. The findings have important implications for source-based writing analyses, annotating source integration in texts, and source-based writing pedagogy.

We see this study as an initial investigation into the potential for automated NLP features of source integration to better explain source integration practices and link these practices with essay quality scores. While there are limitations to this study including sample size, participants, topics, and length of essays (some essays were shorter than 50 words), the study does provide strong evidence for the effectiveness of NLP features to examine source integration practices. However, there are many areas for development, especially considering that our models only explained around ~ 47% to ~ 52% of the variance in human ratings. We envision a number of different NLP metrics to increase our coverage of source integration in essay writings. For example, metrics for variation in sources use should be developed (i.e., what percentage of the overlap between the source and text comes from each source text and how much variation is there in that overlap). Additionally, future studies need to assess how well the NLP techniques used here are influenced by text length (i.e., are the results reliable with shorter texts). Beyond text length, it is not clear that the results would generalize to different populations (for instance middle or high school students), nor is it clear whether the results would hold across different writing prompts and sources. Thus, future studies that build on this work need to not only extend the NLP techniques used here but also the variety of texts, prompts, sources, and participants to ensure the results are generalizable.

Beyond NLP, there are a number of other techniques that could be used to identify other aspects of text integration. Eye-tracking could be used to examine time spent on reading sources, the number of sources read, and how often writers switch between sources. If combined with NLP approaches, eye-tracking techniques could tell us much about how reading patterns interact with actual source integration. Lastly, keystroke logging could complement NLP analyses and eye-tracking studies by examining writing bursts, which may be related to source integration, pauses, which may be related to reading of source texts, and determining copy-paste behaviors on the part of writers. Triangulating NLP, eye-tracking, and keystroke logging techniques may explain additional variance in human scores of source use and inferencing, and overall quality. A multidimensional account of source-based writing is needed to fully understand the complex interactions between comprehension and writing processes and how to scaffold students during the various stages of understanding multiple sources to help them convey their understanding of source texts within their essays.