Keywords

1 Introduction

Text mining techniques based on advanced Natural Language Processing (NLP) and Machine Learning algorithms, as well as the ever-growing computer power, enable the design and implementation of new systems that automatically deliver to learners summative and formative assessments using multiple sets of data (e.g., textual materials, behavior tracks, meta-cognitive explanations). New automatic evaluation processes allow teachers and learners to have immediate information on the learning or understanding processes. Furthermore, computer-based systems can be integrated into pedagogical scenarios, providing activity flows that foster learning. ReaderBench is a fully functional framework based on text mining technologies [1].

It may be seen as a cohesion-based integrated approach that addresses multiple dimensions of learner comprehension, including the identification of reading strategies, textual complexity assessment and even Computer Supported Collaborative Learning (CSCL). In the later context, special emphasis is given to considering participant involvement and collaboration [2]. However, this facility will not be introduced in this chapter for readability’s sake.

In addition to a fully functional NLP processing pipeline [3], in terms of Educational Data Mining, ReaderBench encompasses a wide variety of techniques: Latent Semantic Analysis (LSA) [4], Latent Dirichlet Allocation (LDA) [5] and specific internal processes addressing topics extraction, extractive summarization, identification of reading strategies, as well as textual complexity assessment, all deduced from a cohesion-based underlying discourse structure. In this context, ReaderBench provides teachers and learners information on their reading/writing activities: initial textual complexity assessment, assignment of texts to learners, capture of self-explanations reflected in pupil’s textual verbalizations and reading strategies assessment [2].

The remainder of this chapter is as follows. The next section introduces a general perspective in terms of educational applications. The third section is an overview on how learner comprehension can be modeled and predicted while introducing an educational scenario that makes use of a wide variety of educational activities covered by ReaderBench. The fourth section is centered on cohesion, the core text feature from which almost all ReaderBench measures are computed. The next four sections present the main functionalities of our system: topic extraction, cohesion analysis, reading strategies analysis and textual complexity assessment. The ninth section focuses on the validation of the provided facilities, while the latter section compares ReaderBench to other systems, highlighting pros and cons for each approach.

2 Data and Text Mining for Educational Applications

Learning analytics aims at measuring, collecting, analyzing and “reporting data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs” (Society for Learning Analytics Research, http://www.solaresearch.org/). While the main focus of this approach is to gather data about learners (e.g., their behavior, opinions, affects, social interactions), very few researches are performed to infer what learners actually understand and the learning contexts are rarely taken into account (e.g., which learning material is used, to what pedagogical intent, within which educational scenario) [6].

Educational data analyzed from computer-based approaches typically comes from two wide categories: knowledge (e.g., textual material from course information) and behavior (e.g., learners’ behavior in Learning Management Systems from log analysis). Whereas a substantial amount of research is centered on behavioral data [7], relatively few researches encompass the analysis of textual materials, presented beforehand to learners. Raw data is ideally and easily computable with data mining techniques [8], but the inferences necessary to uncover learners’ cognitive processes are far more complex and involve comparisons to human experts judgments.

Our approach stems from a very broad idea. Cohesion, seen as the relatedness between different parts of texts, is a major determinant of text coherence and has been shown to be an important predictor of reading comprehension [9]. In turn, cohesion analyses can be applied on a wide range of data analyses in educational contexts: text readability and difficulty, knowledge relatedness, chat or forum group replies. The next section addresses learner comprehension, its relationships with textual complexity, and how the comprehension level can be inferred from learner’s self-explanations.

2.1 Predicting Learner Comprehension

Learner’s comprehension of textual materials during reading depends both on text properties and on the learner’s reading skills. It has long been recognized that the comprehension performance differs according to lexical and syntactical complexity, as well as to the thematic content and to how information is structured [10, 11]. Of particular importance are the cohesion and coherence properties of texts that can support or impair [12] understanding and, moreover, interact with reader’s personal characteristics [11, 13]. On the reader’s side, his/her background knowledge and the reading strategies (s)he is able to use to process the provided information are also strong predictors of reading comprehension, in addition to personal word recognition abilities and semantic skills [1416].

Therefore, our aim consists of designing an integrated system capable of supporting a wide range of educational activities, enabling in the end three kinds of work-loops (see Fig. 13.1) in which teacher/learners can be freely involved, thus triggering self-regulated learning [17]. It is worth noting that these three loops do not generate behavioral data per se, to be analyzed in turn in the automatic system.

Fig. 13.1
figure 1

General educational scenario envisioned for ReaderBench

The first loop addresses reading: learners read some material (e.g., course text, narrative) and can, at any moment, get information about its textual organization. The second one is a gist selection loop, which is a bit more interactive than the previous. Learners produce keywords or select main sentences of the read texts and submit their selection to either the teacher or the automatic evaluation system, which prompts feedback. The third is a writing loop that gives learners the opportunity to develop at length what they understood from the text (e.g., summaries) or the way they understood (strategies self-explanation). Besides these three loops, the teacher can use automatic tools to select appropriate textual materials according to learners’ level. Each of the main activities presented in in Fig. 13.1, either tutor or learner centered, are presented from a computational point of view in subsequent sections. The remainder of this section elaborates more on the two main factors of text understanding: textual features (through textual complexity) and readers’ abilities (through the identification of reading strategies).

3 Textual Complexity Assessment for Comprehension Prediction

Teachers usually need valid and reliable measures of textual complexity for their day-to-day instruction in order to select materials appropriate to learners’ reading level. This proves to be a challenging and cumbersome activity since different types of texts (narrative, argumentative or expository) place different demands on different reading skills [18, 19]. For example, McNamara and her colleagues [19] found that narrative texts contain more familiar words than scientific texts, but that they have more complex syntactic sentences, as well. Narratives were also found to be less cohesive than science expository texts, the latter more strongly requiring background knowledge. In conclusion, different skills must be involved in comprehending different types of texts and the same reader can be more or less able to comprehend a text corresponding to his/her reading and/or grade level.

Two approaches usually compete for the automated assessment of text complexity: (1) using simple statistical measures that mostly rely on word difficulty (from already-made scales) and sentence length; (2) using a combination of multiple factors ranging from lexical indicators as word frequency, to syntactic and semantic levels (e.g., textual cohesion) [20].

As an in-depth perspective, text cohesion, seen as the relatedness between different parts of texts, is a major determinant for building a coherent representation of discourse and has been shown to be an important predictor of reading comprehension [9]. Cohesiveness understanding (e.g., referential, causal or temporal) is central to the process of building textual coherence at local level, which, in turn, allows the textual content to be reorganized into its macrostructure and situation model at global level. Highly cohesive texts are more beneficial to low-knowledge readers than to high-knowledge ones [21]. Hence, textual cohesion is a feature of textual complexity (through some semantic characteristics of the read text) that might interfere with reading strategies (through the inferences made by a reader). Moreover, inference skills and the ability to plan and organize information have been shown to be strongly tied to the comprehension performance of more complex texts [18]. These findings let us consider cohesion as one of the core determinants of textual complexity.

3.1 The Impact of Reading Strategies Extracted from Self-Explanations for Comprehension Assessment

Moving from textual complexity to readers’ comprehension assessment is not straightforward. Constructing textual coherence for readers requires that they are able to go beyond what is explicitly expressed. In order to achieve this, readers make use of cognitive procedures and processes, referred to as reading strategies, when those procedures are elicited through self-explanations [22]. Research on reading comprehension has shown that expert readers are strategic readers. They monitor their reading, being able to know at every moment their level of understanding. When faced with a difficulty, learners can call upon regulation procedures, also called reading strategies [23].

Reading strategies have been studied extensively with adolescent and adult readers using the think-aloud procedure that engages the reader to auto-explain at specific breakpoints while reading, therefore providing insight in terms of the comprehension mechanisms they call upon to interpret the information they are reading. In other words, reading strategies are defined here as “the mental processes that are implicated during reading, as the reader attempts to make sense of the printed words” [24, p. 40].

Four types of reading strategies are mainly used by expert readers [25]. Paraphrasing allows the reader to express what she understood from the explicit content of the text, and can be considered the first and essential step in the process of coherence building. Text-based inferences, for example causal and bridging strategies build explicit relationships between two or more pieces of information in texts. On the other hand, knowledge-based inferences build relationships between the information in text and the reader’s own knowledge and are essential to the situation model building process. Control strategies refer to the actual monitoring process when the reader is explicitly expressing what she has or has not understood. The diversity and richness of the strategies a reader carries out depend on many factors, either personal (proficiency, level of knowledge, motivation), or external (textual complexity).

We performed an experiment [26] to extend the assessment of reading strategies with children ranging from 3rd to 5th grade (8–11 years old). Children read aloud two stories and were asked at predefined moments to self-explain their impressions and thoughts about the reading material. An adapted annotation methodology was devised starting from McNamara’s [25] coding scheme, covering the strategy items: paraphrases, textual inferences, knowledge inferences, self-evaluations, and “other”. The “other” category is very close to the “irrelevant” category [25] as it aggregates irrelevant, as well as unintelligible statements. Two dominant strategies were identified: paraphrases and text-based inferences; text-based inferences frequency increases from grade 3 to 5, while erroneous paraphrases frequency decreases; knowledge-based inferences remain rare, but their frequency doubled from grade 3 to 5, amounting from 4 to 8 % of the identified reading strategies within the appropriate verbalizations.

Three results are noteworthy. Firstly, self-explanations are a useful tool to access the reading strategies of young children (8–11 years old) who already dispose of all the strategies older children carry out. Secondly, we found a relation between the ability to paraphrase and to use text-based inferences, on one hand, and comprehension and extraction of internal text coherence traits, on the other. A better comprehension in this age range is tied to less false paraphrases and more text-based inferences (R² = 0.18 for paraphrases and R² = 0.16 for text-based inferences). Thirdly, mediation models [27] showed that verbal ability partially mediates the effect of text-based inferences and that age moderates this mediating effect. The effect of text-based inferences on reading comprehension is mediated by verbal ability for the younger students while it becomes a direct effect for older students. Starting from the previous experiments and literature findings, one of the goals of ReaderBench is to enable the usage of new texts with little or no human intervention, providing both textual complexity assessments on these texts, and a fully automatic identification of reading strategies as a support for teachers. The textual complexity assessment aims at calibrating texts before providing them to learners.

4 Cohesion-Based Discourse Analysis: Building the Cohesion Graph

Text cohesion, viewed as lexical, grammatical and semantic relationships that link together textual units, is defined within our implemented model in terms of: (1) the inverse normalized distance between textual elements expressed in terms of the number of textual analysis elements in-between; (2) lexical proximity that is easily identifiable through identical lemmas and semantic distances within ontologies [28]; (3) semantic similarity measured through LSA [4] and LDA [5].

Additionally, specific natural language processing techniques [3] are applied to reduce noise and improve the system’s accuracy: spell-checking (optional) [29, 30], tokenizing, splitting, part of speech tagging [31, 32], parsing [33, 34], stop words elimination, dictionary-only words selection, stemming [35], lemmatizing [36], named entity recognition [37] and co-reference resolution [38, 39].

In order to provide a multi-lingual analysis platform with support for both English and French, ReaderBench integrates both WordNet [40] and a serialized version of Wordnet Libre du Français (WOLF) [41]. Due to the intrinsic limitations of WOLF, in which concepts are translated from English while their corresponding glosses are only partially translated, making a mixture of French and English definitions, only three frequently used semantic distances were applicable to both ontologies: path length, Wu–Palmer [42] and Leacock–Chodorow’s normalized path length [43].

Afterwards, LSA and LDA semantic models were trained using three specific corpora: TextEnfants [44] (approx. 4.2 M words), Le Monde (French newspaper, approx. 24 M words) for French, and Touchstone Applied Science Associates (TASA) corpus (approx. 13 M words) for English.

Moreover, improvements have been enforced on the initial models: the reduction of inflected forms to their lemmas, the annotation of each word with its corresponding part of speech through a NLP processing pipe (only for English as for French it was unfeasible to apply to the entire training corpus due to the limitations of the Stanford Core NLP in parsing French) [4547], the normalization of occurrences through the use of term frequency-inverse document frequency (Tf-Idf) [3] and distributed computing for increasing speedup [48, 49].

LSA and LDA models extract semantic closeness relations from underlying word co-occurrences and are based on the bag-of-words hypothesis. Our experiments have proven that LSA and LDA models can be used to complement one other, in the sense that underlying semantic relationships are more likely to be identified, if both approaches are combined after normalization.

Therefore, LSA semantic spaces are generated after projecting the arrays obtained from the reduced-rank Singular Value Decomposition of the initial term-doc array and can be used to determine the proximity of words through cosine similarity [4]. From a different viewpoint, LDA topic models provide an inference mechanism of underlying topic structures through a generative probabilistic process [5]. In this context, similarity between concepts can be seen as the opposite of the Jensen-Shannon dissimilarity [3] between their corresponding posterior topic distributions.

From a computational perspective, the LSA semantic spaces were trained using a Tagged LSA engine [45] that preprocesses all training corpora (stop-words elimination, Part of Speech (POS) tagging, lemmatization) [46, 47], applies Tf-Idf and uses a distributed architecture [48, 50] to perform the Singular Values Decomposition. With regards to LDA, the parallel topics model used iterative Gibbs sampling over the training corpora [49] with 10,000 iterations and 100 topics, as recommended by [5]. Overall, in order to better grasp cohesion between textual fragments, we have combined information retrieval specific techniques, mostly reflected in word repetitions and normalized number of occurrences, with semantic distances extracted from ontologies or from LSA- or LDA-based semantic models.

In order to have a better representation of discourse in terms of underlying cohesive links, we introduced a cohesion graph [2, 51] (see Fig. 13.2) that can be seen as a generalization of the previously proposed utterance graph [5254]. We are building a multi-layered mixed graph consisting of three types of nodes [55]: (1) a central node, the document that represents the entire reading material, (2) blocks, a generic entity that can reflect paragraphs from the initial text and (3) sentences, the main units of analysis, seen as collections of words and grammatical structures obtained after the initial NLP processing. The decomposition is applied to chat conversations or forum discussion threads where blocks are instantiated by utterances or interventions.

Fig. 13.2
figure 2

The cohesion graph as underlying discourse structure

In terms of edges, hierarchical links are enforced by inclusion functions (sentences within a block, blocks within a document) and two types of links are introduced between analysis items of the same level: mandatory and relevant links. Mandatory links are set between adjacent blocks or sentences and are used for best modeling the information flow throughout the discourse, thus making possible the identification of cohesion gaps. Adjacency links are enforced between the previous block and the first sentence of the next block and, symmetrically, between the last sentence of the current block and the next block. Links ensure cohesiveness between structures at several levels within the cohesion graph, disjoint with regards to the inclusion function, and augment the importance of the first/last sentence of the current block, in accordance with the assumption that topic sentences are usually at the beginning/ending of a paragraph and ensure in most cases a transition from the previous paragraph [56].

Optional relevant links are added to the cohesion graph for highlighting fine-grained and subtle relations between distant analysis elements. In our experiments, the use as threshold of the sum of mean and standard deviation of all cohesion values from within a higher-level analysis element provided significant additional links for the proposed discourse structure. As cohesion can be regarded as the sum of semantic links that hold a text together and give it meaning, the underlying cohesive structure influence the perceived complexity level. In other words, the lack of cohesion may increase the perceived textual complexity as a text’s proper understanding and representation become more difficult to achieve. In order to better highlight this perspective, two measures for textual complexity were defined, later to be assessed: inner-block cohesion as the mean value of all the links from within a block (adjacent and relevant links between sentences) and inter-block cohesion that highlights semantic relationships at global document level.

5 Topics Extraction

The identification of covered topics or keywords is of particular interest within our analysis model because it enables us to grasp an overview of a document, but also in observing emerging points of interest or shifts of focus. Tightly connected to the cohesion graph, topics can be extracted at different levels and from different constituent elements of the analysis (e.g., the entire document or conversation, a paragraph or all the interventions of a participant). The relevance of each concept mentioned in the discussion and depicted by its lemma is defined by combining a multitude of factors:

  1. 1.

    Individual normalized term frequency1 + log(no. occurrences) [57]; in the end, we opted for eliminating inverse document frequency, as this factor is related to the training corpora and we wanted to grasp the specificity of each analyzed text.

  2. 2.

    Semantic similarities through the cohesion function (LSA cosine similarity and inverse of LDA Jensen–Shannon divergence) with the analysis element and to the whole document for ensuring global resemblance and significance.

  3. 3.

    A weighted similarity with the corresponding semantic chain multiplied by the importance of the chain; semantic chains are obtained by merging lexical chains determined from the disambiguation graph modeled through semantic distances from WordNet and WOLF [58] through LSA and LDA semantic similarities and each chain’s importance is computed as its normalized length multiplied with the cohesion function between the chain, seen as an entity integrating all semantically related concepts, and the entire document.

In addition, as an empirical improvement and as the previous list of topics is already pre-categorized by corresponding parts of speech, the selection of only nouns provided more accurate results in most cases due to the fact that nouns tend to better grasp the conceptualization of the document.In terms of a document’s visualization, the initial text is split into paragraphs, cohesion measures are displayed in-between adjacent blocks and the list of sorted topics with their corresponding relevance scores is presented to the user, allowing him to filter the displayed results by number and by corresponding part of speech.

As an example, Fig. 13.3 presents the user interface of ReaderBench for a French story—Miguel de la faim [59]— highlighting the following elements: block scores (in square brackets after each paragraph), demarcation with bold of sentences considered most important according to the summarization facility and document topics and identified topics ordered by relevance. Although the block score can be elevated (e.g., hélas …), the presented values are a combination of individual sentence scores; therefore, underlying sentences might not be selected in the summarization process. The later scoring and summarization facilities are presented in the next sections.

Fig. 13.3
figure 3

Reading material visualization

An appealing extension to topics identification is the visualization of the corresponding semantic space that can also be enlarged with semantically similar concepts, not mentioned within the discourse and referred to in our analysis as inferred concepts (see Fig. 13.4). Therefore, an inferred concept does not appear in the document or in the conversation, but is semantically related to it.

Fig. 13.4
figure 4

Network of concepts visualization from and inferred from [62]

From a computational perspective, the list of additional inferred concepts identified by ReaderBench is obtained in two steps. The first stage consists of merging lists of similar concepts for each topic, determined through synonymy and hypernymy relations from WordNet/WOLF and through semantic similarity in terms of LSA and LDA, while considering the entire semantic spaces. Secondly, all the concepts from the merged list are evaluated based on the following criteria: semantic relatedness with the list of identified topics and with the analysis element, plus a shorter path to the ontology root for emphasizing more general concepts.

The overall generated network of concepts, including both topics from the initial discourse and inferred concepts, takes into consideration the aggregated cohesion measure between concepts (LSA and LDA similarities above a predefined threshold) and, in the end, displays only the dominant connected graph of related concepts (outliers or unrelated concepts that do not satisfy the cohesion threshold specified within the user interface are disregarded). The visualization uses a Force Atlas layout from Gephi [60] and the dimension of each concept is proportional with its betweenness score [61] from the generated network.

Although the majority of displayed concepts make perfect sense and seem really close to the given initial text, in most cases there are also some dissonant words that appear to be off-topic at a first glimpse. In the example set in Fig. 13.4, campaigner might induce such an effect, but its occurrence in the list of inferred concepts is determined by its synonymy relationship from WordNet to candidate, a concept twice encountered in the initial text fragment that has a final relevance of 2.14. Moreover, the concept has only 7 occurrences in the TASA training corpus for LSA and LDA, therefore increasing the chance of making incorrect associations in the semantic models as no clear co-occurrence pattern can emerge.

In this context, additional improvements must be made to the previous identification method in order to reduce the fluctuations of the generated inferred concepts, frequent if the initial topics list is quite limited or the initial text is rather small, and to diminish the number of irrelevant generated terms by enforcing additional filters. All the previously proposed mechanisms were fine-tuned after detailed analyses on different evaluation scenarios and on different types of texts (stories, assigned reading materials and chat conversations), generating in the end an extensible and comprehensive method of extracting topics and inferred concepts.

6 Cohesion-Based Scoring Mechanism of the Analysis Elements

A central component in the evaluation process of each sentence’s importance is our bottom-up scoring method. Although tightly related to the cohesion graph [55] that is browsed from bottom to top and is used for augmenting the importance of the analysis elements, the initial assessment of each element is based on its topics coverage and their corresponding relevance, with regards to the entire document. Therefore, topics are used to reflect the local importance of each analysis element, whereas cohesive links are used to transpose the local impact upon other inter-linked elements.

In terms of the scoring model, each sentence is initially assigned an individual score equal to the normalized term frequency of each concept, multiplied by its relevance that is assigned globally during the topics identification process presented in the previous section. In other words, we measure to what extent each sentence conveys the main concepts of the overall conversation, as an estimation of on-topic relevance. Afterwards, at block level (utterance or paragraph), individual sentence scores are weighted by cohesion measures and summed up in order to define the inner-block score. This process takes into consideration the sentences’ individual scores, the hierarchical links reflected in the cohesions between each sentence and its corresponding block and all inner-block cohesive links between sentences.

By going further into our discourse decomposition model (document > block > sentence), inter-block cohesive links are used to augment the previous inner-block scores, by also considering all block-document similarities as a weighting factor of block importance. Moreover, as it would have been a discrepancy in the evaluation in terms of the first and the last sentence of each block for which there were no previous or next adjacency links within the current block, their corresponding scores are increased through the cohesive link enforced to the previous, respectively next block. This augmentation of individual sentence scores is later on reflected in our bottom-up approach all the way to the document level in order to maintain an overall consistency, as each higher level analysis element score should be equal to a weighted sum of constituent element scores.

In the end, all block scores are combined at document level by using the block-document hierarchical links’ cohesion as weight, in order to determine the overall score of the reading material. In this manner, all links from the cohesion graph are used in an analogous manner for reflecting the importance of analysis element; in other words, from a computational perspective, hierarchical links are considered weights and are characterized as a spread of information into subsequent analysis elements, whereas adjacency or relevant links between elements of the same level of the analysis are used to augment their local importance through cohesion to all inter-linked sentences or blocks.

In addition, starting from tutors’ general observations that an extractive summarization facility, combined with the demarcation of the most important sentences, is useful for providing a quick overview of the reading material, we envisioned an extractive summarization facility within ReaderBench. This functionality can be considered a generalization of the previous scoring mechanism built on top of the cohesion graph and can be easily achieved by considering the sentence importance scores, in descending order, as we are enforcing a deep discourse structure, topics coverage and the cohesive links between analysis elements. Overall, the proposed unsupervised extraction method is similar to some extent to TextRank [63] that also used an underlying graph structure based on the similarities between sentences. Nevertheless, our approach can be considered more elaborate from two perspectives: (1) instead of simple word co-occurrences we use a generalized cohesion function and (2) instead of computing all similarities between all pairs of sentences, resulting in highly connected graph, inapplicable for large text, we propose a multi-layered graph that resembles the core structure of the initial texts in terms of blocks or paragraphs.

7 Identification Heuristics for Reading Strategies

Starting from the two previous studies and the five types of reading strategies used by [64], our aim was to integrate within ReaderBench automatic extraction methods designed to support tutors at identifying various strategies and to best fit the aligned annotation categories. The automatically identified strategies within ReaderBench comprise monitoring, causality, bridging, paraphrase and elaboration due to two observed differences. Firstly, very few predictions were used, perhaps due to the age of the pupils, compared to McNamara’s subjects; secondly, there is a distinction in ReaderBench between causal inferences and bridging, although a causal inference can be considered a kind of bridging, as well as a reference resolution, due to their different computational complexities. Our objective was to define a fine-grained analysis in which different valences generated by both the identification heuristics and the hand coding rules were taken into consideration when defining the strategies taxonomy.

In addition, we have tested various methods of identifying reading strategies and we will focus solely on presenting the alternatives that provided in the end the best overall human–machine correlations. In ascending order of complexity, the simplest strategies to identify are causality (e.g., parce que, pour, donc, alors, à cause de, puisque) and control (e.g., je me souviens, je crois, j’ai rien compris, ils racontent) for which cue phrases have been used. Additionally, as causality assumes text-based inferences, all occurrences of keywords at the beginning of a verbalization have been discarded, as such a word occurrence can be considered a speech initiating event (e.g., Donc), rather than creating an inferential link.

Afterwards, paraphrases, that in the manual annotation were considered repetitions of the same semantic propositions by human raters, were automatically identified through lexical similarities. More specifically, words from the verbalization were considered paraphrases if they had identical lemmas or were synonyms (extracted from the lexicalized ontologies—WordNet/WOLF) with words from the initial text.

In addition, we experimented identifying paraphrases as the overlap between segments of the dependency graph (combined with synonymy relations between homologous elements), but this was inappropriate for French as there is no support within the Stanford Log-linear Part-Of-Speech Tagger [31].

In the end, the strategies most difficult to identify are knowledge inference and bridging, for which semantic similarities have to be computed. An inferred concept is a non-paraphrased word for which the following three semantic distances were computed: the distance from word w 1 from the verbalization to the closest word w 2 from the initial text (expressed in terms of semantic distances in ontologies, LSA and LDA) and the distances from both w 1 and w 2 to the textual fragments in-between consecutive self-explanations. The latter distances had to be taken into consideration for better weighting the importance of each concept, with regards to the entire text. In the end, for classifying a word as inferred or not, a weighted sum of the previous three distances is computed and compared to a minimum imposed threshold which was experimentally set at 0.4 for maximizing the precision of the knowledge inference mechanism on the used sample of verbalizations.

As bridging consists of creating connections between different textual segments from the initial text, cohesion was measured between the verbalization and each sentence from the referenced reading material. If more than 2 similarity measures were above the mean value and exceeded a minimum threshold experimentally set at 0.3, bridging was estimated as the number of links between contiguous zones of cohesive sentences. Compared to the knowledge inference threshold, the value had to be lowered, as a verbalization had to be linked to multiple sentences, not necessarily cohesive one with another, in order to be considered bridging. Moreover, the consideration of contiguous zones was an adaptation with regards to the manual annotation that considered two or more adjacent sentences, each cohesive with the verbalization, members of a single bridged entity.

Figure 13.5 depicts the cohesion measures with previous paragraphs from the story in the last column and the identified reading strategies for each verbalization marked in the grey areas, coded as follows: control, causality, paraphrasing [index referred word from the initial text], inferred concept [*] and bridging over the inter-linked cohesive sentences from the reading material. The grey sections represent the pupil’s self-explanations, whereas the white blocks represent paragraphs from “Matilda” [65]. Causality, control and inferred concepts (that through their definition are not present within the original text) are highlighted only in the verbalization, whereas paraphrases are coded in both the self-explanation and the initial text for a clear traceability of lexical proximity or identity. Bridging, if present, is highlighted only in the original text for pinpointing out the textual fragments linked together through cohesion in the pupil’s meta-cognition.

Fig. 13.5
figure 5

Visualization of automatically identified reading strategies

8 Multi-Dimensional Model for Assessing Textual Complexity

Assessing textual complexity can be considered a difficult task due to different reader perceptions primarily caused by prior knowledge and experience, cognitive capability, motivation, interests or language familiarity (for non-native speakers). Nevertheless, from the tutor perspective, the task of identifying accessible materials plays a crucial role in the learning process since inappropriate texts, either too simple or too difficult, can cause learners to quickly lose interest.

In this context, we propose a multi-dimensional analysis of textual complexity, covering a multitude of factors integrating classic readability formulas, surface metrics derived from automatic essay grading techniques, morphology and syntax factors [66], as well as new dimensions focused on semantics [55]. In the end, subsets of specific factors are aggregated through the use of Support Vector Machines (SVM) [67], which has proven to be the most efficient for providing a categorical classification [68, 69]. In order to provide an overview, the textual complexity dimensions, with their corresponding performance scores, are presented in Table 13.1, whereas the following paragraphs focus solely on the semantic dimension of the analysis. In other words, besides the factors presented in detail in [66] that were focused on a more shallow approach, of particular interest is how semantic factors correlate to classic readability measures [55].

Table 13.1 Textual complexity dimensions

Firstly, textual complexity is linked to cohesion in terms of comprehension; in other words, in order to understand a text, the reader must first create a well-connected representation of the information withheld, a situation model [70]. This connected representation is based on linking related pieces of textual information that occur throughout the text. Therefore, cohesion reflected in the strength of inner-block and inter-block links extracted from the cohesion graph influences readability, as semantic similarities govern the understanding of a text. In this context, discourse cohesion is evaluated at a macroscopic level as the average value of all links in the constructed cohesion graph [2, 55].

Secondly, a variety of metrics based on the span and the coverage of lexical chains [58] provide insight in terms of lexicon variety and of cohesion, expressed in this context as the semantic distance between different chains. Moreover, we imposed a threshold of minimum of 5 words per lexical chain in order to consider it relevant in terms of overall discourse; this value was determined experimentally after running simulations with increasing values and observing the correlation with predefined textual complexity levels.

Thirdly, entity-density features proved to influence readability as the number of entities introduced within a text is correlated to the working memory of the text’s targeted readers. In general, entities consisting of general nouns and named entities (e.g., people’s names, locations, organizations) introduce conceptual information by identifying, in most cases, the background or the context of the text. More specifically, entities are defined as a union of named entities and general nouns (nouns and proper nouns) contained in a text, with overlapping general nouns removed.

These entities have an important role in text comprehension due to the fact that established entities form basic components of concepts and propositions on which higher level discourse processing is based [71]. Therefore, the entity-density factors focus on the following statistics: the number of entities (unique or not) per document or sentence, the percentages of named entities per document, the percentage of overlapping nouns removed or the percentage of remaining nouns in total entities.

Finally, another dimension focuses on the ability to resolve referential relations correctly [38, 72] as co-reference inference features also impact comprehension difficulty (e.g., the overall number of chains, the inference distance or the span between concepts in a text, number of active co-reference chains per word or per entity).

9 Results

Of particular interest are the thorough cognitive validations performed with ReaderBench that were centered on providing a comparison to learners’ performances. In terms of the presented functionalities, the validations for ReaderBench covered: (1) the aggregated cohesion measure by comparison to human evaluations of cohesiveness between adjacent paragraphs; (2) the scoring mechanism perceived as a summarization facility; (3) the identification of reading strategies by comparison to the manual scoring scheme and (4) the textual complexity model emphasizing morphology and semantics factors, compared to the surface metrics used within the Degree of Reading Power (DRP) score [19].

Firstly, for validating the aggregated cohesion measure we used 10 stories in French for which sophomore students in educational sciences (French native speakers) were asked to evaluate the semantic relatedness between adjacent paragraphs on a Likert scale of [1; 5]; each pair of paragraphs was assessed by more than 10 human evaluators for limiting inter-rater disagreement. Due to the subjectivity of the task and the different personal scales of perceived cohesion, the average values of intra-class correlations (ICC) per story were: ICC-average measures = 0.493 and ICC-single measures = 0.167. In the end, 540 individual cohesion scores were aggregated and then used to determine the correlation between different semantic measures and the gold standard. On the two training corpora used (Le Monde and TextEnfants), the correlations were: Combined-Le Monde (r = 0.54), LDA-Le Monde (r = 0.42), LSA-Le Monde (r = 0.28), LSA-TextEnfants (r = 0.19), Combined-TextEnfants (r = 0.06), Wu-Palmer (r = −0.06), Path Similarity (r = −0.13), LDA-TextEnfants (r = −0.13) and Leacock–Chodorow (r = −0.40).

All those correlations are non-significant, but the inter-rater correlations are on a similar range and are smaller than the Combined-Le Monde score. The previous results show that the proposed combined method of integrating multiple semantic similarity measures outperforms all individual metrics, that a larger corpus leads to better results and that Wu–Palmer, besides its corresponding scaling to the [0; 1] interval (relevant when integrating measurements with LSA and LDA), behaves best in contrast to the other ontology based semantic distances.

Moreover, the significant increase in correlation between the aggregated measure of LSA, LDA and Wu–Palmer, in comparison to the individual scores, proves the benefits of combining multiple complementary approaches in terms of the reduction of errors that can be induced by using a single method.

Secondly, for the preliminary validation of the scoring mechanism and of the proposed extractive summarization facility we have performed experiments on two narrative texts in French: Miguel de la faim [59] and La pharmacie des éléphants [73]. Our validation process used the measurements initially gathered by Mandin [74] in which 330 high school (9th to 12th grade) students and 25 tutors were asked to manually highlight the most important 3–5 sentences from the two presented stories [74]. The inter-rater agreement scores were rather low, as the ICC values were of 0.13, respectively 0.23, highlighting the subjectivity of the task at hand.

Afterwards, as suggested by [75], four equivalence classes were defined, taking into consideration the mean – standard deviation, mean and mean + standard deviation of each distribution as cut-out values. In this context, two measurements of agreement were used: exact agreement (EA) that reflects precision and adjacent agreement (AA) that allows a difference of one between the class index automatically retrieved and the one evaluated by the human raters.

By considering the use of the equivalence classes, we notice major improvements in our evaluation (see Table 13.2) as both documents have the best agreements with the tutors, suggesting that our cohesion-based scoring process entails a deeper perspective of the discourse structure reflected in each sentence’s importance.

Table 13.2 Exact and adjacent agreement using equivalence classes

Moreover, our results became more cognitively relevant as they are easier to interpret by both learners and tutors—instead of a positive value obtained after applying the scoring mechanism, each sentence has an assigned importance class (1—less important; 4—the most important). In addition, we obtained 3 or 4 sentences per document that were tagged with the 4th class, a result consistent with the initial annotation task of selecting the 3–5 most important sentences.

Therefore, based on promising preliminary validation results, we can conclude that the proposed cohesion-based scoring mechanism is adequate and effective, as it integrates through cohesive links the local importance of each sentence, derived from topics coverage, into a global view of the discourse.

Thirdly, in the context of the validation experiments for the identification of reading strategies, pupils read aloud a 450 word-long story, “Matilda” [65], and stopped in-between at six predefined markers in order to explain what they understood up to that moment. Their explanations were first recorded and transcribed, then annotated by two human experts (PhD in linguistics and in psychology), and categorized according to scoring scheme.

Disagreements were solved by discussion after evaluating each self-explanation individually. In addition, automatic cleaning had to be performed in order to process the phonetic-like transcribed verbalizations.

Verbalizations from 12 pupils were transcribed and manually assessed as a preliminary validation. The results for the 72 verbalization extracts in terms of precision, recall and F1-score are as follows: causality (P = 0.57, R = 0.98, F = 0.72), control (P = 1, R = 0.71, F = 0.83), paraphrase (P = 0.79, R = 0.92, F = 0.85), bridging (P = 0.45, R = 0.58, F = 0.5) and inferred knowledge (P = 0.34, R = 0.43, F = 0.38). As expected, paraphrases, control and causality occurrences were much easier to identify than information coming from pupils’ experience [76].

Moreover we have identified multiple particular cases in which both approaches (human and automatic) covered a partial truth that in the end is subjective to the evaluator. For instance, many causal structures close to each other, but not adjacent, were manually coded as one, whereas the system considers each of them separately. For example, “fille” (“daughter”) does not appear in the text and is directly linked to the main character, therefore marked as an inferred concept by ReaderBench, while the evaluator considered it as a synonym.

Additionally, when looking at manual assessments, discrepancies between evaluators were identified due to different understandings and perceptions of pupil’s intentions expressed within their metacognitions. Nevertheless, our aim was to support tutors and the results are encouraging (correlated also with the previous precision measurements and with the fact that a lot of noise existed in the transcriptions), emphasizing the benefits of a regularized and deterministic process of identification.

In the end, for training and validating our textual complexity model, we have opted to automatically extract English texts from TASA, using its DRP score, into six classes of complexity [19] of equal frequency (see Table 13.3). This validation scenario consisting of approximately 1,000 documents was twofold: on one hand, we wanted to prove that the complete model is adequate and reliable and, on the other, to demonstrate that high level semantic features provide relevant insight that can be used for automatic classification.

Table 13.3 Ranges of the DRP scores as a function of defining the six textual complexity classes [after 19]

In the end, k-fold cross validation [77] was applied for extracting the following performance features (see Table 13.1): precision or exact agreement (EA) and adjacent agreement (AA) [68], as the percent to which the SVM was close to predicting the correct classification.

By considering the granular factors, although simple in nature, readability formulas, the average number of words per sentence, the average length of sentences/words and balanced CAF [78] provided the best alternatives at lexical and syntactic levels; this was expected as the DRP score is based solely on shallow evaluation factors.

From the perspective of word complexity factors, the average polysemy count and the average word syllable count correlated well with the DRP scores. In terms of parts of speech tagging, nouns, prepositions and adjectives had the highest correlation of all types of parts of speech, whereas depth and size of the parsing tree provided also a good insight of textual complexity.

In contrast, semantic factors taken individually had lower scores because the evaluation process at this level is mostly based on cohesive or semantic links between analysis elements and the variance between complexity classes is lower in these cases. Moreover, while considering the evolution from the first class of complexity to the latest, these semantic features do not necessarily have an ascending and linear evolution; this can fundamentally affect a precise prediction if the factor is taken into consideration individually.

Only two entity-density factors had better results, but their values are directly connected to the underlying part of speech (noun) that had the best EA and AA of all morphology factors. Also, the most difficult classes to identify were the second and the third because the differences between them were less noteworthy.

Therefore ReaderBench enables tutors to assess the complexity of new reading materials based on the selected complexity factors and a pre-assessed corpus of texts, pertaining to different complexity dimensions. Moreover, by comparing multiple loaded documents, tutors can better grasp each evaluation factor, refine the model to best suit their interests in terms of the targeted measurements and perform new predictions using only their selected features (see Fig. 13.6).

Fig. 13.6
figure 6

Document complexity evaluation

Two additional measurements were performed in the end. Firstly, an integration of all metrics from all textual complexity dimensions proved that the SVMs results are compatible with the DRP scores (EA = 0.779 and AA = 0.997), and that they provide significant improvements as they outperform any individual dimension precisions.

The second measurement (EA = 0.597 and AA = 0.943) used only morphology and semantic measures in order to avoid a circular comparison between factors of similar complexity, as the DRP score is based on shallow factors. This result showed a link between low-level factors (also used in the DRP score) and in-depth analysis factors, which can also be used to accurately predict the complexity of a reading material.

10 A Comparison of ReaderBench with Previous Work

As an overview, in terms of individual learning, ReaderBench encompasses the functionalities of both CohMetrix [21] and iStart [79, 80] as it provides teachers and learners information on their reading/writing activities: initial textual complexity assessment, assignment of texts to learners, capture of meta-cognitions reflected in one’s textual verbalizations, and reading strategies assessment.

Nevertheless, ReaderBench covers a different educational purpose, as its validation was performed on primary school pupils, whereas iStart mainly targets high school and university students [81] (see Table 13.4 for a detailed comparison between ReaderBench and CohMetrix).

Table 13.4 ReaderBench versus iStart [79, 80, 82]

With regards to Coh-Metrix [21], ReaderBench integrates different factors, measurements, and uses SVMs [67, 68] for increasing the validity of textual complexity assessment [66] (see Table 13.5 for a detailed comparison).

Table 13.5 ReaderBench versus Coh-Metrix [21, 83]

Moreover, ReaderBench encompasses textual complexity measures similar to Dmesure [68, 84], but with emphasis on more in-depth, semantic factors. In other words, the aim of ReaderBench is to provide a shift of perspective towards demonstrating that high-level factors can be also used to accurately predict the complexity of a document (see Table 13.6 for a detailed comparison).

Table 13.6 ReaderBench versus Dmesure [68, 84]

11 Conclusions

ReaderBench, a multi-lingual and multi-purpose system, supports learners and teachers to mine and analyze textual materials, learners’ productions and identify reading strategies that enable an ‘a priori’ and an ‘a posteriori’ assessment of comprehension.

Our system allows computing a large range of measures that have been validated and compared to human ones. Moreover, ReaderBench infers data regarding the cognitive processes engaged in understanding and can be integrated in several pedagogical scenarios.

As a recall to Fig. 13.1, our system supports all the proposed learning activities from both perspectives, learner and tutor centered. On one hand, tutors can select learning materials by using the multi-dimensional textual complexity model, can compare the learners’ productions to the automatically extracted features (topics, most important sentences or the strength of the cohesive links in-between adjacent paragraphs) and can evaluate self-explanations while addressing the identified reading strategies.

On the other, learners can take advantage of the document assessment facilities in order to better understand the structure, difficulty level and topics of the assigned material. Moreover, they can improve their own self-regulated processes through the system’s feedback, especially in the case of their self-explanations in terms of the used reading strategies.

In addition, the potential of ReaderBench’s multi-lingual unified vision based on textual cohesion is confirmed by performing thorough validations on both analysis languages, English and French. Therefore, all the previous aspects make the integration of ReaderBench appropriate in various educational settings.

Further research will investigate the use of ReaderBench in classrooms by teachers and learners in order to validate the pedagogical scenarios. Moreover, the large range of raw data generated by ReaderBench will be subject to analysis in other educational data mining platforms, for example UnderTracks [87].