Introduction

Educators and researchers interested in assessing reading comprehension have been in a bind. Comprehension occurs as one reads, but most available assessments rely on measuring comprehension after reading is completed. This has arisen because the vast majority of standardized comprehension assessment tools use a multiple-choice format where questions are answered after the text is read. Although there are advantages to this testing format (Freedle and Kostin 1994; Glover et al. 1979; Malak and Hegeman 1985; van den Bergh 1990), many of the instantiations of this approach do not adequately assess the products and processes specified by discourse theory to be critical for deep comprehension (Magliano and Millis 2003; Magliano et al. 2007). Part of the problem is that the multiple-choice testing format does not afford the assessment of these processes as they occur during reading because questions are answered only after the test passage is read. Perhaps more importantly, the items that comprise these tests are typically not constructed to assess comprehension products and processes specified in theories of discourse comprehension (Magliano et al. 2007; Snow 2002). For example, Magliano et al. (2007) showed that the Nelson-Denny test, a widely used reading test, primarily contained questions that verified word meanings rather than inferences required for comprehension.

The goal of the present study was to assess the viability of an assessment tool that (1) assesses comprehension online, that is, while students are reading a text, (2) is grounded in theory, and (3)assesses not only comprehension, but some of the processes that give rise to comprehension. We call this tool the Reading Strategy Assessment Tool (RSAT; Gilliam et al. 2007). RSAT requires the student to read texts on a computer and answer open-ended questions that are embedded within them. After reading pre-selected target sentences, readers are asked to produce responses to one of two types of open-ended questions: indirect and direct. Indirect questions were intended to tap comprehension processes and require readers to report thoughts regarding their understanding of the sentence in the context of the passage (Instructions include “What are your thoughts regarding your understanding of the sentence in the context of the passage?”). Direct questions were designed to assess comprehension level and required readers to answer specific “wh-“questions about the text at target sentences (e.g., “Why was the Union demoralized?” in a passage about the American Civil War). Based on the answers to the indirect and direct questions, RSAT provides a measure of information processing processes and overall comprehension, respectively. These will be described below.

We should be clear on the onset that RSAT provides no direct assessment of metacognition, although there is evidence linking metacomprehension and calibration skills to comprehension (Hacker et al. 2009; Pressley and Afflerbach 1995). One could develop automated procedures for detecting indications of statements that involve an assessment of one’s comprehension (e.g., “I don’t get this”, “I get this”), but we doubt that this would be sufficient for capturing the nuances of metacognition. Metacomprehension is complicated; readers with a high degree of metacomprehension are able to dynamically adjust their reading strategies to the task, demands of the text, and an assessment of comprehension (e.g., Pressley and Afflerbach 1995; McNamara and Magliano 2009a). Clearly, RSAT would need much more intelligent algorithms than the present ones to correctly identify metacognitive thoughts that relate to comprehension.

We should also be clear on the onset that this approach does not measure many of the strategies that readers use when making meaning of text and educational researchers refer to as ‘reading strategies’ (e.g., See McNamara (2007) for an extensive review). Active readers will reread, question the text or the author, generate personal examples, look forward and backwards through the text, take notes, create images, etc. (e.g., Pressley and Afflerbach 1995; Pressley et al. 1985). Moreover, skilled comprehenders change their reading strategies and behaviors in light of reading goals (Pressley and Afflerbach 1995; Taraban et al. 2000). It would be very ambitious to develop a computer-based assessment tool that directly tests the use of all of these strategies during reading, although there are tools that assess the self-reported use of many of them.

In the current study, we were interested in whether a computer-based system could detect a small number of reading processes that are known to contribute to comprehension. As will be described below, we focused on paraphrases, bridges, and elaborations, which together we refer to as information processing activities. Although bridges and elaborations are typically referred to as inferences (e.g., Singer 1988), in some cases, they can be under the strategic control of readers (Magliano et al. 1999). Nonetheless, if our efforts are successful, then this research would suggest that it would be worthwhile to pursue the development of a system that detects other important comprehension processes, as well.

The evidence-based approach for RSAT

The development of RSAT followed the evidence-based approach towards assessment development (Mislevy 1993; Pellegrino and Chudowsky 2003; Pellegrino et al. 2001). This framework specifies that test developers consider three primary components during test development: a model of the student, a model of the task, and principles for interpreting data provided by the task. The student model describes the types of mental processes and representations that the test purports to measure, which ideally are based on theories of student proficiency. The model of the task describes how the tasks or problems that the student will encounter in the assessment tool implicate the processes described in the student model. Principles for data interpretation refer to how performance on the task relates to the student model.

The student model: comprehension

Comprehension arises from a series of cognitive processes and activities that contribute to a reader’s ability to connect the meaning of multiple sentences into a coherent mental representation of the overall meaning of text (e.g., Graesser et al. 1994; Kintsch 1988, 1998). The vast majority of theories of discourse comprehension delineate between two classes of information processing activities that support comprehension, namely bridging and elaborative inferences (McNamara and Magliano 2009b).

Bridging inferences provide conceptual links between explicitly mentioned ideas in the text. Bridging inferences include anaphoric and pronominal inferences as well as conceptual and causal-based inferences that require the application of world knowledge. RSAT was designed to detect the evidence for bridging inferences because there is ample evidence indicating that bridging inferences are required for coherence and are routinely generated during comprehension (e.g., Singer and Halldorson 1996) and that skilled and less skilled readers can be differentiated by the extent that they generate these inferences (e.g., Magliano and Millis 2003).

Elaborative inferences are based upon the reader’s world knowledge of the concepts and events described by the text. Unlike bridges, elaborations do not provide connections between explicit units of text (e.g., sentences). They are presumably generated because the semantic context of the current sentence resonates with semantic knowledge of the world and thus, becomes available for computations in working memory (e.g., Myers and O’Brien 1998). They embellish the text content with information from the reader’s prior knowledge. This knowledge may be rooted in world (e.g., schematic knowledge), text/topic specific (e.g., knowledge of cancer or biology), or episodic knowledge (e.g., personal experiences). Moreover, some elaboration can be construed and entailments of the text, whereas others may require reasoning beyond the text (Wolfe and Goldman 2005). Unlike bridging inferences there is some controversy regarding the frequency and utility of elaborative inferences. For example, some research has shown that elaborative inferences are generated only when there is a strong semantic association between world knowledge and the text (McKoon and Ratcliff 1986, 1992) or that only a small class of these are routinely generated during reading such as those that support explanations (e.g., Graesser et al. 1994). On the other hand, others have argued that elaborative inferences help readers establish how their knowledge relates to the discourse context and are particularly important when understanding expository discourse (McNamara 2004). With respect to this later perspective, only those elaborations that are germane to the text and/or learning task are likely to support comprehension (e.g., Wolfe and Goldman 2005).

The task: answering questions

Indirect questions and strategies

As mentioned earlier, readers answer two types of questions in RSAT. The indirect question—“What are you thinking now?”—was used to elicit responses similar to those produced when thinking aloud. In a think-aloud methodology, readers are asked to report thoughts that come to mind as they read a text. Answers to the indirect protocols should reveal the content of working memory while comprehending the target sentences (Ericsson and Simon 1993), which has been shown to be predictive of comprehension (Chi et al. 1989; Coté and Goldman 1999. Magliano and Millis 2003; Millis et al. 2006; Trabasso and Magliano 1996b). Traditionally, these thoughts are produced orally, transcribed, and then coded in order to make inferences regarding the strategies, information, and mental processes that occur during reading (e.g., Coté and Goldman 1999; Magliano 1999; Magliano et al. 1999; Pressley and Afflerbach 1995; Trabasso and Magliano 1996a, b). In the studies reported here, when this question appeared, participants were instructed to report thoughts regarding their understanding of the sentence in the context of the text (Magliano et al. 1999; Magliano and Millis 2003; Trabasso and Magliano 1996a, b).

It is important to consider some strengths and weaknesses of using verbal protocols for assessing comprehension. A strength of verbal protocols is that they provide a window on the products of comprehension and some of the processes that give rise to comprehension (e.g., Pressley and Afflerbach 1995; Trabasso and Magliano 1996a, b). However, distinguishing between the two can be complicated, and perhaps impossible with an automated scoring system. We are fairly certain that verbal protocols do reveal the emerging products of comprehension, such as inferences, because the thoughts corresponding to the inference would be included in the verbal protocol. But in many cases, the processes that led to the inference being generated might not be apparent from the protocol. For example, a reader might make a prediction about a future event, but whether that inference arose from personal experiences, the prior text or from particular schemata might be unanswerable. This points to a weakness of this approach, namely that all a researcher can analyze is the presence of content in a protocol and infer the processes that give rise to them (Ericsson and Simon 1993). Furthermore, a reader may not be able to write down or say everything that comes to mind, and may edit or omit thoughts that do come to mind. It is important to validate verbal protocols against other independent behavioral measures, such as reading times or outcome measures of comprehension (Magliano and Graesser 1991). If the contents are correlated with these independent measures, then that provides some validation that their contents are indeed indicative of comprehension. RSAT reflects an application of this general approach for validating verbal protocols.

RSAT was designed to detect three activities that have been revealed in the context of thinking aloud: paraphrasing, bridging, and elaborative inferences (Magliano and Millis 2003; Millis et al. 2006). Paraphrasing reflects an activity in which a reader is rephrasing all or part of the current sentence. Consider the example verbal protocols presented in Table 1 produced at Sentence 11 of “How Cancer Develops” (see the Appendix A for the entire text). Clause 1 for Participant 1 is a paraphrase because the participant mentioned the event that was explicitly described in the current sentence. Paraphrasing provides readers the opportunity to describe the current discourse content with familiar language (McNamara 2004). However, college students who primarily paraphrase as opposed to using other strategies while thinking aloud may do so because they adopt a low standard of comprehension, although this is uncertain at the current time (Magliano and Millis 2003; Millis et al. 2006).

Table 1 Example indirect protocols for the sentence “A message within each receptor cell becomes activated.,” From the text “How cancer develops”

When readers produce bridges, they describe how the current discourse content is related to the prior discourse content. For Example, clause 1 for Participant 2 reflects a local bridge because that participant mentioned the event described in the immediately prior sentence (see the text in Appendix A). Conversely, clauses 1 and 2 for Participant 3 reflect distal bridges because they mentioned information contained in sentences 7, and 9, respectively. Sentences 7 and 9 would not be expected to be in working memory as sentence 11 is read. Elaborations are inferences, which are based on the reader’s world knowledge and which do not link text units. Clauses 1 and 2 for Participant 4 are considered elaborations because the reader activated relevant world knowledge not presented in the text. In this case, the reader inferred that the text information could be potentially useful for developing a cure for cancer, a topic not discussed in the text.

There is a growing body of evidence that think aloud protocols reveal qualitative differences among skilled and less-skilled readers (Chi et al. 1989; Coté and Goldman 1999; Magliano and Millis 2003; Millis et al. 2006; Pressley and Afflerbach 1995; Trabasso and Magliano 1996b; Whitney et al. 1991). For example, Magliano and Millis conducted a series of studies in which they had students produce verbal protocols while reading simple narratives (Magliano and Millis 2003) and more challenging scientific texts (Millis et al. 2006). They determined reading comprehension proficiency from scores on the Nelson-Denny test of reading comprehension. Based on the verbal protocols, they found that skilled readers tended to bridge more than less skilled readers. More importantly, these studies showed that the extent to which readers engaged in bridging was correlated with the comprehension of both texts in which the protocols were produced and those that were read silently. This finding suggests that the verbal protocols revealed strategies that are related to comprehension and are used somewhat consistently across reading encounters.

Direct questions and comprehension

The direct questions used in RSAT were designed to provide an assessment of comprehension at a particular point in the passage. The majority of the direct questions were why-questions based on the explicit content of the passage. For example, one direct question was “Why does a tumor develop?” and occurred immediately after the sentence “A tumor then develops.” which was in a passage that described the development of cancer. Why-questions expose the lion’s share of inferences that occur during the comprehension of actions and events (Graesser and Clark 1985; Graesser et al. 1987; Long et al. 1992; Magliano 1999; Millis and Graesser 1994). In particular, why-questions require readers to activate causal and explanatory information that provides a critical basis for deep comprehension (Graesser et al. 1996; Graesser and Clark 1985; Graesser and Franklin 1990; Graesser et al. 1994; Millis and Barker 1996). Therefore, why-questions should be ideal for assessing comprehension.

In RSAT, readers type in answers to both direct and indirect questions onto a keyboard. We are mindful that some researchers have expressed concern that readers adopt different strategies as a function of how the verbal protocols are produced (oral vs. written; Hausmann and Chi 2002). However, Muñoz et al. (2006) found negligible differences in reading inferences and strategies as a function of modality for both simple narratives and more difficult scientific texts. More specifically, they had participants produce verbal protocol both orally and by typing (in the context of a within-participants design) while reading science and narrative texts. They did not find differences in the magnitudes of the processes targeted in RSAT as a function of modality for the science texts. For narrative text, they found that participants produced slightly (albeit significantly) more paraphrases and bridges for the narrative texts when producing the protocols orally than when typing.

Data interpretation: word counts

RSAT relies on a computer-based assessment of the answers readers give to the questions. The presence of information processing activities is estimated by counting words in each answer to the indirect question which match words appearing in different sentences at the point in which the answer is given. Word counts are obtained for content words from the current sentence, local sentence (immediately prior sentence), and distal sentences (two or more sentences back) in the prior discourse context. In addition, content words in each answer that do not appear in the text up to that point are also counted (new words). Words from the current sentences are indicative of paraphrasing, words from the local sentences are indicative of local bridging, words from the distal sentences are indicative of distal bridging, and new words are indicative of elaborations (Magliano et al. 2002; Millis et al. 2007). Answers to the direct questions are assessed by counting the number of content words contained in an “ideal” answer. For each reader, the word counts in each category and question type are averaged.

We are making the assumption that the sources of words used in answers to indirect questions reflect the amount of paraphrasing, local and distal bridging, and elaborations done by a reader (Magliano and Millis 2003; Magliano et al. 2002; Millis et al. 2004, 2006, 2007). As mentioned above, we do acknowledge that there is an inherent ambiguity of what processing activities are being indicated by classifying words on whether they had appeared in the text or not. Given that the prior discourse activates only a small portion of a reader’s world knowledge, when a person uses a word from the prior discourse, it is likely that he or she is referring to the prior discourse. Of course, there are no guarantees. A more significant problem arises in the scoring procedure when a person uses a word not found in the prior text. Because the word counting algorithms are unable to detect when a student uses a synonym of a word in the text, synonyms are classified as new words even though they may actually reflect a bridge or paraphrase. Therefore, our measure of elaboration contains words that reflect true elaborations and possibly other processing activities. The extent that this synonym problem poses interpretational limitations is an empirical issue, and we will be mindful of this fact when we interpret the findings.

Overview of the study

The goal of this study is to provide evidence that RSAT measures comprehension and the information processing activities of paraphrasing, elaboration, and bridging. Consequently, the goal of this study is to provide evidence that these measures have construct, convergent, and discriminant validities. Construct validity is established when an instrument truly measures what it intends to measure. Convergent validity helps to establish construct validity by correlating with other measures thought to measure the construct of interest (i.e., comprehension). Discriminant (divergent) validity occurs when instruments that purport to measure different constructs, do not correlate.

To this end, three comprehension assessments were administered in this study. Participants were administered RSAT, the Gates-McGinitie (GM; Level 10/12, Form T) test of reading comprehension, which is a paper multiple-choice standardized test, and also an experimenter-generated test which required participants to answer open-ended questions to expository passages. We also gained access to ACT composite scores as a measure of academic achievement. The ACT (American College Testing) is a high-stakes, standardized test for high school students in the United States that is used by many colleges and universities when making decisions regarding admissions. RSAT’s ability to account for overall comprehension (convergent validity) was assessed by correlating RSAT’s measure derived from the direct questions with the GM, the experimenter-generated test, and the ACT scores. In addition, convergent validity of RSAT’s measure of the information processing strategies was assessed by correlating scores derived from the answers to the indirect questions to blind expert judges who also rated the answers on the same reading strategies. We were also able to assess the discriminant validity of the RSAT processing measures. Specifically, the measure for one process (e.g., paraphrasing) should be more highly correlated with human judgments of that process (e.g., paraphrasing) than human judgments of another process (e.g., elaboration).

It is important to understand the nature of the GM and ACT tests. The comprehension section of GM requires students to read short text segments and answer 3–5 questions for each segment. The texts are available when the participants answer the questions. There is no information provided by the test publishers regarding the underling student model of comprehension that GM is designed to test. However, Magliano et al. (2007) conducted an analysis of the items in the GM test and specifically classified them as requiring (1) extraction from a local segment (one or two sentences), (2) extraction from a global segment (e.g. paragraph), or (3) inference. These categories roughly correspond to extraction and integration items used in PISA (OECD 2002). Magliano et al. found that 56%, 13%, and 31% of the items could be classified as local extraction, global extraction, or inference items. As one can see, the test primarily contains local extraction items.

The ACT has subtests for reading, English, science, and mathematics. The reading subtest adopts the traditional multiple-choice format, but no information is provided regarding the types of items on this test. Although it would have been optimal to use scores on this subtest, we had to rely on a composite score, which is average score on English, mathematics, reading, science, and an optional writing test subscores. (We did not have access to scores on the reading subtest.) The composite score can be conceptualized as a measure of general academic achievement at the end of a student’s junior year in high school, of which certainly reading comprehension ability would contribute.

Methods

Participants

One hundred and ninety undergraduates participated for course credit associated with an introductory to psychology course taught at Northern Illinois University. Sixty-one, 58, and 71 participants received RSAT lists A, B, and C, respectively. One hundred and fifty six participants were able to provide their ACT scores. ACT scores were not available for 34 students. The mean composite ACT score was 21.92 (Median score = 22; SD = 3.28). The minimum score was 12 and the maximum score was 30 (36 is the highest possible score).

Procedure

There were two phases to the study. In phase 1, participants took the GM test of reading comprehension meant for eleventh and twelfth grade. The test is a standardized test, and took 35 min to administer. Participants were also administered a short-answer (SA) test of comprehension created by the experimenters. The SA test required the student to read two texts and then answer open-ended questions about them. Some of the text questions measured the textbase, which is the propositional representation of the explicit ideas in the text, and others measured the situation model, a representation which includes inferences based on world knowledge (van Dijk and Kintsch 1983). Answers to the textbase questions could be found in one sentence or two adjacent sentences, whereas answers to situation model questions would require the reader’s world knowledge or the integration of several sentences across the span of the test. One text was a historical narrative and the other was a science text. These texts were approximately a single page in length (single spaced) and had Flesch-Kincaid Grade Levels of 10.5 and 12 for the science and historical texts, respectively. Ten short-answer questions were constructed for each text.

Two days after phase 1, participants completed phase 2 of the study. In this session, participants took RSAT administered on personal computers in a web-based environment (Gilliam et al. 2007). The texts were presented in black font in a gray field left justified near the top of the computer screen. The title of each text remained centered at the top of the screen while participants read the entire text. In the current study, only one sentence of a text was shown on the screen during reading because this presentation has been shown to be a good predictor of comprehension skill (Gilliam et al. 2007). Participants navigated forward through the text by clicking on a “next” button, which is located near the bottom left portion of the computer screen. “NEW PARAGRAPH” markers appeared when there is a shift to a new paragraph. After participants clicked the “next” button, the next sentence appeared, provided it was a non-target sentence. The text sentences were not present on the screen when there was a question prompt and nor could the participants navigate back and reread the texts in response to the questions. For target sentences, a response box appeared to the right of the “next” button with a prompt above the box. The prompt for an indirect question was “What are you thinking now?” For direct questions, the target sentence was removed from the screen when the question and response box appeared. Participants typed their answers to the question in the response box. They clicked the next button when they were finished, after which the response box disappeared and the next sentence was presented. Responses were recorded on a computer server. The order of the texts was randomly presented to the participants.

Materials

Three stimulus lists of passages were used in RSAT. Each stimulus set contained six texts: two science texts, two history texts, and two narratives for a total of 18 texts. The texts and type of question at the target sentences were empirically determined (Magliano et al. 2010). They had participants read and answer either direct or indirect questions in the context of the RSAT tool at pre-selected target sentences (sentences which immediately preceded the questions), which were chosen based on a causal network analysis (Trabasso et al. 1989) of each passage. Target sentences had a relatively high number of causal connections to prior sentences and therefore, an ideal reader would theoretically be able to make bridging inferences at these locations. The texts in the three stimulus sets were chosen because they had a high proportion of target sentences in which the computer-based assessments of the answer were correlated with independent outcome measures (e.g., performance on the GM). Within any given text in the current study, participants answered the indirect or a direct question after reading each pre-selected target sentence. The type of probe (i.e., direct or indirect) was also empirically determined based on the strength of the correlations between the automated assessments and the independent outcome measures. The stimulus sets were created such that the strength of these correlations was as equal as possible. Information regarding the characteristics of the text in the stimulus sets is shown in Table 2. As can be seen in Table 2, the three lists had roughly the same number of direct and indirect target sentence. These scores indicate that the texts were suitable for eighth graders, and should be understandable by university students. However, it is notable that the Lexile score for stimulus set B is higher than the other two sets, but that occurred because one text had an outlier score of 2420L. The Appendices A and B shows a sample text along with direct and indirect questions for the target sentences. The order of presentation of the texts was randomized for each participant. .

Table 2 Summary characteristics of the three stimulus lists

RSAT coding of the answers

Each answer to the target sentences was automatically scored by identifying the number of content words in the answer that was also in the text or in an ideal answer (Millis et al. 2007). Content words included nouns, adverbs, adjectives and verbs (semantically depleted verbs, such as is, are, were omitted). Word matching was accomplished by literal word matching and Soundex matching (McNamara et al. 2004), which detects misspellings and changes in verb forms (Birtwisle 2002; Christian 1998). For answers to the indirect question, four scores were computed. The paraphrase score was the number of content words from the target sentence. The local bridging score was the number of content words from the sentence immediately prior to the target sentence. The distal bridging score was the number of content words from sentences that were more than two sentences prior to the target sentence. The elaboration score was the number of content words in the answer that were not present in the text.

In the case when the same content word appeared in more than one category, it was omitted from the category according to the following order: distal, local, and current sentence. That is, if the same word appeared in both a distal and current sentence, it was excluded from distal and retained in the current sentence. This was done to address an ambiguity that arises when an answer contains a word that occurred in both the current sentence and prior text. Was the reader referring to the current sentence or to the prior text? It is impossible to tell with certainty. Therefore, we assumed that when this occurred, the source of the thought was the current sentence because of its heightened state in working memory, despite the fact that it may have been mentioned earlier. For the direct questions, there was only one score computed: the number of content words in the answer that was in the ideal answer.

For each participant, we computed mean scores by averaging over the individual scores obtained for each target sentence. Therefore, we calculated mean scores for paraphrases, local bridges, distal bridges, elaborations from the answers to the indirect questions, and mean comprehension scores from the answers to the direct questions.

Human coding of the answers

The indirect answers were scored by trained human judges using a coding system designed to identify the presence of at least one in the following information processing activities: paraphrases, bridges (local and distal), and elaborations. Think aloud protocols can contain other kids of responses (e.g., Pressley and Afflerbach 1995) and our decision to focus on these stemmed from the fact that the automated system was designed to detect these processes. The unit of analysis was the entire answer to a question. It is important to note that the human judges did not simply count words which appeared or did not appear in the answer or texts. Rather, they were trained to detect the conceptual features of paraphrasing, bridging and elaboration. Moreover, they were trained to identify the use of synonyms, which the word count algorithms cannot detect.

Judges were instructed that paraphrases occurred when the participant restated or summarized information contained in the target sentence. There were three levels for analyzing paraphrases. A “0” indicated that no paraphrase was present. A “1” indicated that the answer contained a noun or noun phrase from the current sentence. A “2” indicated that the answer contained a verb phrase that had its basis in the current sentence. It is important to note that judges included synonyms to both nouns and verbs in their ratings.

Bridges were instances where people mentioned concepts (i.e., content words) and clauses from the prior text. Both local and distal bridges were scored via the same criteria. A “0” indicated that the answer did not contain a local bridge. A “1” indicated that the answer contained a noun or noun phrase from a prior sentence. A “2” indicated that the answer contained a verb clause that had its basis in a clause from a prior sentence. Local bridges occurred when the answer contained information from the immediately prior sentence and distal bridges contained information from all other prior sentences. Again, judges were trained to detect synonymous expressions for local and distal bridges.

Elaborations were instances that contained concepts and inferences not mentioned in the text. A “0” indicated that no elaboration was present. A “1” indicated that the answer contained a noun or noun phrase not present in the text. A “2” indicated that the answer contained a main idea containing a verb clause from world knowledge. Protocols given a two contained a statement that was entailed by the text or were likely the results of reasoning (see Table 1 for examples). Personal recollections (e.g., “There was a thunderstorm last night.”, “My grandma died of cancer”) and evaluative statements (e.g., “I hate thunderstorms.”, “Cancer is scary.”) were not considered elaborations. Trained judges worked in pairs and there were two groups of trained judges. Inter-rater reliability for assessing the presence of each category of processing was acceptable (r ranged from .81 to .93).

Given that the unit of analysis was the entire protocol, each protocols could contain evidence for multiple processes, which is illustrated in the first two examples in Table 1. In fact, it is often the case that protocols contain any combination of the processes targeted by RSAT (e.g., Trabasso and Magliano 1996a, b; McNamara 2004). As such, these processes are not to be viewed as mutual exclusive. Moreover, the protocols could contain categories of responses that that were not part of the coding system (e.g., evaluative statements, recollections from episodic knowledge).

Another scoring system was developed to assess the quality of responses to the direct questions and the answers to the SA questions (i.e., the experimenter-generated test). The system identified the ideal parts of ideal answers for each question. Responses were scored on a four point scale (0–3). A three indicated that the answer was completed; a 2 indicated that it was almost complete; a 1 indicated that the answer was vague, but largely correct; finally a 0 indicated that the answer was incorrect. Rules for assigning these numbers were established by the coders for each question-answer pair. Inter-rater reliability was high (r = .89). The SA questions were scored in a similar fashion. The inter-item reliability of the SA test was adequate (r = .92).

To summarize, both RSAT and human coding used the same unit of analysis, which was the entire answer to an indirect or direct question. RSAT counts the number of words that fall into different categories that corresponded to the constructs that were intended to be assessed by RSAT. For indirect questions, RSAT counted the number of content words which were also present in the target sentence, the sentence immediately prior to the target sentence, other prior sentences, and the number of words which did not appear in the text. These corresponded to paraphrasing, local bridging, distal bridging and elaborations. All of these had a lower bound of 0 and no upper bound. The human scoring was based on the conceptual presence of these processing activities, and had a lower bound of zero and an upper bound of 2. For direct questions, RSAT counted the number of words in the participant’s answer that was included in an ‘ideal’ answer, whereas the human judges used a 3-point scale of answer completion (i.e., incorrect, partially complete, complete).

Results

There were four sets of analyses conducted to assess the validity of RSAT. The first set correlated our on-line measure of comprehension (word counts based on the answers to direct questions) the GM test of comprehension, the experimenter-generated comprehension test (i.e., the SA test), and the ACT. If RSAT shows convergent validity for measuring comprehension, then its measure of comprehension should correlate with these other measures to about the same extent they correlate with one another. The second set assessed RSAT’s construct validity for its measure of comprehension, as well as its measures of information processing activities (paraphrasing, bridging, elaborations) which were word counts based on the answers to the indirect questions. We correlated RSAT-generated word count measures of comprehension and information processing activities with expert human judgments of these same variables. The third set assessed the construct validity of the RSAT measures of information processing activities by testing whether they predict performance on the comprehension measures. According to theory (e.g., Graesser et al., 1994), the measures should be correlated with comprehension. Finally, we tested whether the set of RSAT’s measures (comprehension, paraphrasing, local bridging, distal bridging, elaborations) predict performance on the SA test of comprehension over and above that accounted for by the GM comprehension test.

The mean and standard errors for all measures of comprehension and information processing activities are presented in Table 3. The means derived from RSAT were based on the answers to the embedded indirect and direct questions, and were calculated for each participant separately.

Table 3 Means and standard errors for the measures of comprehension and comprehension processes

Analysis 1: Comparing RSAT’s measure of comprehension with other measures of comprehension

We computed the bivariate correlations between the word counts for embedded direct questions (RSAT’s measure of comprehension), GM performance, and performance on the SA questions (see Table 3). The correlations among the measures were all comparable and statistically significant (p < .001). The mean word counts for direct questions were significantly positively correlated with the ACT (r = .54), GM (r = .53) and the experimenter-generated SA test (r = .45) scores. The correlations between RSAT and the two multiple choice tests were similar in magnitude that they correlated with each other (r = .59). The experimenter-generated SA questions correlated with ACT (r = .56), GM (r = .52). The similar magnitudes of these correlations provide some degree of construct validity of RSAT’s assessment of comprehension.

Analysis 2: Comparing RSAT’s word counts to human raters

The next research questions pertained to establishing that the RSAT measures (as revealed by the embedded direct and indirect questions) of comprehension and information processing activities actually measure what they were intended to measure. The correlations between RSAT measures and expert human raters were very encouraging, and again were all statistically significant (r < .001). The correlations for paraphrases, local bridges, distal bridges and elaborations were .70, .70, .64, and .44, respectively. As can been seen, word counts correlated better for information in the discourse context (paraphrases, local bridges, and distal bridges) than for elaborations. Overall, these correlations help to establish the convergent validity of the information processing activities. That is, RSAT and humans show relatively high correlations for each processing activity.

It is also helpful to address discriminant validity. Discriminant validity refers to when a measure is uncorrelated with measures of theoretically unrelated concepts. That is, RSAT’s measure of paraphrasing should be correlated with human ratings of paraphrasing (convergent validity) but relatively uncorrelated with human ratings of local bridges, distal bridges, and elaborations (discriminant validity) since these later relate to other processing activities. A simple way to address this is to compare the correlations in the preceding paragraph to the off-diagonal correlations when the correlations are plotted in a 4 (RSAT) × 4 (Human) matrix. For example, the RSAT measure of paraphrase correlated with the human rating of paraphrases at .70, and this correlation should be, and in fact was higher, than its correlation with the human ratings of local bridging (.43), distal bridging (.27) and elaboration (.19). Hence, the on-diagonal and average off-diagonal correlation for paraphrases was .70 and .29, respectively. The corresponding on- and off-diagonal correlations for local bridges, distal bridges and elaborations were .70 vs. .41, .64 vs. .41, and .44 vs. .32, respectively. Overall, the pattern of correlations indicate the following order of convergent and discriminant validity: paraphrasing > local bridges > distal bridges > elaborations.

In reference to RSAT’s measure of overall comprehension, the word counts from the direct answer were significantly correlated with the human judgments with a correlation of .70 (p < .001). Overall, the pattern of correlations suggests that most of the word counts are correlated with human experts, and therefore, can serve as a proxy for human judgments.

Analysis 3: Predicting comprehension from RSAT’s measures of information processing activities

According to theory, measures of paraphrasing, bridging and elaboration should predict comprehension (Graesser et al. 1994). First, we predicted performance on the direct questions from mean scores of paraphrasing, local bridging, distal bridging, and elaboration using multiple regression (Dummy-coded variables were entered for each of the three RSAT stimulus lists.). These variables were simultaneously forced entered. Table 4 contains the resulting coefficients. The regression equation on the embedded direct questions accounted for a significant 38% of the variance, F (6, 144) = 14.40, p < .001. Paraphrase, distal bridging, and elaborations scores were all significant positive predictors of performance on the embedded direct questions.

Table 4 Bivariate Pearson correlations between measures of comprehension and comprehension processes
Table 5 Predicting performance on comprehension measures

Second, we predicted the performance on the SA questions from the same measures. The equation accounted for a significant 21% of the variance, F (6, 144) = 6.35, p < 0.001. As can be seen in Table 5, distal bridging, and elaborations scores were both significant positive predictors. Paraphrasing scores was a significant negative predictor of performance on the SA test, consistent with prior findings (Magliano and Millis 2003; Millis et al. 2006).

Overall, these results illustrate that RSAT’s measures of information processing activities predict measures of comprehension provided by the RSAT tool (direct questions) and by independent measures (SA questions), thus establishing their construct validity.

Analysis 4: Comparing RSAT to standardized tests

The final question pertained to whether the RSAT measures can account for comprehension performance on a level comparable to standardized tests. Our “gold” standard measure of comprehension performance was the answers to the SA test. The standardized tests were the GM and ACT scores. A series of regression analyses were conducted for these analyses. The first involved simultaneously entering RSAT scores for comprehension (direct questions), paraphrasing, local bridging, distal bridging, and elaboration into a regression equation predicting performance on the SA questions. This analysis revealed that the measures provided by RSAT accounted for a significant 33% of variance, F (7, 143) = 10.21, p < .001. A comparison of the multiple correlation coefficient provided by the regression equation to the bivariate correlations between the standardized measures and SA performance provide one basis for comparing the measures. The correlation coefficients for RSAT, GM, and ACT were .58, .52, and .56, respectively. As such, each of the assessment approaches were comparably correlated with performance on the SA questions.

A set of two-step, hierarchical regression analyses was also conducted. This analysis compared RSAT to the GM and ACT on the amount of unique variance they share with the SA performance. In the first step of each analysis, the RSAT measures were simultaneously force entered into the equation, and in the second step, the standardized measure was entered (GM or ACT). Next, another hierarchical analysis was conducted, but the order of entry of the assessment measures was reversed (e.g., GM entered in the first step and RSAT measures entered in the second step). R 2 changes for the second steps of these analyses were extracted to assess the unique variance of each measure. With respect to the comparison of the RSAT measures with GM, RSAT scores accounted for a significant 14% of the variance. (F (5,142) = 6.94, p < .001) and the GM scores accounted for a significant 8% of the variance, (F (1,142) = 20.54, p < .001). With respect to the comparison of the RSAT measures with ACT, RSAT scores accounted for a significant 10% of the variance. (F (5,142) = 4.97, p < .001) and the ACT scores accounted for a significant 9% of the variance, (F (1,142) = 23.19, p < .001). The results of this last set of analyses suggest that the RSAT measures of comprehension and comprehension processes were as predictive of comprehension (SA questions) as the two standardized tests (GM, ACT).

Discussion

Traditional comprehension assessment tools are typically not designed to assess comprehension as it emerges during reading or the processes that give rise to comprehension (Magliano et al. 2007). Developing assessment tools that provide valid and reliable assessments of these dimensions of comprehension could provide a boon to educational practices because they could provide a basis for giving feedback to students regarding how they approach reading for comprehension. The present study assessed the viability of assessing comprehension by using a computer-based scoring system of open-ended responses to questions embedded in text, which we believe could be the basis of such an assessment tool. In this study, we used RSAT, which, unlike the vast majority of commercially published comprehension skills assessments, was developed based on the evidence-based approach for test development (Mislevy 1993; Pellegrino and Chudowsky 2003; Pellegrino et al. 2001). Theory guided the construction of the questions and where they were placed within the passages. The research we presented here can be viewed as providing a “proof of concept” for RSAT and more generally an assessment tool designed not only to assess comprehension skill, but some of the processes that support it.

What evidence do we have for the proof of concept? First, we have evidence of convergent validity of the RSAT measure of comprehension and standardized and experimenter-generated measures of comprehension. That is, the RSAT comprehension score was correlated with performance on the GM and ACT tests as performance to the same extent that these were correlated with each other. Moreover, the RSAT, GM and ACT scores were all comparably correlated with performance on the SA test. When we assessed the unique variance accounted by the RSAT measures and the standardized measures (GM and ACT), we found that the RSAT measures accounted for slightly more unique variance than the GM test and a comparable amount as the ACT. One advantage of RSAT over these standardized tests is that it provides an assessment of some of the processes that support comprehension.

Second, the data indicate that the RSAT comprehension and information processing activities were correlated with human judgments. The pattern of correlations showed convergent validity and some degree of discriminant validity. The correlations indicative of convergent validity were high and statistically significant. These data are impressive given that human judges were trained to detect the strategies, rather than particular words or word counts. These data indicate that the word counts were indeed valid measures of paraphrasing, local and distal bridges, and answers to the direct questions. The measures of elaboration had the lowest indicators of validity.

Third, we showed that the RSAT measures of information processing activities were predictive of performance on both the RSAT measure of comprehension (direct questions) and performance on the independent SA test. The measures of integration (distal bridging scores) and elaboration (elaboration score) were all significant positive predictors of both measures of comprehension. However there was discrepancy with respect to the paraphrase score. The paraphrase score was a significant positive predictor for the direct questions, but a significant negative predictor for the SA test. In our prior research, we have documented similar negative correlations with comprehension tests similar to the SA test used here (Magliano and Millis 2003; Millis et al. 2006). We have interpreted this finding as indicating that readers who paraphrase excessively tend not to integrate the current sentence to prior sentences and therefore do not construct globally coherent representations (see also Wolfe and Goldman 2005). That is, they focus on understanding individual sentences rather than linking each sentence to the existing passage representation. One reason for finding a positive correlation between paraphrasing and performance on the direct questions is that paraphrasing probably strengthens memory for the explicit content of the text, and in many cases, the explicit text provided the correct answer to the direct answers.

Although we believe we have met the ‘proof of concept’ requirement, it is premature to conclude we are ready to develop a test of comprehension skill based on RSAT that would be ready to be used on a large scale. Comprehension emerges from a complex interaction between the reader, text, and task (Snow 2002). We believe that the development of an assessment tool would require a deeper understanding of this interaction. For example, some strategies would be more appropriate for some text than others. If readers are comprehending a texts that describes a causal sequence of events, then causal bridging inferences are important for comprehension, but this would not be the cause for text that describe biological nomenclature. We do believe that the data show that tests based on open-ended responses are plausible to construct and as RSAT could provide a valuable research tools to develop assessment tools of this nature.

It is important to acknowledge that RSAT only assesses paraphrasing, bridging, and elaboration, which only comprise a small subset of the processes that support comprehension. Skilled comprehenders use a number of strategies for meeting their reading goals (Gaskins et al. 2007; Pressley and Afflerbach 1995; Pressley et al. 1985; Pressely and Woloshyn 1995). RSAT cannot determine the processes (strategic or otherwise) that give rise to the answers that readers produce. For example, RSAT cannot determine if bridging words were produced as part of a strategy to summarize or self explain, or whether it was under the strategic control of the reader. The cognitive processes that give rise to the content of verbal protocols can be induced by human judges in some instances (Trabasso and Magliano 1996a, b), but it would be challenging to implement them in an automated system with our current approach of essentially taking “snap shots” of comprehension and processes across a text. Despite the limitations of the current version of RSAT, its framework allows for flexibility in that researchers and educators can compose their own direct questions to suit particular goals. For example, one could construct questions that align with well established assessment frameworks, such as what is adopted by OECD (2002). Specifically, one could develop questions that require the extraction, integration, or evaluation of the materials.

Additionally, one would need to be able to assess the relationship between metacognitive awareness and the processes revealed by RSAT. Certainly, the effective use of these strategies requires metacognitive awareness of their appropriateness given the dynamically changing demands of a text and a reader’s level of comprehension (McNamara and Magliano 2009a). We have conducted subsequent research that had shown that measures associated with self-regulation (including self-reported awareness and use of metacognitive strategies) partially mediate the relationship between the reading processes measured by RSAT and comprehension (Magliano et al. 2010). It is important to note that this study relied on human judgments of the answers and one would want to assess if RSAT word counts are sensitive to this relationship as well.

Another factor that requires improvement in RSAT is the detection of elaborations. As we discussed above, the synonym problem creates a situation such that our measure of new words will contain the use of synonyms for words in the discourse. As such, our measure of elaborations can be conceptualized as a continuum of relevant knowledge ranging from synonyms of text words to true knowledge-based elaborative inferences. The low to moderate correlations between new words and human judgments of elaboration can be explained by the fact that our human judges were trained to detect true elaborations rather than synonyms of text content. The synonym problem may also explain why ‘new words’ was not significantly correlated with outcome measure of comprehension, although they were in the analyses involving the entire data set. In our past attempts to detect elaborations via computer-based coding, we created a list of words from elaborations that readers tend to use when thinking aloud while reading the target sentences and then used LSA (Landauer and Dumais 1997) to compare those words to the protocols (Millis et al. 2004, 2007). LSA stands for latent semantic analysis and is a statistical method for assessing the semantic similarity between two units (words, sentences, paragraphs, texts) of text. However, we have obtained higher correlations with human judgments with the current approach. Yet the ‘elaboration problem’ remains—it is difficult to anticipate the variety of elaborations that readers can produce, which is different from text-based inferences, which are constrained by the content of the text.

Moreover, not all elaborations are relevant and support deep comprehension (Wolfe and Goldman 2005). For example, connecting the text with personal episodic knowledge may not be useful in supporting deep comprehension that reflects the underlying meaning of the text. The human coding system was designed to focus on elaborations that were based on general or text specific prior knowledge or the result of “on the fly” reasoning. However, the RSAT word count algorithms currently have no basis for distinguishing these processes from other kinds of responses that would involve words outside the textual context (e.g., valianced evaluations or personal recollections). As such, unlike the word counts for paraphrasing and bridging, the word counts for elaborations are likely reflective of processes that go beyond those that were targeted in the coding scheme. This is clearly an aspect of the current word count algorithms that warrants improvement.

Additionally, we should point out some concerns regarding the ecological validity of RSAT, although some have argued that overzealous concerns regarding ecological validity can stifle innovations in education (Dunlosky et al. 2009). The most pressing, in our opinion, is how embedded questions affect the profile of responses and the comprehension of RSAT texts. Readers are not typically asked questions as they read. There has been an impressive literature over the last three decades on the effect that adjunct questions have on increasing comprehension (e.g., Callender and McDaniel 2007; Peverly and Wood 1999; Rothkopf 1970; Sagerman and Mayer 1987). It is likely that the indirect question had less of an impact than the direct questions on comprehension because think aloud instructions in which the indirect question was modeled on do not interfere with complex cognitive activities (Ericsson and Simon 1993). Future research is needed to indicate the extent that answering indirect and direct questions affect the comprehension of the texts, if they do so at all. Another ecological issue arises from the fact that readers were only able to see one sentence at a time and were unable to go back to prior sentences. Normally, readers can go back to read prior text when they wish. An alternative version of RSAT would allow readers to regress, but not during question-answering because the reader might use the prior text to answer the question, and that would be antithetical to the goal of assessing comprehension as it happens.

In conclusion, RSAT provides a new approach to reading assessment. We believe that its greatest promise is as a formative assessment that could be used to inform and guide interventions designed to help struggling readers. For example, it could be used in the context of tailoring strategy training to the specific needs of students in the context of computer based strategy training (e.g., McNamara et al. 2004). For example, if RSAT demonstrates that a student does not bridge, then that strategy could be emphasized during training. Additionally, RSAT could be valuable in developmental reading programs in post secondary education. Most of these rely on the Nelson-Denny Test of reading comprehension to diagnose students with comprehension problems (Boylan 1983; Wang 2006). RSAT could be valuable for this population of readers because we have demonstrated that RSAT outperformed the GM on accounting for comprehension, a multiple-choice test similar to the Nelson-Denny. In addition, RSAT is able to identify subcomponents of comprehension, namely paraphrasing, bridging and (to some extent) elaborating, in addition to an overall measure of comprehension. The Nelson-Denny and GM give an overall account of comprehension, but no measure of strategies or inferences.