Introduction

Reading proficiency is critical for academic achievement (e.g., Sparks et al., 2014), career opportunities (e.g., Hernandez, 2011), and mental and physical wellbeing (e.g., Jordan et al., 2014). Decades of research on reading development and reading instruction yield broad consensus as to how reading develops and how teachers can facilitate the acquisition of reading skills (Castles et al., 2018). Still, there is evidence that some teacher preparation programs do not adequately expose teachers to this science of teaching reading. Teachers, teacher educators, literacy consultants, and administrators responding to an International Literacy Association (2020) survey identified “the variability of teacher knowledge and teaching effectiveness” as the greatest barrier to equity in literacy education. The majority of respondents (60%) did not agree that teacher preparation programs equip educators with the skills they need for effective reading instruction.

Developing reliable, valid instruments that measure educator reading-related knowledge can facilitate the provision of customized professional learning opportunities that better build on teachers’ strengths and address areas for growth. Because learning to read is a complex process that relies on multiple types of knowledge and skill, there is value in designing instruments that measure teachers’ knowledge to support learning within multiple domains. Specifically, reading comprehension depends on the development of language knowledge and skill (e.g., vocabulary, syntax, verbal reasoning) to support meaning making—as well as code-focused knowledge and skills that contribute to word reading (e.g., phonemic awareness, phonics knowledge; Cervetti et al., 2020; Hoover & Tunmer, 2018). For this reason, it is valuable to assess teachers’ knowledge of key concepts and evidence-based instructional practices associated with improved language and reading comprehension as well as with foundational knowledge and skills to teach word reading. It is problematic, that previous surveys of elementary-grade teacher knowledge have primarily limited their focus to measuring code-focused knowledge (e.g., Binks-Cantrell et al., 2012; Carlisle et al., 2011; Cohen et al., 2017; Cunningham et al., 2004; Fielding-Barnsley & Purdie, 2005; Foorman & Moats, 2004; Jordan et al., 2018; McMahan et al., 2019; Moats, 1994; Piasta et al., 2020; Spear-Swerling & Brucker, 2003; Washburn et al., 2011, 2017), with a few surveys focused on language comprehension alone (e.g., Duguay et al., 2016; Piasta et al., 2022) or solely on knowledge to teach reading comprehension strategies (Wijekumar et al., 2019).

A handful of previous studies have measured elementary-grade teachers’ knowledge to facilitate both word reading and language comprehension. Like most studies of knowledge to build foundational skills essential to word reading, most of these surveys have not reported detailed information about instrument psychometric properties (e.g., factor structure; Brady et al., 2009; Goldfeld et al., 2021; Spear-Swerling & Cheesman, 2012; Spear-Swerling & Zibulsky, 2014). However, there are a few that have explored these properties and report adequate overall or subscale reliability (Carlisle et al., 2011; Davis et al., 2022; Phelps & Schilling, 2004). We see the current project as building on this prior research. Although it does not directly replicate any previous survey study, this project serves as a conceptual replication, exploring the degree to which findings remain the same across different instruments and research teams. Replication experiments function as an operationalization of objectivity by demonstrating that the same findings are obtainable in different contexts by different researchers. Knowledge gained from such experiments is separated from other confounding elements of a research design (i.e., time, place, or persons). A recent report from the National Science Foundation and the Institute of Education Sciences (2018) highlights the importance of such replication studies: “Efforts to reproduce and replicate research findings are central to the accumulation of scientific knowledge that helps inform evidence-based decision making and policies” (p. 1). Thus, the present study reports on the development and validation of a new survey of educator knowledge across five domains (phonological awareness [PA]; phonics, decoding, and encoding; reading fluency; oral language; and reading comprehension) that is aligned with reading science and for which we report detailed information about psychometric properties.

What do elementary educators need to know to teach students to read?

The simple view of reading (SVR; Gough & Tunmer, 1986), an empirically validated framework for understanding the contributions of reading component skills to reading comprehension, posits that reading comprehension is the product of word recognition and language comprehension. Word recognition depends on the development of phonemic awareness, phonics knowledge, knowledge of orthography, and decoding skill; language comprehension depends on background and vocabulary knowledge, syntactical knowledge, knowledge of text genre and macrostructure, and verbal reasoning skill (Scarborough, 2001).

Large-scale meta-analyses conducted during the last two decades have confirmed that effective elementary grade reading instruction addresses both factors described by the authors of the SVR, including the different subcomponent skills (Donegan & Wanzek, 2021; Foorman et al., 2016). There is strong evidence that effective elementary-grade reading instruction emphasizes phonemic awareness, grapheme-phoneme correspondences and decoding, and fluent passage reading (Foorman et al., 2016). There is moderate evidence supporting the effectiveness of instruction that (a) focuses on morphology, including instruction in affixes and roots; (b) builds students’ vocabulary/world knowledge and targets other elements of academic language; and (c) provides explicit explanations, modeling, and practice using reading comprehension strategies (Foorman et al., 2016).

Knowledge and practice standards for teachers of reading articulated by the International Literacy Association ([ILA]; ILA, 2017), the International Dyslexia Association ([IDA]; IDA, 2018) and other international, national, and state-level organizations also emphasize the importance of knowledge within these code- (i.e., word recognition) and meaning- (i.e., language comprehension) focused domains. Importantly, standards typically address both content knowledge (knowledge of the subject matter to be taught) and pedagogical content knowledge (knowledge required to teach this subject matter effectively to students; Shulman, 1987); research has demonstrated positive relations of each type of knowledge with student learning (e.g., Baumert et al., 2010; Podhajski et al., 2009). Phelps and Bridgeman (2022) articulate three related categories of knowledge necessary for effective reading teaching: (1) principles of teaching and learning (e.g., can teachers retrieve definitional knowledge, or knowledge related to principles of reading development or instruction?); (2) concepts and skills students are learning (e.g., can teachers segment a word into constituent phonemes or resolve an anaphor in a text?); and (3) applied content knowledge (i.e., when presented with a scenario, can teachers demonstrate situated knowledge by identifying the most appropriate instructional move?). Phelps and Bridgeman (2022) note that the first category of items focuses on declarative knowledge, or “knowing that”; the second focuses on procedural knowledge, or “knowing how”; and the final category largely taps conditional knowledge, or “knowing when and why” (p. 2027). Knowledge and practice standards thus foreground the importance of teachers’ content knowledge related to English phonology, orthography, morphology, syntax, semantics, and text structure and genre. They also prioritize an understanding of literacy development and knowledge of evidence-based instructional methods that support students’ literacy learning across the domains of oral language, PA, phonics and decoding, word recognition, reading fluency, vocabulary, and reading comprehension.

Teacher characteristics and teacher knowledge

Numerous research teams have explored the degree to which elementary teacher characteristics predict knowledge to teach reading. In most studies, significant associations of teacher characteristics with teacher knowledge are the exception rather than the rule. For example, most research does not consistently demonstrate significant relations of teacher education or access to professional development with knowledge to teach reading (e.g., Davis et al., 2022; Jordan et al., 2018). Studies exploring relations of certification type with reading-related knowledge also rarely yield significant associations (e.g., Cunningham et al., 2004; Pittman et al., 2020).

One possible explanation for non-significant associations of teacher education with knowledge to teach reading is that teacher education coursework may not be aligned with questions on teacher knowledge surveys (Hoffman & Roller, 2001). In addition, teacher education programs vary in their emphasis on reading-related knowledge. Spear-Swerling et al. (2005) and Piasta et al. (2022) suggest measures defining education more narrowly (e.g., as amount of reading-related coursework completed) may more reliably yield positive relationships with teacher knowledge. However, research examining relations of reading-focused courses completed by teachers with teachers’ reading-related knowledge has not furnished evidence supporting this theory (Jordan et al., 2018; Washburn et al., 2017).

Receiving post-degree training on reading development or instruction is associated with increased knowledge in some studies (e.g., McMahan et al., 2019; Peltier et al., 2022). However, other studies suggest teachers’ exposure to reading-related professional development opportunities is not a significant predictor of knowledge (e.g., Washburn et al., 2017; White et al., 2020). Taking traditional paths to teacher certification is also not reliably associated with increased knowledge to teach reading. Pittman et al. (2020) found traditionally certified teachers did not possess more reading-related knowledge than alternatively certified teachers. Cunningham et al. (2004) found no difference between fully credentialed and not fully credentialed elementary-grade teachers on tasks measuring PA and phonics knowledge. Washburn et al. (2017) found certification type (general versus special education) did not predict reading-related knowledge.

Limitations of previous surveys of educator knowledge to teach reading

As noted earlier, previous surveys of reading-related knowledge have prioritized knowledge to teach word reading; it is rarer for surveys to also measure knowledge to teach language and reading comprehension. The vast majority of teacher knowledge surveys do not report detailed information about instrument psychometric properties, such as factor structure; this is also true of studies measuring elementary-grade teacher knowledge across both code- and meaning-focused domains (Brady et al., 2009; Goldfeld et al., 2021; Spear-Swerling & Cheesman, 2012; Spear-Swerling & Zibulsky, 2014). Psychometric information that is available for more comprehensive instruments frequently reveals low reliability for survey subscales, especially those focused on language or reading comprehension (e.g., Brady et al., 2009; Spear-Swerling & Zibulsky, 2014).

A few surveys of elementary-grade teacher knowledge across code- and meaning-focused domains have reported detailed information about psychometric properties and adequate overall or subscale reliability (Carlisle et al., 2011; Davis et al., 2022; Phelps & Schilling, 2004). However, older surveys occasionally include items based on perceptions of reading development and instruction that are contradicted by recent reading science. For example, we identified previous survey items that were inconsistent with research indicating that matching students to texts at their “appropriate” or “just-right” level (i.e., rather than supporting them in reading grade-level texts) does not improve learning (Shanahan, 2014). Other items suggested that a focus on rhyme awareness in kindergarten was developmentally appropriate and a necessary pre-requisite to instruction in phonemic awareness (when research suggests that phonemic awareness does not depend on awareness of larger units of sound in language, and that phonemic awareness instruction is a developmentally appropriate focus for core reading instruction in K classrooms; Brady, 2020).

That previous surveys sometimes demonstrate an antiquated understanding of reading development and instruction should not be surprising; the science of reading is always evolving, and past reading intervention research has not always been translated accurately to researchers studying teacher knowledge or teacher training to teach reading (Solari et al., 2020). There is always value in developing new instruments aligned with the most recent science. In addition, there is value in conducting conceptual replications of previous research using different assessments of the measured constructs, developed by a different research team and administered at a different time, with a different sample, in a different context.

Study purpose

Thus, the purpose of the present study was to develop a science-aligned instrument that provides a reliable and valid assessment of teacher content and pedagogical knowledge to teach both word reading and language comprehension. Specifically, the Teacher Understanding of Literacy Constructs and Evidence-Based Instructional Practices (TULIP) survey measured teacher knowledge within five literacy domains: PA; phonics, decoding, and encoding; reading fluency; oral language; and reading comprehension. We asked:

  1. (1)

    What is the factor structure of the TULIP survey?

  2. (2)

    Does the TULIP survey demonstrate adequate reliability?

  3. (3)

    What levels of knowledge are demonstrated by the K-5 teachers in this study’s sample?

  4. (4)

    To what extent do teacher characteristics predict performance on the TULIP survey? In particular, do (a) education, (b) certification, (c) position, or (d) grade level taught relate to educator knowledge to teach reading?

Method

Item development

During the 2021–2022 academic year, a multistep process was used to develop items and ensure content validity for the TULIP survey. First, we searched three electronic databases (ERIC, Google Scholar, PsycINFO) using the following search terms in various combinations (e.g., teacher knowledge, reading, literacy, spelling, evidence-based practices) to identify studies that employed a survey of elementary-grade teacher knowledge related to reading. When a publication met these criteria, we conducted snowball searches (i.e., searched publications’ reference lists) and citation searches (i.e., searched works citing publications we had previously identified) to locate additional studies that did not appear in our initial search of the literature. In this way, we identified 57 studies on the topic of elementary educator knowledge to teach reading. For each article, we collected information about the constructs that were assessed and the degree to which teachers’ knowledge was associated with instructional practice and/or student literacy outcomes. We also consulted What Works Clearinghouse (WWC) practice guides on elementary-grade reading instruction (e.g., Foorman et al., 2016; Shanahan et al., 2010) and position statements and teacher education standards related to reading instruction published by international literacy organizations (e.g., IDA, 2018; ILA, 2017). In this way, we identified an initial set of five literacy domains (PA, phonics/decoding/encoding, reading fluency, oral language, and reading comprehension) within which research and standards indicate it is important for elementary-grade teachers to have knowledge. It is worthy of note that we decided not to develop a separate subscale measuring knowledge to build morphological awareness. Instead, items related to morphological awareness were included in the Oral Language subscale. Morphemes are typically defined as the smallest meaningful units of language (Carlisle, 2003); awareness of common morphemes (e.g., affixes; Latin root and Greek combining forms) can help students determine words’ meanings. Thus, although students can learn to decode and automatically recognize common morphemes in a way that aids word reading and spelling, the fact that morphemes are, at a fundamental level, units of meaning informed our decision to include items testing educators’ knowledge to teach about morphology within the meaning-focused oral language subscale.

The first and third authors were the primary item developers, although the entire author team assisted with developing, reviewing, and providing feedback on items. We drew on items used to assess pre-service and in-service educators’ learning during undergraduate and graduate-level reading education courses taught at the university where the first three authors are professors. We also developed new items on topics deemed important within previously described research reviews and teacher education standards. During item development, we adhered to the guidelines for writing items outlined by Bandalos (2018; e.g., when writing multiple-choice items, we used item stems that were clear and concise, stated items in a positive form, made answer choices the same length, and ensured that answer choices were grammatically consistent). Four experts in early literacy development and instruction reviewed the initial set of 98 items, organized by literacy domain. Expert reviewers were associate or full professors of education who provided written feedback on the overall survey (e.g., whether the constructs related to teacher knowledge of evidence-based literacy instruction that should be measured were present) and on specific items (e.g., whether items were accurate and fairly phrased). If an expert reviewer raised a concern about an item, the item was either revised or deleted. A set of items (n = 59) was selected for testing based on expert review and consultation among authors. These items addressed both code- and meaning-focused domains (i.e., PA, phonics/decoding/encoding, reading fluency, oral language, and reading comprehension).

Pilot project

We piloted the revised, 59-item survey with a sample of 206 educators who taught reading to students in Grades K-5. Participants were recruited via Qualtrics Panel, a subdivision of Qualtrics (a private research software company specializing in Web-based data collection that partners with more than 20 Web-based panel providers). Panelists received an invitation to participate in the survey via email or a social media account. Those who completed the pilot survey received approximately $5 or the equivalent in redeemable points directly from Qualtrics; this is a standard incentive used for Qualtrics panel surveys that collect data from elementary-school educators.

Once data collection was complete, we performed a set of preliminary analyses to identify items that needed revision or removal. Specifically, we examined (a) overall mean accuracy to identify items suffering from floor or ceiling effects and (b) item-total score correlations to identify items that were misinterpreted by respondents. We then performed confirmatory factor analyses and examined inter-item correlations to understand the dimensionality of the scale. We also performed reliability analyses to understand the internal consistency of the overall scale and subscales. Based on these preliminary results, we administered a revised version of the TULIP instrument to our validation sample.

Final TULIP instrument

The final TULIP instrument (see Supplemental Appendix A) consisted of 55 multiple-choice items measuring teacher knowledge within the domains of PA (12 items); phonics, decoding, and encoding (15 items); reading fluency (7 items); oral language (9 items); and reading comprehension (12 items). Supplemental Appendix A includes  all items. The number of answer options for each item ranged from three to seven. Most TULIP survey items assessed content knowledge (n = 42; ~ 76%). These items asked about a concept or skill important for early reading, or the “principles of teaching and learning” described by Phelps and Bridgeman (2022). For example, several content knowledge items asked teachers to identify the most likely explanations for errors students might make in their spelling. Other content knowledge items asked educators about concepts and skills students were learning (the second item type described by Phelps and Bridgeman); for example, one question asked teachers to identify the rule that informs the use of “ck” to spell the /k/ sound. The remaining items in the TULIP survey assessed pedagogical content knowledge (n = 13; ~ 24%). These items asked about instructional practices research indicates are effective or ineffective for teaching a component of early reading and were closely aligned with the “applied content knowledge” category identified by Phelps and Bridgeman. For example, one TULIP item asked, “Which of the following is not an evidence-based component of reading fluency instruction?”.

Validation study

Participants and data collection procedures

For the validation study, participants were again recruited via the Qualtrics Panel. Eligible respondents were (a) residing in the United States, (b) working as professional educators of K-5 students, and (c) teaching English language arts, literacy, or reading. Participants who matched the eligibility criteria and consented to participate in the research study were asked to complete the online survey. Panelists who completed the survey received approximately $5 or the equivalent in redeemable points directly from Qualtrics. Although the survey collected no identifying information, Qualtrics ensured that participants were only able to take the survey one time.

Quota sampling was used to ensure a sample of respondents that were diverse in terms of age, race, ethnicity, and region. The survey utilized an opt-in sample and was not designed to be fully representative of the national population. However, Qualtrics has been found to be more demographically representative than surveys recruited through other opt-in approaches (Boas et al., 2020). Table 1 displays demographic information for the survey sample. The sample broadly mirrored the regional, racial, and ethnic breakdown of the United States, based on the 2020 Census (U.S. Census Bureau, 2020). The average age of survey respondents was 44.62 years (range = 20–77 years). Participants taught in 46 states (with 10% teaching in California, 9% in Texas, and 7% in Florida).

Table 1 Teacher demographic information

All surveys were conducted in English. Although items within each subscale were not randomized, subscales were presented to participants in a random order. Participants were required to choose at least one answer option for each item before moving on to the next item but could end the survey at any time. Participants were excluded from the analysis if they did not answer all items. Additionally, to minimize inattentive responses, participants were excluded from the analysis if their survey was completed in less than ten minutes. Overall, 457 participants began the survey, but only 313 participants provided complete data for our analyses. The average time to complete the entire survey was 19 min.

Results

TULIP survey item analysis

We performed an analysis of the difficulty and discrimination of the TULIP survey items, which is presented in Supplemental Appendix B. Few items were extremely easy or extremely difficult: the median proportion correct was 0.49 and 80% of the items (45 of 55) had a proportion correct between 0.35 and 0.71. Most of the items had good discrimination in terms of their associations with both the domain and the overall scores: 75% of the items (41 of 55) had item-domain score and item-total score correlations greater than 0.30. The median item-domain score correlation was 0.40 and the median item-total score correlation was 0.43.

TULIP survey factor structure

We performed confirmatory factor analyses (CFAs) to understand the constructs that underlay responses to the TULIP survey. We believed that responses would be organized around five literacy domains: (a) PA, (b) phonics, decoding, and encoding, (c) reading fluency, (d) oral language, and (e) reading comprehension. To determine the appropriateness of this model, we compared the fit of a one-factor model in which all items loaded on a single overall factor to a five-factor model in which the items were organized around the proposed domains. The latent domain scores were allowed to covary in the five-factor CFA.

Although both the one-factor CFA (chi-square [1430] = 1778.36, p < 0.001, RMSEA = 0.03, CFI = 0.96, TLI = 0.95, SRMR = 0.09) and five-factor CFA (chi-square [1420] = 1715.26, p < 0.001, RMSEA = 0.03, CFI = 0.96, TLI = 0.98, SRMR = 0.08) showed evidence of acceptable fit, a model comparison test indicated that the five-factor model fit the data significantly better than the one-factor model (chi-square [10] = 63.10, p < 0.001). Given that both the one-factor and the five-factor models had acceptable fit statistics, we decided to examine both the overall TULIP score as well as scores representing knowledge within each of the five domains in our following analyses.

TULIP survey reliability

Table 2 presents the reliability of the overall scale and the scales for the five domains of knowledge as estimated by Cronbach’s α, Coefficient H, and McDonald’s ω. Cronbach’s α was calculated using SPSS version 27, whereas Coefficient H and McDonald’s ω were calculated using a spreadsheet provided in the supplemental materials of McNeish (2018). The overall scale had good reliability. Subscales representing knowledge within separate literacy domains generally had acceptable reliability, although the reliability of the Oral Language composite was notably lower than the others.

Table 2 TULIP Survey Internal Consistency Reliability

Levels of teacher knowledge as measured by the TULIP survey

We calculated an overall TULIP composite score as the proportion correct across all items in the scale, as well as five domain composite scores calculated as the proportion correct across all items in the corresponding domain. Table 3 presents (a) the correlations among the TULIP composites and (b) the composite means and standard deviations. The overall composite score had very large correlations (r′s > 0.7) with all domain composite scores, and all between-domain composites would be described as large (r′s > 0.5; Cohen, 1988). It is worth noting that correlations with the oral language composite tended to be smaller than correlations with other domains, reflecting its lower reliability. Accuracy across all the domains was near 50%. Teachers had the lowest mean accuracy on items measuring knowledge of phonics, decoding and encoding and the highest mean accuracy on items measuring knowledge of reading comprehension.

Table 3 TULIP survey descriptive and between-composite correlations

Relations of teacher knowledge with teacher characteristics

We examined the bivariate relations of the TULIP overall and domain scores with four teacher characteristics: education level (coded as high school degree/GED/associate degree, bachelor’s degree, or master's degree or beyond), certification type (coded as regular, alternative, temporary, or other), position (coded as general education, special education, reading specialist, or other), and grade level (coded as K-2 or not K-2). Relations of each teacher knowledge assessment with each categorical variable were tested using an ANOVA.

Because of the large number of relations we examined, we decided to use a significance level of 0.005 (see Greenwald et al., 1996). The results of our analyses revealed that teacher knowledge was not significantly related to certification type, position, or grade taught (all p values > 0.005; specifics of these tests are available in Supplemental Appendix C). Teacher knowledge overall and within each TULIP literacy domain was significantly related to education level, such that teachers with more education had higher TULIP scores. The relations of teacher knowledge and education level are summarized in Table 4.

Table 4 Relations of TULIP survey constructs with education level

Discussion

Reliable and valid surveys of educator reading-related knowledge permit stakeholders to design effective professional learning opportunities that build on teachers’ strengths and address areas for growth. There exist relatively few psychometrically sound, science-aligned assessments measuring teacher reading knowledge to develop multiple subcomponent skills and types of knowledge contributing to reading comprehension. To address this need, we developed the TULIP survey, an instrument that provides estimates of educator broad knowledge to teach reading as well as estimates of teacher knowledge within the domains of PA; phonics, decoding and encoding; reading fluency; oral language; and reading comprehension. We investigated the factor structure of TULIP, its reliability, and the extent to which teacher characteristics predicted knowledge.

TULIP validity and reliability

Factor analyses provided preliminary evidence that the TULIP survey serves as a valid measure of these five domains of knowledge to teach reading, as well as of knowledge to teach reading broadly. The overall scale and each of the five subscales were shown to reliably assess educators’ knowledge. It is noteworthy that internal consistency was lower within the subscale designed to measure knowledge to foster oral language development than within other subscales. For reasons that are not obvious, previous surveys measuring educator knowledge to teach reading across code- and meaning-focused domains have also reported lower reliability for subscales focused on oral language. For example, Brady et al. (2009), who measured teacher knowledge within the domains of phoneme awareness, code concepts, reading fluency, and oral language, reported low internal consistency for the oral language subscale (α = 0.07 at Time 1; α = 0.45 at Time 2).

Previous surveys including items measuring knowledge to facilitate oral language development within a subscale that measures knowledge to teach meaning-focused aspects of reading more broadly have tended to be more internally reliable (presumably partly because these subscales can include more items). For example, Phelps and Schilling (2004) reported Cronbach’s α’s of 0.77 and 0.73, respectively, for subscales of their Content Knowledge for Teaching Reading survey that measured (a) knowledge of content to teach comprehension and (b) knowledge of teaching and content to teach comprehension. Davis et al. (2022) reported Cronbach’s α = 0.81 for the subscale of their Knowledge for Enhancing Reading Development Inventory that focused on “meaning/connected text processes” (p. 788). As a counter example, Spear-Swerling and Zibulsky (2014) reported a somewhat lower reliability (α = 0.63) for their survey subscale focused on fluency, vocabulary, and comprehension. Even given somewhat lower internal consistency, we believe there is value in being able to report an adequately reliable knowledge score on separate language comprehension domains (e.g., separating knowledge to build oral language from knowledge to facilitate reading comprehension), as is possible within the TULIP survey. Doing so enables us to provide professional learning opportunities that are more closely aligned with teachers’ needs.

Relations of teacher characteristics with teacher TULIP performance

Our finding that teachers with higher levels of education had higher TULIP scores was intuitively unsurprising, both because accumulating knowledge is usually the goal of education and because individuals with the resources to pursue higher education often have increased access to knowledge-building opportunities in addition to those that are part of their coursework. However, previous research has mostly failed to reveal consistent, statistically significant associations of teacher education with knowledge to teach reading (e.g., Davis et al., 2022; Jordan et al., 2018). The explanation for the difference between our finding and findings reported in previous research may lie in the fact that our sample had more variation in education level than was typically found in previous studies reporting nonsignificant associations (e.g., Davis et al., Form C; Jordan et al.). In the Jordan et al. study, all participants had at least a bachelor's degree; in the Davis et al. Form C study, only 2.8% of teachers had not earned a bachelor’s degree (i.e., their high-school degree was their terminal degree). In the present study and for the Davis et al. Form A and B sample (for which there was a significant association between level of education and knowledge), a larger percentage of participants had only a high-school degree (i.e., 11% and 12.8%, respectively).

Teaching certification, position, and grade level were unrelated to knowledge to teach reading as measured by the TULIP survey: Regularly certified teachers did not have more knowledge than alternatively or temporarily certified teachers, reading specialists did not have more knowledge than teachers certified in other areas, and special educators did not have more knowledge than general educators. Knowledge to teach reading was also unrelated to grade level taught. These results are counterintuitive: one might expect regularly certified teachers to have more knowledge than alternatively or temporarily certified teachers because alternative or temporary certification is usually associated with fewer hours of coursework. Similarly, one might expect reading specialists to have more knowledge to teach reading than general educators given the specialized knowledge of reading development and instruction implicit in the “reading specialist” title. Still, given null findings in previous research investigating the relations of these teacher characteristics with teacher knowledge (e.g., Cunningham et al., 2004; Pittman et al., 2020; Washburn et al., 2017), the lack of statistically significant differences in knowledge between groups based on certification, position, and grade level was not unexpected. It should be noted that some teacher subgroups were small, which has implications for our power to discover significant differences. Only five teachers in the sample of 313 had a temporary certification, and only 12 had pursued alternative certification; only six teachers in the sample were reading specialists and only 16 were special educators. There is a need for future research with increased representation for these subgroups to provide more definitive tests of the relations of certification and position with teacher knowledge.

Limitations

Although it measures knowledge to teach reading more comprehensively than most previous surveys, the TULIP survey still only considers a fraction of the knowledge teachers need to teach reading. For example, it does not address knowledge to negotiate critical literacies with young children (Vasquez, 2014), which facilitates students’ questioning of the biases implicit in authorial perspective and their understanding of the ways in which texts are never neutral. It does not measure teachers’ dispositions related to the funds of knowledge (built through participation in family, community, and culture; González et al., 2006) that students bring to the classroom, or teachers’ skill in inviting students to bring these funds of knowledge to bear when interpreting texts. It does not capture educator knowledge of culturally responsive pedagogy or teaching practices associated with reading motivation. It does not measure knowledge of reading development for students who are bidialectal or for emergent bilingual readers. Finally, it does not assess knowledge to teach within multi-tiered systems of support or to effectively use data to make instructional decisions. We believe there is a need to measure these constructs, which are tremendously important when it comes to teaching quality and student outcomes, within other survey projects. That said, we worked hard to ensure that the TULIP survey could be completed by most teachers in less than 20 min. Longer surveys tend to generate fewer responses and yield less reliable information (e.g., Deutskens et al., 2004). This makes them more difficult to use within school settings, where time is always precious.

Another limitation is that we only assessed the content validity (via literature review and expert feedback) and the construct validity (via factor analysis) of the TULIP survey. As a result, we do not know whether the TULIP demonstrates convergent validity with similar measures of teacher knowledge, or whether it predicts effective classroom reading instruction or student reading outcomes.

Perhaps the largest limitation of this study lies in the fact that the oral language subscale of the TULIP demonstrated lower reliability than what is generally considered adequate. It also did not measure the construct of knowledge to build elementary-grade students’ language comprehension as completely as it might have. Most items in the oral language subscale were devoted to knowledge to build morphological awareness: there were five items focused on morphology, two items focused on semantics/vocabulary, one item focused on syntax, and one item focused on phonology. There would be value in developing a revised version of the oral language subscale that includes additional items focused on vocabulary, narrative language and narrative macrostructure, pragmatics, and knowledge to teach bidialectal students.

Finally, although the TULIP survey included items that measured different knowledge types (e.g., content knowledge and pedagogical content knowledge; Shulman, 1987), we did not make any distinctions between these item types in our analyses. In this project, we were less interested in the theoretical distinction between different knowledge categories and more interested in the literacy domains within which we were tapping teacher knowledge. Although previous studies (e.g., Binks-Cantrell et al., 2012; Davis et al., 2022) also chose not to analyze performance on items according to item format or knowledge type (i.e., they also solely focused on literacy domain), it is a limitation that we do not provide information about the reliability and validity of, or educator level of knowledge within, knowledge-type subscales.

Implications for practice and future research

Our results indicate that the TULIP assessment can reliably assess elementary educators’ knowledge to teach reading. Although research has not determined a threshold level of reading-related knowledge that teachers need to possess to facilitate student reading success, educators’ TULIP performance demonstrated that the average teacher likely has knowledge to gain across all domains of knowledge to teach reading. In practice, school administrators can use this survey as a kind of “screener” of teacher knowledge: it can identify teachers who need support when it comes to knowledge of literacy constructs and evidence-based practices overall or within the domains of PA, phonics/decoding/encoding, reading fluency, and reading comprehension, although it cannot as reliably identify teachers who need support to build students’ oral language. Survey results, when combined with other data sources, can inform the design and provision of professional learning opportunities that build on areas of strength and address areas of weakness for individual teachers. When providing professional learning opportunities for teachers, it may be useful for administrators to know that teachers with more education (e.g., with advanced degrees) tend to have higher levels of knowledge, at least when measured by this survey. However, based on these preliminary findings, one cannot assume that teachers with regular (vs. alternative or temporary) certification have more knowledge, or that special educators or reading specialists have more knowledge than general educators.

There will be value in conducting future research to determine whether the TULIP survey demonstrates convergent validity with similar measures of teacher knowledge, or whether knowledge as measured by the TULIP survey predicts teachers’ reading instructional practices or student reading outcomes. There is also a need to determine whether particular domains of reading knowledge measured by the TULIP survey are more closely associated with teaching quality and student outcomes than other domains of knowledge.

Previous research has revealed mixed associations between teachers’ reading-related knowledge and their instructional practices. Spear-Swerling & Zibulsky (2014) found that K-5 educators’ knowledge of phonology and orthography was positively related to the amount of instructional time they chose to allocate to instruction of PA and phonics. Jordan & Bratsch-Hines (2020) found that K and Grade 1 teachers’ knowledge to teach word reading predicted their use of a “broad range of the elements of high-quality reading instruction documented as necessary for reading achievement” (p. 282). However, although Puliatte & Ehri (2018) found that elementary educators’ linguistic knowledge was significantly associated with their use research-based instructional strategies when teaching spelling, there was no overall association between knowledge and use of research-based instructional practices. Similarly, McCutchen et al. (2002) found that K and Grade 1 teachers’ knowledge was only associated with increased use of explicit instruction during PA instruction; no associations between teacher knowledge and other aspects of practice were observed. Piasta et al. (2020) determined that Grade 1 educators’ knowledge was not associated with the amount of explicit decoding instruction they provided.

There is a need for future research to determine why this is the case. Moats (2009) and others have hypothesized that teachers must achieve a certain threshold level of knowledge to impact practice. Piasta and colleagues (2020) note that associations between knowledge and practice may not be evident within a sample that has reached this threshold of knowledge: after achieving a sufficient level of knowledge, the association between knowledge and practice may plateau, such that more knowledge may not be associated with better practice. Conversely, if an entire sample of teachers has not achieved this threshold level of knowledge, there may be little association between knowledge and practice. The association between knowledge and practice may depend on having a sample for which some teachers have knowledge below the threshold and some have knowledge above it. Future research revealing whether there is such a threshold level of knowledge needed to impact practice would be tremendously valuable.

Finally, there is a need for research considering a broader set of factors that interact with teacher knowledge to improve classroom practice and student outcomes. The amount and quality of teachers’ instruction within literacy domains depends, at least in part, on access to essential resources: a high-quality, feasible, usable curriculum; curriculum materials (e.g., whiteboards/whiteboard markers, decodable books, and libraries filled with high-interest, non-decodable books for students who have moved beyond decodable books); administrative support; access to professional development and coaching; time to implement reading instruction; and manageable class sizes (Cohen et al., 2003). Educator knowledge to teach reading is likely a necessary but insufficient condition for implementing evidence-based reading instruction.

Conclusion

The TULIP survey serves as a valid and reliable measure of broad knowledge to teach reading and of knowledge within each of five sub-domains: PA; phonics, decoding and encoding; reading fluency; oral language; and reading comprehension. The overall TULIP scale had good reliability, and subscales representing knowledge within specific literacy domains had acceptable reliability (with the oral language subscale having lower reliability than the other four subscales). Although it does not measure all the factors necessary for improving classroom practice and student outcomes, the TULIP survey permits stakeholders to assess a small but critical piece of the puzzle that is high-quality reading instruction. TULIP results can inform the design and provision of customized professional learning opportunities that build on teachers’ areas of strength and address areas of weakness.