Keywords

1 Factor Structure and Uses of the Grit Scale

The latent construct of grit is reported to be comprised of two elements—perseverance of effort and consistency of interest in the original research into grit (Duckworth et al. 2007). Grit has been shown to be an effective predictor of success and retention in a variety of contexts such as the national spelling bee, military, workplace, school, and marriage (Duckworth et al. 2011; Eskreis-Winkler et al. 2014; Strayhorn 2014). At the same time, first-generation college students (FGCSs) experience lower likelihoods of completing a 4-year degree when compared to their continuing-generation peers (Chen and Carroll 2005; D’Amico and Dika 2013; Engle and Tinto 2008; Vaughan et al. 2014). Therefore, the measurement of grit among this population has the potential to uncover underlying issues that impact FCGS retention, which could lead to improved interventions and supports. Additionally, within the extant literature on grit, a recent meta-analytic study states that “the grit literature may benefit from a refinement of the Grit Scale using methods based on Item Response Theory (IRT)” (Credé et al. 2016). In this chapter, we examine the Grit Scale (Duckworth et al. 2007; Duckworth and Quinn 2009) when used among FGCSs, a population as yet unstudied in conjunction with grit, compared to non-FGCSs, and offer an IRT analysis useful for the emergent area of research into refinements of the Grit Scale.

1.1 Predictive Validity of Grit in the Extant Literature

Previous research has provided evidence of the predictive validity of the Grit Scale on metacognition (Arslan et al. 2013), the retention of first-year military cadets (Maddi et al. 2012), educational attainment among adults, and grade point average (GPA) among Ivy League undergraduates (studied by Duckworth et al. 2007). However, these populations and scenarios differ in important ways from FGCSs pursuing a baccalaureate degree at a large, public research university. For example, the distribution and variance of undergraduate GPAs among students at Ivy League universities are likely to differ from those of FGCSs at a public university, as students in the research conducted by Duckworth et al. (2007) are positioned in the most advantageous university settings, with student populations, largely derived from the most advantaged high school students in the United States. Similarly, research done by Arslan et al. (2013) studied grit and metacognition among college students, yet is contextually different in that it focuses on Turkish university students—a group likely to have nontrivially different cultural norms than the average FGCS at a public university in the United States. Other studies of grit, such as those around the retention of military cadets and spelling bee champions, present radically different contexts from that of FGCSs completing a baccalaureate degree. Further, while the previous studies mentioned here demonstrate successful use of the Grit Scale, none offer an IRT analysis that can more deeply examine the psychometric properties of the scale.

1.2 Factor Structure of the Grit Scale

In line with research by Credé et al. (2016) which calls for IRT analyses of grit, the research presented here offers insight into the existing scholarly disagreement over grit as a latent construct—whether it is substantively different from conscientiousness, its incremental validity, and even its factor structure. For example, Duckworth and Quinn (2009) used confirmatory factor analysis (CFA) to determine that grit has a higher-order structure with two first-order factors and one second-order factor. Yet this finding is tempered by the fact that, using CFA, a higher-order model would exhibit identical fit to a model using two correlated first-order factors and no higher-order factor, making the analysis of limited utility (Credé et al. 2016). In fact, such a factor structure was examined by Duckworth et al. (2007) and found to have poor fit based on comparative fit index (0.83) and root mean square error of approximation (RMSEA) (0.11). Credé et al. (2016) suggest that a more meaningful way to assess the factor structure would be to examine the correlation between the two theoretical components of grit—perseverance of effort and consistency of interest. However, empirical estimates of this correlation summarized in the meta-analysis by Credé et al. (2016) find that the strength of the correlation has wide variation, with the correlation dropping as low as zero in some empirical studies (Chang 2014; Datu et al. 2016; Jordan et al. 2015). IRT analysis offers insights into the factor structure of the scale to shed light on the contradictory findings from previous CFA analyses.

1.3 Grit and FGCSs

Despite substantial debate over grit as an important noncognitive factor related to student success, grit has become an important buzzword in education as both an explanation for student achievement and as an intervention (Anderson et al. 2016). As the role of grit in education gains popular attention, it is important to understand if the measurement of grit is both reliable and valid among populations whose retention has historically been at greater risk. This research seeks to assess the validity and reliability of the Grit Scale among one such group—FGCSs—and to answer the overarching research question: What are the psychometric properties of the Grit Scale when used among FGCSs? To answer this question, we examine the reliability and factor structure of the scale and test for local dependence and differential item functioning.

2 Methods

IRT analysis was used to examine the psychometric properties of the items on the Grit Scale. IRT also allows for the assessment of local dependence between items. Analysis of DIF through logistic regression was used to test the validity of the scale for FGCSs and non-FGCSs, as well as by race/ethnicity and gender.

2.1 Data

A total of 648 participants completed a version of the Grit Scale. The sample consisted of 190 undergraduates who completed a survey at the end of their first year of college that contained the 12-item Grit Scale (Duckworth et al. 2007) in Spring 2015. An additional 458 recent graduates completed a survey in Fall 2015 that contained 9 items taken from the 12-item Grit Scale.

2.1.1 Differences in Scale Administration

The 9-item scale is a subset of the 12 items from the published 12-item scale (Duckworth et al. 2007). However, the 9-item scale used response options “very much like me” (1) to “not like me at all” (5), while the 12-item scale used response options “strongly disagree” (1) to “strongly agree” (5). In addition to the different response wording, responses also differed in direction. The 12-item scale contained a neutral option “neither disagree nor agree,” although the 9-item scale did not. Lastly, different prompts were used between the two administrations. The 9-item scale was preceded by the following prompt: “Please indicate how true the following statements are for you. Rate each statement.” The 12-item scale was preceded by the following prompt: “Here are a number of statements that may or may not apply to you. For the most accurate score, when responding think of how you compare to most people—not just the people you know well, but most people in the world. There are no right or wrong answers, so just answer honestly!” Relevant items were reverse coded so that higher scores reflect more grit across both administrations.

2.1.2 Descriptive Statistics

Of those who answered at least one Grit Scale item, <1% (four students) skipped one or more items. For the IRT analyses, only participants who did not respond to any Grit Scale item were removed from the analyses. For DIF analyses, participants with any missingness on the Grit Scale were listwise deleted. The scale items, along with associated sample size and mean scores, are given in Table 1. Differences in sample sizes are due to the combination of different surveys as explained previously.

Table 1 Combined scale sample item means

2.1.3 Demographics

Of the sample who completed at least one Grit Scale item, 155 FGCSs completed the 12-item scale, 182 FGCS recent graduates completed the 9-item scale, and 254 non-FGCS recent graduates completed the 9-item scale. Participants included 189 men and 402 women. There were 361 non-Hispanic, White participants compared to 234 other races and ethnicities. When disaggregated by FGCS status, there were 333 FGCSs compared to 268 non-FGCSs.

2.2 Analytic Strategy

2.2.1 IRT

Structural validity was evaluated with a factor analytic approach using IRT, implemented using the software IRTPRO (Cai et al. 2011). Each subscale of the Grit Scale was examined in a unidimensional confirmatory factor analysis (CFA) with the graded response IRT model (Samejima 2010) as well as within a bifactor graded response IRT model to examine if one underlying factor explained most of the variability in the two subscales. For the bifactor model, the explained common variance (ECV) of the general factor was examined and unidimensionality considered for values greater than 0.85. In addition to inspection of item content to assess local independence, the LD χ 2 proposed by Chen and Thissen (1997) wherein values larger than 10 are considered evidence of local dependence and values between 5 and 10 may indicate either local dependence or sparseness in the underlying table of frequencies was used as the statistical criteria for determining local dependence.

Item fit was assessed based on the SS-χ 2 fit statistics proposed by Orlando and Thissen (2000, 2003), for which a nonsignificant result (p > 0.05, adjusted for multiple comparisons) was an indicator of adequate model fit. Model fit was determined piecemeal through item fit, as well as the M 2 statistic proposed by Maydeu-Olivares and Joe (2005, 2006) and its associated RMSEA. Additionally, the −2 log likelihood, the Akaike information criteria (AIC) (Akaike 1974), and the Bayesian information criterion (BIC) (Schwarz 1978) were also examined.

2.2.2 DIF

Analysis of potential DIF by FGCS status, gender, and race/ethnicity was conducted using ordinal logistic regression (OLR) of summed scores. For each item within a domain, an OLR model was used to examine whether item responses were significantly associated with group membership after controlling for students’ summed score on the measure. Uniform DIF was detected by a likelihood ratio test comparing an OLR model with one predictor, summed score, to an OLR model with an additional predictor, group membership, representing a shift in the use of the response options due to group membership. Nonuniform DIF was detected by a likelihood ratio test comparing the OLR model with two predictors, summed score and group membership, to an OLR model with an additional interaction term, representing a difference in how strongly the item is related to the underlying construct due to group membership. With each paired-group analysis, an initial OLR model was run to identify a clean anchor group of items without DIF. For each sequential OLR model, any items previously identified as having DIF were removed from the summed score computation. The final OLR model used a summed score computed with only the DIF-free anchor items to test for DIF. Subsequent OLR models used a summed score computed with only the clean anchor items to test for DIF. The Benjamini-Hochberg procedure was used to make inferential decisions in the context of the multiple comparisons. In addition to examining the significance (p < 0.05), magnitude of DIF was further evaluated by examining the expected item scores and estimating the effect sizes (\( \varDelta \)R2 > 0.02 indicative of salient DIF).

2.2.3 Reliability

Internal consistency was evaluated by Cronbach’s α for both versions of the scale, as well as for both subscales within each version using the software Mplus (Muthén and Muthén 1998). Alpha values of 0.70 or greater are an acceptable minimum for group-level assessment.

2.2.4 Convergent and Discriminant Validity

Participants who completed the Grit Scale also completed the Growth Mindset Scale (Dweck 2008) and the 5-item Guilt Proneness Scale (Cohen et al. 2011; Cohen et al. 2014). To assess convergent validity, for both grit subscales, the correlation between the mean item score and the mean item score of the Growth Mindset Scale was computed. To assess divergent validity, for both grit subscales, the correlation between the mean item score and the mean item score of the Guilt Proneness Scale was computed. Growth Mindset response items ranged from “strongly disagree” (1) to “strongly agree” (5) and were scored such that higher mean scores reflect more growth mindset. Guilt Proneness response items ranged from “extremely unlikely” (1) to “extremely likely” (5) and were scored such that higher mean scores reflect more guilt proneness.

2.2.5 Known Groups Validity

The validity of the Grit Scale was examined by assessing the extent to which it could discriminate between several known groups that should, in theory, differ. These groups included FGCSs who were current students, FGCS recent graduates, non-FGCS recent graduates, race and ethnicity, race and ethnicity interacted with gender, and participants grouped by their reported use of university resources. The use of university resources was measured by a list of resources and the frequency with which students used them while in college, with responses ranging from “never” (1) to “ten or more times” (6). We compared means across all groups using a one-way analysis of analysis of variance (ANOVA). Statistical significance was defined at the 0.05 alpha level for evaluation of convergent, discriminant, and known groups validity.

3 Results

3.1 Item Response Theory

Two items on the perseverance of effort subscale had low cell counts in the extreme categories, and thus for those two items, categories were collapsed for IRT analysis. The factor structure was further examined by estimating the graded response model (Samejima 2010), on the item scores for each grit subscale, and then modeling all scores using a bifactor model. Table 2 shows the fit of the graded response IRT models fit to the Grit Scale. The bifactor model fit well, although one item had a high factor loading (0.73) on the overall factor and a low loading (0.25) on the perseverance of effort subscale. The ECVs for this model were 0.45 for consistency of interest and 0.63 for perseverance of effort, suggesting these two subscales do not support one underlying factor. Therefore, unidimensional IRT models were fit to each of the Grit Scale subscales, including both a 5-item and 4-item (dropping the problematic item identified in the bifactor model) version of the perseverance of effort subscale. As Table 2 indicates, these models also fit the data well.

Table 2 Comparison of IRT graded response models

The IRT parameters of the bifactor model as well as the final two unidimensional models for the Grit Scale subscales are presented in Table 3. For the consistency of interest subscale, all items have high IRT a parameters; the item “I have been obsessed with a certain idea or project for a short time but later lost interest” has the strongest relationship with the underlying construct. For the perseverance of effort subscale, the items also have strong IRT a parameters. However, the item “I am a hard worker” has a slightly weaker relationship to the underlying construct, although it also measures the lower end of the Grit Scale (b 1  = −5.16).

Table 3 IRT parameters

3.1.1 Overall Test Information Curves

Figure 1 shows the overall test information curve and standard error for the consistency of interest subscale. This figure indicates that test information is high across the range of consistency of interest, providing adequate measurement from −3 to 3 SDs below and above the mean. Figure 2 shows the same information for the 4-item version of the perseverance of effort subscale, after dropping the poorly fitting item “I finish whatever I begin.” However, test information is only high for the perseverance of effort subscale at the lower end of the scale, indicating that the Grit Scale measures perseverance of effort well for those with levels at 1 SD above the mean and below.

Fig. 1
figure 1figure 1

Total information curve for consistency of interest subscale

Fig. 2
figure 2figure 2

Total information curve for perseverance of effort subscale

3.2 Differential Item Functioning

Analysis of potential DIF resulted in only one item being flagged. The item, “I have overcome setbacks to conquer an important challenge,” had significant uniform DIF (p.03) with a large effect size for FGCSs (n = 333) and non-FGCSs (n = 268). Figure 3 displays the item means by summed score for each of these groups. FGCSs have slightly higher item means across the score range, indicating they are slightly more likely to endorse this item than non-FGCSs, resulting in a higher overall score on the Grit Scale.

Fig. 3
figure 3figure 3

DIF between FGCSs and non-FGCSs

3.3 Reliability

Cronbach’s alphas are adequate for both administered versions of the scale: 0.72 for the 12-item scale and 0.65 for the 9-item scale. The perseverance of effort subscale had lower alphas (0.60, 0.57) in both versions of the scale administered than did the consistency of interest subscale (0.69 in both versions).

3.4 Convergent and Discriminant Validity

The mean item score on growth mindset was 3.55 (s.d. 0.84). It was significantly correlated with perseverance of effort (r = 0.20, p = 0.001), but not with consistency of interest (r = 0.01, p = 0.73). This indicates grit and growth mindset are moderately related; however, it also provides evidence of the difference between the two Grit Scale subscales. The mean item score on guilt proneness was 4.12 (s.d. 0.74) and was significantly correlated with both subscales. Perseverance of effort correlated at r = 0.19 (p < 0.001) and consistency of interest correlated at r = 0.12 (p = 0.002).

3.5 Known Groups Validity

The results from the known groups analysis are given in Table 4. Of the groups analyzed (see Sect. 2.2.5), only two were statistically significantly different from one another. On the perseverance of effort subscale, men showed less perseverance of effort than females (mean = 3.85 vs. 3.97). In consistency of interest, the interaction of gender, race/ethnicity, and FGCS status resulted in statistically significant differences. Women who were White, non-Hispanic, and not FGCSs scored the highest in consistency of interest (mean = 3.13), while men who were non-White and FGCSs scored the lowest (mean = 2.80). Across all gender, race/ethnicity, and FGCS groupings, non-FGCSs (means range from 3.13 to 2.94) scored the highest in consistency of interest than FGCS (means range from 2.92 to 2.80).

Table 4 ANOVA results of known groups

4 Discussion

The IRT analysis suggests that the 4-item consistency of interest subscale fits well with no local dependence, as does the 4-item perseverance of effort subscale, after dropping the poorly fitting item “I finish whatever I begin.” The item “I have overcome setbacks to conquer an important challenge” exhibits uniform DIF among FGCS and non-FGCS with a large effect size. Consistent with previous literature (see Credé et al. 2016 for a comprehensive overview), our findings indicate that the higher-order factor structure suggested by Duckworth and Quinn (2009) is not supported. IRT analysis provided here demonstrates that two unidimensional subscales fit better than the bifactor model. Factor loadings from the IRT analysis suggest that there is little evidence of a higher-order construct, when using the data analyzed in this research. Given the discrepancies in previous research about the factor structure of the Grit Scale, along with the recent notation that IRT analysis is needed, this research contributes the important finding that IRT analysis, conducted using these data, does not support a factor structure wherein consistency of interest and perseverance of effort load onto the higher-order construct grit.