Executive functioning (EF) is one of the most important constructs in understanding individuals’ academic performance. Executive functions (EFs) are a group of higher-level cognitive functions that allow individuals to initiate, maintain, monitor, adjust, and complete goal-directed actions (Dawson and Guare 2010; Dempster 1992; Lezak 1995; Miyake et al. 2000). Major EFs identified from cognitive research include working memory (the ability to mentally hold and manipulate information, often while performing another task and/or dealing with distractions), inhibition (the ability to suppress prepotent responses), and mental flexibility (the ability to switch between tasks or rules or sets; Chan et al. 2008; Diamond 2014; Engle 2002; Miyake et al. 2000; Miyake and Friedman 2012). These core functions facilitate higher-level processes such as problem-solving, reasoning, and decision-making (Collins et al. 2012; Lunt et al. 2012) and predict academic performance in students from PreK to college (Baars et al. 2015; Best et al. 2011).

In college students, for example, EF predicts academic performance above and beyond high school grades and standardized test scores (Crede and Kuncel 2008). EF deficits are also correlated with symptoms and behaviors that negatively impact academic performance, such as anxiety, stress, depression, adjustment problems, and procrastination (Petersen et al. 2006; Rabin et al. 2011; Wingo et al. 2013). The robust relation between EF and academic success has triggered a recent proliferation of promising interventions (e.g., executive coaching; Dawson and Guare 2012) designed to improve students’ EF skills. However, it is unclear how to best plan, tailor, and measure the effectiveness of these interventions.

Measurement of EF

EF can be measured using direct performance tests or rating scales. The bulk of previous EF research has been conducted using direct performance tasks or neuropsychological tests, such as remembering and manipulating digits (digit span; e.g., Diamond 2014). However, researchers have increasingly suggested that EF behavior rating scales are more ecologically valid than direct performance tests (Barkley 2012; Barkley 1991; Dawson and Guare 2010; Dehn 2008; Isquith et al. 2013; Samuels et al. 2016; Toplak et al. 2013). Furthermore, direct tests of EF and EF rating scales are weakly correlated (e.g., r = .19)—indicating that these methods measure different aspects of EF (Toplak et al. 2013). Neuropsychological tests are administered in highly controlled settings and are not representative of how individuals use EFs in their daily lives, whereas rating scales measure EF behaviors that occur in natural environments across time. Additionally, rating scales offer increased efficiency, accessibility, and convenience, especially given advances in web-based administration and scoring platforms. This is particularly important when using measures with large groups to inform interventions and at multiple time points to evaluate interventions.

Currently Available EF and Related Scales

Currently available adult EF rating scales either have high efficiency and accessibility but poor technical adequacy (e.g., Adult Executive Function Inventory or ADEXI; Holst and Thorell 2018), or strong technical adequacy but high training, administration, and scoring financial and time costs. The latter are designed to detect clinical disorders as part of individual diagnostic evaluations rather than to inform and track intervention effectiveness (e.g., the Behavior Rating Inventory of Executive Function-Adult or BRIEF-A, Roth 2005; and the Barkley Deficits in Executive Function Scale or BDEFS, Barkley 2011). There are also study strategies scales specifically designed for college students that include EF (e.g., Learning and Study Strategies Inventory or LASSI, Weinstein et al. 1987); however, these scales tend to blur boundaries between EF and related, but non-synonymous, constructs, such as study skills, learning preferences, and psychological symptoms (Credé and Kuncel 2008). Given the importance of EF for academic success and the limitations of available adult EF rating scales, practitioners would benefit from having access to reliable and valid, non-clinical, affordable, efficient, and academically focused measures of a range of specific EF-related behaviors. This study is the first step in providing an EF scale that meets these needs.

The Current Study

The current study describes the refinement and preliminary psychometric evaluation of the self-report Executive Skills Questionnaire-Revised (ESQ-R) rating scale, a substantial revision of the informal checklists (all called Executive Skills Questionnaires, ESQs) previously published in a series of popular and widely available books for educational support professionals that offer a variety of interventions for different EF skill areas (Dawson and Guare 2010, 2012, 2018). We selected items from the original ESQ versions, generated and refined additional items, used factor analyses to reduce the item pool, and evaluated the resulting ESQ-R version for preliminary reliability and validity evidence.

Method

Participants

We recruited 410 participants enrolled at a regional public university as undergraduate, graduate, or post-baccalaureate/non-degree-seeking students through the university Psychology Participant Pool and campus-wide advertisements. Informed consent and, when appropriate for age, parental consent and child assent were obtained for all individual participants included in the study in procedures approved by the university IRB (Committee for the Protection of Human Subjects or CPHS). (There were two 17-year-olds in the current sample, both of whom participated through the Participant Pool, which requires parent/guardian consent to enroll in the Pool and participate in approved research).

Of the 410 participants who consented to the study, 36 (8.7% of the original sample) were removed due to incomplete data (i.e., participants who started the questionnaire but did not complete it). Most non-completers exited the survey early on, after reading consent and directions but before completing the ESQ or other measures. (On average, non-completers completed 13.94% of the entire questionnaire, which includes consent, and instructions). Comparisons across completers and those who exited the survey before finishing are not available because demographic and other individual data were gathered as the final items in the questionnaire to reduce stereotype threat and other demographic-related impacts on participant responses.

The total sample with completed ESQ-Rs included 374 participants. Demographic characteristics appear in Table 1. Mean age of participants was 26.28 (range = 17–55, SD = 7.61, 70% ages 17–27). Mean GPA was 3.25 (range = [0.0, 4.0], SD = 0.61, 65% 2.7–3.7). When compared to the demographic composition of the USA according to the 2010 census, women and Hispanic participants were overrepresented, while men were underrepresented. When compared to the university student body at the institution from which participants were recruited, women and White participants were overrepresented.

Table 1 Demographic characteristics (N = 374)

Procedures

Data collection was part of a larger series of studies and occurred in three waves over a 1-year period. Therefore, sample sizes differ slightly for the various measures that were included in consecutive versions of the study questionnaire. For all waves, participants received a link to an online questionnaire administered through Qualtrics and completed the measures on their own devices. The questionnaire included the online consent form, followed by the measures described below presented in blocks by topic area (e.g., EF, psychological symptoms, etc.) and randomized within each block, followed by demographic questions. (Participants under 18 years old were also required to obtain parental consent prior to registering for research participation through the university psychology subject pool.) To obtain preliminary test-retest estimates, a subset of 38 participants took the measures twice. Average time between administrations was 100 days (SD = 72), as most participants took the questionnaire once in the Fall semester and again in Spring (see “Results”). All participants earned course credit for participation.

Measures

The Executive Skills Questionnaire-Revised (ESQ-R)

The ESQ-R self-report rating scale integrates current scientific understanding of core EFs (Chan et al. 2008; Diamond 2014; Miyake et al. 2000; Miyake and Friedman 2012) with an ecologically valid understanding of EF that is directly applicable to academic contexts and tasks and directly tied to available EF interventions. It represents a substantial revision of Dawson and Guare’s various ESQ versions (Dawson and Guare 2010, 2012, 2018), based on the psychometric and expert review procedures described in the current manuscript.

In the original ESQ versions and popular books, Dawson and Guare conceptualize EF as “executive skills” (ESs; Dawson and Guare 2010, 2012, 2018). This term highlights the malleability of “skills,” as opposed to the traditional conceptualization of EF as “abilities,” which implies inherent or stable competencies that cannot be successfully intervened upon. The conceptualization of EF as skills also encompasses broader academic and behavioral manifestations of the major EFs than traditional lab EF tasks. These skills are observed when individuals apply core EFs to real-world academic tasks, such as studying for tests, planning big projects, and paying attention in class. The 11 ES areas included in the original ESQ versions and on the ESQ-R are planning/prioritization (P), organization (O), time management (TM), working memory (WM), metacognition (M), response inhibition (RI), emotional control (EC), sustained attention (SA), task initiation (TI), flexibility (F), and goal-directed persistence (GDP). The ESQ-R includes items designed to measure this broad range of ESs to emphasize academic applicability and strengthen the link to intervention.

The ESQ-R directions state, “Read each item and decide how often it’s a problem for you.” We changed the original ESQ response scales, which ranged from Strongly Disagree (1) to Strongly Agree (7) on the adult version (Dawson and Guare 2012) and Big Problem (1) to No Problem (5) on the earlier child/adolescent student version (Dawson and Guare 2010), to a frequency-based response scale that includes the following options: Never or Rarely (0), Sometimes (1), Often (2), and Very Often (3). This frequency-based response scaling method better reflects an attempt to measure the quantity of constructs (DeVellis 2017), enhances sensitivity to change over short periods of time (Fok and Henry 2015), and is similar to the response scaling on some of the most well-validated behavior self-report scales (e.g., the Behavior Assessment System for Children, Third Edition; Reynolds and Kamphaus 2015). Items describe difficulties with ESs. Scores for each item range from 0 to 3, with higher scores indicating more ES problems. The order of items is randomized for each participant to minimize order effects.

The original ESQ checklists include slightly different versions for different age groups (Dawson and Guare 2010, 2012, 2018). We selected the items from the versions geared toward older students that were most relevant to academic tasks and most representative of the current scientific understanding of EF (Chan et al. 2008; Diamond 2014; Miyake et al. 2000; Miyake and Friedman 2012). After selecting items from the original ESQs to represent each of the 11 skill areas proposed by Dawson and Guare, we refined item wording using guidelines from survey and scale development literatures (e.g., eliminating compound and confusing items, rewording negatively worded items, removing specific examples to broaden applicability, etc.; DeVellis 2017; Fok and Henry 2015; Holmbeck and Devine 2009; Visser et al. 2000).

An initial 32-item pool was reviewed by one of the authors of the original ESQ versions and the books in which they appear. This content expert is a doctoral-level licensed psychologist and school psychologist with extensive training, clinical expertise, and publications and presentations focusing on ESs. We pilot tested the initial 32 items with 30 college students (90% undergraduate; 67% women, 30% men, and 3% prefer not to say; 47% White, 10% Black or African-American, 10% Asian, 3% American Indian or Alaskan Native, 3% prefer not to say; 27% Hispanic, Latino, or Spanish origin). For the 32-item pilot version, internal consistency was good (Cronbach’s alpha = 0.95); however, expert review indicated that some ES areas were underrepresented. Applying the Spearman-Brown Prophecy Formula suggested that increasing the number of items to 60 would increase internal consistency reliability to 0.97. With consultation from the ES expert, we created new items to reflect ESs in underrepresented areas, such that each ES had a minimum of four items, until the development team, and the expert agreed that all ESs received adequate coverage. This resulted in an expanded candidate item pool of 61 items, which was administered to study sample participants and used in the factor analyses described below. The 61 candidate items were distributed across the ES areas hypothesized by Dawson and Guare (2010) as follows: five planning/prioritization (P), five organization (O), six time management (TM), five working memory (WM), six metacognition (M), five response inhibition (RI), six emotional control (EC), seven sustained attention (SA), four task initiation (TI), six flexibility (F), and six goal-directed persistence (GDP). Internal consistency for the total sample (n = 374) for the 61-item version was good (Cronbach’s alpha = 0.96). Test-retest reliability for the subsample who took the measures twice (n = 38) was r = .74 for all 61 candidate items.

Convergent Validity Measures

We administered two additional nonclinical, self-report EF scales that are also widely available, efficient, and suitable for older students. The 14-item Adult Executive Functioning Inventory, Self-Report form (ADEXI; Holst and Thorell 2018) measures skills and behaviors related to working memory and inhibitory control. Participants use a five-point scale (e.g., 1 = Not True and 5 = Definitely True) to respond to statements by indicating how well each one “describes [them] as a person.” Higher scores indicate more difficulties. Holst and Thorell reported a two-factor solution for the ADEXI, with working memory and inhibitory control factors highly correlated (r = .69). Reported internal consistency was .91 for the whole sample, and .89 for the nonclinical sample. Test-retest reliability from bivariate correlations was .71, and using intra-class correlations, it was .67. For the current sample (n = 374), the ADEXI shows adequate internal consistency (Cronbach’s alpha = .90) but poor test-retest reliability (r = .52, n = 38).

We also used eight EF-relevant items from the Current Behavior Scale (CBS; see Biederman et al. 2008), with modified instructions (past 2 weeks instead of past 6 months, to emphasize current behaviors). This measure is informal and published only as part of a research manuscript but is similar to and authored by the developer of the BDEFS (Barkley 2011). For the current sample (n = 374), the CBS showed adequate internal consistency (Cronbach’s alpha = .91) but low test-retest reliability (r = .52, n = 38).

Discriminant Validity Measures

Psychological difficulties such as depression, anxiety, and stress are known to negatively impact EF (Ajilchi and Nejati 2017; Wingo et al. 2013) but are not synonymous with EF difficulties. The Generalized Anxiety Disorder 7-item scale (GAD-7; Spitzer et al. 2006) is a global measure of anxiety symptoms in the past 2 weeks. Participants use a four-point scale (e.g., 0 = Not at all sure and 3 = Nearly every day) to rate how often they experience specific anxiety symptoms. Higher scores indicate higher levels of anxiety. The GAD-7 has internal consistency at .92 (Cronbach’s alpha) and test-retest reliability of .83 (intra-class correlation). In the current sample, internal consistency was alpha = .91 (n = 364) and test-retest reliability was r = .79 (n = 38).

The 10-item Perceived Stress Scale (PSS; Cohen et al. 1983) assesses global perceived situational stress levels. The instructions ask participants to use a 5-point scale (e.g., 0 = Never and 4 = Very Often) that indicate how often they experienced various feelings and thoughts in the past month. Four items are positively worded (and reverse scored), and the other six are negatively worded (i.e., ask about problems). Higher scores mean higher levels of stress. For the current sample, internal consistency was alpha = .79 (n = 374) and test-retest reliability was r = .58 (n = 38).

The 21-item Depression Anxiety Stress Scales (DASS-21; Lovibond and Lovibond 1995) asks participants to indicate how given statements applied to them over the past week. There are seven items for each subscale: Depression, Anxiety, and Stress. Participants rate items on a four-point scale (e.g., 0 = Did not apply to me at all and 3 = Applied to me most of the time). All items are worded in terms of problems, such that higher scores mean more symptoms. Research with a large non-clinical adult sample indicated good internal consistency (Cronbach’s alpha values = .91, .80, and .84 for Depression, Anxiety, and Stress, respectively) and supported the three-factor structure for Depression, Anxiety, and Stress subscales (Sinclair et al. 2012). For the current sample, internal consistency was alpha = .94 (n = 364) for the total DASS score, .88 for the Depression scale, .85 for Anxiety, and .86 for Stress. Test-retest correlations were .69 for total score, .59 for Depression, .60 for Anxiety, and .74 for Stress (n = 38).

Criterion Validity Measures

We evaluated criterion validity for the ESQ-R by investigating correlations with university grade point average (GPA) and academic engagement, an important correlate of achievement and adjustment (Zhang et al. 2012). Students self-reported current GPA on the university’s standard 4.0-point scale. Grade data could not be obtained directly from the university for administrative reasons; however, a previous meta-analysis showed that “self-reported grades generally predict outcomes [such as future GPA] to a similar extent as actual grades” (Kuncel et al. 2005, p. 76), and the correlation between self-reported and actual GPAs in their combined sample of 12,089 college students was r = .90.

The 21-item Student Course Engagement Questionnaire (SCEQ; Handelsman et al. 2005) asks students to rate their academic engagement over the past week on a scale from 1 = Not at All Engaged to 5 = Very Engaged. The measure has a four-factor structure: skills, emotions, participation, and performance. Internal consistency ranged from .76 to .82 for the scales. The SCEQ authors reported factor analysis results showing that SCEQ scores explained 26% of the variance in homework grades and 30% in final exam grades. For the current sample, internal consistency was alpha = .91 (n = 364) and test-retest reliability was r = .79 (n = 38).

Results

Factor Analyses

We conducted a series of exploratory factor analyses (EFAs) and confirmatory factor analyses (CFAs) to reduce the item pool. No missing data handling was necessary, as only participants with full data for the ESQ-R were included in factor analyses. The latent trait hypothesized to underlie all ESQ-R items is ES difficulties, with higher scores associated with more difficulties. First, we conducted principal components analyses (PCA) with Varimax rotation and without constraints on number of factors, using the criteria of Eigenvalues greater than 1.0, in SPSS 25.0. We inspected the results and reduced the number of factors for subsequent models when there were fewer than three items on an identified factor. We then ran the EFA on the same items and constrained the analysis to that number of factors (e.g., five) and flagged items for removal that showed either (a) no loadings above 0.40 on any factor or (b) loadings of 0.40 or above on more than one factor (cross-loadings). Flagged items were removed after expert review for content (i.e., to ensure approximately equal weight given to the full range of ES areas and to reduce redundancy with other items; items from original ESQ versions published in Dawson and Guare books were given priority for retention).

Using Mplus 7.1 (Muthén and Muthén 2013), we ran the remaining items through separate CFAs for each of the factors from the initial EFA. Additional problematic items were identified based on suggested correlations among items in the CFA modification indices. Items with high correlations were removed when the content overlapped with correlated items (again, with priority given to retaining original ESQ version items). Once items were removed, the new item pool was assessed through another series of PCAs and CFAs. This process continued until each CFA had either good fit or was just-identified (i.e., the number of estimated parameters is equal to the number of elements in the covariance matrix, which results in zero degrees of freedom and an inability to estimate fit statistics), and no additional meaningful modification indices were present.

In the first round of analyses, the EFA identified eight poorly fitting items. After examining item content to ensure removal would not narrow representation of the range of ES areas, these items were dropped from further analyses. Another EFA was then conducted with the remaining 53 items, and a five-factor model was identified. Fit for each factor in the CFAs was mixed or poor (χ2(9–299) = 82.15–683.04, all p < .001; RMSEA = .06–.21; CFI = .76–.94), and modification indices identified additional items to consider for removal. After examining item content, 11 additional items were dropped from the analyses, which resulted in 42 remaining items (of the initial 61-item pool).

In the second round of analyses, the EFA indicated that these 42 items still fit a five-factor model. One item cross-loaded on multiple factors and, after examining item content, was removed from further analyses. Separate CFAs were run for each factor from the EFA using the remaining 41 items. Fit for three factors was mixed (χ2(2–230) = 25.29–497.03.04, all p < .001; RMSEA = .06–.18; CFI = .90–.95), fit for one factor was good (χ2(9) = 10.44, p = .32; RMSEA = .02; CFI = 1.00), and one factor was just-identified. CFA modification indices identified items to consider for removal, and six additional items were dropped from further analyses, which left 35 items.

For the third round of analyses, the EFA again supported a five-factor model. Based on cross-loadings and item content, an additional six items were dropped from further analyses. Separate CFAs were run for each factor from the EFA using the remaining 29 items. Fit for one factor was acceptable (χ2(119) = 226.88, p < .001; RMSEA = .05; CFI = .96), fit for two factors was good (χ2(2) = 3.52, p = .17; RMSEA = .05; CFI = 1.00 and χ2(2) = 3.48, p = .18; RMSEA = .05; CFI = .99), and two factors were just-identified. Using modification indices and item content, an additional four items were dropped from the one factor with acceptable fit, which left 25 items from the original measure. After dropping these four items, the factor that previously had acceptable fit had good fit (χ2(44) = 54.68, p = .13; RMSEA = .03; CFI = .99). After establishing good fit in the individual factors, a full model estimating all five factors was run. The full five-factor model had acceptable fit (χ2(265) = 423.38, p < .001; RMSEA = .04; CFI = .95). Table 2 shows the 25 retained items’ loadings on the five factors.

Table 2 Retained items on the 25-item ESQ-R Scale and factor loadings

We used participants’ total scores from this 25-item ESQ-R version to estimate reliability and correlations with other measures, with the total score calculated as the sum of scores for all 25 items (possible range 0–75). The items were distributed across the ES areas hypothesized by Dawson and Guare (2010) as follows: two planning/prioritization (P), two organization (O), two time management (TM), two working memory (WM), three metacognition (M), three response inhibition (RI), four emotional control (EC), two sustained attention (SA), one task initiation (TI), two flexibility (F), and two goal-directed persistence (GDP).

Reliability

Reliability estimates were calculated for ESQ-R total scores using Classical Test Theory (CTT). For the 25-item ESQ-R total score, internal consistency estimates were excellent: Cronbach’s alpha = .91, Guttman split-half coefficient = .91. Internal consistency estimates for the items in the five factors described previously were as follows: .89 for Factor 1 (11 items), .74 for Factor 2 (4 items), .76 for Factor 3 (3 items), .75 for Factor 4 (3 items), and .65 for Factor 5 (4 items; see Table 2 for items included in each factor).

Test-retest reliability correlation between ESQ-R total scores (25-item version) at Time 1 and Time 2 was r = .70 (n = 38). There was no significant effect of delay interval (between Time 1 and Time 2) on Time 2 ESQ-R scores (using the final 25 items) (β = .05, t(36) = 1.33, p = .19), or on absolute difference between Time 1 and 2 ESQ-R scores (β = .003, t(36) = 0.19, p = .85). Thus, despite variability in delay among the 38-participant subsample, delay was not associated with a consistent score increase or decrease, and scores did not become significantly more inconsistent over time.

Convergent, Discriminant, and Criterion Validity

Table 3 shows correlations between ESQ-R total scores and other EF measures (range r = .56–.74), as well as psychological symptom measures (r = .38–.55), student academic engagement (r = −.40), and college GPA (r = −.07). All correlations were statistically significant at the p < .001 level, except for that between ESQ-R scores and GPA (r = − .07, p = .175, n = 374). ESQ-R scores were also significantly correlated with age (r = − .118, p = .023, n = 373). Notably, of all the measures administered in the current study, only the student academic engagement measure correlated significantly with GPA (r = .199, p < .001, n = 374).

Table 3 Correlations among measures

Discussion

The current study described the refinement and initial psychometric evaluation of the Executive Skills Questionnaire-Revised (ESQ-R), a self-report ES rating scale. We designed the ESQ-R to adequately represent a range of ESs important for academic success in a way that is specifically tied to available EF interventions.

ESQ-R Development

Initial development was successful in improving the efficiency of the measure by reducing the candidate pool of 61 items to a final 25-item version that retained representation of all 11 ES areas originally hypothesized by Dawson and Guare (2010). Notably, factor analyses resulted in a five-factor structure for the ESQ-R, which deviates from Dawson and Guare’s (2010) original conception of 11 distinct ES areas. The five-factor model appears to represent ESs related to making, adjusting, monitoring, and sticking with a plan (Factor 1 Plan Management: 11 items); and time management and switching tasks (Factor 2 Time Management: 4 items), organization of materials (Factor 3 Materials Organization: 3 items), emotional regulation (Factor 4 Emotional Regulation: 3 items), and impulsivity/inhibition (Factor 5 Behavioral Regulation: 4 items). All factors had internal consistency estimates at or above .70 in the current example, except for Factor 5. This suggests that additional development may be needed for consistent and comprehensive representation of behavioral regulation in the ESQ-R.

The first factor is the largest and includes items originally hypothesized to represent Dawson and Guare’s ES areas of planning/prioritization, metacognition, emotional control, sustained attention, flexibility, and goal-directed persistence. It is common to identify a large and inclusive first factor, but this factor’s representation of multiple ES areas from Dawson and Guare’s (2010) model complicates the one-to-one link between ESQ-R scores and specific ES area interventions. Although future test development efforts may identify items that would more clearly distinguish among the included ES areas in this factor, it is also possible that there are fewer distinguishable ES areas than originally hypothesized and that interventions for one ES area in this factor (e.g., metacognition) would result in skill transfer to related areas (e.g., planning/prioritization). This hypothesis would need to be explicitly tested with participants receiving ES interventions.

The latter explanation is supported by the similarity of the current results to those found for clinical EF rating scales, which generally include two to five factors. For example, Barkley (2011) identified five EF factors measured on the BDEFS, and Roth and colleagues (2013) identified a three-factor structure on the BRIEF-A. In the future, researchers should examine the relations among factors on the ESQ-R and such clinical EF scales. Unfortunately, we were unable to include these scales because of funding constraints. The cost of such measures, as previously noted, is also a likely obstacle for practitioners wishing to use them to plan, tailor, and evaluate interventions—especially at multiple time points and with large groups of students.

Psychometric Properties of the 25-Item ESQ-R

For the 25-item ESQ-R total score, we found excellent internal consistency and adequate test-retest reliability across a wide time range, with no significant effects of delay on scores. These are important properties for an intervention-focused measure. Additionally, although our current test-retest sample was small, the ESQ-R had the strongest test-retest reliability (r = .70) when compared to other researcher-developed EF measures (r = .52 for CBS and ADEXI), and ESQ-R scores were moderately related to those scores (r = .69 to .74). Although further research regarding test-retest stability vs. sensitivity to change is needed, the current results are promising for the ESQ-R as a repeatable measure to measure intervention outcomes.

The current convergent and discriminant validity results are also promising and begin to elucidate a nomological network. Examining patterns in Table 3, ESQ-R correlations with EF measures were all higher than those between the ESQ-R and psychological symptom measures (e.g., DASS, GAD, and PSS), and all correlations were in the expected directions (i.e., scores indicating more problems on the ESQ-R were associated with scores indicating more psychological symptom).

Further, ESQ-R scores were significantly correlated with student academic engagement scores (SCEQ), which were, in turn, correlated with students’ self-reported GPAs. However, the direct correlation between ESQ-R scores and GPA was not significant. This is likely because as Table 1 indicates, our sample had relatively high GPAs, which may have led to restriction of range and hampered our ability to detect significant correlations at the lower end of the GPA distribution—with struggling students, where ESs may matter most. Of all the measures administered in the current study, only academic engagement scores showed a significant correlation with GPA, suggesting that the low ESQ-R–GPA correlation is likely a shared feature of all EF rating scales and not unique to the ESQ-R.

Limitations and Future Research

The current study has several limitations and areas for future improvement. First, sample size and representativeness could be improved and, with cross-validation studies, may show slightly different results regarding correlations and reliability estimates, as well as alternative factor structures. The test-retest sample should be expanded, especially given the intent of the ESQ-R as an intervention-focused measure. Further, the current sample included multiple individuals with (self-reported) disability conditions such as attention deficit/hyperactivity disorder (ADHD), which could have influenced results; however, other measures have included such heterogeneous samples and have touted this inclusivity as an advantage (see Barkley 2011). In fact, future studies may benefit from explicitly recruiting clinical samples with conditions known to involve EF impairment, such as ADHD.

Future research should focus on further scale refinement and additional psychometric data collection to support using the ESQ-R as a comprehensive but efficient measure for informing and measuring effectiveness of EF interventions. We plan to further evaluate basic psychometric properties (e.g., factor structure in different samples) and advanced measurement characteristics (e.g., invariance across cultural and clinical groups), as well as relations among scores on the ESQ-R and other measures of improvement (e.g., actual GPA, grades, other EF scales, and tests) across time. Future studies should also examine ESQ-R scores as predictors of retention and other important academic outcomes. Finally, we plan to evaluate the psychometric properties of the 25-item ESQ-R with an extended age range, including middle and high school students, and to adjust item content according to the resulting data. This will increase applicability to different populations of older students who may benefit from ES interventions and for practitioners who need efficient, reliable measures to evaluate these possible benefits.

Conclusion

In the current study, we addressed limitations of available EF measures by developing a comprehensive but time- and cost-efficient ES self-report scale with adequate to excellent reliability and validity. Future studies are needed to increase sample representativeness and expand psychometric evidence. Given the current results, the ESQ-R is a promising tool for practitioners to plan, tailor, and evaluate the effectiveness of interventions for multiple ES areas.