Arguably “the single most important issue in management research and practice in the last decade” (Saks 2017, p. 76), work engagement, has received considerable and increasing attention (e.g., Saks and Gruman 2014; Seppälä et al. 2015) in the realms of both research and practice. However, engagement – defined as a “positive, fulfilling, and work-related state of mind characterized by vigor, dedication and absorption” (Schaufeli and Bakker 2004, p. 295) – has also seen widespread decline amongst employees worldwide. Gallup estimates that these disengaged employees are costing US companies alone upwards of $250 billion annually (Rath and Conchie 2009). With these underling financials as a salient backdrop, focus on this construct also stems from its established relationship with a host of outcomes essential for organizational success, including employee productivity, firm profitability, and competitive advantage (e.g., Buckingham and Coffman 1999; Harter et al. 2002; Rich et al. 2010; Schaufeli 2013).

Nevertheless, despite this sustained importance and relevance, there remains a notable research-practitioner gap in terms of operationalizing the construct, as well as its effective measurement (Bailey 2016; Saks 2017). Indeed, scholars (e.g., Macey and Schneider 2008; Mills et al. 2013; Saks 2017; Shuck et al. 2017) have consistently lamented that, although engagement is relatively easy to recognize, it has been difficult to define and consistently measure – an issue that Saks (2017) calls “the engagement measurement barrier” (p. 77). This point was illustrated by Macey and Schneider (2008) who provided a synthesis of the many approaches, definitions, and expected behaviors associated with engagement in both practice and academia. The overall evaluation was that, with the exception of only a few (Salanova et al. 2005; Schaufeli et al. 2002), the majority of engagement measures have fallen short in truly representing the construct. Moreover, even of those few that do effectively represent engagement, most are generally too lengthy to be usable by practitioners or by researchers pursuing longitudinal designs. Such exclusions inherently limit those measures’ relevance for the most rigorous and impactful measurement efforts.

In an effort to provide clarity surrounding the measurement and operationalization of engagement, academics have engaged in considerable effort to validate measures of engagement in a transparent way (Christian et al. 2011; Mills et al. 2012; Wefald et al. 2012). Although a number of engagement measures exist (see Saks 2017), academia has seen widespread adoption of the conceptualization put forth by Schaufeli and Bakker (2004) and the associated Utrecht Work Engagement Scale (UWES; Schaufeli et al. 2002). As the most commonly used measure in academic research (Rich et al. 2010; Schaufeli 2012), the UWES has been found to relate to many outcomes important to employees and organizations (see Halbesleben 2010 for an overview). Practitioners, however, vary widely in their operationalization and measurement of engagement (Bakker and Leiter 2010; Macey and Schneider 2008) to the extent that it is questionable whether they are even assessing the same construct.

Moreover, with so many practitioners (e.g., consulting firms, in-house organizational behavior specialists) examining engagement, those in industry often find themselves further divorced from one another in terms of the operationalization and assessment of engagement (Mills et al. 2013). Although practitioners argue their instruments tap engagement and predict critical outcomes, their items are often proprietary, making it difficult to understand what exactly is being measured or how their assessments align with academic approaches. This Tower of Babel-type effect between, and within, practitioners and academics significantly hampers progress in the realms of both research and application. As emphasized by Shuck et al. (2017), “numerous entangled definitions, words, measurements, and frameworks have been proposed when referring to employee engagement... Consequently, researchers have routinely drawn theoretical conclusions about the meaning of employee engagement, limiting [its] applicability in theory building and practice” (p. 263).

To systematically advance work engagement scholarship, the science-practitioner gap must be addressed. We argue that a ready avenue for doing so is via the validation of a parsimonious, psychometrically sound measure appropriate for use in practice as well as in rigorous academically-based research designs. Without such a measure, practitioners are limited in the extent to which they can leverage empirical research, while simultaneously using measures of engagement that are misaligned with the dominant, psychometrically-supported operationalization. Notably, only about half of organizations report employing analytic techniques at all, with even fewer using predictive methodology or more complex modeling (Collins et al. 2017). Organizations can effectively leverage empirical findings from advanced statistical methods, but only if their measurement and operationalization of the construct aligns with that used in the more rigorous academic research. But, that will not happen as long as the existing validated measures in academia are impractically lengthy. A psychometrically valid parsimonious measure is needed to help bridge the divide, thereby better positioning academics to collaborate with practitioners regarding a construct of practical interest to organizations (Bailey 2016). Put simply, a validated measure would allow academics and practitioners to speak a common language when examining issues related to employee engagement and how to best promote it throughout organizations.

A major hurdle facing practitioners – and academics collaborating with organizations – when trying to employ existing scholarly measures is that they are too lengthy to be accepted by organizational stakeholders and/or clients (Lapierre et al. 2018). Moreover, short-form measures, when they exist, are often compromised by nonexistent or limited validity evidence (Fisher et al. 2016; Smith et al. 2000) and even often fail to explain the decision rules used to adapt (e.g., shorten) a given measure (Heggestad et al. 2019). Especially in the case of lengthy annual surveys frequently used within organizations, the goal is usually to measure many constructs simultaneously, making lengthy measures for any one construct largely impossible from the perspective of practitioners and organizations. With these issues in mind, within this line of research we develop and thoroughly evaluate a short-form version of the UWES and its psychometric characteristics, giving due attention to clarifying the decision rules used to adapt the measure (Heggestad et al. 2019).

As we will discuss, some effort has been made in this respect (Reina-Tamayo et al. 2017; Schaufeli et al. 2017). However, existing validity evidence for these measures is limited. As such, within the current program of research, we aim to provide evidence regarding the utility of a short-form measure based on a multifaceted validity approach. In so doing, we make several contributions to the literature. Specifically, we offer a short-form measure aligning with Schaufeli et al.’s (2002) conceptualization of engagement, take a multifaceted validation approach to present extensive evidence for the efficacy of this measure across two studies and four data sources, evidence the temporal invariance (factorial/configural, metric, scalar, strict) of engagement as measured via the short-form measure, assess the extent to which engagement as measured is invariant across various demographic characteristics, and evaluate engagement’s over-time relationship with employee life satisfaction as an example of the utility of the measure for longitudinal research. In so doing, we also move toward closing the research-practitioner gap in the operationalization and measurement of engagement, thereby facilitating implementation in both research and practice. In sum, regardless of application context, the collective evidence presented here should allow individuals (academic or practitioner) to use the resulting measure with the highest level of confidence regarding the underlying psychometrics of the measure.

Work Engagement Matters. But Why?

Several systematic reviews (e.g., Christian et al. 2011; Halbesleben 2010) elaborate on the criticality of work engagement. In brief, work engagement consistently relates to a myriad of employee and organizational outcomes, including decreased absenteeism (Bakker and Schaufeli 2008), turnover (Halbesleben 2010), decreased counterproductive work behaviors and increased organizational citizenship behaviors (Sulea et al. 2012), as well as employee retention (Buckingham and Coffman 1999). Engagement is also related to those bottom-line outcomes of importance to the success of organizations, such as in-role performance broadly defined (Christian et al. 2011; Halbesleben and Wheeler 2008), sales (Xanthopoulou et al. 2009) and service performance (Chen et al. 2018; Salanova et al. 2005), and firm profitability (e.g. Harter et al. 2002; Rich et al. 2010; Schaufeli 2013). Because work engagement is related to so many desirable organizational outcomes, it has been widely utilized in the applied arena and is a frequent construct of interest among consultancies’ clientele – and this interest shows no signs of abating (Collins et al. 2017).

However, as noted, academics’ and practitioners’ respective understandings of engagement are often not well aligned (Bakker and Leiter 2010; Harter et al. 2002). Overwhelmingly, however, large firms and organizations claim to have evidence indicating that their measures relate to those same outcomes that academics have examined (Schaufeli and Bakker 2010). Although this may be true in terms of general direction of relations, it is unlikely that academic and practice-based measures of engagement are truly measuring the same thing. Rather, they are likely tapping into related, but different, underlying construct spaces. With this in mind, it is important to recognize that the relationships explicated above were identified using measures developed within academic engagement paradigms (with one exception; Gallup’s Q12 – see Harter et al. 2002).

The Measurement of Work Engagement

Researchers (e.g. Bakker et al. 2011; Rich et al. 2010) have consistently emphasized the lack of clarity surrounding the measurement of engagement. Three primary studies (Byrne et al. 2016; Viljevac et al. 2012; Wefald et al. 2012) have been conducted to compare the psychometric and predictive efficacy of existing engagement measures (i.e., Britt et al. 2013; May et al. 2004; Schaufeli and Bakker 2004; Shirom 2003). Results suggest that the UWES is a superior measure as it better differentiates engagement from similar constructs (Viljevac et al. 2012) and is most predictive of relevant outcomes (Byrne et al. 2016; Wefald et al. 2012). Byrne et al. further argued that, in practice, the UWES conceptualization of engagement is likely most appropriate because it assesses a broader domain and yields practical information actionable within organizational contexts. Although outstanding measurement issues remain (May et al. 2004; Saks and Gruman 2014; Shuck et al. 2017; Wefald et al. 2012), the UWES demonstrates strong psychometric characteristics, which likely explains why it remains so widely used by researchers. As such, we focus our efforts on improving the utility of this measure.

In line with this conceptualization, scholars characterize engagement predominantly as an affective-cognitive state comprised of vigor, dedication, and absorption (Schaufeli et al. 2002). These three engagement components align with Kahn’s (1990) conceptualization of engagement. Vigor aligns with Kahn’s physical-energetic component and is defined as liveliness and energy at work and encompasses ones’ willingness to invest time in others and be mentally resilient in the face of difficulties. Dedication describes the extent to which an employee is involved, committed, and prideful about their work, and corresponds with Kahn’s emotional component. Absorption is the extent to which an employee feels engrossed in his or her work and senses time as passing quickly, thus aligning with Kahn’s cognitive component. With this operationalization in mind, Schaufeli et al. (2002) created the UWES, consisting of 17 items nested within the three conceptual factors. In brief though, research has been mixed regarding the factor structure of the measure (e.g., Sonnentag 2003; Mills et al. 2012).

A 9-item version of the UWES attempted to address these issues as well as to provide a more parsimonious measure. This version yielded correlations of >.90 for the three latent factors (Schaufeli et al. 2006), leading Schaufeli et al. to posit that this version might be best interpreted as holistic engagement rather than as three components, as the multifactor structure was inconclusive for the given sample. This notion of scoring the UWES-9 as a single unitary engagement index is, in part, reflective of the fact that conceptual factors often correlate with one another highly enough so as to indicate possible multicollinearity (e.g., .88, .94; Balducci et al. 2010; Schaufeli 2006). Beyond that, scholars (e.g., Balducci et al. 2010; De Bruin and Henn 2013) have recommended that the total score may be best practice for interpreting the UWES-9, as independently the three constructs may lack discriminant validity.

Current Program of Research

To date, two known studies have published adapted three-item versions of the UWES. While not their primary focus, Reina-Tamayo et al. (2017) selected three items to develop an “episodic” short-form measure (one item from each dimension of the UWES) for use in their diary study. Reina-Tamayo et al. report retaining the item from each dimension with the highest item-total correlation from the multi-item scale. Alternatively, Schaufeli et al. (2017, p. 4) offered a three-item short-form measure wherein they report selecting items “[b]ased on face validity, theoretical reasoning, and earlier feedback from respondents”. These past efforts provide instructive “proof of concept” support for our argument regarding the need for a short engagement measure. However, while both studies contribute to our collective understanding of this need, there remain important areas of opportunity to improve on and more thoroughly validate a short-form engagement measure.

Specifically, a more methodological (objective) process for item selection as well as a more thorough consideration of validity evidence, based on established best practices (e.g., Stanton et al. 2002) would ultimately allow for greater confidence in the final items in the resulting measure (Heggestad et al. 2019). As noted by Heggestad et al. (2019), it is common practice for researchers to adapt measures to fit their purposes (e.g., shorten a measure by selecting certain items to be retained), albeit with limited attention to the degree to which the adapted measure remains content valid. To that end, we conducted two separate studies, utilizing four unique data sources, to validate a proposed short-form of the UWES, with particular attention given to the necessary conditions for optimal construct measurement and assessment (e.g., Ployhart and Vandenberg 2010). As such, we address issues noted by Heggestad et al. (2019). That is, we provide scholars and practitioners a common extensively validated short-form measure to work from; that is, if applied consistently, utilizing the validated measure from this program of research would ensure that observed results across studies (or between research and practice domains) do not differ as a function of selecting a different subset of items.

Given that the UWES often yields near-multicollinear correlations between factors and therefore is best indexed in terms of collapsing across components (Christian et al. 2011; Sonnentag 2003), we used a multifaceted approach to validate a short-form measure that leveraged the strongest item, defined and clearly articulated here based on multiple criteria, from each of the conceptual factors. In Study 1 we determined the most representative item from each factor in line with recommendations by Stanton et al. (2002). Specifically, judgmental qualities of all items were evaluated by subject matter experts (SMEs; sample 1); that is, rather than relying on the potentially idiosyncratic judgments of the authors (Heggestad et al. 2019), we developed a holistic understanding based on larger number of individuals actively involved in the engagement research domain. Following this, internal and external qualities of the items were evaluated to determine factor loadings and the relationship to related constructs (samples 2, 3). In Study 1 we also examined the extent to which this engagement measure functions comparably across varying demographic characteristics. That is, with practitioners and researchers alike increasingly seeking to compare demographic groups on engagement, it is important to first establish comparable measurement functioning across groups to ensure unbiased comparison.

Subsequently in Study 2, we first examined the psychometric characteristics of the short-form measure (i.e., validity evidence for the short-form measure independent of the administration of the long-form; Smith et al. 2000). In so doing, we expand our understanding of the validity of this measure by examining it for temporal factor invariance (6 waves of data with two-week lags between assessments). In turn, to demonstrate the utility of the short measure (and establish predictive validity evidence), we provide a within-individual test of the concept of gain spirals within conservation of resources theory (Study 2) and examine the dynamic (over time) cause-effect relationship between engagement and life satisfaction.

Study 1

In Study 1 we used data from three independent samples to determine, as objectively as possible, the best items for the short-form measure following best practice recommendations (e.g., Stanton et al. 2002; Smith et al. 2000). Sample 1 consisted of SMEs familiar with the engagement construct and its measurement. To examine issues related to judgmental qualities, SMEs were asked to evaluate engagement items from the UWES-9 (Schaufeli et al. 2006) for conceptual fit with their respective dimension (i.e., vigor, dedication, absorption). Samples 2 (working adults collected via Mechanical Turk) and 3 (working adults collected through a peer-nomination approach) were used to examine internal and external qualities of items from the UWES-9 to help identify the final set of items and to further assess the validity of the abbreviated measure. In turn, we applied a stressor taxonomy approach (i.e., Halbesleben 2010) to drive our framing and understanding the internal and external qualities of items from the UWES. This approach was taken given existing research has and continues to consistently link engagement to various stressors [e.g., role stressors & work hours (Kronenwett and Rigotti 2019); irritation (Baethge et al. 2019), job resources (Afsharian et al. 2018)] and outcomes [e.g., organizational citizenship behaviors (Smith et al. 2020); turnover intentions (Steffens et al. 2018)] denoted within the stressor taxonomy framework (Halbesleben 2010). Furthermore, leveraging the larger heterogeneous nature of Sample 3, we also examined issues of measurement invariance [i.e., based on gender, age, employment status, work schedule, and tenure].

Sample 1: Methods

Participants & Procedures

Data were collected online from 27 SMEs. SMEs included faculty (n = 11), doctoral candidates (n = 14), and applied practitioners (n = 2) in industrial/organizational psychology. SMEs had, on average, 6.04 years of experience in the field (SD = 4.54), 66.7% of SMEs indicated they were either moderately or extremely familiar with the construct; 51.8% had either published or presented scholarly work related to engagement and/or consulted with organizations on issues related to engagement.

SMEs were given the UWES-9 (items presented in random order) and construct definitions for both engagement and its three components. As recommended (Hinkin and Tracey 1999), SMEs were asked to assign each item to the characteristic it best represented; SMEs could not assign a given item to multiple characteristics. Then, on separate pages, SMEs reviewed the construct definitions again, along with the three items representing each engagement characteristic, and were asked to rank the items in terms of which best represented each specific definition (i.e., SMEs were asked to pick the best overall item for that characteristic, and so forth).

Sample 1: Results & Discussion

Results are presented in Table 1. Specific to vigor, item 2 was classified correctly 100% of the time, whereas item 1 was ranked the highest in terms of the degree to which it represented that dimension. For dedication, item 1 was ranked highest, and classified correctly 78% of the time, whereas item three was ranked second, and classified correctly 100% of the time. For absorption, item 1 was categorized correctly 100% of the time, and also ranked the highest. Alternatively, Vigor3, Dedication2, and Absorption3, demonstrated classification inconsistency and/or low rankings. However, judgmental assessments from SMEs are just one piece of information when making decisions about items to potentially retain or remove from a measure (Stanton et al. 2002). As such, we retained all items as part of our examination of internal and external qualities in Samples 2 and 3 in order to develop a more holistic understanding of the items and their strengths and weaknesses.

Table 1 Study 1 sample 1 subject matter expert evaluations

Sample 2: Methods

With Sample 2 we evaluated internal and external qualities of the items (Stanton et al. 2002). Internal qualities were examined in terms of factor loadings via confirmatory factor analysis (CFA) as well as issues related to residual error (Smith et al. 2000). Specific to external characteristics we leveraged two taxonomies to select constructs to examine the nomological network around the items. Based on Sonnentag and Frese’s (2003) stressor taxonomy, we selected constructs to index four classes of stressors: task related job stressors, role stressors, social stressors, and schedule related stressors. We also examined (physical, affective, behavioral) outcomes of engagement indexed based on Weiner et al. (2012).

Participants & Procedures

Participants were recruited from Amazon.com’s Mechanical Turk (Mturk) as part of a larger effort (Matthews and Ritter 2016). Only U.S. participants who had previously completed at least 500 tasks with a 96% approval rating (as tracked and managed by Mturk) were permitted to participate. Respondents were required to work at least 24 hours a week, with no more than 50% of their work being done at home, and were paid $2.50 for participating. Consistent with both emerging recommendations regarding the use of online panels (Huang et al. 2012) and past research (e.g., McGonagle et al. 2016), respondents who failed to correctly complete at least five of six effortful responding questions (e.g., “leave this question blank”) were excluded. We allowed respondents to miss one attention check item given respondents may mistakenly miss one item but still, generally, be attentive (Huang et al. 2012; McGonagle et al. 2016).

A total of 299 people participated; 24 were excluded for not meeting inclusion criteria (e.g., a screener question asked the number of hours per week participants worked, wherein we excluded respondents who reported they worked less than 24 hours per week). Respondents were blind to our analysis inclusion criteria. To facilitate interpretation, we employed listwise deletion at the item level for the UWES-9 items, resulting in the exclusion of nine additional respondents. The analysis sample (N = 264) was 53.8% female, primarily Caucasian (76.9%) with an average age of 34.44 years (SD = 10.51) and organizational tenure of 4.98 years (SD = 4.77). On average, respondents worked 40.69 h a week (SD = 6.22) and 70.2% worked a day shift.

Measures

Constructs selected to index the classes of stressors and outcomes are reported in Table 2. Participants were asked to consider the past month while responding to all items.

Table 2 Study 1, samples 2 and 3, stressor and outcome indices

Sample 2: Results & Discussion

Item-level means and standard deviations for the UWES-9 are reported in Table 3. To understand the internal qualities of the items, we first examined the factor loadings of the items from the UWES-9 based on a three-factor CFA (AMOS-21; Arbuckle 2012). Items were loaded on their respective latent factor, and the three (latent) characteristics were set free to correlate. The model fit the data well [χ2(24) = 48.97, p < .001, CFI = .99, RMSEA = .06, SRMR = .03] with the three factors correlating strongly (r = .82 to .93, p < .001). Table 3 includes standardized loading and modification indices, which were examined for issues of residual error.Footnote 1 The strongest loading items for each characteristic were Vigor2, Dedication1, and Absorption1. Vigor3 demonstrated the most systematic residual error with other items in the model. We also examined a single factor model which fit the data [χ2(27) = 108.91, p < .001, CFI = .95, RMSEA = .10, SRMR = .04] but demonstrated worse fit than the 3-factor model [∆χ2 (3) = 59.94, p < .001].

Table 3 UWES-9 item level psychometric information to study 1, sample 2 and 3

Based on Matthews and Ritter (2016), forward statistical regression was used to examine external qualities of the different engagement items to understand which items demonstrated unique shared variance with the selected indices. By using forward statistical regression, redundant items (items that explained limited incremental variance) were identified (Stanton et al. 2002) and excluded from the statistical model. Results for the four stressor index constructs are reported in Table 4, and results for the three outcome index constructs are reported in Table 5 (coefficients are standardized beta weights).

Table 4 Results of forward statistical regressions based on study 1 stressor indices, samples 2 and 3
Table 5 Results of forward statistical regressions based on study 1 outcome indices, samples 2 and 3

These results provided additional item selection guidance. Specifically, Dedication1 not only demonstrated the highest factor loading (i.e., internal criteria), but was also uniquely related to six of the seven index constructs. Specific to vigor and absorption, the data are less intuitive, especially if Sample 1 SME evaluations are taken into consideration. While Vigor2 demonstrated the highest factor loading, it was systematically related to only one of the index constructs. On the other hand, Vigor3 was uniquely related to four of the seven index criteria. However, based on SME evaluations, this item had poor conceptual representation of the construct. Further, Vigor3 demonstrated systematic residual error variance with five of the other engagement items. Finally, while Absorption1 demonstrated the strongest factor loading and had strong SME evaluations, it explained no unique variance in the index constructs. If nothing else, as it relates to the larger measure adaptation literature (e.g., Heggestad et al. 2019), it is clear that no one piece of evidence (i.e., judgmental, internal, or external) is sufficient when making decisions about items to retain and exclude – a balance act exists between the varying pieces of evidence.

Sample 3: Methods

Following these preliminary results, Sample 3, a larger heterogeneous sample, was used to evaluate additional internal and external evidence to further inform item selection. Factor loadings and residual errors were again used to examine item internal qualities. For external characteristics, we again leveraged the taxonomies by Sonnentag and Frese (2003) and Weiner et al. (2012), but different index constructs were selected to expand the nomological network under consideration and provide further evidence of concurrent validity.

Participants & Procedures

Participants were recruited using a peer-nomination web-based survey: 133 trained undergraduate students from eight geographically dispersed universities were trained on ethics relating to participant recruitment and inclusion criteria. Students distributed a standardized email invitation to working adults they personally knew and received nominal course credit. Email recipients were asked to complete the 15-min online survey; participation was voluntary. Participants were entered into a drawing for a nominal online retailer gift certificate. A total of 1132 respondents completed the survey. To facilitate analyses, and consistent with Sample 2, we excluded participants with missing engagement data (n = 34). The final sample (N = 1098) was 62.8% female and primarily Caucasian (63.7%) with an average age of 37.03 years (SD = 13.70) and organizational tenure of 6.51 years (SD = 7.84). On average, respondents worked 39.98 h/week (SD = 12.94).

Measures

Constructs selected to index stressors and outcomes are reported in Table 2. Participants were asked to consider the past month while responding to all items. We also collected demographic data on gender, age, employment status, work schedule, and tenure.

Sample 3: Results & Discussion

Item-level statistics for the UWES-9 are reported in Table 3. We again examined item factors loadings for the UWES-9 based on a three-factor CFA (AMOS 21; 2012); the model fit the data well [χ2(24) = 132.94, p < .001, CFI = .98, RMSEA = .06, SRMR = .03] with the three factors correlating strongly (r = .86 to .90, p < .001). Standardized factor loadings are reported in Table 3, as are modification indices. We again examined a single factor model. This model fit the data [χ2(27) = 257.66, p < .001, CFI = .95, RMSEA = .09, SRMR = .04], however demonstrated worse fit than the 3-factor model [∆χ2(3) = 124.72, p < .001].

We again used forward statistical regression; stressor results are reported in Table 4, and outcomes results in Table 5. Dedication1 again demonstrated the highest factor loading and was uniquely related to five of the index constructs. While Vigor1 demonstrated a slightly higher factor loading than Vigor2, Vigor2 still loaded highly and related systematically to five of the index constructs. Absorption1 again demonstrated the strongest factor loading and it related uniquely to four of the index constructs.

Short-Form Item Selection

Collectively, based on the three independent samples, wherein we leveraged systematically different methodologies and respondents, we identified three items for inclusion in the short-form engagement measure. While some subjectivity is involved here, our decisions were based on the observed (i.e., objective) data. To be as transparent as possible (and address potential concerns regarding subjective item selection; Heggestad et al. 2019), Table 6 includes the selected items and a summary of retention rationale for each. The abridged measure demonstrated adequate internal consistency (Sample 2, α = .85; Sample 3, α = .79) and correlated strongly with the UWES-9 overall (Sample 2, r = .95; Sample 3, r = .94, p < .001). Moreover, each item correlated strongly with the longer subscale of its respective dimension (Sample 2, .88–.92; Sample 3, .81–.90). These results suggest that not only is the abridged measure internally consistent, but it also overlaps almost entirely with the UWES-9: Even with strategically removing two-thirds of the items, we still effectively capture the overall intended construct space (Smith et al. 2000). Of note, readers intending to use this short-form measure in practice should request permission from the authors of the original UWES (Schaufeli et al. 2002).

Table 6 Justification for final item selection for the three-item measure of engagement

In terms of convergent validity, Table 7 reports bivariate correlations between the short- and long-form engagement measures and the predictor and outcome variables used in the item selection process for Sample 3. The bivariate correlations are consistently smaller for the short-form measure, however, the pattern of relationships between antecedents and outcomes is generally the same for both versions. In fact, average bivariate correlations for the short-form measure is .017 smaller than the long-form resulting in an average difference in variance explained of .0003; per Smith et al. (2000) we would argue this is an acceptable reduction in validity relative to the number of items removed from the long-form measure.

Table 7 Study 1 Sample 3 convergent validity evidence: comparing the short-from to the long-form measure

Multiple-Groups Invariance Testing

Based on Sample 3 data, we conducted a series of multiple-groups CFA invariance tests based on gender (men =1, women = 2), age (1 = 18 to 27, 2 = 28 to 45, 3 = 46+ years of age), employment status [1 = Part-time work (i.e., less than 35 h/week), 2 = full-time work (35+ hours/week)], work schedule (1 = day shift, 2 = alternative shift schedule), and tenure (1 = less than 1 year, 2 = 1–1.99 years, 3 = 2–4.9 year, 4 = 5–9.9 years, 5 = 10+ years). For age and tenure, groups were recoded to produce approximate equal group sizes to facilitate analyses. Hours worked was coded based on the Bureau of Labor Statistics 35-h definition for full-time employment.

A separate multiple-groups CFA was conducted for each grouping variable based on recommendations by Vandenberg and Lance (2000).Footnote 2 We first estimated an unconstrained model, subsequently testing for metric invariance wherein we constrained item factor loadings to be equal across groups. Next, we tested for scalar invariance by constraining item intercepts to be equal across groups. Finally, we tested for invariant uniquenesses by constraining item variance to be equal across groups. As reported in Table 8, each model demonstrated acceptable fit suggesting that the short-form measure demonstrates strict factorial invariance based on gender, age, employment status, work schedule, and tenure. Complete results, including full correlation tables by grouping variable, available upon request.

Table 8 Multiple-groups CFA invariance testing

Study 2

Study 2 was conducted to make two additional key contributions. First, across all three samples in Study 1, data were collected in conjunction with the excluded items (i.e., all items were administered simultaneously). As noted by Smith et al. (2000) when establishing validity evidence for a short-form measure “key empirical evidence should not be based [strictly] on a sample in which the full, long-form was administered” (p. 107). Independent tests of the validity of a short-form measure are needed to ensure the measure continues to function as expected. This is an issue Schaufeli et al. (2017, p. 13) highlight such that their measure was “not independently used from the UWES-9, so that its true reliability and validity is not yet fully understood” (p. 13). Our short-form development process overcomes this limitation.

Furthermore, an underlying argument we have put forth is that a valid short-form measure would be particularly advantageous in advanced research designs, including true longitudinal research (i.e., where data is collected on all constructs at a minimum of at least three time points; Ployhart and Vandenberg 2010). To this end, we administered the short-form measure in a longitudinal data collection (six waves of data with 2-week lags) to examine whether the measure continued to demonstrate acceptable psychometric characteristics (e.g., internal consistency) as well as additionally assessing whether the measure demonstrates temporal invariance and predictive validity.

Specifically, we focused our analyses on understanding the dynamic relationship between engagement and well-being. We frame our hypotheses within the context of gain spirals as defined by conservation of resources theory (Hobfoll 1989). Furthermore though, the majority of the existing research examining related issues has been based on cross-sectional (e.g., Steele et al. 2012) or cross-lagged panel models (Hakanen and Schaufeli 2012), all of which take a between-person analysis approach. While between-person analyses are helpful, particularly cross-lagged panel models (Selig and Little 2012), we seek to contribute to the literature by examining if an engagement/well-being gain spiral exists based on more stringent within-person analyses.

By way of background, it is generally supported that more engaged workers report higher indices of well-being, including life satisfaction. A common argument is that engagement results in improved well-being (over time) through mechanisms such as gain spirals in personal resources, wherein conservation of resources is often evoked as the driving theoretical framework (Hakanen and Schaufeli 2012; Hobfoll 2011). Specific to life satisfaction, a quintessential index of well-being (Diener and Ryan 2009), as noted, much of the research supporting this claim has been cross-sectional (e.g., Körner et al. 2012; Matthews et al. 2014; Steele et al. 2012). Existing over-time studies also support this conclusion (e.g., Hakanen and Schaufeli 2012; Reis et al. 2016). However, as noted, we take a more stringent approach to our analyses, appling a lagged fixed-effects structural equation model to examine whether, and the degree to which, changes in engagement predict future (lagged) changes in satisfaction, for a given person.

  • Hypothesis 1: Within-person changes in engagement predict lagged within-person changes in life satisfaction.

Interestingly though, consistent with the larger stressor-strain literature (Ford et al. 2014), examination of potential reverse causality (i.e., that life satisfaction predicts engagement) has been limited. Nevertheless, an argument for reverse causation can be grounded in the same gain spirals argument put forth in extant research (e.g., Hakanen and Schaufeli 2012). That is, as a proxy for well-being, higher life satisfaction suggests that an individual has a larger pool of personal resources to draw from, across multiple life domains (Diener and Diener 1996). In turn, this personal resource (i.e., great life satisfaction) can in turn be invested to gain further resources, for example, at work (Hobfoll 2011). Put into the context of this study, and consistent with Principle 2 as well as Corollary 1 of conservations of resource theory (Hobfoll 1989, 2011), having more generalized resources, like life satisfaction, should facilitate accumulation of more domain-specific resources, like work engagement.

  • Hypothesis 2: Within-person changes in life satisfaction predict lagged within-person changes in engagement.

Method

Participants & Procedures

Participants were recruited from Amazon.com’s MTurk. Only U.S. participants who had previously completed at least 100 tasks with a 98% approval rating were permitted to participate. While Study 1 Sample 2 respondents were excluded from participating we followed a similar screening method; respondents were required to work at least 24 h a week, be organizationally employed, with no more than 40% of their work being done at home. Participants were asked to complete the same survey six times, with a 2-week lag between assessments. Reminder emails were sent approximately 4 days after the initial invitation, at each wave of data collection. Five validation questions (e.g., “Please leave this item blank”) were included to ensure effortful responding (again, participants were allowed to miss one attention check item; Huang et al. 2012; McGonagle et al. 2016). Participants were paid $1.20 for the first survey, and $1.00 for each of the remaining surveys.

A total of 1506 participants were screened at Time 1. Of these, 667 did not meet our inclusion criteria and were excluded from participating; another 31 failed more than one of the attention check items. The remaining 808 were invited back to complete the remaining five surveys. Response rates for the remaining five surveys ranged between 68.2% and 81.8%. To be retained for analyses respondents had to complete at least two of the six surveys, remain employed over the course of the study with no major job changes, and demonstrate effortful responding on the remaining surveys. This resulted in an analysis sample of 627 which was 49.9% female, primarily Caucasian (75.9%) with an average age of 36.98 years (SD = 10.41) and position tenure of 5.24 years (SD = 4.89). On average, respondents worked 41.69 hours a week (SD = 6.77) and 74.0% worked a day shift. Of note, within the analysis sample, on average, participants completed 5.21 (SD = 1.01) of the six surveys. Sample sizes, by survey wave, were as follows: Time 2 = 562; Time 3 = 555; Time 4 = 523; Time 5 = 496; Time 6 = 504.

Measures

Engagement was assessed with the short-form measure developed in Study 1 (see Table 6). Life satisfaction was measured with a 5-item measure (Diener 1984). A sample item is, “In most ways my life is close to my ideal.” For both measures, participants were asked to consider the past 2 weeks when responding; responses were on a 5-point agreement-based Likert scale.

Results & Discussion

Table 9 reports descriptive statistics. Independent of the excluded items from the larger original-form (Smith et al. 2000) the short-form measure demonstrated good internal consistency at all six time points (as did the life satisfaction measure).

Table 9 Study 2 descriptive statistics

Temporal Invariance Testing

Next, we examined both measures for temporal invariance (Ployhart and Vandenberg 2010). We conducted baseline confirmatory factor analyses (CFA) for each construct separately. Within each CFA, items were used as indicators at each measurement assessment (Time 1 – Time 6). The six latent factors were loaded onto a second order latent factor. In turn we sequentially tested for configural (i.e., loadings on the second order factor were constrained to be equal), metric (i.e., item loads across first order latent factors were constrained to be equal), and scalar invariance (i.e., item intercepts were constrained to be equal), as well as for invariant uniquenesses (i.e., item uniquenesses were constrained to be equal; Vandenberg and Lance 2000). While full results are available upon request, the final constrained model for the short-form engagement measure fit the data [χ2(128) = 207.28, p < .001, CFI = .99, RMSEA = .03]; the model for life satisfaction did as well [χ2(398) = 703.52, p < .001, CFI = .99, RMSEA = .04]. Thus, both constructs demonstrated strict factorial invariance.

Examination of Temporal Order

To examine the potential for dynamic interplay between engagement and life satisfaction, as predicted by conservation or resources theory and represented by Hypotheses 1 and 2, we applied a latent variable fixed-effects estimator approach following recommendations by Allison (2000, 2005) with maximum likelihood estimation. A fixed effects method allows us to control for all unchanging (time invariant) variables (e.g., personality, job design) and removes all between-individual variation. In effect, not only does this increase our confidence that the lagged (causal) relationship between engagement and life satisfaction is not the result of some extraneous variable, it also allows for examination of the effect in terms of within-individual changes (which is facilitated by our establishment of strict factorial invariance for both measures; Ployhart and Vandenberg 2010). Further, our large sample size (Ford et al. 2014) and overall number of assessments (allowing for repeated estimation of lagged effects) helps ensure the observed relationships are not due to chance. Collectively, while not a randomized experiment, our approach affords a strong case for understanding causal order between the constructs (Shadish et al. 2002).

Given that both measures demonstrated strict temporal invariance, to reduce model complexity, constructs were modeled as directly observed (i.e., scale scores were used). While a full discussion of the set-up and estimation process for latent variable fixed-effects estimating within SEM is beyond the scope of this paper, interested readers are encouraged to see Ousey et al. (2011) for an applied example with an expanded discussion of Allison’s (2000, 2005) recommendations. However, the process can be effectively depicted, and to this end, unstandardized results for the latent variable fixed-effects model (and associated constraints) are reported in Fig. 1. As depicted, autoregressive and lagged effects (i.e., coefficients that share a superscript) were constrained to be equal across assessments to further reduce model complexity; to facilitate interpretation, correlations (depicted as double headed arrows) are reported. The model fit the data well [χ2(49) = 72.68, p = .02, CFI = .997, RMSEA = .028, SRMR = .025].

Fig. 1
figure 1

Unstandardized results for Study 2 latent variable fixed-effect model examining causal order between engagement and life satisfaction. Coefficients sharing a superscript were constrained to be equal across lags. Correlations are reported for all double-headed arrows

In addition to testing our hypotheses, our analyses suggest several noteworthy issues. First, the non-significant autoregressive effect for engagement (B = .03, p > .05) suggests that once time invariant predictors are controlled for (e.g., personality, job design) there is no lagged state dependency for engagement. Put another way, the average autoregressive correlation reported in Table 9 is .74. Once time invariant predictors are accounted for, there is no evidence of an over-time relationship in engagement with itself; for a given individual, a change in engagement (increase/decrease) does not result in a subsequent change in engagement 2 weeks later for that individual. However, there is evidence supporting state dependency for life satisfaction. A one unit increase in life satisfaction resulted in a .14 increase in life satisfaction 2 weeks later.

Specific to our hypotheses, we see a small, albeit significant, lagged effect between engagement and life satisfaction (B = .04, p < .05); Hypothesis 1 was supported. However, we do not see a lagged effect of life satisfaction on engagement (B = .04, p > .05); Hypothesis 2 was not supported. Collectively then, and in the context of conservation resources theory as well as the larger well-being literature (e.g., Headey et al. 1991), our results support what might be termed a bottom-up gain spiral in terms of engagement and life satisfaction. That is, increases in engagement drove lagged increases in life satisfaction; gaining resources at work (in the form of engagement) resulted in respondents (based on our within-person analysis) reporting experiencing more generalized resources (in the form of increased life satisfaction). However, we did not observe evidence for a top-down gain spiral. Specifically, increases in generalized resources (i.e., life satisfaction) did not drive changes in, or rather the accumulation of resources, in the work domain (in terms of changes in engagement). We would suggest then that, given the methodological and analytical approach used here, both of which were facilitated by the application of our short-form measure, we have a clearer understanding of the lagged effects of engagement on life satisfaction.

General Discussion

The studies reported herein are informed by understanding the value in formulating a rigorously validated measure of engagement to facilitate more coalesced research and practice. We present extensive evidence for the efficacy of our short-form measure across two studies and multiple data sources, using a multifaceted approach to validation. Collectively, we demonstrate that the short-form engagement measure is psychometrically sound in and of itself, as well as evidencing that it functions similarly to its parent item set. Subsequently, so as to further inform an appropriate understanding of this construct and ensure its representative measurement, and address the existing void in the literature, we established that the short-form measure is both invariant across several demographic characteristics, but is also temporally invariant, a critical precondition for scholars seeking to understand “changes” in engagement over time. We elaborate on these issues next.

Theoretical and Practical Implications

Our research has important implications for engagement operationalization and measurement, both in and of itself as well as how it relates to other important constructs both within and over time. Importantly, we found expected relations with relevant constructs, supporting concurrent validity (Study 1, Sample 3), as well as a more in-depth evaluation of engagement alongside life satisfaction (Study 2), filling a gap in the literature that has heretofore failed to examine the extent to which work engagement may impact employees’ broad well-being indices over time (or vice-versa). This has implications for how we understand the engagement construct conceptually, as well as for the optimal content design of future engagement research.

Our findings also have tangible practical implications insofar as optimizing utilitarian and psychometrically sound engagement measurement. In Study 1, we provided strong empirical and theoretical justification for a valid three-item measure of engagement that, while derived from the most popular research-based engagement instrument, is also parsimonious enough for implementation in practice-based efforts. Such a contribution takes an important step forward in closing the research-practice gap (e.g., Saks and Gruman 2014; Shuck et al. 2017). In presenting such an instrument that can be used across research and practice domains alike, as well as across a variety of research designs (cross-sectional, longitudinal, frequent administrations, etc.), we equip scholars and practitioners with a crucial tool. Such a tool is necessary to align and thereby lay the groundwork for more research-informed engagement practice within academia and practice. In particular, the parsimony of this measure, combined with its theoretical soundness and empirical support, makes it more ubiquitously attractive to a wide variety of researchers and practitioners alike seeking to assess engagement. As such, this measure facilitates consistency and generalizability across studies, as well as benchmarking across organizations and industries, thereby allowing for more accurate comparative assessments than are currently provided by different and inconsistent measures of varying lengths and item texts – thus offering all those seeking to assess engagement a ‘common language,’ as we noted earlier, with which to do so.

Yet another practical implication of this research is the examination of the instrument’s measurement invariance across a number of salient demographic characteristics relevant in the organizational sciences. With both researchers and practitioners increasingly seeking to make comparisons across grouping variables (e.g., men vs women, 20-somethings vs baby boomers), it is critical to establish that any measure used in that way does not function differentially across the groups in question. As such, the present research’s determination that our short-form measure functions invariantly across the critical demographic characteristics – within a large heterogeneous sample – allows researchers and practitioners alike to make comparisons across these groups with an improved level of confidence.

The present series of studies is also worthwhile to furthering engagement research on the whole, as well as optimizing its contextualization within the occupational health literature (Bakker et al. 2008). That is, whereas a majority of engagement research is cross-sectional or otherwise fails to account for the invariant nature of the construct over time (e.g., Kim and Beehr 2020), much of that research simultaneously suggests that improving engagement is critical because, among other things, engagement begets engagement in a sort of positive gain spiral such as that put forth in conservation of resources theory. However, preliminary evidence from Study 2 suggests that these are more complex issues than initially recognized. First, we do see evidence for a bottom-up gain spiral; changes in engagement drove (lagged) changes in life satisfaction. That is, gaining resources at work (i.e., becoming more engaged) resulted in perceived changes in generalized resources (i.e., life satisfaction). However, gaining more generalized resources (i.e., life satisfaction) did not drive domain specific resource accumulation in the form of (lagged) changes work engagement – there was no evidence for top-down gain spiral (Headey et al. 1991).

Yet, arguably even more interesting as related to gain spirals is how changes in engagement, at least based on our data, are not related to future changes in engagement at the within-person level. That is, once time invariant predictors are controlled for (e.g., personality, job design) there was no lagged state dependency for engagement; for a given individual, a change in engagement (increase/decrease) did not result in a subsequent change in engagement 2 weeks later for that individual. Specific to conservation of resources theory, our data does not support the argument that engagement begets more engagement over time. These results, collectively, are critical because they raise meaningful questions about the theoretical underpinnings of engagement functioning. Specifically, with conservation of resources theory serving as a popular theory in which to ground engagement research, our findings suggest there are meaningful boundary conditions around the extent to which the gain spiral central to this and other (e.g., broaden-and-build) popular positive organizational behavior theories is an optimal frame within which to view engagement.

As such, our research provides an important baseline understanding of engagement’s invariant nature – an understanding critical in informing future change-focused studies of engagement, as well as practical considerations such as the value of interventions and other such attempts (e.g., job re-design, job crafting; Kuijpers et al. 2020) to promote engagement over time. In this way, longitudinal designs such as that undertaken herein are central to accurately modeling engagement’s true functioning as well as the directionality of its relationship with well-being indices over time, and the short-form measure of engagement that we establish and validate herein greatly facilitates such longitudinal studies without compromising soundness of measurement.

Limitations and Future Directions

Despite the wide-reaching benefits of the short-form measure, it is important to note that the three-item measure does limit the extent to which the instrument can be partitioned into the separate engagement components (vigor, dedication, absorption). It has been argued that such an approach is not psychometrically optimal or may be variably acceptable depending upon the circumstance or construct (e.g., Fisher et al. 2016). That said, the engagement construct in particular has been the center of several assessments of factorial structure, with a number of researchers (e.g., Christian et al. 2011; Sonnentag 2003) determining that the unifactorial measure often functions better than the tripartite assessment. As such, while an important consideration for interpretation and usage, we do not believe that using the short-form measure for a unifactorial interpretation of engagement is detrimental. On the contrary, in addition to the aforementioned implications for research and practice, the measure also provides an important opportunity for future research to leverage unique designs and rigorous, longitudinal measurement that is largely impossible with the current longer engagement measures, or even with the existing abbreviated measures which lack sufficient validity evidence. In this way, this measure presents a uniquely practicable opportunity for research and practice to align in their assessment of this critical construct which has been deemed one of the most important constructs in organizational science in recent years (Saks 2017).

Conclusion

We provide extensive and multifaceted validity evidence supporting the psychometric utility of a short-form measure of engagement that aligns with the predominant operationalization of this popular construct. Not only does this short-form measure allow researchers to more confidently conduct complex designs, but its parsimonious length stands to make noteworthy strides in diminishing the ever-problematic scientist-practitioner gap regarding engagement. In this way, this instrument has the potential to begin rectifying a central concern consistently lamented by scholars – that academics and practitioners are failing to measure the same construct, and, relatedly, that an optimal measure for consistent use across research and practice does not yet exist. The measure outlined herein intends to serve that end.