Before investing in any type of wellness and population health management program, those responsible for demonstrating its success must be realistic about the outcomes that can be achieved and over what period of time (Weltz 2009).

Introduction

Return on investment (ROI) analysis is typically presented as the gold standard for evaluating employee wellness program (EWP) outcomes. While some long-established EWPs may earn positive ROI’s, this goal may not be realistic for small and mid-size employers offering new programs. Actuarial analysts argue that wellness programs require investments during early years, prior to earning returns in subsequent years (Fitch 2008). Specification of the program goal and the evaluation metric is a salient issue for employers initiating new programs, because those employers may face contract renewal or termination decisions after two or three years of program operation. Instead of measuring ROI, these employers could rely on the United States Preventive Task Force (USPSTF) recommendations, that identify preventive measures for which there is “high certainty” that a net benefit will be generated over time (USPSTF 2013), and focus firm-level analysis on vendor management issues, such as the ability of the vendor to: (i) induce participation that aligns with program goals, and (ii) induce those participants to increase their investments in individual health production activities. The Disease Management Association of American (DMAA) advises employers to evaluate both ROI and vendor management measures; however this does not address the question of whether ROI is a realistic measure for new programs. We use health claims and employment data to evaluate a new EWP offered by a mid-size employer, and we find that the two metrics yield conflicting results:

  • The program successfully recruited a broad spectrum of employees to participate, and it successfully induced short-term behavior change, as manifested by increased preventive screening, and

  • Health care expenditures and absenteeism did not decrease.

We conclude that unrealistic reliance on ROI as the gold standard could lead employers to terminate viable programs, simply because it is premature to attempt to measure ROI.

EWPs have become ubiquitous among large employers (Fitch and Pyenson 2008), but smaller employers have not followed suit (O’Donnell 2010). The Patient Protection and Affordable Care Act (Section 10408) attempts to close this gap, by authorizing $200 million for short-term grants to small employers that initiate new comprehensive wellness programs. This raises the question: how will these employers evaluate the new programs when it is time to make a decision to renew the initial vendor contract, solicit bids from additional vendors, or terminate the program? Specification of the evaluation criteria may play a key role in determining whether these programs continue after the initial vendor contract expires. The published literature offers three views on the evaluation strategy.

Published evaluations of EWP outcomes typically focus on ROI estimates. For example, Nicholson et al. (2005) stress the importance of ROI analysis to guide decisions to invest in employee health, and they provide pragmatic strategies to facilitate estimation of this ROI. Serxner et al. (2006) analyze methodological options for estimating ROI, and provide recommended guidelines. In addition, vendor organizations provide online wellness program ROI calculators that offer to compute expected program impacts, based on basic information such as numbers of benefits-eligible employees, and the level and growth-rate of healthcare expenditures. (See, for example, http://www.wellnessonline.com/about/what-we-offer/return-on-investment/). Finally, a recent meta-analysis concludes that some EWPs have successfully generated net savings after a few years of operation (Baicker et al. 2010). These results are supported by evidence indicating that almost one-fifth of healthcare expenditures are attributed to ten individual health risk factors—that could potentially be modified through EWP efforts (Goetzel et al. 2012).

In contrast, (Fitch 2008) argues that “Looking for a financial ROI from medical claims savings is the wrong approach”, because wellness programs require short-term expenditures to support healthy behaviors such as screenings for cancers or chronic conditions, while the potential benefits of these investments may accrue over years—or decades. Similarly, the USPSTF concludes that there is “high certainty” that mostFootnote 1 of the screenings included in this study will generate moderate or substantial net benefit over time; however they do not indicate that positive net benefits can be expected in the short run. In addition, Pyenson and Zenner’s (2005) actuarial analysis of the outcomes of cancer screening programs concludes that cancer screening programs generate net costs during the initial years, following by medium-term net savings. These authors also provide an example of a screening program that is expected to prevent five deaths for an employee group of 50,000. This poses two serious issues for firms that initiate new programs by signing two-year or three-year vendor contracts. First, contract renewal/termination decisions for new programs will be based on short-term results. Second, demonstrating a positive ROI is a tougher hurdle for EWPs implemented by small and mid-size employers, because the screening costs are clearly measureable, but the benefits cannot be measured with precision. Thus, the outcomes measurement question is particularly salient for small and mid-size firms initiating new programs.

Employer survey results are also substantially less positive than the results summarized by Baicker et al. (2010). Fitch and Pyenson (2008) report that only 14 % of employers that offered incentives for behavior change actually observed a positive return on that investment. Similarly, the 2010 Buck Survey reports that only a subset of firms estimate program impacts and only 45 % of these firms reported that the programs generated net savings (Buck Surveys 2010).Footnote 2 Serxner et al. (2009) analyze factors that may underlie this employer skepticism, and conclude that the factors that influence a program’s ROI are complex, and employers should therefore engage in ongoing program monitoring and vendor management.

The Disease Management Association of America (2007) addresses this issue by advocating measurement of both intermediate targets (such as behavior change) and the outcomes measures that support ROI estimation. Some analysts implement this two-pronged strategy—and report that the evaluated EWP successfully met both criteria (Loeppke et al. 2008, 2010). However, neither the DMAA nor Loeppke et al. (2008, 2010) address the issues raised by Pyenson and Zenner (2005), who argue that achieving an ROI greater than one is an unrealistic expectation for new programs. If the published literature creates pressure for managers to demonstrate that a firm’s EWP is generating a positive ROI, before renewing a vendor contract or extending this employee benefit beyond the initial contract period, this unrealistic expectation could lead to termination of EWPs that are—in fact—useful. This raises the question of whether these employers should be advised to focus on evaluating the EWP success in recruiting participants and inducing behavior change, while relying on larger studies to provide evidence that the behavior changes are likely to generate long-term savings.

Gross (2012) provides an overview of the tradeoffs between outcomes measures and process measures such as cancer screenings and chronic disease management. While his discussion focuses on measuring the quality of medical care, his analysis is relevant for evaluating wellness programs. He argues that final outcomes measures have generally-accepted “face validity”, but they present three weaknesses. Broad measures of overarching outcomes (such as mortality or healthcare cost) cannot be measured in the short-run, they are influenced by an array of observable and unobservable socioeconomic, genetic, and environmental factor, and these measures do not identify actionable quality-improvement issues. On the other hand, process improvement measures can be analyzed in the short-run and they can support improvement efforts—to the extent that the process measures are linked to the program goals. In this framework, the quality of the national-level evidence linking wellness and preventive activities to long-term health outcomes is the central issue.

We analyze data provided by one mid-size employer with a new EWP, to examine the implications of the ambiguous advice that employers should consider both ROI and behavior change, as program evaluation criteria. We use four years of administrative data from a mid-size employer that was facing a decision to either continue (or terminate) its three-year-old EWP. We examine both types of program-outcome indicators, and we find that an evaluation focused on the impacts of the EWP on healthcare claims and absenteeism would support a decision to terminate the program, because participation in the EWP is not associated with decreases in either of these variables. In contrast, an evaluation focused on the impacts of the EWP on intermediate targets (such as employee engagement and participant behavior change) would support a decision to continue to offer the EWP, because EWP participation is associated with increased rates of recommended health screenings, and participation is spread broadly across demographic groups. The decision to continue or terminate the EWP hinges on the initial specification of the key evaluation metric.

Given the low current adoption rate of EWP’s among small and mid-size employers, and federal efforts to induce these entities to initiate new programs, identification of useful metrics for evaluating new programs is a salient issue.

Data and variables

We analyze health claims and encounter data, EWP participation data, and employee data provided by a mid-size employer. This employer signed a three-year contract, to initiate an employee EWP in the third quarter of 2006, and the employer began preparing to make the contract-renewal decision midway through the third year. The employer requested a program evaluation to support the upcoming decision to: renew the three-year contract with the current vendor, solicit bids from new program vendors, or terminate the program. This evaluation was conducted midway through year 3, to allow time to solicit bids and complete the process of contracting with a new vendor in the event that the employer decided to change vendors. This contract renewal decision was, therefore, based on two full years of post-implementation data. (The dataset also includes the third year of post-implementation data. This data was added after the contract decision was made, to assess whether an additional year would have affected the results.)

This EWP, which was the first wellness program offered by this employer, included four components: employees were encouraged to complete an online Health Risk Assessment (HRA), attend the annual employee Health Fair, participate in a class (a one-time event),and/or participate in a campaign (a series of events). While the classes and campaigns addressed general issues of healthy behaviors (e.g. compliance with recommended screening, diet, exercise and stress management behaviors); they particularly emphasized diabetes prevention and management. The employer aimed to increase employee inputs into the health production function, by encouraging employees to (i) increase compliance with recommended screenings for chronic conditions and for cancer, (ii) strengthen individual efforts to prevent and manage chronic conditions, and (iii) adopt healthy behaviors. During the year prior to the implementation of the EWP, employees who elected to enroll in the Health Maintenance Organization (HMO) plan incurred a $10 co-payment for all wellness activities, including screenings for chronic conditions and cancers. For employees who elected fee-for-service (FFS) coverage, the PPO paid for the first $250 of wellness expenditures for wellness visits. After the first $250 of wellness expenditures, the employee paid a $10 co-payment for wellness activities that occur in a primary care setting, and these employees incurred a 20 % copayment for wellness activities (such as the cancer screening tests) that do not occur in a primary care setting. Both types of employees were eligible for free blood sugar and cholesterol tests and $25 osteoporosis screenings at the annual employee Health Fair. The employer considered modifying the HMO and FFS plans, so that both sets of employees faced reduced payments for wellness healthcare visits; however this was not feasible due to issues that arose during the collective bargaining process. Hence, the out-of-pocket expenditures incurred by each set of employees (HMO members and FFS enrollees), for wellness visits and the associated screening tests, remained constant before and after implementation of the EWP.

Monetary incentives for participation were minimal. During the first and second years, small prizes (e.g. Starbucks gift cards or raffle tickets for an iPod) were offered for completing the Health Risk Assessment and participating in EWP events. The employer specified that evaluation of the program should be conducted by independent analysts. Based on published literature, the employer anticipated that this analysis would conclude that the program would generate a positive ROI during the initial three-year contract period. The EWP was terminated after four years, in response to recession-induced financial pressures.

We use four years of data, from the third quarter of 2005 through the second quarter of 2009, which provides one full year of “before” data and three years of “after” data. (The final year of data did not become available until after the employer actually made the contracting decision; we provide year-by-year estimation results to assess whether the results are sensitive to the number of years included in the analysis. We include this data to assess whether the specification of the decision criteria would have been less critical, if the employer delayed the decision for a year.) The EWP roll-out started in the third quarter of 2006; and the roll-out process continued for several months. The dataset includes the 2,425 unique individuals who were age 65 or younger during the “before” period, employed during the year prior to implementation of the EWP, and represented in the data with observations for all study variables.

We use data from three sources: employee data (wages and hours absent) was provided by the employer, health claims data was provided by the third-party entity that processes claims for the employees who elected FFS coverage, and encounter data was provided by the HMO that provides care to employees who elected to join the HMO. The HMO provided data that was formatted to be comparable to claims data, with costs included for each healthcare encounter (based on the reimbursement rates at which the HMO contracted with providers).

We use difference in difference (DD) with propensity score matching, to examine the impact of program participation on the overall goals of reducing both healthcare costs and absenteeism, and on the intermediate target of inducing individuals to obtain recommended health screenings. We define variables to measure: (i) program participation, (ii) individual demographic, health insurance and health characteristics, and (iii) program outcomes.

Variable definitions

We define four types of variables, to measure program participation, individual demographic characteristics, individual diagnoses, and program outcomes.

Program participation

A zero-one indicator variable identifies whether each individual participated in any component of the EWP program during any of the three years of program operation: 993 individuals (41 %) participated, while 1,432 (59 %) did not. We use the most liberal definition of program participation: an individual who participated in any component of the program during any program year is categorized as a participant. Therefore, our estimates of program impacts may have a downward bias but the likelihood of upward bias is low.

Demographic characteristics

The dataset includes observable demographic characteristics: age, marital status, weekly pay, and gender. As shown in Table 1, average age, marital status and weekly pay were similar for the participant and non-participant groups: the average age of the participants was 44, while the average age of non-participants was 45 years; 64 % of individuals in both groups were married; weekly pay was $1,101 for participants and $1,119 for the non-participants. In contrast, gender is associated with participation: 72 % of the participants were women, compared only 50 % of the non-participants. We also include a health insurance variable, to indicate whether each individual elected to enroll in the HMO or the FFS plan. About 44 % of participants were enrolled in HMO compared to 46 % of non-participants. Table 1 provides t statistics to test whether the means of these variables are significantly different for the participant versus non-participant groups. Participants are significantly more likely to be female, below-median age, below-median wage, and enrolled in the FFS plan.

Table 1 Variable definitions and descriptive statistics for the year prior to EWP implementation

Diagnoses

We include two types of diagnosis variables, to indicate whether each individual had a condition, prior to implementation of the EWP, that was: (i) potentially impacted by the EWP, or (ii) associated with high healthcare expenditures that are not prevented by an EWP. The program vendor identified five diagnoses for which EWP participation can potentially help individuals prevent or manage the condition: diabetes, mental health conditions, bone and joint conditions, hypertension, and asthma. The incidence of diabetes (which was the key diagnosis targeted by the program) among participants was 29.6 %, compared with 31.1 % among non-participants. In addition, the incidence of diagnoses indicating mental health conditions was 17.3 % among participants and 16.6 % among non-participants; the incidence of diagnoses indicating bone/joint 41.4 % among participants and 43.4 % among non-participants; the incidence of hypertension was 25.2 % among participants and 29.7 % among non-participants, and the incidence of asthma was 5.9 % among participants and 6.3 % among non-participants (see Table 1).

The EWP vendor also identified high-cost conditions that are not expected to be impacted by the EWP: cancer, pregnancy, hepatitis and HIV/AIDs.Footnote 3 We used ICD-9 codes included in the claims and encounter data, to indicate whether each individual had any of these diagnoses during the years prior to initiation of the wellness program. In the pre-EWP period, 4.5 % (4.1 %) of the participants (non-participants) had diagnoses indicating cancer, 3.0 % (3.0 %) of the participants (non-participants) were pregnant, and 0.5 % (1.1 %) of the participants (non-participants) had hepatitis.

Program outcomes

We define two overall outcomes measures: healthcare costs and absenteeism. We denote the cost information provided by the HMO and the FFS plans as “healthcare costs”, and we state these costs in constant 2005 dollars, using the medical care component of the Consumer Price Index. The absenteeism variable measures the number of hours per year each individual was absent from work for either sick leave or unpaid time-off; paid vacation days were not included in this variable. The average health care cost for the participants was $3,063 in year 1, compared to $3,689 for the non-participants; participants averaged 83.6 absent hours in year 1, compared to 96 h for non-participants. These differences are statistically significant, suggesting that the EWP attracted relatively low-cost and low-absenteeism individuals. This initial situation contrasts with four of the six studies of healthcare costs reported by Baicker et al. (2010) in their second category of studies (which reported evaluations of programs for which participation was voluntary). Average pre-program health care costs were higher in the participant group, compared with the non-participant group, in four of these studies; the costs were roughly comparable in the other two studies. Baicker et al. (2010) also report evaluations of absenteeism for 11 studies in which participation decisions were voluntary. In these 11 studies, pre-program absenteeism was higher among participants, compared with non-participants, in 3 studies, lower in 5, and comparable in 3.

We also examine the impact of the EWP on short-term “behavior change”. We focus on behaviors that can be observed in the health claim/encounter data, to avoid the problems intrinsic to HRA data (the data are self-reported, and they are only available for the subset of individuals who self-select to complete the HRA). The range of behaviors that can be measured in claims data is currently limited; however the range of behaviors that will be visible in administrative data will increase as providers increasingly use electronic medical records systems. The claim/encounter data are unlikely to be completely accurate for all individuals, but there is no a priori reason to expect the inaccuracy to vary systematically with participation (vs. non-participation) in the EWP program (Sing 2004). For example, the incidence of breast cancer and osteoporosis screening reported in our data is zero for the year prior to implementation of the wellness program, but these screening rates are greater than zero for both participants and non-participants in the three subsequent years. There appears to have been a change in the system for coding and billing for these tests, which coincided with implementation of the EWP. While this specific coding issue is unique to this dataset, the issues posed by changes in coding practices are endemic to claims/encounter data.

We focus on the impact of program participation on the probability that an individual will obtain screenings for indications of chronic conditions (blood glucose, cholesterol and osteoporosis) and screenings for cancer [mammogram and pap test (females only)], prostate screening (males only), and colonoscopy), because the HRA and the Health Fair encouraged employees to obtain recommended screenings. Because the vendor’s programming specifically emphasized diabetes prevention and management, we hypothesize that the program may be particularly likely to generate increased screenings for diabetes. The recommended schedules for these screenings may vary across the type of screening and individual’s gender, health risk and age; hence we estimate separate regressions for each screening, and we control for gender, pre-existing diagnoses, and age in the computation of DD with propensity score matching.

The probabilities of obtaining some of the screenings may not be independent (screenings for all three of the chronic conditions could be obtained at the Health Fair); hence we also construct three screening indices to indicate whether each individual obtained at least one screening for a chronic condition, cancer relevant for males (prostate or colon), or cancer relevant for females (breast, cervical or colon).

The pre-program screening rates, for most of the screening tests, were similar for participants and non-participants: 12.3 % of male participants and 13.4 % of male non-participants received prostate cancer screening, 48.1 % of female participants and 42.3 % of female non-participants received cervical cancer screening, 6.1 % of participants and 5.7 % of non-participants received colorectal cancer screening, 2.0 % of participants and 1.7 % of non-participants received diabetes screening, and 24.1 % of participants and 25.2 % of non-participants received cholesterol screening. The difference between the cervical cancer screening rate for participants (0.481) and for non-participants (0.423) is statistically significant; the other differences are not significant (see Table 1).

Normalized differences of the independent variables that will be used in the propensity score matching equation

To test whether the participant and non-participant groups have enough similarity in the pre-EWP period to support additional analysis, we compute the normalized difference for each variable (Imbens and Wooldridge 2010). The normalized differences [which are reported in Appendix Table 9 (column 5)], are equal to the difference between the average for each variable for participants and the average among non-participants, divided by the square root of the sum of the two variances. We check whether the (normalized difference for each variable exceeds the critical value of 0.25. This critical value was initially suggested by Imbens and Rubin (2007) for linear regression methods; however it also useful for non-linear analyses. Only one of the normalized differences (gender) exceeds the critical value of 0.25. In addition, most of the normalized differences for our data are lower than the normalized differences reported by Imbens and Wooldridge (2010) after they used the matching techniques proposed in Rubin (2006) and Imbens et al. (2001). In our implementation of propensity score matching, we will impose a common support restriction; however these numbers already suggest that common support will not be an issue in our sample. In other words, our sample meets the criteria defined by Imbens and Wooldridge (2010) for producing “credible and robust estimates”, with the possible exception of the fact that women are more likely to participate in the EWP than men. We address this issue by reporting estimates for the full sample, and for separate subsamples of males and females.

Difference in difference computation—means

Table 2 presents the DD estimates (in means) for the outcomes measures for the pre-EWP year (year 1 in the data) and for the first, second and third years of the EWP (years 2, 3 and 4 in the dataset).Footnote 4 By year 3, EWP participation is associated with statistically significant increases (at the 5 % level) in the proportions of individuals screened for prostate cancer, diabetes and cholesterol. By year 4, EWP participation is also associated with a statistically significant increase in absenteeism. The increased number of screenings may account for some of the increase in hours absent from work. However, regression to the mean may also be occurring: participants logged fewer absent hours prior to EWP implementation, and the gap narrowed during the three EWP years.

Table 2 Difference in difference computation for means

These results contrast with the DD (in means) estimates reported by Baicker et al. (2010). While these authors do not report standard errors for the DD estimates, they report negative EWP impacts on cost for 5 of the 6 cost studies, and negative EWP impacts on absenteeism for all 11 of the studies that report absenteeism numbers.

Methods

Even though our analysis of the descriptive statistics indicates that the participant and non-participant groups are roughly comparable, this analysis of DD (in means) does not address the key counterfactual question: Would the participant’s behavior have been the same, had he not participated? We use propensity score matching to construct this counterfactual, using the appropriate weights, and then report results for DD propensity score matching. Stata 10 was used to complete the analysis. We designate individuals who participated in the program as the treated group and those who did not participate as the control group.

Propensity score matching estimates

The first step is to estimate propensity scores. We use a Probit model to compute a propensity score for every individual in the sample, as a function of the pre-EWP demographic, health status and absenteeism characteristics of the 2,425 individuals included in our sample. We use two types of variables to measure pre-EWP health status. First, we include the pre-EWP healthcare expenditures as a measure of overall health status. However, Eichner et al. (1997) show that year-to-year variability in healthcare expenditures is substantial; hence we also include variables to measure the diagnoses that are targeted by the EWP and two diagnoses (cancer and hepatitis) with high costs that are not expected to be impacted by an EWP. We also included a variable to indicate pregnancy, because costs incurred for pregnancy and delivery are not expected to continue over multiple years.

We estimate propensity scores for the full sample and for four sets of split samples. (The subsamples are split by gender (male vs. female), age (younger than the median age of 45 vs. older than 45), income (annual wage less than the median of $54,080 vs. wage above this level), and insurance type (managed care vs. FFS). The split sample results allow the employer to assess whether the vendor’s programming successfully appeals to the entire spectrum of employees. Some analysts argue that employers should also consider whether the EWP should be designed to target individuals with either high (or low) healthcare expenditures to maximize the productivity of the EWP (Edington 2009). We do not split the samples by baseline healthcare expenditures, however, because the impacts of such a targeting strategy cannot be observed during the initial years of a new EWP

Table 3 reports the estimation results for the propensity score equation for the full sample. The likelihood ratio \(\chi ^{2}\) statistic is 183.4. Women, older employees, FFS participants, individuals with a diagnosis of diabetes, and individuals with relatively low rates of absenteeism were significantly more likely to participate in the EWP. The impact of a diabetes diagnosis on participation is consistent with the fact that the program vendor emphasized diabetes prevention and management. While the univariate analysis presented in Table 1 indicated that older individuals were significantly less likely to participate in the EWP, the sign is reversed in the multivariate analysis presented in Table 3.

Table 3 Propensity score matching equation: estimation results

The minimum value of predicted propensity scores is 0.07 and the maximum is 0.72, which implies that most of these variables satisfy the rule of thumb (propensity scores between 0.1 and 0.9) criterion for assessing overlap as suggested by Crump et al. (2008). The distributions of the estimated propensity scores for the participant and non-participant groups, shown in Fig. 1, indicate that common support is not an issue for the analysis reported here because the treatment and control groups are comparable.

Fig. 1
figure 1

Overlap of propensity scores for the non-participant (control) and participant (treated) samples

We re-estimate the propensity score equation for split samples using age, annual pay, type of health insurance, and gender (see Table 4). The significant impacts of gender and absenteeism on participation in the full sample are observed in all of the subsamples. However, the impact of a diagnosis of diabetes is concentrated among women, FFS participants, individuals who are relatively young and individuals with above-median income.

Table 4 Propensity score matching equation: probability of participation dependent variable: participation indicator variable

We used the Becker and Ichino (2002) p score command to check that the balancing criterion is satisfied. (This criterion uses a t test to test whether the treatment and control group covariate means are equal, within each propensity score block. For our study, the participants are the treatment group, and the non-participants serve as the control group.) After verifying that the balancing criterion is satisfied, we use Kernel matching, to permit full use of the sample observations. For each participant, this procedure assigns weights to each non-participant based on the difference between the two propensity scores. We completed this procedure for the full sample, and for each of the sub-samples. We use the psmatch2 command written by Leuven and Sianesi (2003) to generate the matching estimates. The common support restriction was imposed in all cases. Also, we use bootstrapped standard errors based on 200 replications, because it is well known that analytical standard errors are not accurate.

The estimated propensity scores were used to compute DD estimates of the impacts of EWP participation on overall outcomes and short-term screening behaviors. The DD method compares the before-vs-after differences in outcomes for participants with the comparable before-vs-after differences in outcomes for non-participants. Thus, the impacts of unobserved time-constant individual characteristics are differenced-out.

However, neither the matching nor the DD methodology addresses unobserved individual characteristics that vary over time. Hence, the matching estimates can be only be interpreted as evidence of a causal effect under the assumption of unconfoundedness. This assumption holds when that there is no selection (into the participant group), based on unobservable time-varying characteristics. For example, if some of the EWP participants were already planning to make behavioral changes (prior to program participation), these behavior changes would occur concurrently with EWP implemention—but the relationship would not be causal. In this case, the DD propensity score matching method may indicate a statistically-significant association, but we could not conclude that the EWP generated the behavior change. Unfortunately, the presence (or absence) of this type of selection on time-varying unoservables (such as intention to change behavior) cannot be tested directly. However, a falsification test is available, to provide an indirect test of the unconfoundedness assumption. (Smith and Todd 2005; Imbens and Wooldridge 2010)

The falsification test estimates a pseudo treatment effect when the treatment effect is expected to be zero (because there was no known treatment at that time). We implement the falsification test using the first six quarters of data, we use the first three quarters as “pseudo before” period (2005:Q3–2006:Q1) and last three quarters of the before-program period as the “pseudo after” period (2006:Q2–2006:Q4). Thus, the “pseudo after” period includes one quarter of data prior to program implementation and the first two quarters of data observed after the official program-implementation date. This generates a conservative test, in the sense that the test may be biased toward finding a significant effect—if the program generated measurable impacts during the first two quarters. This issue is minor in this case, however, because the program vendor indicated that the program was not fully operational until the end of these two quarters.

We estimate whether a measurable “treatment effect” occurred between these two periods, and we hypothesize that the estimated treatment effect should be zero because there was no treatment in 2006:Q2. On the other hand, if some individuals were already making changes before the EWP was implemented (and if these individuals selected into the participant group), the estimated “treatment effect” would be positive. Thus, an insignificant estimated treatment effect will indicate that the data does not support the hypothesis that concurrent changes were occurring. We use DD propensity score matching to estimate the pseudo treatment effects. The results of the falsification test are presented in Table 5. None of the estimated treatment effects are statistically different from zero; hence the results of the falsification test are consistent with the unconfoundedness hypothesis.

Table 5 Falsification test results

However, we note that Imbens and Wooldridge (2010) caution that a zero treatment effect in this falsification test does not imply that the Unconfoundedness assumption is true; instead this result simply indicates that the assumption is more plausible. Without a valid instrument or a regression discontinuity design it is impossible to argue that the estimated treatment effect is indeed causal (Sekhon 2009).

Finally, the coefficients for the seven screening variables (presented for the full sample for each year, and for the full time period for each subsample) could potentially exhibit bias (toward accepting a null hypothesis that is false) due to multiple comparisons. We use two strategies to address this issue. First, we use the Bonferroni method to adjust for multiple comparisons (Pfaffenberger and Patterson 1987). This method is appropriate, because decisions to obtain some of the screenings, particularly screenings available at the Health Fair (osteoporosis, blood sugar, blood pressure) may not be independent. However, this method provides a conservative result, because it focuses on the probability that any one of the seven results is a false positive. Therefore, we also construct three indexes of screening behavior. The first index counts the number of screenings obtained by each individual, among those available at the Health Fair (up to three). The second index counts the number of cancer screenings obtained by each male (colorectal and prostate) and the third counts the number of cancer screenings obtained by each female (colorectal, breast and cervical).

Results

Difference-in-difference estimates: full sample

Table 6 reports the DD matching estimates for the comprehensive measures of healthcare cost, absenteeism, the three indices of screening behavior (chronic conditions, male-relevant cancers and female-relevant cancers), and the individual screening tests for the full sample. We report separate estimates for the first, second and third year of program operation.

Table 6 Difference in difference estimate of the impact of program participation with propensity score matching

In the full sample, participation is not associated with changes in healthcare cost or absenteeism in any of the post-treatment years. The coefficients of healthcare cost are positive for all three years; however they are not statistically significant at conventional levels. Two factors could potentially contribute to the fact that these coefficients are not statistically significant. First, it is well-known that healthcare cost distributions are typically highly skewed; hence a small number of extreme observations may be generating relatively high variance. One way to circumvent this problem is to exclude individuals in the top 5 % of the expenditure distribution. Second, employee turnover could be adding noise to the sample. We estimated the effect of participation on a restricted sample that excluded: (i) employees with the top 5 % of healthcare costs in each year and (ii) individuals who were not employed for the full four-year observation period. DD, with propensity score matching, results for the outcome measures are presented in Appendix Table 10. We see a small (statistically significant at 5 %) increase in healthcare expenditure in the third year of the program for the restricted sample; however the overall results are roughly comparable to the full sample results.

We also assess whether the program successfully achieved the short-term goal of inducing healthy behavior changes, such as increased participation in healthcare screenings. The full sample results indicate that program participation is associated with increased screening for chronic conditions and cancers relevant for females, during each of the three years. It is also associated with increased screening for cancers relevant for males during the second and third years. During the second year, which was most salient for the employer’s contract decision, EWP is associated with a 9 percentage point increase in the rate of cancer screening for males, an 11 point increase in this rate for females, and a 10 point increase in the rate of screenings for chronic conditions (see Table 6).

The results for the specific screening tests are also presented in Table 6. To test the significance of the estimates for specific screenings, we apply the Bonferroni correction for multiple comparisons. This is a conservative criterion that focuses on the probability that a false positive result will occur on at least one test, in a set of tests. We apply this criterion for the set of individual screenings that might occur in a specific year; this implies that the critical t statistic for a test at the 5 % level is t = 2.45.

Using the Bonferroni significance criterion, there is no statistically significant effect of participation on cancer screenings during the first year. The results also indicate that participation in the wellness program is associated with increased screenings for cholesterol during the first program year (this screening was available during the employee Health Fair). (We should note that since breast cancer and osteoporosis screenings were not coded in the pre-treatment period for these outcomes our results are essentially cross-sectional matching estimates for these screenings.)

During the second year of program operation, EWP participation is significantly associated with increased likelihood of screenings for prostate cancer (by 11 % points) and breast cancer (by 13.2 % points). Program participation also led to a 10.4 % point increase in diabetes screening, and 9.8 % point increase in cholesterol screening. The third year of program participation was associated with similar increases in screenings. The magnitudes of the estimated impacts increased from the first year to the second year as the program matured; then remained stable from year 2 to year 3.

Distribution of EWP program impacts across demographic subgroups

To assess whether the wellness program benefits are clustered among demographic subgroups, we report results for four sets of split samples. We use the first three sets of split samples to assess whether EWP impacts are distributed across demographic groups that are relevant for assessing whether the vendor is successfully inducing participation across the employee population. The fourth split, by insurance type, permits assessment of potential interactions between the HMO/FFS systems for providing and funding healthcare, and the EWP incentives.

The DD estimates are detailed in Table 7 for the four sets of split samples for the third year of program operation. A significant positive impact on healthcare cost is observed in the subsamples with older individuals and FFS health insurance. (In addition, the magnitudes of the cost impacts are significantly larger for individuals with FFS coverage than for individuals enrolled in the HMO, and for older individuals compared with younger individuals. This is consistent with the hypothesis that the EWP is conceptually more likely to impact individuals enrolled in the FFS plan, rather than the HMO, because HMOs may offer some overlapping disease prevention/management features. Alternately, this result may reflect pricing issues.) Statistically significant impacts of participation on testing for diabetes and cholesterol are observed in all of the split samples, even after the Bonferroni correction (i.e., critical t statistic is 2.45 for 5 % level of significance). The impacts of participation on some cancer screenings are concentrated among specific subgroups: the increased probability of prostate screening is concentrated in the subsamples with above-average age, above-median income and HMO enrollment, while the increased probability of colorectal cancer screening is concentrated among older employees and employees with FFS health insurance.

Table 7 Difference in difference estimate of the impact of program participation sub-sample results for year 3

These results indicate that recruitment and program design are reasonably well-aligned for this program, in the sense that individuals with pre-existing diabetes are more likely to participate than individuals who did not have this diagnosis prior to participation, and EWP participation is associated with increased rates of blood sugar testing. However, the fact that the impact of EWP participation is significantly associated with insurance type for colorectal screening and blood sugar testing (in opposite directions) raises questions about interactions among EWP design and the design of the health insurance plans offered to employees. Similarly, the fact that the impact of participation on the most costly screening (for colorectal cancer) is concentrated in the subgroup with above-median income (after controlling for age in the propensity score equation) could indicate a potential program design issue. After reviewing the results, for example, the subject employer indicated that it would use the evaluation results to assess the convenience of program events (both timing and location) for employees who work on diverse shifts and at diverse locations. This strategy, to use firm-specific EWP program evaluation information to support efforts to strengthen the program design is potentially useful: the Aon (2011) survey of 1,000 employers shows that approximately half (52 %) use firm-specific data to influence wellness program design “significantly” or “moderately”, and 19 % do not use firm-specific data for this purpose at all.

Impact of increased propensity to obtain screenings on healthcare expenditures

The results reported above indicate that EWP participation is associated with: (i) significant increases in the screening rates for cancers and chronic conditions and (ii) positive (insignificant) coefficients for healthcare cost variable. This pair of results raises the question: what proportion of the positive coefficient on the cost variable can be attributed to the new screenings? To address this question, we regressed healthcare cost on the screening variables, controlling for demographic and health characteristics to estimate the statistical impact of screenings on cost in our data. These estimated coefficients include the cost of both the screening and additional procedures that may be triggered by positive test results. We used ordinary least squares estimation, hence it is possible that these estimated coefficients could be statistically biased, due to omitted variable bias and/or correlation with unobserved (especially time varying) characteristics. The coefficients of the demographic and health status variables have reasonable signs and significance (see Appendix Table 11). However none of the screening variable coefficients are significant. Nonetheless, the estimates provide a useful benchmark. The results, reported in Table 8, indicate that the cost of the increased screenings, induced by participation in the EWP, accounts for approximately 46 % of the estimated year-3 coefficient on the healthcare cost variable. This suggests that the initial cost of inducing increased screening compliance is an important consideration, when choosing an evaluation metric.

Table 8 Computation of the estimate of the cost impact of additional screenings

Conclusion

These results indicate that selection of the evaluation criteria may determine the result of the program evaluation—and the decision to continue or terminate the program. Much of the academic literature on wellness programs focuses on ROI as the key metric, and concludes that EWPs can generate net savings. In contrast, employer survey results indicate that many programs do not achieve this result, and some actuarial analysts argue that short-term employer-specific analyses should focus—instead—on the short term targets of employee engagement and behavior change. We analyze data from one mid-size employer, for the first three years of the EWP, using propensity score matching to compute DD estimates of the impact of participation on employee screening decisions. We find that the estimated impact of the program on healthcare cost is insignificant, but positive: participation in the program is associated with an (insignificant) increase in healthcare expenditures. This result is consistent with the actuarial evidence, indicating that preventive screening programs require an initial investment—with the promise of generating savings in later years. However, we also find that the program vendor is managing the program well, in the sense that the program successfully induced 40 % of the employees to participate, and it successfully generated increased screening rates among the participants. Coupled with the USPSTF conclusion that most of these screening tests are likely to generate net benefits (possibly, over a period of years), the vendor’s success in inducing increased screening participation indicates that the EWP is likely to generate benefits over time. The individual’s current employer may enjoy a portion of those benefits, if the organization continues to offer health insurance, and employee turnover is low.

The employer’s decision to either terminate or continue the program, therefore, hinges on the specification of the evaluation criteria. If the employer’s goal is to achieve a positive ROI within a few years after implementing an EWP, these results suggest that the program should be terminated. In contrast, the employer will continue the program if he (i) relies on externally-generated evidence to demonstrate that compliance with wellness and prevention recommendations will generate long-term savings, and (ii) focuses on firm-specific analysis on the impact of the program on the employee wellness behaviors.