Introduction

Services research findings are most useful to practitioners when they specify the type of person for whom an intervention has been found to be effective (Chambless and Hollon 1998; Kraemer et al. 2002; Miklowitz and Clarkin 1999; Rothwell 2005; Wells 1999). Findings become specific when a research sample is precisely defined or limited to a single diagnostic group, as is the case in many therapy and pharmaceutical trials (e.g., Jensen et al. 1999; McBride et al. 2006). However, heterogeneous samples are also needed to test the efficacy of treatments for complex patients who do not fit clearly into standard demographic or diagnostic categories (Ackerman 1999; Blankertz 1998; Ruscio and Holohan 2006). Sample homogeneity precludes heterogeneity, and yet common sense suggests that a blend of these two approaches would be advantageous for assessing the effectiveness of an intervention for different types of people.

Homogeneous samples, by definition, have minimal variability, so homogeneity increases confidence in findings (internal validity) by minimizing the likelihood that intervention outcome differences can be explained by differences in participant characteristics. However, most services research studies need large heterogeneous samples to maximize generalizability (external validity), so services researchers are legitimately concerned that variation in sample characteristics across experimental conditions could pose a threat to the validity of findings. Random assignment will not assure equitable allocations of characteristics across experimental groups unless a sample is very large (Krause and Howard 2003). To safeguard internal validity, services researchers typically compare intervention samples on commonly measured characteristics, and then statistically control for any detected differences. If a control variable is a significant predictor of outcomes, this means that participants with this particular characteristic had overall better (or worse) outcomes, assuming that all other measured characteristics are held constant. Statistically controlling for outcome-related sample characteristics does not identify for whom each intervention was most effective, but it does allow researchers to assume that sample differences between experimental conditions did not account for overall intervention differences in outcomes.

Unfortunately, the statistical control of participant characteristics can instill false confidence in the validity of research findings, curtailing a search for other possible alternative explanations. Differential effectiveness can also compromise internal validity if it is not taken into account. This is especially true of null findings because an absence of overall significant differences in intervention outcomes can mask the fact that one service was more effective for certain clients, while the comparison service was more effective for others (e.g., Bickman et al. 1999; King et al. 2000). Likewise, it is important to check whether differential effectiveness could explain statistically significant differences in outcomes whenever one experimental intervention is designed to benefit some clients more than others, and that type of client is prevalent (or underrepresented) in the total study sample (Bühringer 2006). For instance, if one experimental program has medical staffing, while a comparison program does not, we would expect the program with medical staff to have better health-related outcomes if the study sample has many individuals with physical health problems. Even if random assignment creates comparable experimental groups, a high percentage of unhealthy individuals in both conditions would favor the medical program. Likewise, screening out applicants with health problems during recruitment would give the non-medical program an unfair advantage if health problems were prevalent in the study population. Neither comparability in sample characteristics across experimental conditions, nor the statistical control of detected sample differences, is sufficient to ensure internal validity. Main outcome findings can also be explained by variations in service effectiveness that reflect with whom and how each program was intended to be effective.

Methods for Testing for Differential Service Effectiveness

A preferred method for testing whether a service was more or less effective for particular types of clients is to enter variable-by-intervention interaction terms as predictors in a multivariate analysis of variance or ‘moderated multiple regression’ analysis (Aiken and West 1991; Stone-Romero and Anderson 1994). For instance, one program might be expected to be more effective than another program for older, physically unhealthy people, and this hypothesis could be tested by adding ‘age-by-intervention’ and ‘health-by-intervention’ interaction terms to the analysis. To test the more specific hypothesis that a program was less effective for older individuals with health problems, an ‘age-by-health-by-intervention’ interaction term would also be needed.

Alternatively, a heterogeneous sample could be disaggregated into relatively homogeneous subgroups based on participant characteristics expected to moderate intervention effectiveness (e.g., King et al. 2000; Pettinati et al. 2000; Uehara et al. 2003). For instance, a sample might be disaggregated into subgroups defined by age and health, so that individuals with commonly associated characteristics are grouped together within each experimental intervention (e.g., ‘older, unhealthy subgroup-by-intervention’ interaction term). Subgroups can be defined using variable cut-points (e.g., median scores), ranked categories, or category combinations (e.g., older women). Alternatively, if a sample is large, statistical techniques, such as cluster analysis, can be used to group individuals who share the same constellation of characteristics.

Variable and subgroup-based analyses are both viable strategies for testing hypotheses about differential service effectiveness when a sample is relatively homogeneous, and/or individuals fall into distinct, comparably sized groups based on one or two key variables related to intervention success. When a sample is very heterogeneous, and especially when each individual can be characterized by several related characteristics, subgroup analyses would appear to be more meaningful and statistically powerful than analyses based on variables (Aguinis and Stone-Romero 1997). This is because complex individuals who have co-occurring characteristics must be depicted using complex ‘higher order’ interaction terms (e.g., ‘age-by-health-by-substance use’), each of which requires the additional inclusion of not only the main variables (e.g., age, health, substance use), but also ‘lower order’ interaction terms that together represent all possible variable combinations (e.g., ‘age-by-health,’ ‘age-by-substance use,’ ‘health-by-substance use’). By contrast, a single interaction term is sufficient for depicting this same level of complexity in a subgroup-based analysis (‘older age, poor health, low substance use’ versus all participants without this combination of attributes), and any number of unique subgroups can be compared as long as each individual is assigned to a single subgroup. For this reason, subgroup analyses appear to be particularly advantageous for service programs designed to serve individuals who have multiple co-occurring disorders or dual diagnoses (Batstra et al. 2002; Bekker 2003; Beutler et al. 1996; Kraemer et al. 2001).

Role of Program Theory in Hypothesis-formulation

Tests of differential effectiveness should always be program-specific and designed to refine program theory, rather than pursued through exploratory analyses. Atheoretical analyses that rely on trial-and-error explorations, and statistical methods that capitalize on covariation (e.g., stepwise regression), will almost always identify one or more types of client who did especially well or poorly in a particular intervention, but these findings will very likely be due to chance alone. Hypotheses derived from program theory will yield more practical and valid insights into service effectiveness because they specify and limit the number of planned analyses. Fortunately, service manuals and intervention descriptions abound with assumptions about who should benefit most and why, and these assumptions are easily translated into testable hypotheses prior to data analysis (Howell and Peterson 2004; Stout and Hayes 2005; West and Aiken 1997).

Overview of Present Study

In this article, we use an existing dataset collected for a randomized controlled trial of supported employment to compare the relative sensitivity of four methods of testing for differential service effectiveness: (1) continuous variables, (2) categorical variables, (3) subgroups based on categorical variables, versus (4) cluster analysis-identified subgroups. We then reinterpret our study’s previously published main findings (Macias et al. 2006) in light of these post hoc subgroup analyses to illustrate how tests of predicted variations in service effectiveness can help to refine program theory. In our example, we pay close attention to the relative effectiveness of our two experimental programs for providing supported employment services to adults with severe mental illness who also have physical health problems and/or substance use disorders that might limit job attainment. One intervention was a vocationally integrated program of assertive community treatment (PACT; Allness and Knoedler 1998; Frey 1994; Stein and Test 1980), which is a mobile team providing out-of-office psychiatric care, help with daily living, crisis intervention, substance use treatment, and medical care, in addition to supported employment services. The other intervention was a facility-based clubhouse modeled on Fountain House in Manhattan (Anderson 1998; Beard et al. 1982) that, in keeping with international clubhouse standards (Propst 1992), provided no medical services or substance use treatment, but offered case management, social support, supported housing, supported education, supported employment, and a workplace environment designed to encourage members to relinquish a patient identity and return to a normal life (Propst 1992). Based on these service model characteristics, we hypothesized in our original application for grant funding that PACT would be most vocationally effective for adults with severe mental illness who had chronic physical health problems or severe substance use, whereas the clubhouse would be most effective for those who were relatively healthy with no severe substance use.

Methods

Study Design

Data for these analyses were from a long-term services evaluation project conducted in Worcester, Massachusetts from 1996 to 2001 (Macias et al. 2006). This randomized controlled trial assigned adults with serious mental illness (= 177) to one of two community-based psychiatric rehabilitation interventions following procedures approved by the McLean Hospital IRB. In both multi-service programs, staff trained in supported employment (Bond et al. 2001; Trach 1990) worked closely with other staff to ensure rapid placement of participants into mainstream jobs not reserved by employers for individuals with disabilities.

Sample Description

Study applicants were recruited in 1996–1998 from 42 local organizations, and through posted flyers, radio, and newspapers. Any individual over age 18 was eligible if she or he were unemployed and had a clinician diagnosis of schizophrenia spectrum disorder, bipolar disorder, or recurrent severe depression, but no diagnosis of severe mental retardation. One enrollee crossed-over to the unassigned service, and employment data were unavailable for two others. The remaining sample (= 174) was heterogeneous and similar to larger epidemiological samples within the same state (Jones et al. 2004) in demographics and health problems (Dickey et al. 2002), as well as in mortality rate (Dembling et al. 1999).

Work-related Grouping Variables

Investigation of PACT and clubhouse differences in vocational effectiveness focused on four potentially disabling factors known to limit employment in psychiatric populations: psychiatric symptom severity (Anthony et al. 1995; Chwastiak et al. 2006; Goldberg et al. 2001; Razzano et al. 2005; Regenold et al. 1999; Slade and Salkever 2001), physical health problem severity (Dixon et al. 2001; Druss et al. 2000; Razzano and Hamilton 2005; Razzano et al. 2005), older age (Burke-Miller et al. 2006; Cook et al. 2001; Goldberg et al. 2001; Mueser et al. 2001; Wewiorski and Fabian 2004), and substance use (Lehman et al. 2002; Razzano et al. 2005). Gender is not predictive of work among adults with serious mental illness (Burke-Miller et al. 2006); ethnicity was restricted in this Massachusetts study sample (98% Caucasian).

These four client variables were measured concurrently with employment across the 1996–2001 data collection period (rather than only at baseline) to allow identification of chronic conditions that might continuously or sporadically prevent or hinder employment, including the onset or worsening of conditions after study enrollment (Batstra et al. 2002; Kraemer et al. 2006). Health conditions, psychiatric symptoms, and substance use tended to be persistent, with no discernable patterns of temporal variation after first incidence that would suggest service-related changes may have mediated program outcomes.

Psychiatric symptoms were measured as total scores on the Positive and Negative Syndrome Scale (PANSS; Kay et al. 1987) averaged across all interviews completed during each participant’s first 30 months in the project (median: 6 scores); subscale scores were equally weighted. Interviewers were trained by Lewis Opler, MD and had high inter-rater reliability (Salyers et al. 2001). Physical health problems were identified through open-ended PANSS probes, as well as Medicaid claims and interviewer observations, and each chronic or permanent health problem was assigned the least severe ICD-9 diagnostic code that fit the medical description. Each condition was then coded for severity using the Chronic Illness and Disability Payment System, which is based on actual treatment costs for a large multi-state sample of Medicaid recipients (Kronick et al. 2000). Physical health problem severity scores were the sum of estimated annual costs for each participant’s most costly physical condition within each of 14 diagnostic categories (Jones et al. 2004). Substance use disorders were identified through clinician reports, interviews, and treatment records, and coded 0 (minimal or none), 1 (moderate), or 2 (severe). A moderate rating indicates any clinician report of severe dependence or treatment lasting more than 5 days; a severe rating indicates recurrent, life-disrupting substance abuse.

Methods for Disaggregating the Total Sample into Independent Subgroups

In addition to studying variations in PACT and clubhouse work outcomes for participants high and low on each of these four client characteristic variables, we divided the sample into subgroups that took into account co-variation in the four characteristics. Two methods were compared.

Median Splits on Grouping Variables

We first created sample subgroups by dichotomizing each of the four grouping variables based on median scores, and examining cross-tabulations for these variable groupings. The intent was to assign each individual to a specific category, so that subgroups would be independent. Our procedures were admittedly arbitrary, but logical, and we set a goal of at least 30 individuals per group. We first examined the four categories created by a cross-tabulation of age (older, younger) and physical wellness (healthy, unhealthy) categories. Only 4 individuals fell into the older age, healthy category, so we placed these 4 into the younger, healthy category and relabeled it healthy (n = 35). The remaining older, unhealthy (n = 38) category was adequate in size, but the younger, unhealthy category was large enough (n = 101) to disaggregate based on the other two grouping variables, substance use and psychiatric symptoms. There was a low positive association between youth and substance use, so we created high substance use (n = 41) versus low or no substance use (n = 60) subgroups within the younger, unhealthy category. We then further divided the young, unhealthy, low substance use subgroup based on the median score for psychiatric symptoms, creating a young, psychiatrically ill subgroup (n = 30) that was low on substance use and a young, physically ill subgroup (n = 30) that was low on both substance use and psychiatric symptoms. ANOVA validation tests confirmed that these five subgroups differed (P < .001) in ways reflected in the subgroup labels.

Cluster-analysis

Following the examples of James et al. (2006) and Peck (2005), we also identified subgroups using cluster analysis because this statistical strategy would generate homogeneous groupings based on the natural co-occurrence of the four characteristics (Batstra et al. 2002; Rapkin and Dumont 2000). We used a Ward procedure (1963) and the hierarchical agglomeration technique with squared Euclidean distances (SPSS 1999). The cluster analysis identified five subgroups: very psychiatrically ill (n = 35), very physically ill (n = 31), substance use disorder (n = 31), older, chronically physically ill (n = 25), and relatively healthy (n = 52). As with the subgroups based on variable median splits, ANOVA validation tests confirmed that these five subgroups differed on each of the four client variables at P < .001.

Employment Rates

Employment was operationally defined as any job lasting at least 5 days that met the US Department of Labor’s definition of competitive employment: mainstream, integrated work paying at least minimum wage (Department of Labor 1998; Workforce Investment Act of 1998). Clubhouse transitional employment met these criteria, but we did not count TE as an outcome so that our findings would be comparable to the findings of other supported employment studies. The two programs kept identical employment records, which were corroborated by self-report data collected during 6-month and final exit interviews, as well as telephone calls to family members.

Control Variables

Program Preference

To control for participants’ pre-existing attitudes toward either experimental program (Macias et al. 2005), we recorded each applicant’s program preference at the time of application, and then recoded these preferences as match and mismatch to preference versus no prior preference following randomization to experimental conditions.

Work Interest

Participants’ stated interest in work (1, yes; 0, no or uncertain) was measured during the first research interview, prior to randomization. Work interest is a strong predictor of employment (Drebing et al. 2005; Macias et al. 2001; Regenold et al. 1999) and often used as a screening criterion by supported employment programs.

Receipt of Employment Services

Total hours of help with job searches (logged) were derived from daily service logs kept by all staff from January 1996 through December 2000.

Data Analysis Plan

We tested our research hypotheses using moderated multiple regression (Aguinis 2004), a preferred method for subgroup comparisons (Aiken and West 1991; Lipchik et al. 2005) and risk-adjustment (Hendryx et al. 2001; Hendryx and Teague 2001). Program-by-client characteristic interaction terms (Judd and Kenny 1981; Kenny et al. 2004) were created by multiplying each participant’s variable category (1, high; 0, low), or centered variable score (Aiken and West 1991), by program assignment (1, PACT; 0, clubhouse). Subgroups were compared as dichotomous variables (1, membership in the subgroup; 0, membership in another subgroup), with one of the five subgroups serving as the reference category in each analysis. To control for multiple tests, we conducted hierarchical regression analyses (SPSS 1999) and required each block of conceptually similar variables to reach statistical significance as an omnibus test before interpreting any significant beta within the block.

Results

Preliminary Analyses

T-tests revealed experimental program differences on two of the four key variables expected to moderate service effectiveness: PACT clients had worse (higher) psychiatric symptoms, while clubhouse clients had more severe physical health conditions. Because both variables have correlated negatively with employment in previous studies, the difference in psychiatric symptoms favored the clubhouse, while the difference in physical health favored PACT. The only significant correlation between the four variables was for age and physical health (r = +.29, P < .01). Older individuals tended to be in poorer health. We rephrased our hypotheses to take this correlation into account: PACT should be most vocationally effective for older clients with health problems, and for clients with severe substance use disorders. The clubhouse should be most effective for clients who are younger and relatively healthy without severe substance use.

Aim I. Comparison of Four Methods for Calculating Client-by-program Interaction Terms

We conducted four separate regression analyses (Table 1), each of which measured the four key client characteristics in a specific way. Analysis 1 used continuous variable scores on the four client variables, with health condition severity log-transformed. In Analysis 2, these four variables were dichotomized based on median splits, with the age-by-health interaction term representing a four-category variable. Since every individual was grouped as either high or low on each variable, these groupings were not independent. Analysis 3 compared independent subgroups that were identified through cross-tabulations of the four dichotomized variables used in Analysis 2. In Analysis 4, we identified five independent and homogenous subgroups using cluster analysis.

Table 1 Comparative sensitivity of four measurement methods for detecting a program-by-subgroup interaction effect (= 174)

Statistical Sensitivity

We compared the relative sensitivity of each method for testing our hypothesis that older adults with chronic health problems would have a higher employment rate if assigned to PACT. The older, unhealthy subgroup was the reference group in Analyses 3 and 4.

Table 1 presents statistics for the predictor variables in each block at the time the block was entered into the analysis. Block 1 is an uncontrolled test of program effectiveness showing that PACT had a significantly higher overall employment rate. Block 2, which tests the predictive power of each client measure, was not statistically significant as an omnibus test in any analysis, in spite of the relatively strong tendency for psychiatric symptoms to discourage work.

Block 3 statistics illustrate the relative sensitivity of the four analyses for detecting variations in outcomes within programs. The age-by-health-by-program interaction term is significant in Analyses 2, 3, and 4, but this block and the full regression model are both significant as omnibus tests only in Analysis 4. Of the four methods, the comparison of cluster analysis-based subgroups provides the statistically strongest evidence of differential program effectiveness.

Specificity

Analysis 4 also provides the greatest specificity as a statistical test of the percentage differences presented in Table 2: older, unhealthy clients were more likely to work than very physically ill, very psychiatrically ill, and relatively healthy clients if they were assigned to PACT, but less likely to work than very physically ill, very psychiatrically ill, or relatively healthy clients if assigned to the clubhouse. Had this same block been significant in Analysis 3, we could conclude that older, physically ill clients had higher work rates than young, psychiatrically well clients in PACT (67% vs. 39%), but the opposite was true for the clubhouse (22% vs. 53%). Findings for this same block in Analysis 2 would be interpreted simply as better work rates for older, unhealthy PACT clients in comparison to everyone else in the study.

Table 2 Employment rates for sample subgroups within PACT and clubhouse programs

One reason that Analysis 4 was the most sensitive and specific test of program differential effectiveness is that cluster analysis is designed to maximize subgroup differences in naturally co-occurring characteristics. As a result, the cluster analysis-based subgroups were distinct in several meaningful ways that aid in the interpretation of findings (Table 3). In addition to having favorable scores on all four clustering variables, the relatively healthy subgroup had better work histories and reported the fewest limitations to everyday activity. The very psychiatrically ill subgroup had the highest percentage of schizophrenia spectrum diagnoses and fewest substance use disorders. Individuals in the very physically ill subgroup scored highest on the physical health severity measure and reported the most physical limitations to everyday activity. Individuals in the older, chronically physically ill subgroup were older when first hospitalized and had the highest self-esteem, but they were the most obese and least likely to have worked in 5 years. All individuals in the substance use disorder subgroup had lifestyles of recurrent, disruptive substance abuse, and so were the most frequently homeless or incarcerated. Although each of the cluster analysis-derived subgroups differed (< .05) from the other four subgroups in these defining ways, the five subgroups were comparable in gender, ethnicity, referral source, and program preference at time of project enrollment. As Table 2 shows, the overall pattern of subgroup differences in work rates remains stable even when the sample is reduced to individuals interested in work at the time they were randomized to the PACT or clubhouse program.

Table 3 Characteristics of participants included in the five cluster analysis-based subgroups (= 174)

Similar subgroups could be created using various cut-points on the four key variables, but a search for optimal groupings would increase Type I errors. The cluster analysis approach required subjective judgment in the selection of an optimal cluster solution, but was guided by findings from Rubin and Panzano (2002), who identified five similar clusters for a sample of 3,600 adults with serious mental illness. Our replication of their groupings with a smaller sample and different variable measures suggests these five groupings are robust and representative of the population.

Aim II. Tests of the Internal Validity of Main Study Findings

To check on the validity of our main study findings, we repeated Analysis 4 (Table 1) with a preceding block of two attitudinal variables (program preference, work interest) known to predict work outcomes (Macias et al. 2005; Macias et al. 2001). We changed the reference group for this formal outcome analysis to the relatively healthy subgroup, since this would be a logical comparison group to most providers and service planners.

As can be seen in Table 4, the program variable (Block 2) is again significant, even when controlling for the attitudinal variables in Block 1. However, the addition of the fourth block of program-by-subgroup interaction terms shows that the PACT and clubhouse work rates (Table 2) differ significantly for two clusters: PACT was more effective for older individuals who had chronic physical health problems than for relatively healthy clients, while the reverse was true within the clubhouse. With the addition of interaction terms, the beta for program assignment is no longer statistically significant (= .894), and is substantially reduced (β = −.09, SE = .63), indicating that the strong effectiveness of PACT for older adults with health problems accounts for the overall significantly higher work rates for PACT versus clubhouse (variable change not in table).

Table 4 Logistic regression analyses of work rates for cluster analysis-based subgroups assigned to PACT and clubhouse programs (= 174)a

When total job search hours (log-transformed) is added to the regression model (not shown in table), this dosage variable predicts employment over and above all other variables (β = .28, SE = .08, P = .001), but does not account for program differences in effectiveness because the beta for the program-by-older, chronically physically ill subgroup interaction term remains significant with no decrease in value.

Discussion

The inclusion of a final block of program-by-subgroup interaction terms in our regression analysis of employment rates (Table 4) qualifies the main finding of higher work rates for PACT versus clubhouse, restricting PACT greater vocational effectiveness to a portion of the study sample. The 83% employment rate (Table 2) attained by PACT for clients with age-related chronic health problems not only surpassed the clubhouse work rates for this subgroup, but also the overall 50–55% benchmark employment rates reported for specialized supported employment teams that usually screen for work interest (Cook et al. 2005; Macias et al. 2006; Twamley et al. 2003). This PACT rate rises to 89% when we consider only those clients who expressed an interest in work at the time they enrolled in the study (= 121). These findings should be useful for service planning because older adults with chronic health problems are now prevalent in the population of adults with severe mental illness (Jones et al. 2004; Rubin and Panzano 2002) and are likely to increase in number as the baby-boom generation grows older.

Advantages of Cluster Analysis

Use of cluster analysis for subgroup identification not only increased statistical power, but also solved the dilemma of how to blend the advantages of sample homogeneity and heterogeneity. Sample subgroups differed greatly from one another, but each was homogeneous in the sense that individuals within that subgroup shared the same mix of correlated characteristics, regardless of how heterogeneous the particular mix. This balance of heterogeneity and homogeneity allowed rich, complex descriptions of differential program effectiveness. Using variables, we can report only that PACT became more vocationally effective as client age increased, assuming all other client variables are held constant. Using subgroups, we can say that PACT was especially effective for middle-aged and older clients with chronic health problems who became psychiatrically ill later in life.

Study Limitations

Our subgroups were identified after randomization, so our findings require replication in new studies that first identify subgroups and then randomly assign individuals within subgroups to experimental conditions. Hopefully, our subgroup descriptions will also prove useful in the design of stratified sampling to ensure comparably sized subgroups. New studies are needed not only to test the reliability of these particular findings, but also to test more specific hypotheses, such as whether a greater appreciation of staff outreach and monitoring by older, chronically physically ill adults might account for their better employment rates in PACT, or whether older, chronically physically ill adults assigned to the clubhouse found the option of voluntary clubhouse work more appealing than a job in the competitive workforce.

A caution is also warranted: while our cluster definitions may be useful in the design of new studies, they should not become standardized subgroup definitions for the population of adults with severe mental illness. Population groupings found to predict outcomes in one study may not be as meaningful when used to examine other programs or other outcomes. Hypotheses should always be outcome and program specific. Moreover, even if studies have similar designs, target similar populations, and test similar interventions, conceptual replications that measure the same concepts in different ways are more useful than literal replications for the refinement of program theory (Aronson et al. 1990).

Conclusions

Only a handful of service evaluations have disaggregated heterogeneous samples into homogeneous subgroups to test hypotheses about the relative effectiveness of particular interventions for certain types of individuals, and these few studies span several service fields (Abel et al. 2005; Carey et al. 2007; Clark and Rich 2003; Halvorsen and Monsen 2007; Hodges et al. 2004; Maisto et al. 2001; McKendrick et al. 2007; Ogrodniczuk et al. 2007). We hope that these exemplary studies and our own subgroup-based findings will encourage the formulation and testing of theory-based hypotheses about differential service effectiveness. We also hope that the advantages of subgroup analyses for ensuring valid interpretations of whole-sample outcomes will encourage the publication of null findings (Turner et al. 2008) and spur agencies that fund services research, and registries for randomized controlled trials, to routinely require a priori hypotheses that predict the relative effectiveness of experimental interventions for particular sample subgroups.