Introduction

Self-management and patient education programs attempt to promote self-management competencies, empowerment, and participant’s acceptance of their chronic condition(s). This is achieved through health professionals imparting knowledge and insight, and providing participants with training on how to incorporate new behaviors into their lives [13]. However, efficacy studies of self-management programs often do not address the aforementioned outcomes. Instead, standard clinical or socio-medical outcomes are measured, for example, somatic parameters, quality of life, or return to work. These more distal outcomes depend on factors that are not directly influenced by self-management programs. For example, distal outcomes may be strongly influenced by the severity of a somatic disease [4, 5]. Accordingly, systematic reviews of the impact of self-management or patient education programs often show little or no change in distal outcomes [616].

When only distal outcomes are assessed, the efficacy of self-management programs may be underestimated, and moreover, effects on key early outcomes may be overlooked [17]. Therefore, it is important that researchers and program managers incorporate proximal outcome measures in the evaluation of self-management programs [3, 18]. Proximal outcomes are more directly affected by the intervention than distal outcomes [19] and can be clearly deduced from the contents and goals of self-management programs. As expected, stronger effects in proximal outcomes are often demonstrated empirically [12, 15, 20, 21].

In response to an observed lack of valid measures of the intended proximal outcomes of self-management programs, Osborne and colleagues developed the Health Education Impact Questionnaire (heiQ™) in Australia [22]. Originally, this generic instrument contained 42 items, assessed on a 6-point Likert response scale, to measure eight independent constructs: Positive and active engagement in life, Health-directed activities, Skill and technique acquisition, Constructive attitudes and approaches, Self-monitoring and insight, Health service navigation, Social integration and support, and Emotional distress. The items and scales were developed through careful consultation with patients, healthcare professionals, researchers, healthcare managers, and policymakers; the constructs were subsequently validated using rigorous psychometrics. Studies have demonstrated that heiQ™ can be used to display the effects of self-management programs in outpatient and community settings [12, 21, 23, 24]. Since its development, the heiQ™ has become widely applied and has required only minor refinements. The original heiQ™ had 42 items with a 6-point response scale. Analyses during the construction of heiQ™ showed that some items had disordered thresholds, that is, some respondents were unable to differentiate between the two midpoints “slightly agree” and “slightly disagree.” Further analysis (unpublished) also suggested two items could be removed without compromising content validity. As a result, the response format was simplified to a 4-point response scale (“strongly disagree” to “strongly agree”) and 2 items removed. Generally, higher values in the heiQ™ scales indicate better status, except for Emotional distress, in which higher values indicate higher distress. Further information on the heiQ™ can be found elsewhere [2, 22] and on: www.heiQ.org.au.

The eight independent heiQ™ scales were designed to be sensitive to the immediate or proximal outcomes of self-management [17]. Longer-term outcomes of an intervention might be a reduction in disability, improved health-related quality of life, or even prolonging survival. The proximal outcomes were conceptualized as those impacts that are likely to be observable soon after participation in a self-management education program, such as improvements in attitudes associated with the chronic illness (Constructive Attitudes and approaches) or particular skills taught in a diabetes or weight loss education program (Skill and technique acquisition). Group-based interventions that promote connectedness between participants are likely to result in immediate improvements in Social integration and support.

Until now, robust and sensitive questionnaires to comprehensively assess outcomes of self-management programs across chronic disorders have been absent in the German language. Therefore, the heiQ™ was translated and culturally adapted. The rigorous analyses of its factorial validity and reliability are reported in this paper. Furthermore, we conducted a first approach to test concurrent validity of the heiQ™. Thus far, no studies have systematically examined correlations between heiQ™ scales and other instruments used in self-management program evaluation.

Methods

The research was undertaken in two phases. First, the heiQ™ was translated and culturally adapted to German, and its comprehensibility was tested. Second, psychometric properties were examined.

Phase 1

The translation and cultural adaptation of the heiQ™ was undertaken using a strict protocol conforming to international standards [2527]. After translating the questionnaire, cognitive interviews [28] were conducted with members of the target population. Ten patients (35–55 years, eight females, all native German speakers) with either orthopedic conditions or heart disease from two hospitals were interviewed. This procedure checked for semantic equivalence [29], comprehensibility, and content validity.

The translation process included one forward and one backward translation with the aid of two professional translators. The forward translation was checked by bilingual researchers (MS, MS, GM, RK, CG, IE, and HF) and was slightly modified in consultation with the forward translator, a native German speaker. The modified version was translated back to English by a native English speaker (backward translator) who had no knowledge of the original heiQ™. The back translation was then compared with the original heiQ™ by the Australian-based researchers (SN—bilingual) including the author (RHO). A consensus meeting generated a preliminary German translation. While emphasis was placed on the equivalence between the English and the German version, when discrepancies arose, cultural and conceptual adaptations were preferred over the literal translations.

Phase 2

Sample

Patients from seven rehabilitation hospitals with a range of medical conditions (cancer, chronic pain, heart disease, inflammatory bowel disease, obesity disorder, orthopedic condition, and respiratory disease) were included. Patients completed the survey, that is, heiQ™ as well as other questionnaires to further assess construct validity, at the beginning (T1), at the end (T2), and 3 months after inpatient rehabilitation (T3). A subgroup also completed the heiQ™ 3 weeks before inpatient rehabilitation (T0). Only patients that were able to complete the questionnaires independently were included in the study. All analyses presented in this paper were based on data from T0 and T1.

Factorial validity

Confirmatory factor analyses (CFA) were initially conducted separately for each scale. Evaluation of model accuracy was based on chi-square test and model fit indices such as Comparative fit index (CFI), Root mean square error of approximation (RMSEA), and Standardized root mean square residual (SRMR) [30]. As small and essentially unimportant discrepancies of the data from postulated models are likely to result in statistically significant chi-square values if sample sizes are large as in the present case [31, 32], a significant chi-square was always interpreted in conjunction with other fit indices. For model fit to be interpreted as “acceptable,” CFI needed to be above 0.95, RMSEA below 0.06, and SRMR below 0.08 [31, 33]. If a model test exceeds one or more of the cutoff values, expected parameter changes (EPCs) and modification indices (MI) were calculated to estimate type and magnitude of model misspecification [34]. Akaike’s Information Criterion (AIC) and Bayes Information Criterion (BIC) were used to compare non-nested models [31].

To test factorial validity, the total sample was divided into a calibration (N = 603) and a validation sample (N = 599) using a stratified randomization procedure whereby the condition was the grouping variable. First, the total sample was used to test the assumed one-factor measurement models. If evaluation of respective model fit was positive, the model was accepted. Otherwise, a modified model was tested in the calibration sample. To modify a model, statistical criteria (EPC, MI) [34] and content-related considerations were used. If model fit was accepted, it was then tested in the validation sample (cross-validation). Eventually, all final one-factor models were again tested in the total sample. After all one-factor models were confirmed, the full eight-factor model was tested in all three samples.

Reliability

Reliability of each scale was estimated using Raykov’s Composite Reliability Coefficient (CRC) score [35, 36]. CRC values can be interpreted like Cronbach’s alpha; it requires only a congeneric measurement model [37] and takes the effects of correlated error variances into account [31]. Based on CRCs, Standard Error of Measurement (SEM) [38, 39] was computed. Furthermore, test–retest reliability [intraclass correlation coefficient (3,1)] of each scale was computed [39] in a subsample of N = 69 patients with orthopedic disease who had completed the heiQ™ at T0 and T1.

Concurrent validity

To study concurrent validity, the following comparison scales were used: (1) SF-36 [40, 41], a widely used generic instrument for assessing health status with eight subscales divided into Physical and Mental Health scales; (2) IRES-24 [42], a short-form of the IRES 3 [43], a widely used instrument in Germany for assessing subjective health in patients with chronic conditions; (3) Illness Perception Questionnaire-Revised (IPQ-R) [44, 45], an instrument based on the Common Sense Self-Regulation-Model [46] assessing cognitive and emotional representations of an illness; (4) Patient Health Questionnaire (PHQ-9), a short screening instrument for depression that allows a categorical analysis (no depression—other depression—major depression) based on criteria of the Diagnostic and Statistical Manual of Mental Disorders [47]; and (5) Generalized Anxiety Disorder Scale (GAD-7) [48], a short screening instrument to measure anxiety. The latter two instruments are used worldwide for patients with different chronic conditions [4953].

We made the following hypotheses:

  1. 1.

    Overall heiQ™ scales would have low to moderate associations with the comparator scales with the majority of correlations expected to be below r = 0.6, given that they were intended to measure something different than most available scales (for exceptions see below).

  2. 2.

    Most heiQ™ scales will show low to moderate correlations with scales of subjective health, depression, and anxiety; correlations between most heiQ™ scales and with mental health scales are expected to be higher than those with physical health scales given the item content.

  3. 3.

    Most heiQ™ scales will show low to moderate correlations with the following IPQ-R scales: Personal Control, Coherence, Consequences, and Emotional Representation. Especially, Self-monitoring and Insight and Skill and technique acquisition will show at least moderate correlations with the IPQ-R scale Personal Control. Furthermore, Emotional distress will show high correlations with the IPQR-Scale Emotional Representation. No hypotheses were formulated about correlations between heiQ™ scales and other IPQ-R scales.

  4. 4.

    The heiQ™ scales Emotional distress, Constructive attitudes and approaches, and Positive and active engagement in life will show at least moderate to high correlations with depression, anxiety, and mental health.

Statistical analysis

Confirmatory factor analyses were computed using Mplus 6.1 [54] with Robust Maximum Likelihood (MLR-estimator). To handle missing data, the Full Information Maximum Likelihood (FIML) algorithm was used [55]. Computations with manifest variables were conducted with IBM PASW Statistics 18. In these analyses, missing data were estimated using multiple imputations. Five complete data sets were imputed, and the results of each were combined to build the overall results [56]. The amount of missing data per item was low (0.1–3.0 %). A p value <0.05 was regarded statistically significant unless otherwise stated. Effect sizes for between-group effects were estimated using Cohen’s d (with pooled standard deviations of the compared groups as denominator), with d = 0.2/0.5/0.8 indicating small/medium/large effects. Correlation coefficients of 0.1/0.3/0.5 were regarded as small/medium/large [57].

Results

Phase 1

After finalizing the translation process, a preliminary German heiQ™ was established. Cognitive interviews showed that items were generally well understood by interviewees in the intended manner. However, based on the responses, 12 items (30 %) required further refinement. All changes were discussed with all project members and the author of the heiQ™.

Phase 2

Sample

The total sample comprised 1,202 patients from seven clinics. A large proportion had rheumatic/orthopedic conditions (40.9 %) or respiratory conditions (28.4 %); 11.3 % were diagnosed with cancer (rectum, colon, or bladder cancer), 11.8 % with inflammatory bowel disease, 4 % with heart disease, and 4 % with other chronic conditions. Sample characteristics are shown in Table 1. No substantive differences between calibration and validation sample were observed regarding socio-demographic parameters (age, sex, education, and income).

Table 1 Sample characteristics

Factorial validity

Table 2 displays the results of the CFA for the total sample (results of calibration and validation sample are available on request). The postulated measurement models of Positive and active engagement in life, Constructive attitudes and approaches, and Skill and technique acquisition showed good fit. In contrast, the remaining five scales showed inadequate fit in at least one fit index. When freeing an error covariance in respective measurement models, fit indices improved in a way that model fit was acceptable. For the measurement model of Emotional distress, two possibilities for improving model fit were found: A good model fit (χ2 = 24.81, df = 8, p = 0.002; CFI = 0.993; RMSEA = 0.042) was achieved by freeing the error covariance between items 4 and 18, but a superior model fit (χ2 = 5.22, df = 5, p = 0.390; CFI = 1.00; RMSEA = 0.006) was achieved by deleting one item (item 18). Since this item should not be deleted from the scale prematurely, it was maintained in subsequent analyses involving this scale; the error covariance was freed instead. All eight heiQ™ scales showed good factorial properties.

Table 2 Model fit and reliability indices of the measurement models

Factor loadings of all tested models in the total sample are shown in Table 3. Most loadings were between 0.5 and 0.9, indicating a good representation of the items by the underlying factors. The only exception was Self-monitoring and insight which had some coefficients between 0.4 and 0.5.

Table 3 Item content and factor loadings of original and modifieda models (total sample)

Based on the results shown above, a full eight-factor model was tested in all three samples, whereby latent factors were allowed to correlate. No additional associations between items or between items and factors (cross-loadings) were allowed. As results of all three samples were similar, only the results of the total sample are reported herein. The model exhibits acceptable fit values (χ2 = 2223.96, df = 670, p < 0.001; CFI = 0.918; RMSEA = 0.044; SRMR = 0.054).

Correlations between heiQ™ factors and those between manifest heiQ™ scales are displayed in Table 4. Positive correlations were observed between all factors, with correlation coefficients ranging from r = 0.17 to r = 0.95. Noticeable are the high correlations between Skill and technique acquisition and Self-monitoring and insight (r = 0.95), and Active engagement in life and Constructive attitudes and approaches (r = 0.85). However, testing alternative models such as allowing cross-loadings between single items and both factors did not lead to a significant reduction in the factor correlations. In a further assessment, an alternative model with only one factor for all items from both scales was tested and compared with the two-factor models. For Active engagement in life and Constructive attitudes and approaches, the one-factor model (CFI = 0.923; AIC = 22,599.82; BIC = 22,752.57) shows worse fit values than the two-factor model (CFI = 0.954; AIC = 22,467.61; BIC = 22,625.45). For Skill and technique acquisition and Self-monitoring and insight, results of the one-factor model (CFI = 0.980; AIC = 22,931.70; BIC = 23,094.64) and the two-factor model are very similar (CFI = 0.980; AIC = 22,933.03; BIC = 23,101.06).

Table 4 Correlations between heiQ™ factors (italicized values) and between heiQ™ scales (non-italicized values)

Reliability

Reliability estimates using Raykov’s CRC for the accepted models can be classified as moderate (e.g., CRC = 0.71 for Self-monitoring and insight) or good (e.g., 0.87 for Constructive attitudes and approaches) (Table 2). Test–retest reliability coefficients were somewhat lower (r tt  = 0.60 for Health-directed activity to r tt  = 0.83 for Social integration and support).

Concurrent validity

As hypothesized, the heiQ™ scales showed generally low to moderate correlations with most comparator scales. Only one correlation coefficient exceeded 0.6 (see below). Correlations with scales of mental health were slightly higher than those with physical health scales. For example, the range of the correlations between heiQ™ scales and IRES-24 Subjective health was between 0.21 and 0.60, while correlations with IRES-24 Physical health were between 0.11 and 0.36.

Most heiQ™ scales showed low to moderate correlations with IPQ-R scales Coherence, Consequences, and Emotional Representation (Table 5). The highest correlation was seen between Emotional distress and Emotional representation (r = 0.73). However, only very low correlations with the IPQ-R scale Personal control were observed, even no correlations with heiQ™ scales Self-monitoring and insight (r = 0.01) and Skill and technique acquisition (r = 0.02).

Table 5 Correlations between heiQ™ scales and comparator scales

Further, heiQ™ scales showed low to high correlations with PHQ-9 and GAD-7. As shown in Table 6, patients with major depression or other depression (according to PHQ-9) had significantly lower values in heiQ™ scale scores than those without depression. Effect sizes were moderate or high between patients with major depression and patients without depression.

Table 6 Mean differences in heiQ™ scales between persons without depression, with other depression, and with major depression (according to PHQ-9)

As expected, moderate to high correlations were found between heiQ™ scales Emotional distress, Constructive attitudes and approaches, and Positive and active engagement in life and measures of anxiety, depression, and mental health.

Discussion and conclusion

Discussion

In this study, the heiQ™ was translated and culturally adapted to German. Comprehensibility of the items was confirmed using cognitive interviews; comparison with other relevant constructs yielded meaningful associations. Using robust and highly restricted CFA procedures, the heiQ™ was found to be well replicated in German language; the psychometric properties (reliability, factorial validity, and concurrent validity) showed good fit after only minor adjustment. The German heiQ™ is therefore likely to be a useful measure of proximal outcomes of self-management and health education programs in German-speaking countries.

Overall, the translated heiQ™ was found to have good factorial validity. While three of the eight scales could be accepted immediately, the remaining five scales needed minor adjustments (freeing error covariances of distinct items) to achieve good fit indices. Across the entire questionnaire, only one item was considered problematic. In Emotional distress the fit indices were good after freeing the error covariance of two items; however, the deletion of item 18 (“Ich bin sehr beunruhigt, wenn ich über meine Gesundheit nachdenke”) improved the model fit substantially. It may be possible that the core meaning of the original item (“upset”) was not fully captured by our translation (“sehr beunruhigt”). Nonetheless, removing this item may affect the content validity of the scale; thus, the item was retained. Moreover, the reliability of the scale did not substantially improve when the item was removed (see Table 2). Further studies with different translations of the item may clarify this issue.

Although the CFI for the eight-factor model is somewhat lower than our recommended cutoff value, the fit indices for this model are still within the acceptable range for multidimensional questionnaires, particularly when interpreted in the context of the otherwise acceptable fit statistics [58]. In spite of this, the high correlation between Skill and technique acquisition and Self-monitoring and insight on the one hand and Active engagement in life and Constructive attitudes and approaches on the other hand might be problematic. The question arises whether these scales indeed measure conceptually and empirically different constructs. Assessments of the one- and two-factor models indicate that Active engagement in life and Constructive attitudes and approaches should be modeled as two highly correlated factors. In contrast, the one-factor and two-factor models for Skill and technique acquisition and Self-monitoring and insight showed very similar fit. However, the conceptual difference between the two scales is very clear: Patients may have skills to cope with symptoms of their illness (skills and techniques), but at the same time, they may have little understanding of the underlying mechanisms (insight). Therefore, the two-factor model has been chosen for now. More studies are needed to clarify the relationship between these two scales across settings.

Although all tested models show good model fit, the values are somewhat lower than in the validation of the original heiQ™ [22]. Several reasons may explain this discrepancy. First, this may be due to the original 42 items being selected from a large pool of items to generate the best possible model, whereas only the 40 translated items were tested in this study. As there are different possibilities to translate an item, other translation options may have led to better fit values. Second, the sample in this study differed from that of Osborne et al. [22]. For example, they did not include cancer patients or patients suffering from inflammatory bowel disease. Finally, the German translation was based on a heiQ™ version with four-point Likert scales and 40 items, while Osborne and colleagues used the six-point Likert scales and 42 items.

In general, reliability estimates of the heiQ™ scales showed acceptable to good values (0.71–0.87). As expected, retest reliability estimates were found to be slightly lower (0.60–0.83) than estimates in CRC, but they are still acceptable.

Most of our hypotheses concerning concurrent validity were confirmed. With one exception, correlation coefficients were lower than r = 0.6, indicating that the heiQ™ scales capture other constructs than the comparator scales. This finding confirms that the heiQ™ fills a gap in the measurement of outcomes of patient education and self-management programs.

All heiQ™ scales showed at least low to moderate correlations with measures of subjective health; correlations are slightly higher with mental health than with physical health scales. Furthermore, all heiQ™ scales showed at least low to moderate correlations with depression and anxiety. From all heiQ™ scales, Emotional distress, Constructive attitudes and approaches, and Active engagement in life showed the highest correlations with measures of mental health, depression, and anxiety. This result indicates that these constructs capture elements of a global mental health construct.

The very high correlation (the only one above 0.6) with the IPQ-R scale Emotional representation indicates a good convergent validity of heiQ™ scale Emotional distress. Both scales capture emotional states with clear attribution to the illness of the patient [22, 45]. The moderate correlations between the heiQ™ scales and IPQ-R scales consequences and coherence were also expected. For example, patients who feel as though their illness “doesn’t make sense” or is “a mystery” (IPQ-R scale Coherence) understandably also have fewer skills to cope with the symptoms of the illness (heiQ™ scale Skill and technique acquisition). The surprisingly very low correlations between the heiQ™ scales and IPQ-R scale Personal control may be due to unclear psychometric properties of this particular IPQ-R scale. For example, Glattacker et al. [44] report low correlations between Personal control and other comparator scales (e.g., self-efficacy expectations).

Our findings have shown that the heiQ™ scales can differentiate between patients with and without major depression. Patients with high levels of distress tend to have low values on the heiQ™ constructs. Patients who have little confidence (constructive attitudes) and few self-management competencies are conceivably more likely to become depressed than other patients. An increase in self-management competencies should therefore reduce depression. Conversely, depressed patients may appraise their competencies as lower than patients without depression.

Limitations

Our study has several limitations. Although our sample represents several chronic conditions, many groups are absent. For example, only few patients suffered from heart diseases. Some important chronic conditions, such as diabetes mellitus and common tumors (e.g., breast, prostate, lung, or skin cancer), are not represented in the sample. Further studies may appraise the generalizability of the results for patients with other chronic conditions.

Construct validity of some heiQ™ scales (e.g., Emotional distress) was confirmed by comparisons with related comparator scales (e.g., IPQ-R scale Emotional representation). However, construct validity of some other heiQ™ scales (e.g., Health service navigation) was less well examined since no comparator scales exist. To obtain additional information on concurrent validity, further studies should use validation scales that encompass related constructs such as doctor–patient relationship [59, 60] or patient competence [61]. Furthermore, future studies should focus on the responsiveness of the scales in groups of individuals participating in interventions that have a specific curriculum designed to improve a range of target outcomes. A more complete understanding of the construct validity of the heiQ™ will evolve through longitudinal studies where sensitivity-to-change or predictive validity is examined.

Conclusions

Overall, the German heiQ™ is well understood by patients suffering from different types of chronic conditions; it assesses relevant outcomes of self-management programs in a reliable manner. The constructs measured by the heiQ™ scales capture different aspects than other used outcome measures and can be assigned to the defined goals of self-management programs, in particular, empowerment (e.g., Health-directed behavior, Health service navigation), self-management (e.g., Skill and technique acquisition), and acceptance of the chronic illness (e.g., Constructive attitudes and approaches). The heiQ™ constructs may serve as proximal goals of self-management programs to advance outcome assessment in this field. Further studies involving the heiQ™ and its practical application are warranted.