Background

A central goal of cardiovascular care is to improve patients’ health status. In addition to mortality and morbidity outcomes, patient-reported health-related quality of life (HRQoL) is an important measure of health, especially when examining the effects of interventions on cardiovascular health [1]. HRQoL can predict mortality [2], cardiovascular events, hospitalization, and costs of care [1]. Patients with a cardiovascular disease (e.g., congenital heart disease [3], congestive heart failure [4], myocardial infarction [5], or coronary heart disease [6]) have impaired HRQoL compared to the general population.

One common intervention in cardiovascular care is cardiac rehabilitation [7]. In Germany, it is covered by public and private health insurers as well as the German Statutory Pension Insurance Scheme [8]. Methods of treatment comprise preventive cure (Heilverfahren) and—more frequently—follow-up treatment immediately after acute cardiac events (Anschlussheilbehandlung). Though outpatient rehabilitation services were introduced in the 2000s, it is far more common for patients to receive inpatient services for a period of time typically lasting three or more weeks.

The effect of an intervention aimed at improving HRQoL is usually quantified as the difference between measurements taken at baseline and follow-up assessments. The most common approach only takes the actual measurements of perceived HRQoL into account (e.g., the mean difference posttestminuspretest) and provides a more objective measure, one of observed change. Another approach considers the fact that a person’s perception of the quality in question can change over time, even if the quality itself does not change. Hence, this approach does not use measurements of perceived HRQoL at baseline. Instead, it uses a thentest, a retrospective assessment of baseline HRQoL that is first reported at follow-up (e.g., [9,10,11]). The difference (posttestminusthentest) provides a more subjective measure, one of perceived change, which is more meaningful for understanding the effects of interventions, as patients perceive them. Each of these three measurements (pretest, posttest, thentest) is based on its own frame of reference, and every mean difference can be biased. This bias is differentiated into four types (reconceptualization, reprioritization, and uniform and non-uniform recalibration) and is called response shift [12]. Reconceptualization describes a redefinition of the target construct, reprioritization describes a change in the importance of the target components, and recalibration describes a change in the internal standards. Recalibration is called uniform if the change in internal standards can be explained by change in the target construct, and it is called non-uniform if not. One method for analyzing these types of bias is the structural equation modelling (SEM) approach introduced by Oort [13]. Theoretically, response shift can be caused by different mechanisms such as, among others, coping and social comparison [14]. Since it is an aim of cardiac rehabilitation to support active coping, response shift should be expected to occur in cardiac rehabilitation [7]. Dempster et al. [7] showed that response shift does indeed occur during cardiac rehabilitation and concluded that this bias probably leads to an underestimation of the effects of the intervention. Because Schwartz et al. [15] pointed out that the thentest is susceptible to recall bias and potentially contaminated by other influences (e.g., social desirability, effort justification, and implicit theories of change), the question arises how a person’s recollection differs between pretest and thentest besides response shift. This question can be answered when complementing both approaches (investigating observed and perceived change) into one approach using SEM. The complementary integration of these two approaches is new, because so far either only one kind of change has been examined, or both kinds of change have been compared as competing methods, e.g., Visser et al. [16]. Considering the two approaches as complementary within a single structural equation model and not as competing approaches has the advantage that the susceptibility to memory distortion of the thentest approach can be quantified.

In light of these theoretical implications and empirical findings, our aims are as follows: (a) to examine differences in HRQoL between patients undergoing cardiac rehabilitation and the general population, (b) to investigate changes in HRQoL that were observed and that were perceived, and (c) to explore response shift effects and indications of recall bias.

Methods

Study participants

Between February 2015 and April 2016, a group of 479 cardiac rehabilitation inpatients treated in a German rehabilitation clinic administrated by the Deutsche Rentenversicherung Westfalen were asked to participate in the study. Inclusion criteria were (1) survival of an acute cardiovascular event, (2) age of 18 years or older, and (3) the absence of any severe cognitive or verbal impairments that would interfere with a patient’s ability to complete questionnaires. Informed consent was obtained from the study participants after they were given an explanation of the purpose of the study and data collection and storage methods. Of the 479 patients invited to take part in the study, 356 (74%) consented to participate and filled in the first questionnaire (baseline). Three months later, these patients were sent a packet by mail including a letter, a questionnaire (follow-up), and a stamped addressed return envelope. If they did not respond, they were sent one reminder by mail. In total, data from 282 patients (79%) were available for analysis. The study was approved by the Ethics Committee of the University of Leipzig.

General population

The reference data were taken from two studies that examined representative samples of the German general population [17, 18] (n = 4476). From these, a subsample was selected (1760 males, 343 females) so that the proportion of general population females was identical with that of our rehabilitation patients’ baseline sample (16.3%) and that the mean age of the general population sample was very similar to that of the patients’ baseline sample (M = 55.6 years). The selection was realized by systematically excluding young participants and women from the original general population sample until the distribution of the patients’ sample was reached.

Instruments

The sociodemographic characteristics we accounted for included: gender, age at baseline, education, employment status, and partnership status. The medical characteristics we recorded were diagnosis and time since start of treatment (in weeks).

HRQoL was measured with the functioning scales of the Quality of Life Questionnaire EORTC QLQ-C30 that was developed by the European Organization for Research and Treatment of Cancer (EORTC). Although this is a disease-specific instrument developed for use with cancer patients, it can also be used to assess HRQoL in other populations as well including the general population [17, 19,20,21,22] and other patient groups suffering from, for example, chronic pain [23] or cardiac diseases [24]. The instrument contains 30 items distributed across five functioning scales, three symptom scales, six symptom items, and one global health/quality of life scale [25]. All scores are linearly transformed to obtain the range 0–100. Higher values on the functioning scales indicate higher functioning, and higher values on the symptom scales indicate greater levels of burden [26]. A recent study tested the higher order measurement structure [27].

Statistical analyses

Missing values were estimated using the Expectation Maximization procedure [28]. Statistical analyses were performed using IBM SPSS Statistics 23, IBM SPSS Amos 23, using the maximum likelihood estimation procedure, and Microsoft EXCEL 2010 supplemented by the “Real Statistics Resource Pack” for EXCEL [29].

Comparisons of means were conducted with t tests for independent groups (general population) and the respective t tests for dependent groups (between pretest, posttest and thentest). Furthermore, we computed effect sizes (Hedges’ g) to express the mean score differences in relation to the pooled standard deviation. Hedges’ g is a bias-corrected value of Cohen’s d and is classified with g ≥ 0.2 as small, g ≥ 0.5 medium, and ≥ 0.8 large [30]. Type-I-error probabilities (p values) for the effect sizes were computed using their standard errors [31].

Detection of response shift was conducted with the SEM approach proposed by Oort [13, 32]. First, the measurement model of functioning quality of life according to Gerlich et al. [33] was tested for each measurement (pretest, posttest, and thentest) separately. This model included the respective five EORTC QLQ-C30 functioning scales (physical functioning, role functioning, social functioning, cognitive functioning, and emotional functioning). Then, these three models were combined through introducing between occasion covariances for each scale and additional within occasion covariances between the residuals of physical and role functioning, and between role and social functioning, analogous to Gerlich et al. [33]. The model diagram used for response shift evaluation is presented in Fig. 1. Subsequently the response shift detection process is based on the following three steps that are distinguished through models containing different levels of restriction:

Fig. 1
figure 1

Diagram of the model for response shift evaluation. Abbreviations: PF physical functioning, RF role functioning, SF social functioning, CF cognitive functioning, EF emotional functioning. Annotations: Rectangles manifest variables, ovals latent variables, circles residuals, straight arrows regression weights, curved arrows covariances. Terminology of the model parameters: r covariance, e residual variance, i intercept, w regression weight, m latent mean, v latent variance, numbers after the underscore indicate the occasion: 1 pretest, 2 posttest, 3 thentest

  1. (1)

    Unconstrained model Here the latent variables of all three measurements are fixed to a mean of 0 and a variance of 1 to fully identify the model. This model serves as a baseline model for comparisons with the fully constrained model (next step). If the unconstrained model does not show acceptable fit, reconceptualization response shift between measurements is indicated, and the analysis ends here, because it is not possible to assume comparable concepts in general.

  2. (2)

    Fully constrained model This model assumes the null hypothesis (no response shift). Accordingly all parameters (weights, intercepts and residual variances) are constrained to be equal across all three measurements. Here, only the latent variable of the pretest measurement is fixed to a mean of 0 and a variance of 1 to identify the model. Acceptable fit indicates no further types of response shift, and the analysis ends at this step. In the case of poor fit, the following step is taken.

  3. (3)

    Response shift model In this model, restrictions from the previous model are freed one after another. A restriction identified as misspecified was released when the release led to a substantial improvement of the model fit. The sequence of releasing begins with residual variances (to detect non-uniform recalibration), followed by the intercepts (uniform recalibration), and then by the weights (reprioritization). The releasing is done until there is no substantial increase in model fit. A released parameter is indicative of response shift, and the type of the released parameter determines the type of response shift (residuals: non-uniform recalibration, intercepts: uniform recalibration, weights: reprioritization).

Unacceptable misspecifications were identified using the combination of the modification index, the power of the MI-Test, and the expected parameter change [34]. According to Saris et al.’s suggestion [34], we chose the following critical deviations: ten percent of the pretest sample’s variance of the respective functioning scale for residual variances; ten percent of the pretest standard deviation of the scale for regression weights; and for intercepts, we followed the guidelines for interpreting the EORTC QLQ-C30 change scores that were proposed by Cocks et al. [35]. Type-I-error-probability was set to 0.05, and high power to 0.80. Model fit was assessed with a combinational rule of CFI (comparative fit index) and SRMR (standardized root–mean-square residual) [36]. Models were rejected if both CFI and SRMR indicate poor fit, that is, if CFI < 0.95 and SRMR > 0.08. To indicate the trade-off between model fit and model complexity, we additionally present AIC (Akaike information criterion). To evaluate model differences, a value of ∆CFI ≥ 0.002 was regarded as substantial model improvement [37].

To judge the share of response shift within the change that obviously occurred between two means, it is helpful to decompose the change into two parts, one part that indicates the difference under the assumption that no response shift would have occurred, that is, if the parameters would have been equal to the baseline measurement (called “true change”), and a another part that indicates the amount of response shift. The meaning of response shift for the mean of an item can be illustrated by the idea behind the SEM approach. Here a latent variable is assumed to manifest itself in a number of items. As consequence, the mean of every item can be decomposed into three components: a common, a unique, and a residual quality. However, only the first two are of relevance, because they are shares of response shift that affect the item’s mean.

Oort [13] showed how to decompose the mean difference into three components: the contribution of true change, of uniform recalibration, of reconceptualization, and of reprioritization. Note that Oort [13] used the term “observed change” to indicate the change that is not decomposed yet and comprises true change and response shift. Because we use the term “observed change” to indicate the change between the actual baseline (pretest) and follow-up (posttest) measurements, we use the term “observed difference” to indicate the change that is not yet decomposed. Nevertheless, it is also an observed change. The following equation describes the decomposition of response shift effects:

$$ \begin{aligned} & X_{2} - X_{1} \, {\text{Observed difference}} \\ &= \left( {i_{2} - i_{1} } \right)\, {\text{Uniform recalibration}} \\ & \quad + \left( {w_{2} - w_{1} } \right) \cdot L_{1} \, {\text{Reprioritization}}/{\text{reconceptualization without change in L}}_{ 2} \\ & \quad + \left( {w_{2} - w_{1} } \right) \cdot \left( {L_{2} - L_{1} } \right)\, {\text{Reprioritizationreconceptualization with change in L}}_{ 2} \\ & \quad + w_{1} \cdot \left( {L_{2} - L_{1} } \right)\, {\text{True change}}\end{aligned}$$

X denotes the mean of the observed score of a manifest variable. It can be decomposed to \( X = i + w \cdot L + e. \) The letter i denotes the intercept (constant), w the regression weight (factor loading, constant) between the latent variable L (factor score, mean) and the manifest variable X, and e indicates the residual factor score (mean), which is set to zero in the equation above, because the residual means are fixed to zero in the model. If the mean of L1 is zero, this equation reduces to that presented by Oort [13, Eq. 8, p.594]: \( X_{2} - X_{1} = \left( {i_{2} - i_{1} } \right) + \left( {w_{2} - w_{1} } \right) \cdot \left( {L_{2} } \right) + w_{1} \cdot \left( {L_{2} } \right) \).

Results

Sociodemographic and medical characteristics

Of the 356 baseline participants 282 (79%) returned the 3-month follow-up questionnaire. Data from the 74 (21%) participants who dropped out were excluded from the analyses. Table 1 presents the sociodemographic and medical characteristics of both samples (dropout and analysis sample). Column proportions between angiopathy patients who dropped out and those who completed the study differed to a statistically significant extent (p < 0.05): 16% in the dropout sample and 7% in the analysis sample. The majority of patients was male (83%), between 50 and 70 years old (73%), had 8 to 10 years of education (71%), and was employed (78%). The most common diagnosis was coronary heart disease (69%). The mean age of the study participants was 56.4 (SD = 8.2) years.

Table 1 Sociodemographic and medical characteristics

Comparison with the general population

Cardiac patients showed worse HRQoL than the general population in all dimensions. Table 2 presents mean scores, standard deviations, and effect sizes (Hedges’ g) for both groups. Hedges’ g for the functioning scales showed only large effects (|g| ≥ 0.80) and ranged from − 0.91 (physical functioning) to − 1.57 (social functioning). Regarding the symptoms, Hedges’ g showed higher levels of burden and ranged from 0.31 (nausea/vomiting) to 1.66 (dyspnoea). Besides dyspnoea, three further symptoms showed large standardized mean differences: fatigue (g = 1.38), financial difficulties (g = 1.25), and insomnia (g = 0.99). The global health/quality of life scale (QL) showed an effect of g = − 0.76.

Table 2 Mean scores and effect sizes

Observed change in HRQoL (posttest–minus–pretest)

Three months after cardiac rehabilitation the means of all five functioning scales were higher than the means that were reported during cardiac rehabilitation (Table 2). The effect sizes of the differences of the functioning scales from pretest to posttest were all positive and statistically significant (p < 0.001). Hedges’ g ranged between 0.18 (cognitive functioning) and 0.29 (social functioning). This increase in HRQoL was accompanied by statistically significant declines in the symptoms dyspnoea (g = − 0.45), fatigue (g = − 0.37), pain (g = − 0.28), and appetite loss (g = − 0.20).

Perceived change in HRQoL (posttest–minus–thentest)

Three months after cardiac rehabilitation the perceived change in physical functioning was greater than the observed change (g = 0.53 perceived vs. g = 0.24 observed, Table 2). All other functioning scales showed lower perceived than observed change. Hedges’ g for these scales ranged from 0.12 (cognitive functioning) to 0.23 (social functioning).

Detection of response shift effects

Using the thentest approach, we found response shift in the physical functioning domain with an effect of g = 0.32 (Table 2, column “pre–then”). To get a more comprehensive picture of the response shift effects, we also used the SEM approach complemented by the thentest. Consequently, we first analyzed the fit of the measurement model for each measurement (pretest, posttest and thentest) separately. The fit of the single models was acceptable: all measurements showed values of CFI > 0.96 and SRMR < 0.04 and therefore no indication of reconceptualization (Table 3). The unconstrained combined model confirmed this assumption with a slightly lower but also acceptable fit (CFI = 0.96 and SRMR = 0.07). The fit of the fully constrained model was marginally acceptable (CFI = 0.90 and SRMR = 0.07), but the decrease of fit (∆CFI = − 0.052) was substantial, indicating other types of response shift. After the release of six constraints, of which each led to a substantial increase in model fit, the final response shift model was found.

Table 3 Response shift detection (n = 282)

The step-by-step procedure revealed all remaining kinds of response shift. The resulting parameters of the response shift model are shown in Table 4. Reprioritization, a change in the importance of an item relative to the others [13], was found in cognitive and emotional functioning. The weights were higher at the pretest measurement. Uniform recalibration, a change in the respondent’s internal standards of measurement [13], was indicated in physical and cognitive functioning. The intercept of physical functioning was lower in the thentest measurement, and the intercept of cognitive functioning was lower in the posttest measurement. With the fact that the intercepts of physical and cognitive functioning were different in the follow-up assessments, the question arises of why only one follow-up assessment was affected. It is conceivable that this might be a methodological artefact of restrictions, and if we had freed the intercepts of physical functioning (in pretest and posttest) and the intercepts from cognitive functioning (in pretest and thentest), the uniform recalibration would be distributed across both follow-up assessments. But this hypothesis did not hold. When we freed these parameters, the intercepts did change minimally (less than one raw point) and the model fit did not increase substantially.

Table 4 Parameters of the response shift model (M6)

Non-uniform recalibration was found in the physical and social functioning domains. While the residual variance of physical functioning was higher in the thentest measurement, the residual variance of social functioning was higher in the pretest measurement. The means of the latent variables (overall functioning quality of life) of the follow-up measurements changed from 0.00 (pretest, fixed) to 0.11 (thentest) and finally to 0.47 (posttest).

Decomposition of response shift effects and recall bias

Table 5 shows the decomposition of the observed differences into true change and contributions of response shift (uniform recalibration and reconceptualization/reprioritization). We present raw differences as well as the pooled standard deviations.

Table 5 Decomposition of response shift effects (raw differences of implied means, range 0–100)

Response shift effects that influenced the observed difference were found in physical, cognitive, and emotional functioning. Regarding physical functioning, we found the influence of the uniform recalibration in the perceived change (lower intercept in thentest measurement) with + 7.27 points and in the difference of recollection (thentest–minus–pretest) with the opposite sign. This effect increased the perception of change and changed the direction of the systematic deviation between pretest and its retrospective pendant. Regarding cognitive functioning, two effects influenced the comparisons. The uniform recalibration (lower intercept in the posttest measurement) reduced the observed and the perceived change by − 2.60 points. The reprioritization effect (higher weight in the pretest measurement) lowered the observed change by − 2.20 points and the systematic deviation between the thentest and the pretest by − 0.51 points. This effect is different for both comparisons because it depends on the latent variable that belongs to the shifted item (pretest: 0.11, posttest: 0.47). The effect does not occur in the perceived change comparison because the weights do not differ here. Regarding emotional functioning, the reprioritization effect decreased the observed change by 2.93 points and the systematic deviation between the thentest and the pretest by − 0.68 points. When comparing thentest and pretest responses (differences of recollection) and taking response shift into consideration (true change), a systematic deviation (recall bias) in one direction was revealed, that is, the patients seemed to remember their former functioning as having been slightly better than it actually was (raw differences below 2.63 points, effect sizes below 0.1).

Discussion

The first aim of this study was to compare HRQoL of cardiac patients with HRQoL of the general population. At baseline (pretest, during cardiac rehabilitation), the patients’ HRQoL differed significantly on all scales. (All effect sizes were above 0.3.) Similar results have been reported by other studies, e.g., Juenger et al. [4] for all of the Short Form Health Survey SF-36 scales, and Schweikert et al. [5] for the usual activities scale amongst others of the EQ-5D health states. The two functioning scales that differed the most were role and social functioning. The largest differences in symptoms were found in the scales for fatigue and dyspnoea. Although patients’ HRQoL had increased significantly three months after rehabilitation, their mean scores were still lower than those of the general population.

The second aim was to investigate how changes in HRQoL were observed and perceived. We took a closer look at the functioning scales and found that, over a 3-month period, the patients perceived changes in their HRQoL differently than they were observed. In the physical functioning domain, they perceived more change, and in the role, social, cognitive, and emotional functioning domains, they perceived less change. After taking response shift into consideration, the perceived changes on all functioning scales were lower than observed, whether using the thentest or the SEM approach.

The third aim of this study was to explore response shift effects and indications of recall bias more closely. We identified different kinds of response shift. Uniform recalibration affected physical functioning (thentest) and cognitive functioning (posttest). Regarding physical functioning, the patients judged their actual level of functioning (at pretest and posttest) in the same way, but when they retroactively assessed their former physical functioning (thentest) they reported lower scores. This result is in line with another study on patients undergoing cardiac rehabilitation [7]. It is possible that the social experience of cardiac rehabilitation plays a role whereby the patients come into contact to other patients with similar levels of physical functioning. They learned that their level of physical functioning was actually worse than they thought, started to cope with that, and finally recalibrated their internal values. At follow-up, they reported their recalibrated former physical functioning, but the value of their actual physical functioning did not change. Regarding cognitive functioning, the patients judged their former cognitive functioning equally at the pretest and the thentest, but they judged their current cognitive functioning (posttest) to be lower than before, e.g., 3 months after rehabilitation they felt to have more difficulties concentrating and remembering. Reprioritization affected cognitive functioning (pretest) and emotional functioning (pretest), indicating that at follow-up (posttest and thentest) the patients attached less importance to cognitive functioning and emotional functioning when they assessed their overall functioning. This means that cognitive and emotional functioning had less impact on their overall assessments. This might be due to diminishing cognitive and emotional strain as a result of new knowledge acquired during cardiac rehabilitation. The patients may have learned new ways of responding to the emotional challenges of their illness and gained cognitive ability. Consequently, the strains on their cognitive and emotional functioning diminished and lost importance compared to other types of functioning. Non-uniform recalibration affected physical functioning (thentest) and social functioning (pretest). Lower residual variance (social functioning in post- and thentest) indicates less distance to the mean and suggests that the respondents answered in a more differentiated or more precise way in comparison with the pretest measurements. Higher residual variance (physical functioning in thentest) indicates an increase in random error due to less differentiated or less precise answers.

The indication of recall bias was marginal and did not influence the conclusion that was indicated by the thentest approach. The SEM results differed in some effects, but not due to recall bias. A convergent validity study [16] that also took both approaches into account did not find indications of relevant recall bias either. In the thentest approach, response shift is measured with the mean of the difference thentest-minus-pretest. It showed the only statistically significant effect in physical functioning (g = − 0.32) that was also identified with the SEM approach (raw difference − 7.27 points, g = − 0.37). But the SEM approach revealed an additional recalibration effect, which is in line with another study [38]. The uniform recalibration effect in cognitive functioning of − 2.60 points (g = − 0.10) could not have been detected by the thentest approach, even without the occurrence of any recall bias. This is because this effect turned out only in the posttest measurement of cognitive functioning, which is not reflected in the thentest–minus–pretest difference. Consequently it seems obvious that the two approaches are not equivalent and do provide converging results only under very special circumstances (no reconceptualization, no reprioritization, no recalibration in the posttest measurement, minimal recall bias).

Limitations

We analyzed a more or less homogenous group of patients who underwent cardiac rehabilitation. On the one hand, the generalizability of the results to other kinds of diseases is unclear, but on the other hand all patients had a common catalyst that may explain these findings. A comparison with a control group of patients who are not undergoing cardiac rehabilitation could attribute the effects to the intervention. Furthermore, the EORTC QLQ-C30 is an instrument developed to assess HRQoL in patients with cancer. Thus, the recorded symptoms are those commonly reported by cancer patients [25]. On the other hand, we based the essential part of our analysis on the functioning scales, which contain the main components that define HRQoL, they are the functional effects of physical, mental, and social response to disease and treatment [39]. While disease-specific instruments have limited sensitivity in identifying differences between different groups, e.g., the comparison with the general population, they are still more sensitive to change than generic questionnaires [40].

Conclusions

In summary we found that cardiac patients have markedly worse HRQoL in all dimensions of the EORTC QLQ-C30, even 3 months after cardiac rehabilitation. We found that response shift effects do occur, something that should be taken into account when changes in HRQoL over time are studied. Simple post–pre differences can underestimate real changes. Furthermore, in the case of uniform recalibration (unequal intercepts) a comparison of the latent means of pre-, post- or thentest is not defensible because of the shifted metric of the latent variable. Combining both methods (thentest and structural equation modeling) proved to be essential for detecting more comprehensive evidence of response shift.