Surgery is evolving. The number of procedures is increasing, while hospital stay and the number and severity of complications are decreasing [1]. Patients’ expectations have shifted from merely surviving the operation with manageable complications to recovering their quality of life (QoL) and returning to their baseline level of functioning [2, 3]. Much of this evolution is the result of surgical innovation, including the widespread adoption of laparoscopic surgery [4, 5], the emergence of robotic surgery [6], and the introduction of surgical checklists and care pathways [7, 8]. Nearly all new techniques and processes of care are advocated on the basis that they “improve recovery.” Yet postoperative recovery is a poorly defined and even less well-measured construct [9]. A rapid but transient deterioration in physical capacity is expected immediately after surgery, followed by a more gradual return towards and occasionally beyond baseline [10]. Though this anticipated trajectory is defined, there is no single instrument that has been validated as the gold standard measure of recovery after abdominal surgery [11]. This may reflect the fact that postoperative recovery is in fact a complex and multi-dimensional construct that requires assessment of several interrelated and increasingly complex dimensions [12, 13].

Health-related quality of life (HRQL) instruments are frequently used in research in an effort to capture this complexity and to operationalize the construct of recovery. The Short-Form-36 (SF-36) is one of the most common tools used to measure postoperative recovery [14]. The SF-36 is a generic HRQL questionnaire designed in the early 1990s by the RAND Corporation, and abundant evidence exists to support its validity in a variety of medical contexts [15, 16]. There is a willingness to extrapolate the available validity evidence from the medical to the surgical context. For example, guidelines recommend the use of the SF-36 to measure QoL after laparoscopic surgery [17]. Yet this practice is questionable, and the choice of which instrument to use in a trial of comparative effectiveness as well as the interpretation of the results obtained should be based on the measure’s psychometric properties specifically in the context of interest [1822]. It should not be assumed that because an instrument is valid in one context (e.g., asthma) it would have similar properties in another context (e.g., recovery after surgery).

The objective of this study was therefore to contribute evidence for the longitudinal (sensitivity to change) and construct (cross-sectional convergent and known groups) validity of the SF-36 as an indicator of postoperative recovery in patients undergoing planned colorectal surgery.

Materials and methods

Participants and setting

Data collected prospectively within the frame of two separate studies approved by the Institutional Research Ethics Board (ethics approval codes REC#02-053 and GEN06-023) and previously reported were used [2325]. The study sample thus consisted of adult patients scheduled to undergo colorectal surgery at one university-affiliated teaching hospital in 2005–2006 or 2009–2010. Exclusion criteria within the frame of these studies were: the presence of a psychiatric condition significantly limiting the patients’ ability to understand and complete the SF-36, baseline mobility restricted by a pre-existing condition, metastatic cancer, contraindications for neuraxial anesthesia, and chronic opioid use. Eligible patients were approached by a research assistant at the time of their visit to the pre-operative clinic, at which point written informed consent was obtained. Participants were evaluated 1 week pre-operatively and at 1 then 2 months postoperatively. At each of these times, they completed the SF-36 and their walking capacity was assessed with the 6-min walk test (6MWT). Baseline demographic characteristics were recorded, and data were also collected on intra- and post-operative parameters, including the occurrence and severity of complications. For the purpose of this validation study, complete case analysis was performed, resulting in the analysis of a subgroup of patients from the combined original datasets.

Measures

The SF-36 (http://www.rand.org/health/surveys_tools/mos/mos_core_36item_survey.html) is a generic self-reported HRQL questionnaire that defines and evaluates three principal health attributes namely functional status, wellbeing, and general health perceptions and overall QoL [15, 16]. The SF-36 is an instrument used to measure patient-reported outcomes, and will interchangeably be called a questionnaire, an instrument, a measure, or a profile throughout the text. It was developed by the RAND Corporation within the context of the Medical Outcomes Study [26]. This was a 2-year prospective observational study investigating determinants of health outcomes in patients with chronic disease and/or depression and the impact of the health care system on these outcomes. The SF-36 consists of 35 individual questions (items) divided into eight subscales that represent eight domains of health. An additional independent health transition item is present but not included in the scoring algorithm. Patients answer each question using an ordinal scale (0–3 or 0–6, depending on the question). These numerical answers are then recoded according to a pre-specified algorithm to yield scores ranging from 0 (worst health state) to 100 (best health state). The scores on items pertaining to the same dimension are then aggregated to generate a score for each of the eight domains of health (physical functioning, role physical, pain, social functioning, role emotional, vitality, mental health, and general health perception). These subscale-specific scores are then further combined to produce a physical and a mental component summary score (PCS and MCS) [15].

The 6MWT is a test of performance that evaluates patients’ fitness to sustain an intermediate level of physical activity for a given period of time. Patients are invited to walk along a hospital hallway for 6 min at a pace that should tire them by the end of the 6 min, and the distance covered (in meters) is recorded [27]. This level of fitness is reflective of patients’ ability to perform more strenuous activities of daily living [28]. The 6MWT has been validated as a measure of postoperative recovery after colorectal surgery [25].

Validity evidence

In the context of measurement, validity (or construct validity) is the extent to which a given instrument actually measures what it is intended to measure (the relevant construct). Validity is not absolute, but rather depends on the intended use of the instrument as well as on the target population. In assessing outcomes of an intervention for example, longitudinal validity, or a tool’s sensitivity to expected clinically important changes over time, is also of critical importance. A valid measure of postoperative recovery should reflect the anticipated trajectory of initial deterioration followed by improvement that occurs after an operation. Construct validity, defined above, can be divided into cross-sectional convergent and known-groups validity. The former is the degree of correlation between scores on the instrument and another measure of the same construct. In this case, determining whether scores on the physical functioning domain of the SF-36 are correlated with the distance covered in 6 min would be appropriate. Cross-sectional convergent validity is particularly relevant in the absence of a gold standard metric of postoperative recovery against which to compare the SF-36. Finally, establishing known-groups validity involves determining whether the instrument behaves in a predictable way, allowing differentiation between groups that are expected to be different on substantive grounds. This includes patients with versus without complications, patients versus healthy individuals, and patients undergoing open versus laparoscopic surgery [29].

Our a priori hypotheses for evaluating longitudinal and construct validity were as follows: (1) Longitudinal validity: Scores on selected domains of health will decline significantly from baseline to 3–5 weeks, and improve near baseline at 8–9 weeks. (2) Construct—cross-sectional convergent: At each assessment time, scores on the physical functioning domain of the SF-36 will correlate with scores on the 6MWT. (3) Construct—known groups: Scores on selected domains of health will be lower at 3–5 weeks in patients with complications compared to patients without complications; scores on selected domains of health will be lower at 3–5 weeks in patients compared to a healthy population (Canadian norms [30]); and scores on selected domains of health will be lower at 3–5 weeks in patients undergoing open compared to those having laparoscopic surgery.

Statistical analyses

Standardized response means (SRM), defined as the change in scores divided by the standard deviation of this change, were calculated to determine the evolution of scores on subscales of the SF-36 over time (longitudinal validity). Values between 0.5 and 0.8 are considered moderate, and the sign of the SRM reflects the direction of change. The Wilcoxon signed-rank test was used for significance testing. The magnitude of the change was also considered in relation to the context-specific minimal clinically important difference (MCID) for each of the eight domains of health, which represents the smallest change in an outcome measure that would influence patient management [31, 32]. In previous work, we estimated the MCID for domains of health of the SF-36 to range between 8 (6–9) and 15 (12–18) and between 15 (12–19) and 32 (28–36) points (on a scale of 0–100), depending on the patient’s baseline level of functioning [33]. Spearman’s rank correlation was used to test the cross-sectional convergent validity hypothesis. Known-groups validity was investigated by determining the effect of complications on domain-specific scores, adjusting for age, gender, American Society of Anesthesiologists grade (ASA), and laparoscopic approach. The one sample test for the median was used to compare patients’ scores to Canadian norms. Fisher’s exact test was used to compare the proportion of patients having returned to baseline among individuals with versus without complications.

Allowing a 5 % probability of committing a type I error (α = 0.05), 126 patients would have been required to detect a MCID of 15 points with 80 % power.

Statistical significance was defined a priori as p < 0.05. Data analysis was conducted using the statistical program STATA (Version 11.2, StataCorp, College Station, TX, USA). Results are presented as mean (95 % confidence interval), median [25th; 75th percentile], and n (%) where appropriate.

Results

A total of 128 (of 194 available) patients with data at all three time points were included in the analysis. After generating missing PCS and MCS scores using accepted algorithms, 66 patients (34 % of the original sample) were excluded from the complete case analysis. The demographic and operative characteristics were similar between the included and excluded patients. Baseline characteristics of the study sample are presented in Table 1. Patients were mostly men, with a median age of 63 years old [52; 73], mildly overweight with a mean BMI of 27 kg/m2 [20, 37], and in relatively good health (with 84 % having ASA I or II). They had predominantly undergone segmental colectomies (41 %) or low anterior resections (32 %). Less than 25 % of patients had received a stoma, and a laparoscopic approach was used in 54 % of all operations. Follow-up clinic visits occurred between 3 and 5 weeks and between 8 and 9 weeks, primarily as a consequence of scheduling conflicts.

Table 1 Patient and operative characteristics of the study sample

Longitudinal validity

Compared with baseline, scores on six of the eight domains of health (physical functioning, role physical, pain, social functioning, role emotional, and vitality) and the PCS had decreased significantly by the first postoperative appointment (p < 0.01). Decreases in scores on these domains ranged from −7 (−11, −3) to −42 (−52, −32). The same six domains of health had subsequently improved significantly between the first and second postoperative visits (p < 0.01), with changes ranging between +6 (2, 9) and +32 (24, 40). Moreover, for the physical functioning, role physical, pain, and social functioning domains, which represent biophysical constructs, changes in scores were equal to or greater than the corresponding MCID, both during the deterioration and recovery phases. SRMs for deterioration were between 0.26 and 0.86 (small to large); those for recovery were between 0.32 and 0.79 (small to moderate).

The change from baseline to 1 month was minimal and not significant for the mental health (p = 0.99) and general health perception (p = 0.13) subscales and for the MCS (p = 0.53).

At 2 months, patients had mostly recovered to baseline, but continued to report some limitations on the role physical subscale, with scores 10 points (0.34, 19) below baseline.

Median scores illustrating this trajectory and changes over time are presented in Table 2.

Table 2 Perioperative changes in scores on measures used to evaluate recovery after colorectal surgery

Defining “return to baseline” as a score within 10 % of the baseline score, the percentage of patients that were below baseline, at baseline, or above baseline is presented in Table 3 for each subscale of the SF-36 at 1 and 2 months. The analysis was repeated using Canadian norms rather than baseline with no substantive changes in the results obtained.

Table 3 Percentage of patients who had returned to baseline at each assessment time

Construct validity: cross-sectional convergent

Significant positive correlations were found between physical functioning scores and 6MWT distance at 1 and 2 months (Spearman’s r = 0.31 and 0.36, respectively, p < 0.01). This correlation only approached significance 1 week pre-operatively (Spearman’s r = 0.16, p = 0.07). A positive correlation was also identified between the PCS and the 6MWT distance at 1 month (Spearman’s r = 0.22, p = 0.02). Small non-significant correlations were observed between the seven other subscales and 6MWT distance.

Within each domain of health, higher baseline scores were found to correlate with higher subsequent scores (Spearman’s r between 0.20 and 0.66, p < 0.01). This was also true for both the physical and mental PCS and MCS (Spearman’s r between 0.29 and 0.55, p < 0.01).

Construct validity: known groups

Twenty-eight patients experienced Clavien-Dindo I or II [34] complications in the postoperative period, with an additional nine experiencing grade III and higher complications.

Baseline scores on seven of the SF-36 domains did not differ between patients with versus without complications (p values from 0.08 to 0.90). Baseline PCS and MCS were also similar between the two groups (p = 0.14 for both PCS and MCS). The only difference was found in baseline general health perception, with lower scores in patients who subsequently developed a complication (−10 (−18, −2), p = 0.01).

At 1 month and after adjusting for age, gender, ASA class, and laparoscopic approach, scores on all eight subscales were lower in patients with complications by 7–18 points. Although this difference was not statistically significant for three of the eight domains, the lower bound of the 95 % confidence interval was nevertheless highly suggestive of a possible clinically relevant negative effect of complications (Table 4). In addition, a greater proportion of patients without complications had recovered to baseline at 2 months when compared to patients with complications. This was statistically significant for three of the four biophysical domains (role physical 47 vs. 26 %, p = 0.03; pain 74 vs. 45 %, p < 0.01; social functioning 81 vs. 45 %, p < 0.01; physical functioning 62 vs. 45 %, p = 0.07) as well as for the PCS (73 vs. 50 %, p = 0.02).

Table 4 Scores on the SF-36 domains at 3–5 weeks postoperatively in patients without complications, and differences in scores in patients with complications

After adjusting for complications, age, and gender, no significant differences were identified between patients having had laparoscopic versus open surgery, with scores in the laparoscopic group ranging from 11 points higher (−4, 26) on the role physical domain to 7 points lower (−15, 1) on the general health perception domain. Similarly, except for a marginal benefit on the role physical domain (in bold in Tables 5, 6), no significant differences were identified between a laparoscopic and an open approach among patients without complications, or among patients with no or with Clavien I or II complications. This lack of a significant difference between laparoscopic and open surgery was demonstrated at both follow-up times (Tables 5, 6).

Table 5 Difference in scores on the SF-36 domains at 3–5 weeks in patients undergoing laparoscopic versus open surgery
Table 6 Difference in scores on the SF-36 domains at 8–9 weeks in patients undergoing laparoscopic versus open surgery

One month after surgery, scores on all subscales were significantly lower in patients when compared to corresponding Canadian norms. The differences between patients and healthy individuals were moderate to large on six of the eight domains and on the PCS and MCS. This is shown in Table 7.

Table 7 Differences in scores on the SF-36 domains in patients 3–5 weeks postoperatively compared to Canadian norms

Discussion

The SF-36 is widely used in the clinical setting and in research studies to operationalize the construct of postoperative recovery [14, 35]. Despite extensive evidence of its validity in multiple settings, including orthopedic and spine surgery [36, 37], no studies have specifically investigated its performance in the context of recovery after digestive surgery. This study contributes evidence for the longitudinal and construct (known groups and cross-sectional convergent) validity of the SF-36 as it was applied to a cohort recovering from planned colorectal surgery. The SF-36 was responsive to clinically meaningful changes and discriminated between patients and healthy individuals as well as between patients with versus without complications. Scores on the physical functioning domain correlated with the 6MWT, a measure of submaximal exercise capacity. However, it did not differentiate recovery after laparoscopic and open surgery. These findings support the use of the SF-36 as a metric of postoperative recovery, but also underscore the limitations inherent to using generic measures of HRQL in this context.

The SF-36 was responsive to physiological postoperative changes, with scores on all subscales but mental health and general health perception being significantly lower than baseline at 1 month and having returned to baseline at 2 months. At 2 months, scores on all domains were back to baseline except for role physical, which remained significantly below baseline. These findings are consistent with the substantive difference between biophysical and emotional parameters. Indeed, patients’ state of mind after surgery may reflect the relief associated with “being cured” or simply coming through the surgery itself, which dissipates over time as they return to their baseline functional level. Emotional domains are consequently not expected to follow the same trajectory of deterioration and improvement as physical function parameters. Previous work reports the MCID for four biophysical domains of the SF-36 [33]. It is interesting to note that the magnitude of the changes observed in the current study was equal to or exceeded these subscale-specific MCIDs, though this may in part be related to the partial overlap between the datasets used in these two studies.

As the SF-36 is a multidimensional generic HRQL questionnaire, it should not be expected that all domains would meet construct validity criteria, as the SF-36 was not designed to specifically target postoperative recovery. Physical functioning at 1 and 2 months and 1-month PCS correlated with distance walked in 6 min. Though the correlations were not strong, they support the substantively relevant hypothesis that subscales that most closely reflect physical performance would correlate with a more objective measure of the same construct. Furthermore, the SF-36 discriminated between patients and healthy individuals on all domains, as well as between patients with versus without complications on five of eight domains. Interestingly, the ability to capture the emotional burden and the impact on general health perception associated with experiencing any degree of complication supports the validity of the SF-36 as a HRQL measure. However even after appropriate adjustments, the SF-36 did not identify differences between the subgroups of patients undergoing laparoscopic versus open surgery, which is not unexpected given that this questionnaire was neither designed to measure postoperative recovery nor to capture differences between laparoscopic and open approaches. This suggests that such a generic instrument may not be optimal to detect these. This finding is unlikely to be the result of selection bias, as the cohort of patients included in this study was representative, regarding their demographic and operative characteristics, of a typical colorectal surgery population.

A large number of HRQL instruments are currently being used to evaluate patient-centered outcomes during the postoperative recovery period. Strengths of the SF-36 include its generic nature, allowing recovery to be assessed and comparisons to be made between interventions and settings, as well as its ability to capture multiple dimensions of patient outcomes by targeting eight relevant HRQL domains [13, 17]. The SF-36 is also simple to complete, either independently or as administered by an interviewer, in person or by telephone [17]. It was developed based on rigorous measurement methodology, and has since then been shown to be reliable, valid, and sensitive to change in many contexts. It has consequently been broadly translated and adapted to many cultural frameworks. Guidelines from the European Association for Endoscopic Surgery recommend the SF-36 as appropriate in a variety of contexts [17]. Yet these guidelines also highlight the fact that the validity evidence on which some recommendations are based is extrapolated from robust studies that were nonetheless not conducted in surgical populations. Even after adjusting for potential confounders that may overwhelm more subtle differences, the SF-36 fails to detect differences in instances where one might expect them, such as between laparoscopic and open colorectal resections [35, 3841]. Although the apparent absence of a difference in a trial may have several explanations, including true equivalence, lack of power, and use of an inappropriate instrument [42, 43], generating evidence to rule out that the latter two factors are a necessary step toward the useful interpretation of research results. Thus, if researchers use the SF-36 in a surgical population and power comparative studies based on the context-specific MCIDs provided, it is presumed that true differences would be identified. This being said, if the research aims to compare recovery after laparoscopic and open colorectal resection, the SF-36 may not be sufficiently sensitive to discriminate between the two in this context.

Confirmation of the validity of the SF-36 as a measure of recovery after colorectal surgery supports its widespread use in practice, both at ours and at other institutions. Nevertheless, its inability to discriminate outcomes in patients undergoing laparoscopic versus open surgery is concerning and will steer us away from this instrument when comparing these two approaches. An awareness of this limitation is important when planning and interpreting trials, though a gold standard metric of recovery to replace the SF-36 in this context does not yet exist. Work is therefore required towards the development of such a measure, in addition to larger prospective validation studies for the SF-36, as detailed below.

Strengths and limitations

Strengths of this study include its power to detect clinically meaningful differences. Moreover, a relatively homogenous and representative patient population was used, and validity was assessed using several approaches.

Nevertheless, the principal limitation of this study is the absence of a gold standard measure of postoperative recovery. Criterion validity could consequently not be established, and surrogate validity standards had to be used. This limitation may be partially addressed, however, by the fact that sensitivity to change is perhaps the most important aspect of validity for a measure of outcomes after an intervention [29]. Further studies will be required in other surgical populations to assess the generalizability of these results.

Another limitation is the use of data collected prospectively within the frame of studies other than the current one. The inclusion and exclusion criteria as well as the timeframe for the follow-up visits were selected specifically for the purpose of these other previously published studies. Importantly, though we do not suspect that the overall validity evidence supporting the SF-36 as a measure of postoperative recovery would be affected, data at 2 weeks for example may have revealed a difference between laparoscopic and open surgery. Studies [44, 45] have shown that the benefits of laparoscopic surgery, namely decreased pain and length of stay and a faster return to work, are most pronounced in the immediate postoperative period. These differences tend to disappear over time, as patients return to their baseline functional status and activities. Thus, in designing a prospective study to assess the validity and discriminative properties of a given recovery metric, we would deliberately include a follow-up visit within 2 weeks of surgery in addition to considering slightly different inclusion and exclusion criteria. We would also schedule a follow-up visit further downstream (6 months to a year) to determine the evolution of patients’ function and QoL and whether they have returned to Canadian norms or not.

These limitations underscore the need for a large prospective study specifically designed to assess the validity of the SF-36 as a measure of recovery after abdominal surgery. Such a study would be adequately powered to allow subgroup analyses (by severity of complications, for example) as well as analyses at multiple time points and comparisons between several commonly used patient-reported outcome metrics.

Conclusion

Postoperative recovery is a complex construct for which a gold standard measure is not yet available. The SF-36, a generic HRQL questionnaire, is the most widely used metric in this context. The present study provides evidence of the validity of this instrument to quantify recovery after colorectal surgery in general. It also emphasizes the importance of being aware of the psychometric properties of each instrument in the specific context in which it is used. Only when a valid measure is used in an adequately powered study can the results be interpreted as truly in favor or against the presence of a true difference in effectiveness.