Background

Lumbar degenerative spondylolisthesis (LDS) is defined as an anterior slip of one vertebra relative to the adjacent one and is caused by degenerative changes in an intact vertebral arch [1]. Most patients have back pain, radiating pain in the lower extremities, and neurogenic claudication due to concomitant lumbar spinal stenosis [2].

The evidence for the effectiveness of surgical treatment of LDS is moderate [35], but the relative effectiveness and criteria for the appropriateness of the different surgical treatment options are unclear [6]. There are two dissimilar principal procedures used today: decompression alone and decompression in combination with a fusion procedure. Meta-analyses and systematic reviews conclude that there is moderate evidence of a tendency towards better outcome when fusion is added to decompression [1, 79].

In the US, both the rate of patients receiving instrumented fusion and the rate of complex fusion procedures have increased over the last 15 years [10, 11]. These procedures have higher costs and are associated with more severe complications than decompression alone [10, 11]. As well, more recently published studies provide some evidence against the use of extensive procedures when operating for LDS [1214]. Hence, it is important to compare clinical outcome after decompression only with outcome after additional fusion procedures.

The purpose of the present observational cohort study was to compare the relative effectiveness of decompression alone and decompression with fusion in daily clinical practice, 3 and 12 months after surgery for LDS.

Materials and methods

We have followed the guidelines of the ‘Strengthening the Reporting of Observational Studies in Epidemiology’ (STROBE) statement [15].

Collection of data

All data were collected from the Norwegian Registry for Spine Surgery (NORSpine) [16]. NORSpine receives no funding from the industry, and 36 of the 40 centres which perform spinal surgery report to the registry. The patient forms consist of self-administered questionnaires completed before surgery (baseline) and at 3- and 12-month follow-up after surgery. The questionnaires contain patient-reported outcome measures and additional questions at baseline about demographics and lifestyle. During their hospital stays, patients’ data concerning comorbidity, radiological classifications, American Society of Anesthesiologists (ASA) grade, perioperative complications, operation time and operation method were recorded by the surgeon. Peri- and postoperative complications and adverse events were reported by patients on the 3-month follow-up form.

Population/interventions

Inclusion criteria: (1) patients registered in NORSpine during the period 2007–2013; (2) patients assessed by the surgeon to have both spinal stenosis and degenerative spondylolisthesis; (3) patients operated with a decompression procedure or with a decompression and a fusion procedure.

Exclusion criteria: (1) former back surgery at the level to be operated on; (2) operated with an anterior approach; (3) operated with an interspinous device.

Based on the data from the surgeon forms, the patients were categorised as follows: the decompression group: only decompression was performed. The fusion group: the decompression was followed by an instrumented or a non-instrumented fusion procedure.

Primary outcome measures

  1. 1.

    Numeric Rating Scale (NRS 0–10) for leg and back pain.

  2. 2.

    Oswestry Disability Index (ODI 0–100) V. 2.0. ODI has been found to be an appropriate questionnaire for assessing treatment outcome in patients with degenerative spondylolisthesis [1]. It has been translated and validated for Norwegian patients [17].

Minimal clinically important difference (MCID) is estimated by Copay et al. to be 12.8 points for ODI, 1.2 for back pain and 1.6 for leg pain [18].

Secondary outcome measures

(1) Global Perceived Effect (GPE) scale [19]: patient-rated satisfaction with treatment outcome was assessed using a single question on a seven-point descriptive scale with the answers ‘completely recovered’, ‘much improved’, ‘slightly improved’, ‘unchanged’, ‘slightly worse’, ‘much worse’ and ‘worse than ever’. The patients were trichotomised into ‘substantially improved’ (‘completely recovered’ and ‘much improved’), ‘slight or no change’ (‘slightly improved’, ‘unchanged’ and ‘slightly worse’) and ‘substantially deteriorated’ (‘much worse’ and ‘worse than ever’); (2) duration of surgery; (3) length of hospital stay; (4) perioperative complications and adverse events registered on the surgeon form; (5) patient-reported complications and adverse events.

Statistical methods

Propensity score matching (PSM) was used to make the distribution of observed baseline patient characteristics in the decompression and fusion groups as similar as possible [20]. Baseline covariates that could potentially bias the allocation to treatment were considered to be: age, gender, ASA grade, body mass index, smoking, ODI, NRS back pain, NRS leg pain, EQ5D, the presence of foraminal stenosis, degenerative disc disease, scoliosis, predominating back pain, number of levels operated on and neurological palsy. Some data were missing in some of these baseline covariates, resulting in varying sample sizes (N varying from 556 to 616). The net sample with intact information on all variables was N = 464. To prevent the loss of cases used for estimation, the multiple imputation method was used to impute information in covariates prior to calculation of the propensity score. The covariates scheduled for the propensity score were used as predictors for the imputation model. The variables were constrained in order to maintain the measurements’ original range. After maximum 50 case draws, one value was imputed when data was missing [21]. The PSM scores were derived from a logistic regression model, and reflect a patient’s theoretical baseline probability for being fused. Using the technique of ‘1:1 optimal matching without replacement’, pairs of fused and non-fused patients with a difference in propensity scores less than 0.2 in logit of the standard deviation were formed [20].

In a previous study from NORSpine no differences in outcome measures were found when comparing compliers and non-compliers at follow-up [16]. We therefore assumed data to be missing at random [21]. The longitudinal difference analyses were based on the listwise deletion method. Analysis of mean group differences with 95 % confidence intervals (95 % CI) and calculation of proportion of responders at 12-month follow-up were performed. A responder was defined as a patient with an improvement in a primary outcome exceeding the MCID.

In order to include data from all three time points and hence make use of all available data, latent growth curve (LGC) models with full information maximum likelihood were used [22]. Thus, information from patients was not excluded if data were missing at one of the follow-ups. A piecewise growth model was estimated, including one linear change from baseline to 3 months and another from 3 to 12 months. It was thereby possible to make predictions concerning two different change periods: the change during the first 3 months and the change during the last 9 months. Two LGC models were estimated; one unconditional, with level and changes for both groups together, and one including the estimation of the group differences.

SPSS version 22 was used for descriptive statistics, tests for distribution of data, cross-tabulations with χ 2 test, Student t tests and Mann–Whitney U tests of group differences and linear and logistic regression modelling. Mplus 7.3 was used for analysing LGC models [23].

The sample size was based on the same assumption as an ongoing randomised multicentre trial on the same patient group (ClinicalTrials.gov Identifier: NCT02051374). In a non-inferiority design, with one sided testing, we wanted to detect whether the proportion of responders was more than 15 % higher in the fusion group than in the decompression group. This was to be tested by forming a 95 % confidence interval (CI) for the difference of proportions (i.e., percentage of responders in the fusion group minus percentage of responders in the decompression group) and was to be rejected if the upper limit of the CI was less than 15 %. If rejected we had to conclude that decompression alone was non-inferior, i.e., as good as, decompression with fusion. A 15 % difference in responder rate corresponds to a number needed to treat of seven patients (NNT = 100/15 = 6.67) [24]. Choosing a type 1 error = 0.05 and power = 0.80 gives a sample size of 116. Considering these assumptions, and adding 25 % for possible dropouts, a total of 155 patients were required in each group [25].

Results

Out of the 721 eligible patients, 616 met the inclusion criteria. Of these, 294 (48 %) had undergone a fusion procedure in addition to decompression. A total of 260 propensity score matched patients from each group were compared. Seventy-three percent returned the forms at 3 months, 85 % at 12 months and 94 % had at least one follow-up registration (Fig. 1). Table 1 presents baseline data before and after propensity score matching. Twenty-four percent of the decompression group and 66 % of the fusion group underwent a laminectomy. Otherwise a midline preserving decompression was performed. In the fusion group, 83 (32 %) patients were operated without use of instrumentation, 129 (50 %) with use of posterolateral pedicle screw instrumentation and 48 (18 %) with pedicle screws in combination with an interbody fusion.

Fig. 1
figure 1

Flow chart with enrolment and follow up

Table 1 Baseline characteristics before and after propensity score matching

Primary outcome measures

Figure 2 illustrates the changes in primary outcomes for both groups. Table 2 presents the measures at baseline and 3 and 12-month follow-up, the change in scores from baseline to 12 months and the proportion of patients with an improvement in a primary outcome exceeding the MCID. At 3- and 12-month follow-up, there were statistically significant lower levels of back and leg pain in the fusion group as compared with the decompression only group (mean difference at 12 months −0.6, 95 % CI −1.2 to −0.05, p = 0.03 for leg pain and mean difference −0.6, 95 % CI −1.1 to −0.1, p = 0.02 for back pain). There were no significant ODI differences between the groups when measured at both 3 and 12 months (mean difference at 12 months −2.3, 95 % CI −5.8 to 1.1, p = 0.18). At 12-month follow-up, we found a statistically significant difference between the proportion of patients in each group with a reduction of back pain exceeding MCID (74 versus 63 %, difference 11 %, 95 % CI 2–20 %, p = 0.01) favouring the fusion group. The corresponding difference in responder rate (fusion as compared to decompression alone) was not statistically significant for leg pain (74 and 67 %, respectively, difference 7 %, 95 % CI −1 to 16 %, p = 0.09) or for ODI (67 and 59 %, respectively, difference 8 %, 95 % CI 0–18 %, p = 0.06).

Fig. 2
figure 2

Error bars for the propensity score matched cohort representing means and 95 % confidence interval for ODI, NRS back pain and NRS leg pain at baseline, 3 and 12 months

Table 2 Outcome measures for the propensity score matched cohort at baseline, 3 and 12 months, mean changes from baseline to 12 months and proportion of patients with improvement exceeding MCID from baseline to 12 months

Table 3 shows the results analysed by LGC models. When both groups were analysed together we found a significant (p < 0.01) reduction in ODI and back and leg pain from baseline to 3-month follow-up. For leg pain a significant deterioration was found from 3 to 12 months. At 12-month follow-up a significantly lower level of leg and back pain was noted in the fusion group as compared to the decompression alone group. Estimated differences were −0.55 for leg pain and −0.61 for back pain. We found no significant differences between the groups assessed by change in ODI and back and leg pain from either baseline to 3 months or from 3 to 12 months.

Table 3 Results for the propensity score matched cohort estimated with Latent Growth Curve models

Secondary outcome measures

The mean duration of surgery was 103 min for the decompression group compared to 170 for the fusion group (mean difference 68 min, 95 % CI 58–78, p < 0.01). The hospital stay was 7.1 days for the fusion group and 2.9 days for the decompression group (mean difference 4.2 days, 95 % CI 3.5–4.8, p < 0.01). There were no significant differences in the proportion of perioperative complications [8 % (fusion) versus 5 %, p = 0.29] and patient-reported complications during the first 3 months [13 % (fusion) versus 15 %, p = 0.53] (Table 4).

Table 4 Surgical parameters and data from the hospital stays of the propensity score matched cohort

Table 5 presents the results for Global Perceived Effect score. Seventy percent of the fusion group versus 62 % of the decompression group rated themselves as ‘completely recovered’ or ‘much improved’ (difference 8 %, 95 % CI −11 to 17 %, p = 0.09), and 4 % in both groups rated themselves as ‘much worse’ or ‘worse than ever’ (p = 0.96).

Table 5 Patient-reported benefits of operation at 12-month follow-up according to the global perceived effect score for the propensity score matched cohort

Discussion

The difference in proportion of responders assessed by back pain was 11 %. This corresponds to a number needed to treat of 9 (95 % CI 5–50). Consequently, nine patients need to be fused to achieve one extra patient with reduced back pain exceeding MCID. The difference in responder rate was not statistically significant for leg pain or ODI. The upper limits of the 95 % confidence intervals for differences in responder rate were 20 % (back pain), 18 % (ODI) and 16 % (leg pain) all of which exceed the proposed clinically relevant difference of 15 %. Due to these results we cannot assert non-inferiority for decompression alone.

The mean difference at 12-month follow-up in leg and back pain (−0.6 for both) in favour of the fusion group is considered to be small and of uncertain clinical importance. For ODI, the difference of −2.3 units in favour of fusion was not statistically significant.

A borderline significant difference in the percentage of patients rating themselves as ‘much improved’ or ‘completely recovered’ (70 % for the fusion group and 62 % for the decompression group) was also noted.

From a health-economic perspective, the considerably longer duration of surgery and hospital stay contribute to more costs for the fusion group. There were no significant differences between the groups in complications or adverse events.

The evidence for fusion in guidelines and meta-analyses [1, 7, 8] is often based on Herkovitz et al.’s paper from 1991 [26], which is considered to be of low quality according to the GRADE system [27]. Fifty consecutive patients were assigned to decompression with or without non-instrumented fusion. Pain reduction was greater in the fusion group. There were 24 out of 25 good or excellent in the fusion group compared to 9 out of 25 in the decompression alone group.

This year two randomised controlled trials were published [28, 29]. Ghogawala et al. compared laminectomy alone with laminectomy plus instrumented fusion (66 patients included) at 2-, 3- and 4-year follow-up. They found the physical health-related quality of life to be slightly better and more clinically meaningful in the fusion group [29]. Försth et al. randomised 68 patients to decompression alone and 67 to decompression and fusion. No statistical or clinical differences were found between the groups at 2-year follow-up. Interestingly, the mean ODI score in the Swedish study corresponds strongly with the present study at baseline as well as at follow-up.

Several registry studies have been published over the past few years. A study of 213 patients (56 operated on with decompression alone) from one centre participating in the Spine Tango Registry showed better patient-reported outcomes when instrumented fusion was added to decompression. In that study, 86 % of the fusion group versus 70 % of the decompression alone group (p = 0.01) were defined as having a good outcome. When controlling for gender, age, comorbidity, number of levels operated on and baseline outcome scores in a multivariate regression analysis, fusion was a significant unique predictor for a ‘good’ global outcome [30]. Conversely, in a study from the Swedish Registry for Spine Surgery (Swespine), 655 patients operated with decompression only were compared with 651 operated with additional fusion. After adjusting for age, gender, smoking habits, use of analgesics and baseline outcome variables, no differences were found in clinical outcomes at 2-year follow-up; back pain 3.5 versus 3.2 (p = 0.17), leg pain 3.5 versus 3.2 (p = 0.12) and ODI 27 versus 27 (p = 0.93). The Swedish group concluded that a less invasive procedure should be recommended for elderly patients [13]. In another recently published study from Swespine, 839 patients (245 with decompression alone) were analysed in relation to preoperative pain patterns. At 1-year follow-up, after adjusting for age, gender, duration of symptoms, earlier surgery and smoking, the fused patients with predominant back pain had significantly greater reduction in pain and ODI than the non-fused patients. At the 2-year follow-up, no differences were found regardless of pain pattern [14].

Strengths and limitations

Although propensity score matching was applied to reduce the confounding effects of differences in observed baseline characteristics between the groups, unobserved and unknown factors may be dissimilar in the treatment groups. For example, radiological biomechanical factors such as the size of the slip and the degree of instability could lead surgeons to choose one type of surgery over the other. The gold standard for evaluating efficacy in clinical research is a randomised controlled trial (RCT). An ongoing Norwegian RCT will provide better evidence about the additional effect of fusion and eliminate selection bias. However, the present study evaluates how the treatments work in daily clinical practice. The pragmatic design using multicentre registry data ensures a high external validity [31] and complements the recently published RCTs [28, 29] as well as the ongoing Norwegian RCT on the same subject.

The data has been analysed using different statistical methods, and results for the primary outcomes were consistent between the analyses. The clinical outcomes were investigated by the use of appropriate [1] and validated [17] PROMs. The proportion of patients missing in both follow-ups was low, which suggests that it is unlikely that missing data created biased results.

The present study does not address the long-term effect of a fusion procedure. The non-fused patients may develop recurrent stenosis or progression of the slip and the fused patients may be more prone to adjacent level degeneration [32, 33]. However, evidence regarding the influence of surgical procedure on long-term clinical results is sparse and inconsistent [5, 12, 14, 3437].

Conclusion

In the present study, where the decision-making process was affected by physicians’ and patients’ personal preferences and perceptions, we cannot give evidence that decompression alone is as good as decompression with additional fusion. However, the small differences in the groups’ effect sizes suggest that a considerable number of patients can be treated with decompression alone. In the future, well-powered studies should focus on finding the best individual treatment option for patients with degenerative spondylolisthesis.