Introduction

Spondylolisthesis is defined as the forward slippage of one vertebra on another. Surgical fusion is an important method of stabilizing the spine in lumbar spondylolisthesis, which is used to reduce pain and decrease disability in patients with chronic low back pain [1]. With the development of modern surgical techniques, different fusion methods are currently available [28]. The surgical procedures that have been advocated include anterior interbody fusion, posterior interbody fusion, posterolateral fusion, repair of the pars interarticularis, and reduction and fusion [711]. Posterolateral fusion (PLF) and posterior lumbar interbody fusion (PLIF) are common choices among the various techniques available for the treatment of lumbar spondylolisthesis. Many studies have compared these approaches regarding technical demands, radiological outcomes, and clinical results; however, the scientific evidence in favor of PLIF remains weak. Inconsistent outcomes among surgeons concerning the preferred method for the two fusion methods have made it difficult to reach a consensus on the optimal fusion technique. Thus, the goal of this study is to evaluate the effectiveness of PLIF compared with PLF and to determine whether conclusions can be made with regard to which of the two fusion methods is better for treating patients with spondylolisthesis.

Materials and methods

Literature search

Relevant randomized controlled trials (RCTs) and comparative studies were identified from January 1960 to December 2012 by searching MEDLINE, EMBASE, and the Cochrane Central Register of Controlled Trials databases as well as by manually searching the journals of Spine, European Spine Journal, and Journal of Bone and Joint Surgery from January 1990 to December 2012. Keywords and medical subject headings related to the condition and potential treatment were identified prior to initiating the search. The search strings are shown in Table 1. Minor modifications of the search strings were required between different databases. Gray literature, including books and conference papers, were collected and these studies were included if they met inclusion criteria. No linguistic restriction was imposed on the search as recommended by the Cochrane Back Review Group editorial board [12]. Two investigators independently reviewed all subjects, abstracts, and the full text of articles that were potentially eligible based on abstract review. The eligible trials were then selected according to the inclusion criteria.

Table 1 Strings for electronic search

Study eligibility criteria

We systematically reviewed published studies according to the following criteria: (1) subjects who were 18 years or older and who had undergone spinal fusion for lumbar spondylolisthesis; (2) the interventions were PLIF and PLF; studies were excluded if patients accepted posterolateral fusion and interbody fusion simultaneously; (3) the study reported at least one desirable outcome; (4) all included patients were followed up for at least 1 year after surgery; and (5) studies were excluded if more than 5 % of participants had an acute spinal fracture, infection, revision [13], tumor, osteoporosis, rheumatoid arthritis, degenerative kyphosis, or degenerative lumbar scoliosis exceeding 10°.

Risk of bias assessment

The checklist by Furlan [14, 15] was used to evaluate the methodological quality of RCTs. Evaluation of the nonrandomized controlled studies was performed with the checklist used by Cowley [16]. One reviewer assessed the risk of bias of the included studies. The items were scored with “yes,” “no,” or “unsure.” A Furlan score of 6 or more out of a possible 12, or a Cowley score of 9 or more out of a possible 17, was considered to reflect “high methodological quality.”

Clinical relevance

Two reviewers independently assessed the clinical relevance of the included studies according to five questions that were recommended by the Cochrane Back Review Group [14]. Each question was scored “Yes” if the clinical relevance item was met, “No” if the item was not met, and “Unclear” if data were not available to answer the question. A 30 % improvement in pain scores [17] and in functioning outcomes from baseline was considered to be clinically important.

Fig. 1
figure 1

Flowchart of the study selection process

Data collection

The data were independently extracted by two reviewers who had expertise in spinal diseases, and any disagreement that arose was discussed and resolved by consensus. The data retrieved included the following items: participant characteristics, study characteristics, specific intervention, and resultant outcomes of the comparison results. The desirable outcomes measured in this analysis were classified into primary outcomes and secondary outcomes. The clinical satisfaction was considered as the primary beneficial outcome. The complication rate was considered to be the primary harmful outcome. Secondary outcomes included fusion rate, reoperation rate, operating time, blood loss, and postoperative spinal alignment. The assessment of clinical satisfaction was based on the scores of ODI and the Prolo Economic and Functional scale and the objective evaluation from patients. “Excellent”, “very good”, “much better”, “successful”, “satisfied”, or other similar appraisal was considered a satisfactory clinical result, whereas “fair”, “slightly satisfied”, “poor”, “worse”, “failure”, or other similar appraisal was considered an unsatisfactory clinical result. Complications were categorized into major and minor complications according to the severity and influence on daily life. Major complications included deep infection, permanent nerve deficit, instrument failure, nonunion and revision, pulmonary embolus, and urinary tract infection with bacteremia. Minor complications consisted of superficial infection, transient nerve palsy, pain in the bone graft donor site, nonunion not revision, and urinary tract infection.

Data analysis

A meta-analysis was performed on the extracted data with RevMan 5.0 software (Cochrane IMS) using a random-effect model. For dichotomous variables, the odds ratio (OR) and 95 % confidence interval (CI) were calculated. For continuous variables, the mean difference and 95 % CI were calculated when outcome measurements in all studies were conducted on the same scale. Otherwise, the standardized mean difference and 95 % CI were calculated when the trials assessed the same outcome, but used different measurement methods. The I 2 statistics were used to test heterogeneity. An I 2 value <25 % was considered to be homogeneous, between 25 and 50 % to be of low heterogeneity, between 50 and 75 % to be of moderate heterogeneity, and above 75 % to be of high heterogeneity.

Quality of evidence

The quality of the evidence for each outcome was evaluated using a rating system with four levels recommended by the Grading of Recommendations Assessment, Development and Evaluation Working Group [14]. The level of evidence was mainly determined by RCTs. Nonrandomized controlled studies were complementary to the findings of RCTs. The initial strength of the overall body of evidence was considered “high” if the majority of RCTs were of high methodological quality. We downgraded the quality of evidence by one or two levels on the basis of the following five domains [18]: limitations of the study design [19], inconsistency [20], indirectness [21], imprecision (insufficient or imprecise data) of results [22], and publication bias [23]. The evidence was upgraded on the basis of the following criteria: (1) rated up one level when methodologically rigorous observational studies showed at least a twofold reduction or increase in risk, and rated up two levels for at least a fivefold reduction or increase in risk; (2) rated up one level for a dose–response gradient [24]. The level of evidence from nonrandomized controlled studies was upgraded with caution.

Results

Search results

Flow chart shows the study selection process (Fig. 1). After the duplicate studies were excluded, a total of 4,394 studies were obtained. Based on the title and abstract, 4,371 reports were excluded because the topic of the article was not relevant to the objective of the review. In the excluded reports, two possible comparative observational studies were excluded because of a lack of complete information [25, 26]. Finally, we identified a total of nine eligible studies, consisting of four RCTs [2730] and five comparative observational studies [3135] that involved a total of 520 patients. The individual sample sizes ranged from 22 to 138 patients. In one comparative study, one patient younger than 18 years of age was enrolled in each group [32]. One patient with traumatic spondylolisthesis was enrolled in the PLF group from one RCT [30]. Both trials were selected because most participants were eligible despite the small effect of these ineligible cases.

Five of the studies recruited patients who had been diagnosed with isthmic spondylolisthesis [27, 29, 3234]. The other four trials selected a mix of patients with degenerative and isthmic spondylolisthesis [28, 30, 31, 35]. Patients with a diagnosis of degenerative spondylolisthesis accounted for 27.3 % of the total patients. All participants had a history of back pain with or without radicular pain. Internal fixation was used in both fusion procedures. Detailed information on the study designs, characteristics of participants, follow-up, interventions, instruments, and outcomes are shown in Table 2.

Table 2 Characteristics of included studies

Risk of bias of included studies

The Furlan scores for the four RCTs ranged from 5 to 9 out of 12 (Table 3). Three RCTs received Furlan scores of six or higher, indicating the overall lower risk of bias of the trials. The most notable methodological shortcomings were uncertainties regarding blinding procedures, with only one “Yes” item for the 4 RCTs. In two studies, there was a clear attempt at concealment of the group allocation and method of randomization [27, 29]. The average follow-up of the trials ranged from 1 to 4 years. The Cowley scores for the five nonrandomized controlled studies ranged from 11 to 13 out of 17 (Table 4). All nonrandomized controlled studies received Cowley scores >9 and represented “high methodological quality”.

Table 3 Risk of bias assessment of included RCTs using the checklist by Furlan
Table 4 Risk of bias assessment of included observational studies using the checklist by Cowley

Clinical relevance

The clinical relevance of the included studies is presented in Table 5. All of the included studies described the interventions and treatment settings with sufficient detail for clinicians to replicate the treatment in clinical practice. Two studies did not report the important clinical end points [29, 35], such as a reduction in pain or improvement in function. In four included trials (one RCT [27] and three controlled clinical studies [31, 34, 36]), the reviewers considered the likely treatment benefits to be worth the potential harm. In five studies [27, 28, 31, 34, 36], the size of the effect was considered to be clinically important. More than 30 % improvement in back pain scores were observed in six clinical results. Information about function outcome was not provided in the study by Dehoux [32].

Table 5 Clinical relevance

Meta-analysis results

No significant difference in demographics, symptoms, level and grade of slip, and preoperative distribution of lifestyle factors was found between the two surgical groups from the included trials. Because of imprecision of results, we downgraded one level of evidence for clinical satisfaction. There was moderate evidence from three RCTs (202 patients) and three observational studies (142 patients) that the PLIF treatment procedure was more effective than the PLF treatment for clinical satisfaction, with an OR of 0.49(95 % CI 0.28–0.88, P = 0.02). A total of 150 (86.7 %) of 173 patients were satisfied with the surgical outcome in the PLIF group compared to 131 (76.6 %) of 171 patients in the PLF group (Fig. 2). Subgroup analysis revealed inconsistent trends of RCTs and comparative observational studies, with the former showing a comparable outcome (OR 0.59, 95 % CI 0.24–1.44, P = 0.25) and the latter a significantly higher satisfaction rate with PLIF (OR 0.43, 95 % CI 0.20–0.92, P = 0.03). Though no significant difference was found in the subgroup of RCTs, a higher satisfaction rate was found in patients receiving PLIF.

Fig. 2
figure 2

Clinical satisfaction of PLF versus PLIF for the treatment of lumbar spondylolisthesis. The assessment of clinical satisfaction was based on the scores of ODI and the Prolo Economic and Functional scale and the objective evaluation from patients. Significant difference was observed for overall effect, favoring PLIF with higher clinical satisfaction. Subgroup analysis showed inconsistent results between RCTs and observational studies

The indexes applied to assess the improvement in symptoms and function varied among the selected trials and included the 36-Item Short Form Health Survey, Short Form (SF)-12 v2, and SF-6D R2. Statistical analysis was feasible after standardization pooling for comparing functional improvement. Improvement in functional status postoperatively was identified for both interventions, but was more significant in the PLIF group. However, the superiority of PLIF in the reduction of postoperative pain and the improvement of function decreased as the follow-up went on [27, 34]. The level of evidence was downgraded because only one RCT was included in the evaluation of postoperative back pain. There was moderate-quality evidence from one RCT (50 patients) and four observational studies (194 patients) that the PLIF was more effective than the PLF for postoperative back pain with WMD 0.92 (95 % CI 0.48–1.35, P < 0.0001; Fig. 3). Postoperative functional performance was assessed using the Oswestry Disability Index (ODI) questionnaire. The total score ranged from 0 to 100, in which 100 indicates the most severe disability. There was low-quality evidence from two RCTs (202 patients) and three observational studies (142 patients) that the PLIF was more effective than the PLF for improving ODI with WMD 1.30 (95 % CI 0.25–2.35, P = 0.01).

Fig. 3
figure 3

Postoperative back pain of PLF versus PLIF for the treatment of lumbar spondylolisthesis. Both overall and subgroup analyses showed statistical differences between the two procedures. Relief of back pain was more significant in the PLIF group compared to the PLF group

There was moderate-quality evidence from four RCTs (282 patients) and four observational studies (172 patients) that there was no significant difference in the complication rate [OR 2.28, 95 % CI (0.97, 5.35), P = 0.06], which reflects the primary harm outcome (Fig. 4). The strength of evidence was decreased due to inconsistency between RCTs and observational studies. Sensitivity analysis revealed inconsistent trends for both subgroups of RCTs and comparative observational studies, with the former showing a comparable outcome (OR 1.25, 95 % CI 0.33–4.73, P = 0.74) and the latter a significantly lower complication rate with PLIF (OR 4.62, 95 % CI 1.90–11.21, P = 0.0007). The analysis of heterogeneity revealed an overall I 2 score of 58 %, with I 2 scores of 71 and 0 % for the two subgroups, indicating substantial heterogeneity in the RCTs. The study by Inamdar et al. demonstrated a substantially higher complication rate with PLIF, especially postoperatively persistent back pain in four patients. A sensitivity analysis with removal of the study by Inamdar et al. [30] revealed a significantly lower complication rate in the PLIF group (OR 2.50, 95 % CI 1.27–4.95, P = 0.008). With respect to major or minor complications, there was moderate-quality evidence that there was no significant difference between two procedures [OR: 2.65, 95 % CI (0.84, 8.36), P = 0.10; OR 1.62, 95 % CI (0.59, 4.43), P = 0.35].

Fig. 4
figure 4

Complication rate of PLF versus PLIF for the treatment of lumbar spondylolisthesis. No significant difference was found for the complication rate between the two fusion procedures. I 2 was >50 %, indicating substantial heterogeneity in the RCTs

In the secondary outcomes, the level of evidence for fusion rate was downgraded because of the imprecision. There was moderate-quality evidence from four RCTs (284 patients) and three observational studies (112 patients) that PLIF was more effective than PLF for improving fusion rate, with an OR of 0.32 (95 % CI 0.17–0.61, P = 0.0006). The fusion rates of the PLIF group and the PLF group were 92.5 and 79.6 %, respectively (Fig. 5). The strength of evidence for reoperation rate was downgraded by one level because of inconsistency between subgroups. There was low-quality evidence from one RCT (50 patients) and four observational studies (172 patients) that the PLIF was more effective than the PLF for the reduction of reoperation rate with OR 5.30 (95 % CI 1.47–19.11, P = 0.01). The reoperation rates of the PLIF group and the PLF group were 3.6 and 17.3 %, respectively (Fig. 6). Subgroup analysis showed inconsistent trends of RCTs and comparative observational studies. Although no significant difference was found in the subgroup of RCTs, patients who accepted PLIF showed a lower reoperation rate.

Fig. 5
figure 5

Fusion rate of PLF versus PLIF for the treatment of lumbar spondylolisthesis. The fusion rate in the PLIF group was lower than that in the PLF group

Fig. 6
figure 6

Reoperation rate of PLF versus PLIF for the treatment of lumbar spondylolisthesis. Lower reoperation rate in the PLIF group was observed in this forest plot

There was low-quality evidence from 2 RCTs that there was no statistically significant difference between the two procedures with regard to blood loss [WMD = 76.52, 95 % CI (−310.68, 463.733), P = 0.70] and operating time [WMD = −1.20, 95 % CI (−40.36, 37.97), P = 0.95) (Figs. 7, 8). The strength of evidence was downgraded by two levels because of imprecision and inconsistency. The I 2 scores of 92 % indicated considerable heterogeneity in the included RCTs. The greater blood loss in the PLF group in the study by Musluman et al. [27] may result from the procedure used to collect bone from the iliac wing. For the operating time, an I 2 score of 81 % indicated substantial heterogeneity in both RCTs. The study by Cheng et al. [28] that used low-quality methodology reported a longer operating time for the PLF procedure, which could be due to the additional procedure of bone collection from the iliac wing.

Fig. 7
figure 7

Blood loss of PLF versus PLIF for the treatment of lumbar spondylolisthesis. No significant difference was found between the two groups. The I 2 was >75 %, indicating considerable heterogeneity in the included studies

Fig. 8
figure 8

Operating time of PLF versus PLIF for the treatment of lumbar spondylolisthesis. No significant difference was found for operating time between the two groups. Inconsistency was obvious in the included studies

A prospective randomized clinical study with a 2-year follow-up period showed that lumbar lordosis and the segmental angle were restored and maintained in the PLIF group compared to preoperative alignment (P < 0.05) [27]. There was low-quality evidence that PLIF significantly restored the segmental and lumbar lordotic angles, but PLF did not [27].

Discussion

This meta-analysis identified four RCTs and five observational studies that compared PLIF with PLF for lumbar spondylolisthesis. The findings revealed that there was moderate-quality evidence indicating that PLIF had advantages of clinical satisfaction, reduction in postoperative pain, and improvements in fusion rate compared to PLF. Low-quality evidence indicated that there was no significant difference between the two procedures for complication rate, blood loss, and operating time. Patients with PLIF reached better clinical outcomes and satisfaction than patients with PLF.

In the past several decades, the continuous modification and refinement of surgical techniques, such as minimization of the level of neural retraction required and avoidance of broad dissection of the paraspinal musculature during PLIF, have contributed to a reduction in the operative risks, operating time, and blood loss during PLIF [37]. During a PLF procedure, the broad dissection used that exceeds the facet joint may lead to a transient increase in postoperative pain, which can further influence patients’ satisfaction of the procedure. PLIF can overcome these drawbacks and provide anterior column support, which helps to restore lumbar lordosis and intervertebral space height as well as increase fusion rate. A higher fusion rate with PLIF was identified in the current analysis compared to PLF. Some researchers believe that once the unstable segment is successfully fused, mechanical back pain due to a pars defect or facet arthropathy can be reduced, which may contribute to good functional outcomes [38, 39]. Therefore, successful arthrodesis will most likely indicate a satisfactory clinical outcome. However, nonunion and its associated complications may result in postoperative recurrent back pain or even failure of the surgery and reoperation, thus preventing a satisfactory outcome [4042]. Compared to PLF, PLIF resulted in a higher fusion rate. Many studies have postulated that successful fusion status can result in better functional outcome and better satisfaction [43, 44]. This may indicate that PLIF has a clinical advantage over PLF. However, the results from various studies are conflicting [4548], and a complete correlation between good outcome and fusion rate was not recorded in some studies [33, 49]. Moreover, some studies have indicated that there was no significant difference in the reduction of postoperative pain between the two interventions [50].

The present meta-analysis captured information on the characteristics of included patients, detailed information of fusion procedures, and clinically important end points, which will enable clinicians in the field to determine whether the results apply to their patient population and how to apply these procedures in their clinical practice. For some clinical outcomes, statistical significance does not necessarily mean that the change is clinically important [51]. The minimal important change (MIC) can provide more information and help users evaluate the effect size of interventions. There was wide variation in criteria and statistical techniques used to define MIC in the literature and reviews. In 2008, Ostelo et al. [52] proposed MIC values of 15 for the visual analog scale and 10 for the Oswestry Disability Index. When the baseline score was taken into account, a 30 % improvement was considered to be a useful threshold for identifying clinically meaningful improvements for each of these measures. Thus, we used similar criteria to determine important clinical differences in pain reduction and functional improvement [14]. The effect size in the five studies was considered to be clinically important [27, 28, 31, 34, 36].

The combination of RCTs and comparative observational studies is becoming increasingly common for the evaluation of surgical treatments. Well-designed observational studies are believed to be beneficial complements to the findings of RCTs [53, 54], as they may dilute the selection bias of RCTs produced by the rigorous criteria for selecting participants. In our included observational studies, patients were allocated to a designated treatment group according to specific time sequence, such as the date of treatment. In addition, all of the studies attempted to balance the intervention groups for possibly important prognostic indicators. In general, a meta-analysis based on observational studies results in low-quality evidence [55]. However, in the present analysis, we mainly obtained moderate-quality evidence after comprehensively evaluating the level of evidence.

We also found that the outcomes from RCTs may differ from those in observational studies, which could be a result of different study designs and bias. One of the RCTs was rated as having a “high risk of bias” because it met <6 of the 12 CBRG criteria and had serious flaws. However, a sensitivity analysis with removal of the low-quality study by Cheng et al. [28] showed a similar result, indicating that the heterogeneity was not from the low quality of methodology. A subanalysis found that surgical procedure may be the source of the risk of bias and heterogeneity for special outcomes [2729]; however, it is possible that the intrinsic nature of different fusions is the source of the discrepancy. Therefore, the results of this study should be interpreted with caution, especially for reoperation rate, blood loss, and operative time.

This study had several limitations. According to our search results and inclusion criteria, four RCTs and five observational studies were included in this meta-analysis. In addition, the number of studies for each of the outcomes varied from one plot to another. The small number of included trials and incomplete data decreased the quality of evidence and the power of the subgroup analyses. There was also potential publication bias, because we only retrieved published literature in peer-reviewed journals. However, we were not able to evaluate this because of the limitation of the included studies. Blinding was also not completely performed in the four RCTs, though randomization and allocation concealment were performed in two trials [14, 27]. Inadequate blinding has been reported to produce a 15 % overestimation of treatment effects [56]. In addition, the indications for surgery and the actual type of surgical procedure varied in the two fusion methods, which may be more obvious in observational studies. Inconsistency between studies increased the risk of selection bias. In this analysis, eights included studies contained <50 subjects in the smallest group [27, 2932, 35, 36, 57]. Studies with a small sample size can increase heterogeneity and bias. Therefore, the pooled ORs should be treated with caution. A greater number of well-designed RCTs will help provide much stronger evidence for clinical decision-making in the future.

Conclusion

PLIF had advantages of a reduction of postoperative pain as well as improvement of patient satisfaction and function compared to PLF. In addition, a PLIF can increase the fusion rate and decrease the reoperation rate. We identified low-quality evidence showing no significant difference between the two fusion procedures with regard to blood loss and operating time.