Introduction

With more than 400,000 surgical procedures performed annually in the United States alone,106 spinal fusion is an increasingly common procedure used to treat a variety of spinal pathologies including those arising from trauma, degenerative diseases, deformity, infection and tumors. The mainstay of bone graft options in spinal fusion has been autogenic bone from either the iliac crest or local bone such as the laminae or spinous processes.32,54,76 However, autograft bone is often limited in supply and its use has been found to be associated with donor site morbidity.16,31 Furthermore, even when autogenic bone grafts are utilized, non-union, otherwise known as pseudarthrosis, which complicates clinical outcomes and results in significantly higher health care costs, is reported to occur at a rate of around 10%.56 There has thus been much research into bone graft extenders and substitutes as well as investigation of various therapies for improving fusion outcomes.10,29,69,96,131

To prevent pseudarthrosis, a wide variety of treatment options have been explored, including localized and systemic delivery of osteogenic growth factors and osteoporosis therapies such as bone morphogenetic proteins -2 (BMP-2)130 and -7 (BMP-7),90 platelet-derived growth factor,29,97 Nel-like protein 1,69 parathyroid hormone,74 bisphosphonates,95 and selective estrogen receptor modulators.101,123 A range of bone graft substitute and extension materials have also been investigated including demineralized bone matrices,61,124 collagen- and/or calcium phosphate- based scaffolds,10 and osteogenic stem-cell therapy.131 These treatments were originally assessed in animal models, but many have transitioned to use in clinical practice. For instance, the use of mesenchymal stem cells (MSCs) in animal spinal fusion model had been first tested in 1999 in posterolateral rabbit model by Curylo et al.,20 which was subsequently translated into the use of bone marrow aspirate in spinal fusion surgery in human, combined with several different types of bone graft options such as iliac crest autograft, allograft, and demineralized bone matrices.4 More recently, a wide variety of cellular bone matrices, where allogenic MSCs, osteogenic growth factors, and human cadaveric bones were preserved within 72 h of death, were investigated in animal model119 and then applied to one and two level anterior cervical discectomy and fusion in patients with some promising results.7,82 Hence, first of all, the establishment of proper animal models is critical, in order to adequately assess the efficacy of novel treatment options at the preclinical level and then smoothly translate them into clinical practice.

Historically, the rabbit posterolateral spinal fusion model with autologous iliac crest bone graft has been the most common and well-established model, due to similarities in fusion rates with those observed in humans and also in anatomy between rabbit spines and human.35,74,107 Alternatively, larger animals such as goats, sheep, pigs, and dogs have been utilized for interbody fusion models,36,116 considering their relative immobility in comparison to smaller animals and more similar sizing to humans. However, an increasing number of studies on rat posterolateral spinal fusion using autogenic/allogenic bone graft control groups (rat PFABG model) have been published recently. The rat model offers several significant advantages as an experimental model in comparison to other species. First, there are shorter operation times with lower costs.25,127 Smaller animal size also reduces the size of bone graft substitute materials and/or stem cells required for a given fusion study thus resulting in additional cost reduction.25,111 Furthermore, rodent models may allow for more detailed study of the mechanisms underlying spinal fusion, as a wider array of cellular and molecular biology tools and reagents are available for rats than in larger animal models. For instance, Geuze et al. 33 and Lina et al. 73 reported that by transplanting MSCs harvested from luciferase-positive rats or mice into luciferase-negative rats or mice after spinal fusion and tracking luciferase bioluminescence signals at fusion sites, the underlying mechanism of spinal fusion in human such as the survival period of donor cells, immunomodulatory effect of MSCs, and osteoinductive growth factors could be investigated.

Although approximately 100 articles on rat posterolateral spinal fusion models have been published since 1992,1,5,6,8,9,1115,1719,2124,26,33,34,3742,4553,55,5760,6265,6772,75,77,7981,83,8594,97,98,100105,108,109,114,117,120123,126,129,130,132134,136,137 the reliability of the rat PFABG model, defined as the capability of reproducing similar fusion rates across each study so that investigators can reliably utilize it as a control group for certain interventions, has not been thoroughly reviewed. It is already known that subtle differences in the study design of animal models or animal characteristics themselves, such as timing of assessment, strains of animals, can frequently affect the reliability of control groups.35,107

Therefore, the objective of this meta-analysis was to investigate the reliability of the rat PFABG control group, as well as analyze the impact of different variables of interest such as age, weight of rats, graft types, graft volume on fusion outcomes, in order to optimize the model for future studies.

Materials and Methods

Literature Search

A literature search for this meta-analysis was performed, based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines,115 using the terms [“Spinal (or spine) fusion rat(s)”) OR (“rat(s) spinal (or spine) arthrodesis”] OR [“posterolateral fusion rat(s)”] OR [“transverse process fusion rat(s)”]. This web-based literature search was conducted using PubMed, Embase, and Web of Science and included the time period from January, 1970 to September, 2015. The full text of all articles in which posterolateral spinal fusion was performed in the rat model, irrespective of the graft material employed, was obtained and related articles were additionally collected from the reference lists of those manuscripts. In total 90 articles1,5,6,8,9,1115,1719,2124,26,33,34,3742,4553,55,5760,6265,6772,75,77,7981,83,8594,97,98,100105,108,109,114,117,120123,126,129,130,132134,136,137 were reviewed by two-independent researchers in our institution.

Inclusion and Exclusion Criteria

The inclusion criteria for this meta-analysis were as follows: (1) randomized controlled trials which included at least one rat PFABG model group, which was defined as a control group, (2) fusion outcomes based on manual palpation were reported, (3) fusion assessment via manual palpation performed and defined as no motion in vertebrae of operated levels, and (4) at least 6 out of 10 variables of interest including age, weight, sex, and strain of rat, graft volume, graft type, decorticated levels, surgical approach, handling, and timing of assessment were specifically identified in “Materials and Methods” or “Results” sections. Based on these inclusion criteria, 93 articles were further narrowed down to 26 articles,1,9,14,17,22,34,37,46,52,58,64,67,71,100102,104,117,121,123,129,130,132,134,136,137 which included 40 control groups. The algorithm for this web-based search is shown in Fig. 1 Lastly, retraction watch (http://retractionwatch.com/) was utilized in order to make sure that each study had not been retracted and was scientifically valid in the literature. If a study reported more than one control group, each control group was regarded as independent. One of the control groups in the study by Wang et al.,129 which included only one rat, was excluded in order to avoid substantial bias in this meta-analysis.

Figure 1
figure 1

Literature search results and screening process, based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.

Data Collection

Each of the 26 articles was reviewed and the data regarding institutions, number of control rats, fusion rate, methods of fusion assessment, age, weight, sex, and strain of rats, graft volume, graft type, decorticated levels, surgical approach, handling, and timing of assessment were collected.

Statistical Analyses

Statistical analyses were performed using Comprehensive Meta-Analysis Software version 3 (Biostat, New Jersey, USA). Initially, a meta-analysis on 40 control groups was performed to obtain the overall pooled fusion rate. Next, the fusion rates, stratified by each variable, were further analyzed. A fixed effects model was used and the heterogeneity of the meta-analysis was calculated using the Q and I 2 statistics.43 Whereas the Q statistic tests the null hypothesis that all of the included studies have a common event rate across the studies, the I 2 statistic corresponds to the actual percentage of observed variance between studies that is considered to be due to true differences in event rates. I 2 values less than 25% were considered as low heterogeneity, 25–75% as moderate heterogeneity, and more than 75% as severe heterogeneity.43

For variables which potentially affected fusion outcomes, meta-regression analysis was subsequently conducted to clarify the source of heterogeneity between studies. In this regression analysis, the effect size was calculated based on logit event rate [=log (fusion rate/(1 − fusion rate))]. R 2 and p values, which test the null hypothesis that the coefficient of each regression analysis is equal to zero, were calculated. A multi-regression model was established so that the fusion rates of control groups were estimated using the variables of interest. The R 2 value of this multi-regression model was also calculated and the estimated fusion rate of each study using this model was compared with the actual fusion rate.

In an inter-institution reliability analysis, the event rate with upper and lower 95% confidence intervals was calculated and described using forest plots, and heterogeneity was analyzed using the Q and I 2 statistics. Furthermore, a sub-analysis of the particular control subgroups was performed to demonstrate that heterogeneity could be eliminated by extraction based on the multi-regression model.

Finally, publication bias was assessed by drawing a funnel plot for the 40 control groups. The classic fail-safe N test was conducted and p values, which tested the null hypothesis that there was no publication bias and the funnel plot was symmetrical, were obtained.99 Additionally, Duval and Tweedie’s trim and fill test27,28 was performed to determine possibility of potentially missing data and the adjusted fusion rate. All reported p values were two-sided and p < 0.05 was considered to be statistically significant in all cases.

Results

Meta-analysis of Overall Fusion Rate and Its Heterogeneity

None of the 26 identified studies, which included 40 control groups, had been retracted through May, 2016. The median number of rats in each control arm was 10 (range: 4–27) and the number of rats in all control groups was 449. The fusion rate of each study varied widely, from 0 to 96%, as assessed by manual palpation. The meta-analysis revealed that the calculated overall fusion rate was 46.1% with an I 2 value of 62.4, which suggested moderate heterogeneity (Table 1). To scrutinize the reasons for this heterogeneity, a meta-analysis on fusion rate and a subsequent meta-regression analysis were performed for each variable.

Table 1 Summary of meta-analyses.

Meta-analyses and Meta-regression Analyses Stratified by each Variable Timing of Assessment

The timing of fusion outcome assessment via manual palpation ranged from 2 to 12 weeks with a median of 6 weeks. Meta-regression analysis on timing of assessment as a continuous value (Supplementary Fig. 1) demonstrated that it was significantly correlated with fusion outcomes (p = 0.015). Time-points ≥8 weeks before evaluation led to higher fusion rates with a trend towards statistical significance in meta-regression analysis (p = 0.070), but moderate heterogeneity was not altered by this stratification (Table 1). Using any cut-off values to analyze timing as a categorical variable, for example, stratifying them into 4 groups using week 4, week 6, or week 8, also did not result in reduced heterogeneity.

Weight and Age

Animal weight ranged from 190 to 500 g with a median of 275 g. Animal age ranged from 3 to 24 weeks with a median of 12 weeks. In meta-regression analysis, animal weight >300 g and animal age >14 weeks resulted in significantly higher fusion rates (61.2 vs. 36.4%; p = 0.002, 69.6 vs. 43.2%; p = 0.035). Following stratification by weight and age, moderate heterogeneity remained unchanged, except in rats older than 14 weeks (I 2 < 0.001) (Table 1). Weight and age, when identified as continuous values (Supplementary Figs. 2, 3), were found to be related to the fusion rate (p = 0.029 and p = 0.042, respectively).

Sex and Strain

Fifteen control arms (128 rats (28.5%)) which used rat strains other than Sprague–Dawley were identified and consisted of nine arms with athymic rats, four arms with Lewis rats, one arm with Fischer rats, and one arm with Wistar rats. None of the categorical sub-groups stratified by sex or strain demonstrated a reduction in heterogeneity (Table 1). As categorical values, sex and strain (Supplementary Figs. 4, 5) were also shown to have significant influence on heterogeneity (male 56.6% > female 41.2%: p = 0.033, Sprague–Dawley rat 53.2% > other strains 24.2%: p < 0.001).

Surgical Procedures and Graft Characteristics

The use of autogenic coccyx grafts was significantly associated with fusion rate improvement in meta-regression analysis. However, any of the graft type sub-groups resulted in improved heterogeneity (Table 1). The proportion of graft weight to body weight varied from 0.0004 to 0.0057 with a median of 0.0016. The proportion of graft weight to body weight as a continuous variable (p = 0.354), or as a categorical variable (with any of the cut-off values), did not affect the fusion outcome with statistical significance on meta-regression analysis. The graft weight ranged from 0.1 to 2.0 g with a median of 0.4 g. Graft weight itself as a continuous variable (p = 0.854), or as a categorical variable (with any of the cut-off values), did not significantly affect the fusion outcomes. Operated levels and surgical approach were not analyzed because only three studies specifically identified a procedure other than single level (L4–L5) fusion via a paraspinal approach.

Other Potential Confounding Factors Leading to Heterogeneity

The publication year influenced the fusion outcomes with marginal statistical significance in meta-regression analysis (publication year ≥2010 vs. <2010, p = 0.052). It was impossible to assess language bias in this study, since only one non-English manuscript was included. Furthermore, since only a small portion of the 26 included studies (five using plain radiography, two utilizing computational tomography, and two employing histology) specifically defined fusion criteria based on other modalities, it was infeasible to compare manual palpation with others fusion assessment methods.

Multi-regression Analysis

Although the fusion rate was influenced by several factors, heterogeneity across the studies remained almost the same by univariate stratification, thus warranting subsequent multi-regression analysis. A multi-regression model was used to assess the best suited model which enabled the prediction of fusion rates of particular rat control groups using the variables discussed above. This regression model was also used to identify a homogeneous control group from among the 26 included studies. Based on the analysis (Table 2), one example of such a regression model is as follows:

$$ {\text{Logit event rate}} = - 2.88 + {\text{timing\; of \;assessment }}({\text{week}}) \times 0.186 + {\text{age of rats (week}}) \times 0.118 + {\text{sex (male}} = 1,\;{\text{female}} = 0 )\times 1.16, $$

where logit even rate equals to natural logarithm of (fusion rate/(1 − fusion rate)).

Table 2 Multi-regression analysis on logit event rate.

In Fig. 2, it was clearly demonstrated that the estimated fusion rates from the regression model correlated to the actual fusion rates with an R 2 = 0.82.

Figure 2
figure 2

Comparison between the actual fusion rate and the estimated fusion rate based on the multi-regression analysis in Table 2.

Inter-institution Reliability

Inter-institution reliability was assessed by calculating the heterogeneity among all nineteen institutions. The pooled overall fusion rate was 50.8% [45.4, 56.2%] and the meta-analysis revealed statistically significant differences among fusion outcomes at different institutions (I 2 of 72.8 and Q value of 66.3 p < 0.001) (Fig. 3).

Figure 3
figure 3

Meta-analysis on inter-institution reliability. CO: Çanakkale OnsekizMart University, UCSD: University of California, San Diego, NW: Northwestern University, UMDNJ: University of Medicine and Dentistry of New Jersey, UCLA: University of California, Los Angeles, CG: Chang Gung University, CN: Chonbuk National University, AIB: Abant Izzet Baysal University, HSS: Hospital for Special Surgery.

Publication Bias

A potential publication bias was suggested (Fig. 4) with the p value of the classic fail-safe N test calculated to be 0.055. Duval and Tweedie’s trim and fill test indicated that six studies, which actually would have demonstrated high fusion rates, i.e., 90–100%, might be missing from this meta-analysis due to publication bias (Table 3). The adjusted fusion rate would be 50.0% [40.2, 59.8%].

Figure 4
figure 4

Funnel plot of standard error by logit event rate.

Table 3 Duval and Tweedie’s trim and fill test.

Positive Control vs. Negative Control

Given the aforementioned suggestion of publication bias, the data were further analyzed by stratifying the 26 studies into two groups: studies using the rat PFABG model as either a positive control group, for example, in studies examining factors that can cause pseudarthrosis, or as a negative control group, for example in studies of augmentative therapies to improve fusion outcomes. The pooled fusion rates were 70.8% for the positive control group and 40.7% for the negative control group. In the positive control group, heterogeneity was moderately improved with an I 2 value of 35.6 (Fig. 5). Meta-regression analysis revealed the statistically significant impact of this classification on fusion outcomes (R 2 = 0.28, p < 0.001).

Figure 5
figure 5

Regression of logit event rate on comparison between positive controls and negative controls.

Optimizing the Rat PFABG Model as a Control Group

Finally, by utilizing the results of the multi-regression model and extracting sub-groups with less heterogeneity, factors were identified for the optimization of the rat PFABG fusion model. A pooled fusion rate of 38.1% (p = 0.42 and I 2 < 0.001) was found for 10–14 week-old female rats with an assessment time-point of 6 weeks, which is potentially suitable for a negative control group (Fig. 6a). On the other hand, male rats with an assessment time-point longer than 8 weeks had a pooled fusion rate of 72.4% (p = 0.43 and I 2 < 0.001), which may be a reliable candidate for a positive control group (Fig. 6b).

Figure 6
figure 6

(a) Meta-analysis on the articles which included 10–14-week-old female rats with assessment time-point of week 6. SD: Sprague–Dawley rat, AR: athymic rat. UCLA: University of California, Los Angeles. (b) Meta-analysis on the articles which included male rats with assessment time-point longer than week. UCSD: University of California, San Diego.

Discussion

A wide variety of animal models of spinal fusion are currently available, including rabbits, rats, mice, pigs, sheep, dogs, monkeys, and goats.25 These models allow for evaluation of the efficacy of novel treatment options for spinal fusion or assessment of the negative impact of certain factors on spinal fusion outcomes. For instance, BMP-2 has been investigated in the rat model, rabbit model, goat model, and sheep model,74,80,118,128 and now it is one of the most indispensable treatment options in the field of spine surgery.110 More recently, the use of adipose derived stem cells, which were cultured and then loaded onto commercially-available scaffolds or 3D printed osteogenic scaffolds, have been extensively tested in several preclinical models,77,111 which may warrant future clinical trials. As a growing number of treatment options for spinal fusion are being assessed all over the world, there is a substantial scientific need for establishing reliable animal models, which can adequately compare novel treatments with pre-existing modalities.

This study sought to examine the reliability of the rat PFABG model, as utilization of this model is becoming more ubiquitous, with a rapid increase in the number of studies: from 38 articles from 1992 to 2009 to 55 articles from 2010 to 2015. To the best of our knowledge, this is the first meta-analysis which solely focused on analyzing and optimizing the rat PFABG model. The most important finding in this study is that reported fusion rates in the examined studies varied substantially and were influenced by a variety of factors.

Weight and age of rats had significant impacts on fusion outcomes in this study. This finding is consistent with a recent meta-analysis on rabbit posterolateral fusion outcomes,107 which addressed initial weight of animals ≥3 kg as a positive variable. Considering that rat weight generally ranges from 100 to 150 g at 5 weeks, from 200 to 350 g at 10 weeks and peaks at 15 weeks between 250 and 450 g,113 age may be just a surrogate marker for weight (or vice versa). Clinically, age has been associated with increased risk of pseudarthrosis.2 Age has also been known to correlate with decreased bone repair, vascularization and stem cell properties in both pre-clinical animal models and patients. For example, deteriorated vascularization and impaired fracture repair were demonstrated in an elderly mouse model (18-month-old).78 Similarly, age (particularly in people older than 30) was found to inversely correlate with the frequency of bone marrow derived osteoprogenitor cells, as determined via the in vitro colony-forming unit fibroblast assay.66 However, the age of the rats included in our meta-analysis ranged from 3 to 24 weeks with a median of 12 weeks, which, while skeletally mature, is not likely aged enough that the these impacts would be significant. That said, the low heterogeneity observed in rats aged more than 14 weeks (I 2 < 0.001) merits more discussion. All five control groups in three studies where more than 14-week-old rats were utilized sought to examine the validity of ovarectomy to induce osteoporosis and the efficacy of treatment options to enhance fusion formation in osteoporotic rats. We believe that essentially two factors contributed to this low heterogeneity: (a) these five control groups underwent sham ovarectomy (just laparotomy) 4 weeks before spinal fusion surgery and were compared to those with ovarectomy as positive controls and (b) given the positive correlation between age of rats and fusion rates discovered in our analysis (Fig. 2; Table 2), 14 weeks could be the threshold for relatively successful fusion in the rat PFABG model as observed in this particular group study (the pooled fusion rate = 69.6%).

Interestingly, the fusion rate of male rats was significantly higher than that of females, which is the opposite of the results from the meta-analysis on the rabbit model107; however, in their study, the difference was statistically insignificant. Male rats are typically heavier than female rats at a given age.3 As results from studies in mice demonstrated that female mice showed delayed bone healing in comparison to males,84 further study regarding the impact of sex and hormones on fusion outcomes in rat models should be pursued. Of note, two of the included studies (7.7%) did not report either the weight or age of the rats employed, and seven studies (26.9%) did not specify sex. When it comes to differences among strains, Sprague–Dawley rats were most commonly used due to their relatively large size and displayed the highest fusion rate with statistical significance in the regression analysis. Differences in strains or genetic variability have been demonstrated to significantly influence the bone healing process in mouse models, for instance, fracture healing was found to proceed more rapidly in C57BL/6 mice than in DBA/2 and C3H mice.44 However, strain related skeletal differences in rat models have yet to be reported. Further investigation is thus necessary to determine any differences in bone metabolism among different rat strains.

Bone graft volume and graft type are key factors in spinal fusion outcomes. In the rabbit model, it has been shown that sufficient graft volume is critical to achieving successful solid fusion, although the aforementioned meta-regression analysis used an absolute value of graft volume as a variable.107 In our analysis, we assumed that the proportion of graft weight to animal weight more closely reflected human clinical practice, however, this ratio was not found to be a predictive factor. Regarding graft types, autogenic coccyx grafts demonstrated better outcomes compared to autologous iliac crests graft and allogenic bone grafts (Supplementary Fig. 6). As the use of local bone graft, such as from spinous processes and/or laminae, has been recently assessed as an alternative to autogenic iliac crest graft in human spinal fusion,96,112 further investigation on which graft type is optimal is thus warranted both in animal models and in actual patient care.

The surgical procedure performed in the majority of studies was typically an L4–L5 posterolateral fusion via a paraspinal or Wiltse approach,127 and only three of the 26 studies differed (two studies utilized L4–L6 fusion via a midline approach and one study performed L3–L5 fusion via a midline approach). Due to the limited number of multilevel procedures, it was impossible to perform an adequate sub-analysis regarding surgical approaches. Additional intraoperative techniques such as the extent of decortication can significantly affect fusion outcomes125 but are extremely difficult to quantify or qualify. For instance, in the reviewed studies, authors commonly made statements such as “the transverse process was decorticated until punctuate bleeding from cancellous bone was observed” but the amount of decortication was not specifically quantified. Furthermore, the depth of decortication must be precise in the rat model unlike other larger animal models, because the layer of cancellous bone is so thin that it is easy to drill off the entire layer at once. Based on our measurements, in an 8-week-old rat, thickness of a cortical layer of a transverse process was approximately 0.5 mm and the size of a transverse process was 8 mm (length) × 4 mm (width) × 2 mm (height), based on our measurements. Therefore, it is possible that these subtle differences in surgical technique, in addition to several variables previously discussed, contributed to the high variance observed in this model.

In this study, we specifically concentrated on manual palpation as a fusion evaluation method, because it has been well-established and well-described in the literature in a variety of animal models,135 and actual lack of motion on palpation is the gold standard of fusion assessment. Furthermore, only a small fraction of the included studies defined fusion criteria based on other modalities. Therefore, it was impossible to collect and analyze the data when using alternative fusion assessment methods. Notably, in actual clinical patient care, manual palpation with intraoperative exploration is the gold standard, but it is not routinely used unless reoperation is required. Therefore, there is a reliance on radiographical findings to diagnose pseudarthrosis and there is always a possibility of false-positive or false-negative results. In this context, even in animal models, it is highly recommended to define the fusion criteria based on a combination of modalities due to possible discrepancies between manual palpation and other radiographical assessments.

In this meta-analysis, there was a positive correlation between assessment time-point and fusion rates, with 8 weeks found to be a significant threshold for higher fusion rates. Again, this is consistent with the results presented in the recent rabbit model meta-analysis.107 Based on these results, it appears that 8 weeks may be an optimal time-point for sacrifice and fusion assessment that should be pursued in future rat posterolateral fusion studies.

Handling was reported to be relevant to fusion outcomes in the rabbit model,30 and was investigated in this study. Considering the differences in size, the impact of handling on fusion mass may be more substantial in rats than in rabbits. However, in these 26 studies, minimal details regarding handling procedures were provided, although some inference of handing could be made based on the type of treatments being studied. For instance, daily injections of parathyroid hormone would have required at least daily rat handling, but this process of estimation could potentially be a source of bias. In future studies, it would be ideal to report the handling procedures including frequency and injection site.

This meta-analysis was potentially biased by missing studies which would have had higher fusion rates. Conventionally, the rat PFABG model was utilized as a negative control to demonstrate the efficacy of a new treatment. Therefore, if a researcher selected this model and achieved a fusion rate of 95% in the negative control group, it likely would have been challenging to demonstrate statistical significance for benefit of a new treatment. The other possible explanation for these missing studies is that the application of high-dose BMP as a bone graft substitute has replaced the rat PFABG model as a “positive control,” which can reliably achieve fusion rate of 100%.45,57,86,136 For example, BMP treatment has been utilized as a positive control in a variety of compromised rat spinal fusion models, such as the dioxin-induced pseudarthrosis model.45

Among the included control arms in this meta-analysis, 22 PFABG control arms (55%) failed to demonstrate statistical significance in comparison to any of the other treatment arms. This relatively weak statistical power across all studies stems from the heterogeneity and unpredictability of the rat PFABG model. To overcome this limitation, we proposed that “10–14-week-old female rats with an assessment time-point of 6 weeks” be used as a negative control group (pooled fusion rate of 38.1%) and “male rats with an assessment time-point longer than 8 weeks” be used as a positive control group (pooled fusion rate of 72.4%). The number of animals needed to yield sufficient statistical power (α = 0.80, two-sided p < 0.05) was summarized in Table 4. Based on this table, 15–30 animals would be the number needed in an experimental group in order to detect the beneficial or detrimental effect of a certain intervention if the fusion rate above or below the control group was >40%. To maximize statistical power, the use of bone morphogenetic protein as an absolute positive control and a decortication-only group or periosteal-dissection only group as an absolute negative control are still options to be considered. Nevertheless, if we aim to compare intervention groups with a “gold-standard” control group, which should mirror actual patient care, the PFABG control group could be prioritized.

Table 4 Summary of the number of animals needed to treat to produce enough statistical power.

The general limitations of meta-analyses can be applied to our study. For example, the bias inherent in including original articles may have substantially influenced the results of the meta-analysis, as we might have missed some of the articles which were not on PubMed, Embase, or Web of Science. It is also possible that some of the missing information on materials and methods in certain articles biased this study. Since this study focused on the rat model, every included study was a “randomized control trial” and had high evidence levels, thus minimizing bias due to the nature of studies reviewed in our meta-analysis. Another limitation of this study is the fact that it was not possible to eliminate heterogeneity across all of the studies unless particular groups were stratified by the combination of two or three variables. Therefore, it is difficult to define the genuine fusion rate. Thus, it was concluded that the rat PFABG model was relatively unreliable. However, as discussed above, if study design and variables are adequately considered, it can be reliable enough to be selected as a control arm. To maximize the utility of this important model in future studies, the assessment methods, rat characteristics, surgical techniques, and other relevant variables discussed in this article should be clearly documented. This will improve the quality of research, increase the cost-effectiveness of research, and eventually aid in translation of these results to clinical practice.

Due to the heterogeneity across all studies caused by a variety of variables, the reliability of the rat PFABG model was relatively limited. However, selection of adequate variables can optimize its use as a control group in future studies. It is recommended that precise documentation of each variable be provided in future studies in order to more easily compare results among studies. Further research is required to thoroughly establish and optimize this widely-used model.