Introduction

The burden of musculoskeletal disorders is substantial, contributing heavily to worldwide disability [1, 2] and healthcare costs [3, 4]. After lower back and neck pain, osteoarthritis (OA) is the next leading cause of the burden for MSK disorders and the 11th highest contributor to the global burden of disability worldwide [5]. The hip and knee are the most commonly affected joints, with arthroplasty (i.e., surgical reconstruction or replacement of the joint) often recommended as a primary treatment for the OA patient population. Hundreds of thousands of arthroplasty procedures are performed annually [6] and the volume continues to rise with an estimated increase of > 70% for hip arthroplasty and > 80% for knee arthroplasty over the next 5 to 10 years [7,8,9]. In Australia, arthroplasty utilization is expected to grow by approximately 250% by 2030, leading to more than 8 billion dollars in healthcare costs for Australia alone [10]. It is essential to identify ways to minimize healthcare spending for arthroplasty treatments while providing optimal value-based care for patients.

Research methodologies that include economic evaluation allow comparison of cost and health outcomes simultaneously between different treatments and can help policymakers make informed decisions on the distribution of scarce healthcare resources [11]. Accordingly, guidelines have been established internationally for conducting economic evaluations [12, 13]. Given the substantial volume of arthroplasty procedures performed globally, it is important to evaluate the overall quality of the economic evaluations in the arthroplasty literature.

The quality of economic evaluations in arthroplasty has previously been studied through systematic reviews [14, 15, 16••]. Overall, study quality has been reported as good. However, despite the rich volume of available arthroplasty literature, previous reviews have only summarized small samples of studies (i.e., n < 25) due to restrictive eligibility criteria including the exclusion of studies that evaluate pre- or post-surgical treatment, prophylaxis treatment, or non-elective arthroplasty (e.g., to treat hip fracture) and the exclusion of studies that were not conducted in the United States (US). All three reviews focus on full economic evaluations only and the most recent review only included studies up until 2016. Given the recent changes in the delivery of care with emphasis on outpatient arthroplasty and comparisons of arthroplasty to non-surgical interventions, an updated review is warranted. We also believe it is important to systematically evaluate arthroplasty economic evaluations globally and to include all components of arthroplasty interventions to get a more comprehensive assessment of economic evaluations in the field, including studies that evaluate cost independently.

To meet the criteria of a full economic evaluation, a study must compare two or more treatment alternatives and include both health benefits and costs in the outcome analysis [11]. A full economic evaluation is crucial for decision-makers to make an appropriate assessment of the cost-effectiveness of an intervention. Partial economic evaluations compare costs independently from health outcomes and therefore do not meet the necessary criteria. To date, no study has evaluated the quality of all forms of economic evaluation globally for arthroplasty interventions of the hip and/or knee.

Therefore, the aims of our study were as follows (1) to summarize and evaluate the reporting of economic evaluations in the arthroplasty literature for hip and knee interventions; (2) to evaluate the quality of published full economic evaluations and; (3) to identify important areas where study quality can be improved.

Materials and methods

Search strategy and eligibility criteria

We completed a systematic search of the literature from inception to March 1, 2020, using four databases (Medline, EMBASE, AMED, and OVID Health Star). We identified articles published in the English language using combined and/or truncated terms that included economic, economic evaluation, cost, cost-utility, cost-minimization, cost-effectiveness, or cost-benefit as well as hip, or knee. We included studies that met the following three criteria: (1) evaluated hip and/or knee arthroplasty interventions; (2) compared two or more interventions and; (3) reported a cost outcome. We included studies that evaluated a variety of areas in arthroplasty including comparisons of different surgical techniques and/or implants, pre- or post-operative rehabilitation techniques, anesthetics and/or medications, intraoperative tools/equipment, and surgical to non-surgical interventions.

Abstract and full-text screening

Three reviewer pairs independently screened titles and abstracts to establish whether each study met the eligibility criteria to be included for a full-text review. We excluded study protocols, conference abstracts, and duplicates. For abstracts that met the criteria (or those where one or both of the reviewers were uncertain of eligibility), we retrieved the full article and further evaluated whether studies fully met the eligibility criteria. We manually searched the reference lists of systematic reviews and meta-analyses pulled from the initial search strategy to identify any other relevant studies that met the criteria. Four independent reviewer pairs then screened the full-text articles for eligibility. Reviewer pairs extracted data from full texts of eligible studies and once completed, met to discuss the data and resolve any conflicts. We asked a third reviewer to help resolve conflicts for articles where the reviewer pair could not reach consensus.

Data abstraction

We used a custom data abstraction form to extract several variables from the full texts including the following:

  • Year of publication

  • Country of origin

  • Interventions studied

  • Type of economic evaluation (i.e., cost-minimization analysis, cost-utility analysis, cost-benefit analysis, cost-effectiveness analysis, cost analysis, or more than one)

  • Study design (i.e., model- or trial-based)

  • Trial design for trial-based studies (i.e., randomized controlled trial, prospective, retrospective) or model design for model-based studies (i.e., Markov model, decision tree, or decision tree and Markov)

  • Summary measure reported (i.e., mean difference, incremental net benefit [INB], incremental cost-effectiveness, or cost-utility ratios [ICER & ICUR])

  • Uncertainty reported

  • Sensitivity analyses completed

  • Perspective of the analysis

Quality assessment

We used the Quality of Health Economic Studies (QHES) tool (Supplemental Table 1) [17, 18] to evaluate the quality of full economic evaluations. The QHES is a valid and reliable tool that uses 16 questions (binary, yes or no) to assess whether studies report the fundamental components required for an economic evaluation of high quality. The questions are scored from 1 to 9 points [17, 18], where a “yes” answer receives all points for the given question and a “no” answer receives no points. The individual question scores are then tallied to obtain a final summary score (0 to 100) with higher scores indicating better overall study quality.

We established clearly defined criteria for each of the QHES questions at the beginning of the study, made available for all reviewers. As recommended by previous studies [19], we had the same four reviewer pairs independently complete the quality assessments. In addition to the conventional QHES standards, we used the detailed criteria established by Marshall et al. [20] to supplement question scoring. We also used the criteria from Supplemental Table 2 to further clarify questions. We pilot tested with the assessment of five studies at random prior to completing all QHES scoring to ensure consistent interpretation of questions among reviewers. After completing all of the quality assessments, each reviewer pair met to discuss the data and resolve any conflicts. We asked a third reviewer to help resolve conflicts for articles where the reviewer pair could not reach consensus.

Inter-rater agreement

We used Cohen’s Kappa statistic to evaluate agreement for abstract screening between each of the three reviewer pairs. Kappa statistic values can be interpreted as almost perfect agreement (0.81–0.99), substantial agreement (0.61–0.80), moderate agreement (0.41–0.60), fair agreement (0.21–0.40), or slight agreement (0.01–0.20) [21]. We also evaluated agreement for each of the 16 QHES questions among each of the four full-text reviewer pairs. This was done by calculating the percentage of agreements observed (i.e., both said yes or both said no).

Descriptive analyses

We summarized our results using descriptive analyses. We reported the frequency (percentage [%]) of studies by year of publication and design characteristics and interventions studied. We reported the mean quality score (standard deviation [SD]) of the QHES for all studies, by year of publication, by intervention and by geographical location. Finally, we reported the frequency (%) of studies that addressed each of the 16 QHES questions. We also categorized studies for overall quality by quartile using the final QHES scores [22]. The quartiles are classified as high quality (75–100 points), fair quality (50–74 points), poor quality (25–49 points), or extremely poor quality (0–24 points).

Results

Studies summary

A total of 477 studies met the eligibility criteria following the initial titles and abstracts screening. After reviewing full texts, we determined 107 studies did not fully meet the criteria and were excluded. Fourteen additional articles that met inclusion criteria were identified through systematic reviews. A total of 384 [23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405] studies were ultimately included in the systematic review (Fig. 1).

Fig. 1
figure 1

Reporting items for systematic reviews and meta-analyses (PRISMA) study flow diagram

Table 1 summarizes all of the included studies (n = 384). The table also provides a summary of studies that specifically completed a full economic evaluation (n = 127). Interestingly, several studies (n = 72; 28%) that did not carry out a full economic evaluation (i.e., cost analyses—only costs compared directly) drew conclusions concerning cost-effectiveness. Overall, the frequency of studies published increased over time between the years of 1980 and 2020 (Fig. 2). Although the number of partial economic evaluations continued to rise over the years, the number of published full economic evaluations peaked between 2010 and 2014 (Fig. 2). Individual summaries for each study are presented in Supplemental Tables 3 and 4.

Table 1 Overall study summary (n = 384), further subdivided with only full economic evaluations (n = 127)
Fig. 2
figure 2

Frequency of published arthroplasty economic evaluations for hip and/or knee interventions (n = 384) over the years. The studies are divided by economic evaluation type. Partial economic evaluations (n = 257) are presented in yellow, while full economic evaluations (n = 127) are presented in blue

Studies most frequently evaluated surgical techniques/implants (n = 120; 31%) or anesthetics/medications (n = 117; 30%). Studies also evaluated pre- or post-operative rehabilitation techniques (n = 59; 15%), intraoperative tools and/or equipment (n = 65; 17%), or compared surgical interventions to non-surgical interventions (n = 23; 6%).

Inter-rater agreement

For the titles and abstracts screening, inter-rater agreement was strong. Cohen’s Kappa statistics indicated substantial (0.61 to 0.80) to almost perfect (0.81 to 0.99) agreement between each of the three reviewer pairs (0.81, 0.76 and 0.63) [21].

For the quality assessment, the average inter-rater agreement across all 16 questions was also strong for each of the four reviewer pairs (95%, 89%, 89%, and 88%). Agreement was particularly strong for 11 of the 16 questions of the QHES (Q1, Q3–9, Q13 and Q15–16) with agreement exceeding 80% in all four pairings (Supplemental Table 5).

Quality assessment

We assessed the quality of 127 full economic evaluations. The mean QHES was 83.5 (SD = 17.8). According to Spiegel et al.’s [22] quartiles, 96 studies (76%) were considered high quality, while fewer studies were considered fair (n = 22; 17%) and poor (n = 9; 7%) (Supplemental Fig. 1). The overall quality of studies did not show a trend over time (Fig. 3).

Fig. 3
figure 3

The Quality of Health Economics (QHES) total score over time (years) for all full economic evaluations (n = 127)

Studies conducted in North America, the UK or Europe were generally considered of high quality, while the pooled estimate of the remaining countries was considered fair overall (Table 2). Most full economic evaluations compared surgical techniques/implants or anesthetics/medications and the mean quality of these study types was high (Table 2). The mean study quality was also high for studies that compared pre- or post-operative rehabilitation techniques and those that compared surgical interventions to non-surgical interventions (Table 2), despite a smaller volume of published studies. Studies that evaluated intraoperative tools and/or equipment were considered fair quality (Table 2) with a smaller volume of published studies as well.

Table 2 The mean Quality of Health Economics (QHES) total score by geographical location and by study intervention group (n = 127)

QHES questions

Thirteen of the 16 QHES tool questions were addressed frequently (i.e. over 80% of the time) across all full economic evaluations: Q1, Q3–8, and Q10–15 (Fig. 4). Questions that were addressed less frequently were related to whether authors appropriately stated and justified the costing perspective (Q2), reported costing methodology (Q9), and whether a funding statement was provided (Q16).

Fig. 4
figure 4

Percentage of arthroplasty full economic evaluations addressing each of the 16 Quality of Health Economics (QHES) tool questions (n = 127)

Discussion

We summarized data from 384 economic evaluations that assessed arthroplasty interventions of the hip and/or knee. According to international health economic guidelines, clinical and policy decision-making should be made based on evidence that evaluates the cost and effect of treatment alternatives simultaneously (i.e., full economic evaluations) [12, 13]. Although there was a large volume of studies identified in our review, approximately two thirds (67%) did not complete a full economic evaluation. Many of these studies (n = 72; 28%) also made conclusions regarding the cost-effectiveness of interventions despite not conducting a full economic evaluation. This is particularly concerning as several of these studies are highly cited. For example, one study (cited 245 times) concluded that tranexamic acid was a cost-effective intervention compared to placebo; however, cost and health outcomes were evaluated independently [193]. To avoid potential confusion and misinterpretation of study findings for readers, authors should be mindful of the use of similar terms in studies. There was an increasing trend in the number of partial economic evaluations published over the years, while the number of published full economic evaluations peaked between 2010 and 2014. Our results, therefore, highlight the dire need for more full economic evaluations of high quality in the field arthroplasty to assist with informed policy decision-making.

Nevertheless, evaluation of the identified full economic evaluations (n = 127) showed that overall, studies were typically considered high quality (76%). Over the last 3 years, several studies have even scored a perfect 100 on the QHES [36••, 134••, 141••, 398••] and provide great examples for researchers to reference when designing studies and drafting manuscripts or for critical appraisal of studies. Few studies were considered fair quality (17%) or poor quality (7%) [22]. The overall mean QHES score was 83.5 ± 18 with no observable trend in study quality over time (Fig. 3). Although the volume of published studies is small, these results are encouraging for the future of health economic research in arthroplasty. Studies conducted in North America, the UK, and European countries accounted for 92% of studies and generally scored high on the QHES (Table 2), suggesting economic evaluation methodologies are likely well-established in these countries. Comparatively, studies conducted in other countries were considered fair, on average (Table 2). When evaluating quality by intervention type, studies that evaluated anesthetics and/or medications, pre- or post-operative rehabilitation, or that compared surgical to non-surgical treatment scored very well on the QHES. The lowest scores were shown in studies evaluating surgical techniques/implants, as well as studies evaluating intraoperative tools and equipment. These results highlight the importance of ensuring arthroplasty surgeons follow international guidelines for reporting economic evaluations to help improve the overall quality of studies in these areas.

Previous systematic reviews evaluating full economic evaluations for arthroplasty interventions have also reported good overall study quality. Daigle et al. [14] reported a mean study quality of 5.8 out 7 (n = 13) using a self-designed scoring tool. Nwachukwu et al. [16••] reported a mean study quality of 86.4 (range, 63–100; n = 23) in US-based studies while also using the QHES tool. Although not the primary goal of their paper, Kamaruzaman et al. [15•] reported individual QHES scores for their review (n = 23) and concluded study quality to be moderate overall. However, there are important differences between the previous reviews and the present study. We included studies globally which enabled a much larger study sample of full economic evaluations to be included (n = 127). We also assessed all components related to arthroplasty intervention such as prophylaxis, pre- or post-operative rehabilitation and intraoperative tools and equipment. Importantly, we evaluated both full and partial economic evaluations to get a more comprehensive understanding of the arthroplasty literature and potential pitfalls for readers in the interpretation of study results. When compared with other fields of health, the quality of full economic evaluations in the field of arthroplasty is high (i.e., QHES ≥ 75). Studies evaluating ischemic heart disease [406], and physical therapy [407] have also reported overall good study quality, while other fields of health have reported poor to fair (QHES = 25 to 74) study quality including the literature for digestive diseases [22], nursing [20], and oncology [408].

Through the QHES tool, we were able to identify important areas to improve overall study quality in the arthroplasty literature. Although 13 of the 16 QHES questions were appropriately addressed by studies, Q2, Q9, and Q16 were addressed less frequently (Fig. 4). The most problematic QHES question was Q2 which addresses whether a study states and justifies the costing perspective of the analysis. A mere 65% of studies addressed the question in this review. Transparency of the costing perspective is crucial to ensure that all of the necessary costs are being included in the analysis for answering the given question. Moreover, Primeau et al. [409] showed the cost-effectiveness of treatments can be dependent on the perspective analyzed. Their results showed that evaluating from a healthcare payer perspective (includes direct costs chargeable to the public payer) compared to a societal perspective (includes direct costs and indirect costs such as time away from work) provided contradictory results on treatment cost-effectiveness. Therefore, the perspective of the analysis is incredibly important for the appropriate interpretation of the study results in making health policy decisions. Guidelines now recommend that economic evaluations be conducted from multiple perspectives (e.g., healthcare payer and societal) to broaden the interpretability of study results. Interestingly, only 10 of the 127 included studies (8%) evaluated their data from multiple perspectives. Many studies evaluated their data from a hospital perspective (i.e., surgical and inpatient costs), which is considered a narrower viewpoint when compared with other perspectives.

The other two questions that were less frequently addressed related to whether appropriate costing methodology was used (Q9) and whether a funding statement was provided (Q16). Authors should be transparent in the reporting of costing sources and providing specific unit costs as costs can be quite variable from one healthcare system to another (e.g., private vs. public systems). Funding statements are also important for disclosing any conflicts of interest that could potentially bias the results of the study. For example, it is important to consider the funding source of a study that concludes a knee implant is cost-effective when it was funded by the manufacturer of the implant. On another note, although most studies scored well on Q5 (87%) which addresses whether authors handled uncertainty, we awarded points for the question if the study either (1) accounted for uncertainty through the use of bootstrapping, cost-effectiveness acceptability curves, cost-effectiveness planes, and/or 95% confidence intervals or (2) if authors performed sensitivity analyses to cover a range assumption. Most studies completed sensitivity analyses (88%); however, 55 of the 129 studies (43%) did not account for statistical uncertainty (Table 1). All possible levels of uncertainty should be accounted for in a full economic evaluation. As a general recommendation, investigators should follow guidelines for conducting economic evaluations when designing studies [12, 13] to improve study quality. It may be more difficult for reviewers to use full guidelines when critically appraising studies; however, the use of more user-friendly tools such as the QHES is highly encouraged.

Lastly, average cost-effectiveness ratios (ACER) and incremental cost-effectiveness ratios (ICER) are often used interchangeably in economic evaluations. Although both provide summary measures of the relationship between cost and effect, they answer very different questions. The important difference between the two is the ACER provides an overall estimate for each treatment (ACER = Cost/Effect), while the ICER provides an incremental estimate between treatments (i.e., ICER = ΔCost/ΔEffect). The ACER summary measure can be misleading as it distributes the difference in cost across all participants and assumes all health effects are produced at an equivalent cost [410]. Current guidelines recommend the use of the ICER as the summary measure [12, 13]. However, 11 of the 127 studies (9%) in this review reported the ACER in their study results while making inferences on cost-effectiveness. Another problematic finding is that very few studies (n = 12, 9%, full economic evaluations; n = 23, 6%, all studies overall) compared surgical to non-surgical interventions and even fewer included indirect costs in their analysis. It is particularly important to complete full economic evaluations that consider the impact of both direct (e.g., procedure and treatment costs) and indirect costs (e.g., loss in productivity or time away from work) to compare surgical and non-surgical interventions to make informed clinical decisions concerning cost-effectiveness.

Indeed, there are limitations to the present study. First, the QHES tool evaluates internal validity, not the generalizability of results (i.e., external validity). Therefore, studies that score well on the QHES may not necessarily provide study results that are clinically relevant or applicable. Policy decision-makers will also need to take this into consideration when assessing studies. Second, there are some restrictions associated with using the QHES tool to assess quality. There is a possibility that studies with great research methodology could have lost points for questions if the explanations that were provided lacked clarity or if specific information was omitted. Finally, there is a degree of subjectivity with using the QHES tool for quality scoring. To account for this, we had reviewer pairs follow additional criteria established by Marshall et al. [20] to supplement their scoring of the QHES and we also provided further clarifications to the question criteria (Supplemental Table 2). Additionally, we had all reviewers pilot test five studies with the QHES tool prior to evaluating all of the included articles to ensure measurement consistency among reviewer pairs. Accordingly, we showed excellent inter-rater agreement in our assessment of study quality (Supplemental Table 5).

Optimization of value-based care and healthcare resource allocation for hip and knee arthroplasty depends heavily on the quality of available literature. Generally, arthroplasty study quality is high; however, the volume of published full economic evaluations is still quite small when compared to studies that evaluate cost independently. Full economic evaluations are required to infer appropriate conclusions concerning cost-effectiveness, yet only 33% of studies from this review met the necessary criteria. There is a dire need for more full economic evaluations in the field of arthroplasty. Research consumers also need to be aware of common pitfalls in the literature including the inappropriate use of the term “cost-effective” in partial economic evaluations as several studies have been cited over 100 times. When designing studies, investigators need to prioritize strong research methodology and follow available criteria to improve study design and quality including guidelines for economic evaluation (e.g., Panel on Cost-effectiveness in Health and Medicine) or scoring instruments such as the QHES tool.