Introduction

Patient-reported outcome measures (PROMs) are used to measure improvements in pain, function, and quality of life after total knee arthroplasty (TKA) [10]. Given the trend towards more patient-centric and value-based care, PROMs, especially when interpreted in the context of the minimal clinically important difference (MCID), are increasingly recognized as key measures of success after TKA [2]. Being able to predict a patient’s likelihood of MCID attainment could facilitate more personalized preoperative counselling and enable more informed decision-making [6].

Using receiver operating characteristic (ROC) analysis, preoperative PROMs have been used to predict MCID attainment, since patients with poorer preoperative PROMs have been observed to experience greater improvement [2, 3]. Berliner et al. found that preoperative PROM thresholds (KOOS: 58, SF-12 PCS: 34) could predict MCID attainment for TKA reasonably well, achieving an AUC of 0.76 for KOOS and 0.65 for SF-12 PCS [2].

In contrast, machine learning (ML) is a relatively newer method of data analysis which uses computer algorithms capable of learning and improving through experience [22]. In simple terms, ML algorithms work by recognizing patterns within real-world data and then using these insights to generate predictions without being explicitly programmed [11]. Recent studies have also demonstrated the feasibility of using ML algorithms to predict postoperative MCID attainment after TKA, achieving an AUC: 0.60–0.89 for outcomes such as Knee injury and Osteoarthritis Outcome Score (KOOS), Short Form-36 (SF-36) and Oxford Knee Score (OKS) [9, 12, 14].

Although both methods are fairly accurate in predicting MCID attainment, there has been no prior study directly comparing their performance on the same dataset. As such, it is unclear whether ML algorithms offer a significant advantage compared to the use of preoperative PROMs in predicting MCID attainment.

Given the growing interest to identify accurate and accessible prediction tools that can be used to improve preoperative decision-making, this study aims to evaluate and compare the performance of ML algorithms and preoperative PROM thresholds in predicting MCID attainment for the SF-36 PCS, MCS and WOMAC at 2 years after TKA. It is hypothesized that ML algorithms will outperform preoperative PROM thresholds in predicting MCID attainment after TKA.

Materials and methods

Data

Using prospectively collected data from a single institution’s joint replacement registry, 3537 adult patients who had undergone elective primary TKA between 2008 and 2018 were identified, of which 3021 (85.4%) completed their postoperative 2-year follow-up. Only patients with unilateral TKA due to primary osteoarthritis were included and patients with simultaneous bilateral TKA (n = 138), inflammatory arthritis (n = 36), post-traumatic arthritis (n = 3), malignancy (n = 2) and avascular necrosis (n = 2) were excluded. Data were subsequently collected for the remaining 2840 patients. Ethics approval from the institutional review board (DSRB: 2020/00613) was obtained prior to study initiation, with a waiver of informed consent due to the use of deidentified data.

Outcomes

The two PROMs used in this study are the Short Form-36 (SF-36) and Western Ontario and McMaster University Osteoarthritis Index (WOMAC) at 2 years postoperatively. The SF-36 is a generic health questionnaire that measures an individual’s health-related quality of life, with its 8 domains aggregated into the physical component summary (PCS) and mental component summary (MCS) [15, 25]. In comparison, WOMAC is a disease-specific questionnaire for lower limb osteoarthritis and has 3 main dimensions of pain, stiffness and function [26, 27].

Published anchor-based MCID values were used as they take into account the patient’s perspective and are more likely to represent subjective meaningful change [16, 17]. MCID values of 10.0 points for SF-36 PCS, 5.0 for SF-36 MCS and 15.0 for WOMAC were adopted based on work by Escobar et al. and Quintana et al. [8, 21].

Statistical analysis

Statistical analysis and predictive modelling were performed using Python 3.7 (Python Software Foundation, Wilmington, DE) and R software version 4.0.3 (R Foundation for Statistical Computing, Vienna, Austria; 2019). First, the 2840 TKAs were randomly split into a training set (80%) and test set (20%). Data in the training set was used to train and develop the various predictive models while evaluation of model performance was performed using the independent test set to assess its ability to generalize to unseen data.

Machine learning algorithms

For the ML approach, four of the most popular supervised ML algorithms: (1) random forest (RF), (2) extreme gradient boosting (XGB), (3) support vector machines (SVM) and (4) logistic regression with L1-regularization (LASSO) were used.

Input variables for ML models include patient demographics, comorbidities and preoperative PROMs (Table 1). Input variables were limited to preoperative data as we believe the greatest utility of such prediction models would be to aid patients and surgeons in preoperative decision-making. Missing input variables—BMI (n = 9) and preoperative pain score (n = 5) were imputed with median values in the training or test set correspondingly. Upsampling (random oversampling) was implemented during model training due to class imbalance which could potentially affect the predictive performance of some algorithms [1]. All ML models were trained using fivefold cross-validation in the training set to optimize their hyperparameters before the final evaluation their predictive performance on the independent test set.

Table 1 Input variables for ML models

Preoperative PROM thresholds

A non-parametric ROC analysis, similar to the method described by Berliner et al. was used to determine the preoperative PROM threshold that could best predict MCID attainment in the training set [2]. The optimal threshold value was determined using the Youden method, which corresponds to the point on the ROC curve that maximizes the classifier’s combined sensitivity and specificity [28]. This optimal threshold value was then implemented in the unseen test set to evaluate its ability to predict MCID attainment.

Evaluation of predictive performance

The area under ROC curve (AUC) is a commonly used evaluation metric for predicting binary outcomes, measuring the model’s ability to discriminate between different classes. In general, an AUC of 0.7–0.8 indicates fair discriminative ability, 0.8–0.9 indicates good discriminative ability and 0.9–1.0 indicates excellent discriminative ability [24]. Other classification metrics include the F1 score, Brier score, Sensitivity and Specificity.

Results

Baseline patient characteristics of the 2840 TKAs are reported in Table 2. 70.7% (n = 2008) of patients were female. The mean age was 66.3 (SD: 8.2) years and the mean BMI was 28.0 (SD: 5.0).

Table 2 Baseline patient characteristics

Improvements in PROMs are summarized in Table 3. At 2 years postoperatively, mean improvements for SF-36 PCS, SF-36 MCS and WOMAC were +15.4 (SD: 9.2), +3.3 (SD: 8.8) and +26.5 (SD: 13.8) respectively. MCID attainment was 73.6% for SF-36 PCS, 38.5% for SF-36 MCS and 81.2% for WOMAC.

Table 3 PROM improvements and MCID attainment

Predictive performance

The predictive performance of ML algorithms and preoperative PROM thresholds are reported in Table 4. All models were evaluated on the unseen test set. For brevity, only the top two performing ML models were presented. For all outcomes, ML models and preoperative PROM thresholds performed similarly in terms of their discriminative ability (AUC). For the SF-36 PCS, ML models achieved a maximum AUC of 0.77 while the preoperative PROM threshold achieved an AUC of 0.74. For the SF-36 MCS, ML models achieved a maximum AUC of 0.95 while the preoperative PROM threshold achieved a similar AUC of 0.95. For the WOMAC, ML models achieved a maximum AUC of 0.89 while the preoperative PROM threshold achieved an AUC of 0.88.

Table 4 Predictive performance of ML algorithms and preoperative PROM thresholds

Using ROC analysis on the training data, the identified optimal preoperative PROM thresholds are 33.6 for the PCS, 54.1 for the MCS and 72.7 for the WOMAC (Figs. 1, 2, and 3). These threshold values are preoperative cut-offs that can be used to predict MCID attainment for the same PROM. For example, using a preoperative threshold of 72.7 for the WOMAC, it can predict MCID attainment in the test set with 84.7% sensitivity and 71.2% specificity.

Fig. 1
figure 1

Optimal preoperative threshold for SF-36 PCS in the training set

Fig. 2
figure 2

Optimal preoperative threshold for SF-36 MCS in the training set

Fig. 3
figure 3

Optimal preoperative threshold for WOMAC in the training set

Machine learning variable importance

To understand the variables which drive prediction in our ML models, variable importance was assessed using permutation importance, a model-agnostic method which has been shown to generate reliable insights correlating with clinical intuition [4]. The permutation importance scores of the top five variables for each PROM is summarized in Table 5. For each PROM, the single most important predictor of MCID attainment was consistently its corresponding preoperative score. In comparison, other variables such as patient age and operating surgeon played a relatively minor role, as evident by their minute permutation importance scores.

Table 5 Permutation importance scores for ML models

Discussion

The key finding of our study was that ML algorithms did not perform significantly better than preoperative PROM thresholds in predicting MCID attainment after TKA. Our study is the first to directly compare the predictive performance of ML algorithms with preoperative PROM thresholds in predicting MCID attainment after TKA. Our results have shown that simple preoperative PROM thresholds perform just as well as ML algorithms in predicting MCID attainment for SF-36 PCS, MCS and WOMAC after TKA. This is likely because the single most important predictor of MCID attainment after TKA is the patient’s preoperative PROM, with other variables such as patient demographics and comorbidities shown to play a relatively minor role.

However, this is not a denunciation of ML. Rather than saying that ML is not useful, our results simply make the argument that ML algorithms may not perform significantly better than existing methods when applied to this clinical problem and trained using currently available data. Furthermore, current registry data may lack the granularity required by ML algorithms to generate deeper insights beyond those offered by preoperative PROMs.

While artificial intelligence has the potential to transform healthcare, not all clinical problems may warrant the use of ML algorithms or even benefit significantly from it, especially if existing methods may work just as well [22]. However, it remains unclear what types of clinical problems and datasets may benefit the most from ML algorithms. Some studies also do not routinely compare their performance with conventional statistical techniques, making it difficult to critically evaluate the comparative advantage of ML algorithms. This issue has been highlighted by Christodoulou et al. in a systematic review of 71 studies, which found that ML algorithms did not outperform conventional logistic regression in predicting various clinical outcomes [5]. A recent study by Pua et al. also found that ML algorithms did not perform better than logistic regression in predicting severe walking limitation after TKA [20]. Given that this is a vital topic that remains poorly understood, future research should routinely evaluate the comparative advantage of ML algorithms over existing methods and determine the types of clinical problems which may benefit the most from it.

In addition, this study has determined optimal preoperative PROM thresholds using ROC analysis—33.6 for the SF-36 PCS, 54.1 for the MCS and 72.7 for the WOMAC. These findings are consistent with prior work by Berliner et al., who reported a preoperative threshold of 34 points for the SF-12 PCS [2]. Moreover, this is the first study to report preoperative PROM thresholds for predicting MCID attainment in SF-36 MCS and WOMAC after TKA. These findings show that despite its simplicity, preoperative PROM thresholds are effective and remain a clinically viable tool in predicting postoperative functional improvements, with patients exceeding such thresholds less likely to attain a clinically meaningful change.

With increased ability to predict functional improvements after TKA, there is a greater need to define the utility and role of such prediction tools in clinical practice. Price et al. reported that a preoperative OKS threshold of 41 is predictive of MCID attainment and proposed using the threshold to guide specialist referral by primary care physicians [19]. While this is a step in the right direction towards better resource utilization and individualized patient care, such thresholds remain imperfect prediction tools with significant false positives and negatives. Furthermore, the MCID is derived from population estimates and using a point value to assess clinically meaningful change may risk misclassifying individuals with varying thresholds to perceive improvement [7]. As such, these prediction tools may be better suited for preoperative surgical counselling and shared decision-making—whereby patients can be individually counselled on their expected improvements so as to make more informed decisions regarding TKA [2]. Prior studies have also shown that modifying patient expectations prior to joint arthroplasty can improve postoperative satisfaction, thus highlighting the potential of such decision-making tools to encourage more realistic expectations regarding TKA [13, 18].

There are several limitations in this study. First, since data used in this study was extracted from a single institution’s joint replacement registry, it is unclear the extent to which these findings are generalizable to other healthcare institutions. Next, a potential criticism is that our ML models could have performed better if given additional input data. However, input variables were selected based on current understanding of the predictors of MCID attainment after TKA and similar to datasets used by prior studies employing ML algorithms [9, 12, 14]. Furthermore, input variables for ML algorithms were limited to preoperative data as the greatest utility of such prediction tools was to enhance preoperative counselling and modify patient expectations. While the inclusion of postoperative data may potentially improve its predictive performance, such algorithms have limited utility in the preoperative period. Lastly, there were other PROMs such as the Oxford Knee Score and Knee Society Score, as well as other subjective outcomes such as patient satisfaction and expectation fulfilment that were not considered in this study [23].

Conclusion

Machine learning algorithms did not perform significantly better than preoperative PROM thresholds in predicting MCID attainment after TKA. Future research should routinely evaluate the predictive advantage of ML algorithms over existing methods and determine the type of clinical problems which may benefit the most from it.