Introduction

The implementation of bundled payment models as an economic strategy to limit financially undesirable health care costs associated with total joint arthroplasty procedures has become commonplace among hospitals throughout the United States [1, 2]. A subsequent result of such policy has been increasing focus on the ability to predict and anticipate patient outcomes, the selective recruitment of patients, and promotion of preoperative health optimization based on risk factors. Procedure-associated complications after total hip arthroplasty (THA) is a significant potential source of increased health-care expenditures secondary to hospital and emergency room readmissions in addition to the possible need for revision arthroplasty [2]. Therefore, understanding which patients are at-risk for complications during the preoperative period would potentially allow for targeted intervention during a time when health optimization could be performed in order to possibly decrease this complication risk.

Much of the current knowledge regarding risk factors associated with complications after THA is limited to associations identified in studies that are not hypothesis-driven but found through testing many random variables [3,4,5,6,7,8]. Some of these risk factors include age [7, 9], number of comorbidities [7, 9], body mass index, [5, 10], smoking [6, 11], sex [12], and diabetes mellitus [11]. Few studies have sought to develop risk models incorporating sets of such known preoperative risk factors in order to understand and predict which patients are at higher risk for complications in the postoperative period [13], while other prediction models have been limited by using intraoperative data and therefore do not allow for preoperative optimization [14]. As such, there is currently a large pool of potential risk factors and a limited number of prediction models utilizing solely preoperative and modifiable risk factors. Developing and cross-validating models with the smallest number of the most important factors would be of great clinical utility to hip and knee surgeons. This stratification model would benefit patients by providing them with the opportunity to optimize their health prior to undergoing THA.

The application of machine learning is a powerful statistical instrument capable of determining patient-specific factors which influence the probability of a patient experiencing a complication after primary THA. Furthermore, machine learning allows for the development of clinical decision-making tools, which can be used in office-based settings to help discuss risk stratification with patients [15,16,17]. This may assist orthopaedic surgeons to better determine which patients may need further optimization prior to undergoing THA. The purposes of the current study were to (1) develop and internally validate machine learning algorithms capable of predicting all-cause complications within two years of primary THA, and (2) to use these algorithms to determine which preoperative factors are important in predicting all-cause complications after primary THA. The authors hypothesized that best performing machine learning algorithm would allow for both excellent prediction and an interpretable explanation of how factors specific to individual patients influenced the model decision making.

Methods

Patient selection

Following institutional board approval, data was obtained retrospectively from the electronic medical records of patients who underwent primary total hip arthroplasty by one fellowship-trained surgeons at one large academic and two community hospitals. The timeframe for patient inclusion was between January 2014 and January 2016. Exclusion criteria included etiology of degenerative hip osteoarthritis that was inflammatory, infectious, post-traumatic, acute femoral neck fracture, or related to osteonecrosis, patients undergoing revision THA, and patients with less than two-year follow-up. Overall, 616 patients met the inclusion criteria and had a median age of 62 (interquartile range [IQR] 54–70) years. A total of 352 (57.1%) patients were female. Additional demographic and clinical outcome information is displayed in Table 1. A minimum of 100 patients has been demonstrated to be an appropriate sample size for machine learning analyses and associated predictive analytics, and therefore the current sample of 616 patients was deemed valid [18, 19].

Table 1 Characteristics of study population, n = 616

Primary outcome

The primary outcome was all-cause complications within the two-year follow-up period. Complications were considered all events classified to be either medical or orthopaedic (Table 2). Medical complications included post-operative myocardial infarction, pulmonary embolism, deep vein thrombosis, atrial fibrillation, and anemia requiring blood transfusion. Orthopaedic complications included nerve palsy, hematoma formation, heterotopic ossification, hip squeaking, wound abscesses or dehiscence, periprosthetic infection, intra- and post-operative fractures, dislocations, leg-length discrepancy, aseptic loosening, and atraumatic, return to the emergency department or readmission for any complaint related to the operative hip, and reoperations.

Table 2 Surgical and medical complications

Candidate variables

Candidate variables were collected prospectively before THA and stored in a secure clinical repository. Candidate variables are listed in Table 1, with the rates of missing data as follows: preoperative opioid use (n = 2, 0.32%), smoking history (n = 2, 0.32%), diabetes at time of surgery (n = 2, 0.32%), drug allergies (n = 86, 14.0%), presence of one or more comorbidities (n = 2, 0.32%), preoperative health state (n = 26, 4.2%), preoperative modified Harris Hip Score (mHHS) (n = 17, 2.8%), preoperative hip flexion (n = 197, 32.0%). Hip flexion was the only variable with greater than 30% missing data and was consequently excluded [20, 21]. These PROMs included the patient reported health state (PRHS) [22] and the modified Harris Hip Score (mHHS) [23] Prior to analysis, missingness of data was explored and determined to be missing at random and appropriate for multiple imputation. The current analysis applied multiple imputation and predictive mean matching with the “mice” package in R (R Foundation for Statistical Computing, Vienna, Austria) [24]. Following imputation, recursive feature elimination (RFE) with random forest algorithms were used to determine the combination of variables with the highest predictive value that optimized algorithm performance through a process of backwards elimination.

Algorithm construction and performance assessment

The machine algorithm development methodology and data analysis had been previously described in detail [25, 26]. Briefly, five novel algorithms were constructed on a training set of patients (80% of initial cohort) using three iterations of tenfold cross-validation. Standardized metrics of model performance including (1) calibration (calibration plot, intercept, slope) [27, 28], (2) decision curve analysis [29, 30], (3) Brier score [31], and (4) discrimination (area under receiver operating curve), were used to comparatively evaluate model performance of both the training and testing sets.

Exploration of patient-specific model explanations

Local interpretable model-agnostic explanations (LIME) depict the decision-making process of machine learning algorithms and were used to demonstrate how the best performing algorithm explained prediction on a patient-by-patient basis [32]. Using LIME, an open access digital web application was developed with the capacity to provide both predictions and explanations at the individual patient level [33]. This application is freely accessible: https://sorg-apps.shinyapps.io/tha_complication/. Given that external validation was not performed in the present study, the application in its current form merely constitutes an educational tool and open-access source to the developed algorithms.

The Anaconda Distribution (Anaconda, Inc., Austin, Texas), R (The R Foundation, Vienna, Austria), RStudio (RStudio, Boston, MA), and Python (Python Software Foundation, Wilmington, Delaware) were used for data analysis. Predictive modeling development and testing was performed under guidelines set forth by Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) guidelines and the Guidelines for Developing and Reporting Machine Learning Models in Biomedical Research were followed for this analysis [34, 35].

Results

Final variable selection

The combination of variables identified for algorithm development through recursive feature selection that optimized predictive performance were comorbidities, preoperative opioid use greater than three months, current smoking, prior hip surgery, drug allergies, and age (Fig. 1B).

Fig. 1
figure 1

A Receiver operative curve demonstrating discrimination of stochastic gradient boosting algorithm. B Global variable importance plot, with variables ranked in decreasing order of importance

Algorithm selection and model performance

Cross-validation of the training set (n = 494) demonstrated that the AUC ranged from 0.81 to 0.92, the calibration intercept ranged from − 0.21 to 5.09, the calibration slope ranged from 0.8 to 3.49, and the Brier score ranged from 0.08 to 0.12 (Table 3).

Table 3 Algorithm performance on cross-validation of training set, n = 494, mean (95% confidence interval)

In the testing set, the AUC ranged from 0.77 to 0.93, the calibration intercept ranged from − 0.50 to 2.11, the calibration slope ranged from 0.89 to 1.22, and the Brier score ranged from 0.08 to 0.16 (Table 4). The algorithm with the best performance was the stochastic gradient boosting model with AUC 0.88, calibration intercept 0.103, calibration slope 1.22, and Brier score 0.09. The most important factors for prediction of complications were age, documented drug allergies, prior ipsilateral hip surgery, smoking, and preoperative opioid use (Figs. 1A and 2A). The stochastic gradient boosting model resulted in greater net benefit compared to the default strategies of changes for all patients, for no patients, or changes based on age alone as demonstrated by the decision curve analysis (Fig. 2B).

Table 4 Algorithm performance in independent testing set (95% confidence interval), n = 122
Fig. 2
figure 2

A Calibration plot for stochastic gradient boosting algorithm. B Decision curve analysis of stochastic gradient boosting algorithm. In the decision curve analysis, the net benefit of the model (blue line) relative to default strategies of changing management for all patients (“all” or for no patients (“none”). The (“all”) line represents the net benefit from changing management for all patients. The line slopes down because at a threshold of zero, false positives are given no weight relative to true positives; as the threshold increases, false positives gain increased weight relative to true positives and changing management for all patients results in decreasing net benefit. The horizontal line (“none”) represents the default strategy of changing management for no patients (net benefit is zero at all thresholds)

Potential application and utility of machine learning using patient specific explanations

An individual patient-specific risk explanation demonstrating the potential utility of using this risk stratification model towards preoperative health optimization applications is depicted in Fig. 3.

Fig. 3
figure 3

Example of individual patient-level explanation for prediction of postoperative complications by stochastic gradient boosting algorithm. A Scenario 1 depicts a patient who is 80-years-old patient and presents to a total joint arthroplasty clinic with no prior history of hip surgery, more than one medical comorbidity, has used opioids for greater than three months to manage their current hip pain, and is a current smoker. For this particular patient, their risk of experiencing a complication after primary THA is 6.0%. B In scenario two, following a period of preoperative health optimization in which they quit smoking, stopped using opioid medications, and worked with a primary physician to improve their medical comorbidities, this patient reduced their risk of a postoperative complication to 1.0%. Features in red contradict (decrease the risk) of experiencing a complication, while features in green support (increase the risk) of experiencing a complication

Discussion

The main finding of the current proof-of-concept study was that the best performing machine learning algorithm conferred good predictive capability with regards to risk of experiencing a complication after primary THA at the senior author’s institution. This model incorporated various modifiable and patient-specific risk factors that can be the target of optimization during the preoperative period that may decrease the risk of complications prior to undergoing surgical intervention. Furthermore, the development of a clinical decision-making tool which depicts and calculates patient-specific risk for complications can theoretically augment patient care by providing teachable and real-time data in clinic settings which may be used towards health optimization purposes. Such data may also be critical for establishing tiers for alternative payment models so as not to promote the potential for “cherry-picking” behavior and access to care problems.

There are several limitations that should be considered within the context of the current study results. Although the current machine learning algorithms had good prediction capabilities, the incidence of complications was not high enough to perform prediction of individual complication categories such as periprosthetic joint infection or myocardial infarction. The definition of complications was also broad as to capture a wide variety of potential postoperative events, which may have inflated the complication rate. However, we believe that the included complications represent primarily major events which would be relevant to associated penalties in alternative payment models. This study is retrospective in design and therefore is subject to biases inherent in such data collection methods; however, there was a high completion rate of included data and multiple imputation methods were employed to mitigate the effect of missing data. Finally, the relatively small sample size limited the holdout (testing) dataset for assessment of model performance to only 123 patients. As such, the present study represents a proof-of-concept design and the open-access tool presented herein is merely for educational purposes until rigorous external validation is performed. It is yet to be known if the same predictive value in assessing risk will hold in other patient populations. Finally, this analysis is limited in that though we investigated the predictive importance of all variables available in this specific institutional repository, other variables important for predicting complications likely exist. Future studies are warranted that incorporate other clinically relevant variables into the current model to determine whether or not these variables confer beneficial changes to the overall performance of the model in predicting all-cause complications.

The current study demonstrated that the stochastic gradient boosting algorithm was the best performing machine learning algorithm of the five that were developed and internally validated. This particular model had an AUC of 0.88, which is considered good discriminatory capability, and demonstrated appropriate predictive probabilities relative to observed events as the model did not overfit the data (Fig. 2A). This is particularly important in reference to standard AUC values, as in real-life practice complications are better described at the patient-level as probabilities of experiencing an event as opposed to a binary all-or-nothing event. Although the random forest (Supplement 1) and elastic-net penalized logistic regression (Supplement 2) also performed well, they had inferior calibration compared to the stochastic boosting gradient. Furthermore, the decision-curve analysis in the current study (Fig. 2B) demonstrated that using the stochastic gradient boosting model for risk stratification of postoperative complications was superior to considering all patients as high risk, none of the patients as high risk, and when considering age alone as a risk factor. Put simply, for patients undergoing primary THA, the stochastic gradient boosting model conferred greater utility in terms of preoperative risk stratification in comparison to alternate strategies of determining complication risk. This model provides concise assessment of complication risk in patients undergoing primary THA based on synthesized preoperative patient data. In addition, the model was incorporated into a proof-of-concept application that is user friendly and patient-specific while requiring fewer variables than prior prediction risk models [13, 14].

The current study found that the most important patient-specific factors contributing to complications in the institutional data set under consideration were age, medication allergies, opioid use, smoking, comorbidities, and prior hip surgery. It is of note, however, that the primary outcome in this study was all-cause complications. Using this all-encompassing definition of complications was necessary as the incidence of individual complications, such as myocardial infarction, was too infrequent to develop a meaningful model to predict each specific complication. In this context, some of the model interpretability is lost. For example, though a greater patient age may be predictive of all-cause complications, it is possible that increased age was protective of some specific complications simultaneously, such as dislocation. Few studies have also used this all-encompassing definition of complications in attempting to determine associations between preoperative variables and complications after THA. Harris et al. [36] used the American College of Surgeons-National Surgical Quality Improvement Program (ACS-NSQIP) database and least absolute shrinkage and selection operator (LASSO) methods to predict 30-day complications and mortality after total hip and knee arthroplasty procedures. The authors also found that age and various comorbidities were important contributors to experiencing complications after total knee or total hip arthroplasty. However, limitations to their study include: (1) combining total hip and knee arthroplasty patients, which are representative of potentially different patients with distinct risk profiles; (2) using LASSO methodology with less than excellent discriminatory capabilities (all models with AUCs less than 0.8, and for all complications, equal to 0.68); and (3) utilization of a national database with inherent limitations such as overrepresentation of specific populations and the limitation of 30-days complication data. Although the current study was not able to externally validate the best performing algorithm as Harris et al. did on the Veterans Affairs Surgical Quality Improvement Program (VASQIP) database, the model in the current study has the benefits of (1) using institutional data from a single-surgeon; (2) capturing complications within two-years of primary THA; and (3) having rigorously tested five independent classification-type machine learning models. Nonetheless, there remain limitations to both the current study and that performed by Harris et al. [36] which will need to be improved upon prior to creating a meaningful tool amenable to confidently predicting complications in diverse populations.

The model in the current study provides a rapid method for combining various pertinent clinical data points to accurately quantify patient risk at the individual level. Although risk stratification has been previously investigated in elective THA, research has primarily focused on global assessment of risk. The novelty of the methodology developed in the present study is the ability to efficiently determine risk at the individual patient-level and receive real-time feedback on the patient factors influencing the risk calculation. As the dataset in the current study is small and requires external validation, the presented patient scenario (Fig. 3) represents a proof-of-concept for machine learning capabilities and how machine learning can potentially impact clinical workflow and patient outcomes in the future. Though the clinical utility of this algorithm remains questionable, it is important to demonstrate how the machine learning algorithm and online application function. Future studies are warranted to externally validate the model in the current study and determine if additional variables could be of clinical utility, as well as to determine if patients undergoing total knee arthroplasty require a separate risk model. This open-source tool is for educational purposes only and should not be used in clinical settings at this time due to its generalizability being unknown outside of the authors’ institution.

Conclusion

The stochastic boosting gradient algorithm demonstrated good discriminatory capacity for identifying patients at high-risk of experiencing a postoperative complication and proof-of-concept for creating office-based applications from machine learning that can perform real-time prediction. However, this clinical utility of the current algorithm is unknown and definitions of complications broad. Further investigation on larger data sets and rigorous external validation is necessary prior to the assessment of clinical utility with respect to risk-stratification of patients undergoing primary THA.