Introduction

Total hip arthroplasty (THA) is one of the most common orthopedic procedures performed in the United States, with approximately 114,000 THA procedures performed each year [1]. Despite the high success rate of THA, approximately 4.3% of patients require revision THA within 10 years [2]. Revision THA is associated with increased surgical complexity, longer operating time, and a higher risk of complications compared to primary THA [3]. The cost of revision THA is also markedly higher, with the average cost per episode of care reported to be $87,000 in 2014 [2]. The total length of stay (LOS) after revision THA is an important factor that influences both patient outcomes and hospital expenditure [4,5,6]. Previous studies have predicted an extra day in the hospital to increase the cost burden by $2,000–$3,000 [7]. The ability to predict prolonged LOS in individual patients can encourage proactive measures and allocate resources for those patients, thereby improving treatment efficiency and reducing care costs [8].

Traditional statistical models have been historically used to predict prolonged LOS following total joint arthroplasty [9,10,11,12]. Yet, the model performances were inherently limited by the model linearity and simplification of the variable interrelationship. Recently introduced machine learning (ML) models have outperformed such statistical approaches in terms of accuracy and predictive performance in a variety of contexts [13,14,15,16]. As a result, ML models are increasingly being used to predict outcomes in a clinical setting [17]. Previous studies have reported excellent ML model performance in identifying patients at high risk of prolonged LOS after primary total joint arthroplasty [18,19,20,21]. With rising interest in ML models, there has been an increased call for action to broaden the applicability of the model by incorporating multi-center patient data as a way to establish the model’s generalizability while maintaining a high level of accuracy [22, 23].

Large national datasets have been recommended due to their accessibility and availability of a variety of clinical and demographic data on surgical cases across the United States. The American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) is one such database that aggregates patient data from multiple sites in the United States [15]. This study aimed to develop ML models using the ACS-NSQIP dataset and evaluate their performance across three domains: discrimination, calibration, and clinical utility for the prediction of prolonged LOS after revision THA.

Materials and methods

Patient cohort, variables, and study outcomes

The ACS-NSQIP databases from 2013 to 2022 were reviewed to acquire the data of patients who underwent revision THA. The CPT codes 27,134, 27,137, and 27,138 were used to identify our research population. The exclusion criteria were age under 18 or above 100, body mass index (BMI) higher than 100, emergency surgery, bilateral arthroplasty, and incomplete/unclear hospital records. Data were also excluded if a negative value was recorded in any preoperative blood test. For example, a white blood cell count of -99 thousand/mm3 was considered a faulty entry as the value was physiologically impossible. This study was reviewed and approved by the institutional internal review board. The procedure for ML modeling in this study was reported following an established publication guideline [24].

The prediction target of the ML models was prolonged LOS following revision THA. In line with prior research [10, 25], we used the 75th percentile of all LOSs (3 days) as the cut-off value to divide the study cohort into two classes: normal LOS and prolonged LOS. Inputs to the ML models included sociodemographic variables (age, sex, race, ethnicity, BMI, and smoking history), comorbidities (dyspnea, diabetes, hypertension, bleeding disorders, congestive heart failure, etc.), and perioperative variables (American Society of Anesthesiologists (ASA) score, blood test results, total operation time, blood loss, transfusion, anesthesia type, etc.).

Model development and performance analysis

The study cohort was split into training and testing datasets at a ratio of 8:2 using the stratified slip technique [26]. Continuous variables were standardized and no imputation was performed for unknown feature values. ML models included in the study were: artificial neural network (ANN), random forest (RF), histogram-based gradient boosting (HGB), and k-nearest neighbors (KNN). These models were selected based on their previous performance on similar classification tasks [18, 26]. We applied recursive feature elimination using a rudimentary RF model (constructed by passing default values of the hyperparameters) to streamline the feature list while maintaining the model’s discrimination capacity, which was indicated by the area under the receiver operating curve (AUC). All patient variables were first fit to the model and ranked based on their contribution to the prediction, afterward the least important variable was removed. The procedure was repeated until the last variable remained. The model’s performance in each iteration was recorded to determine the optimal number of patient variables. We found that the model’s AUC value was steady at the beginning of elimination and gradually dropped after the number of remaining features was lower than 30. Therefore, the top 30 important patient variables were reserved for the subsequent model development. Important hyperparameters of each ML model were tuned using a coarse-grained grid-search method. In brief, the value of each hyperparameter was allowed to vary within a predefined range based on our previous studies on similar topics [23, 27]. A “grid” was comprised of dots that represented all possible combinations of the hyperparameter set. A “coarse-grained” method tested a subset of the dots and identified the combination that produced the best prediction accuracy of the model. The hyperparameters and their corresponding ranges for each ML model are as follows: ANN: learning rate: 0.0001, 0.001 … 0.1, the size of the hidden layer: 3 to 5, the number of neurons: 50 to 100 for each layer, and maximum epochs: 30; RF: the number of trees: 5 to 100 at the interval of 5, and minimal sample leaf: 2 to 50 at the interval of 2; HGB: learning rate: 0.0001, 0.001 … 0.1, the maximum number of interaction: 60 to 160 at the interval of 20, and leaves: 21 to 46 at the interval of 5; KNN: the number of neighbors: 50 to 450 at the interval of 50, and distance metrics: weights: uniform or distance, and p: 1 or 2. Fivefold cross-validation with five repetitions was applied during model development. The training dataset was divided into five subsets, with each subset serving as a validation set once while the remaining four were used for training. This process was repeated five times with different data splits. The repetition of cross-validation was implemented following the final feature selection to mitigate the risk of overfitting and reduce variances in the models’ performance metrics. After the training was completed, the models were applied to the testing dataset. The average computing time to develop the ML models ranged from 72 to 434 s on a computer running Microsoft Windows 10 Pro (Microsoft Corp., Redmond, Washington, USA), equipped with an Intel i7-13700 F CPU (Intel Corp., Santa Clara, California, USA), an NVIDIA GeForce RTX 3060 GPU (NVIDIA Corp., Santa Clara, California, USA), and 32 GB RAM.

The model’s performance was assessed using several metrics. The first metric was AUC which determines the model’s discrimination. An AUC value greater than 0.80 indicates that a model has excellent discrimination [22, 23]. The second metric used was calibration plots, which graphically represent the agreement between the actual outcomes and the model-predicted probability. A well-calibrated model has a slope of 1 and an intercept of 0. The third metric used was the Brier score, which measures the mean squared difference between the predicted probabilities and the actual outcomes of an event. A Brier score approximating 0 indicates that a model has few prediction errors [28]. Lastly, the decision curve analysis was used to evaluate the benefit of using the model compared to treating all or none of the patients across a range of probabilities [29]. The model’s interpretability was explained globally and locally. The plot of feature importance identified the patient factors with the greatest influence on the model prediction, while a local explanation was provided for a representative patient to demonstrate the weight of each variable on the final prediction of the machine learning model. Codes for ML modeling and computing performance metrics are accessible at https://github.com/tlwchen/ML-models-for-event-prediction.

We anticipated that there may be differences in sex ratio between the normal LOS and prolonged LOS groups. As several measures of the patient characteristics, such as the hematocrit level, can vary in females compared to males, skewness in data distribution might bias the model performance. We therefore stratified the study cohort by sex and performed secondary modeling for each subgroup using the ML model that demonstrated the best predictive metrics for prolonged LOS. The model performance was then compared between the subgroups to ascertain the conjecture of sex-specific influences on prediction accuracy.

Data analysis

Baseline patient characteristics between the normal LOS and prolonged LOS groups were compared. Continuous variables were analyzed using either the independent student T-test or the Mann-Whitney U-test, contingent on whether the assumptions of parametric tests were violated. The Chi-square test was utilized to examine nominal variables. Cohen’s d, rank-biserial correlation coefficient, and Cramér’s V were calculated to indicate effect sizes for corresponding statistical models used in primary examinations. Effect sizes were interpreted using the Cohen convention of negligible (< 0.20), small (0.20—0.49), medium (0.50—0.79), and large (> 0.80) values [30]. Statistical analyses were carried out utilizing Anaconda (version 2.5.4, Anaconda Inc., Austin, TX, USA), Python (version 3.11.4, Python Software Foundation, Wilmington, DE, USA), and SPSS (version 18.0, IBM Corp., Armonk, NY, USA). P < 0.05 was considered for the level of statistical significance.

Results

Patient characteristics

A total of 11,749 patients were included in the analysis, of which 26.8% had extended LOS (N = 3153). The percentage of male patients was slightly higher in the normal LOS group than in the prolonged LOS group (45.53% vs. 44.21%, p < 0.001). Statistics showed that patients with prolonged LOS were older (68.55 years vs. 65.92 years, p < 0.001), had a higher ASA score (ASA level 3 or above: 78.64% vs. 57.94%, p < 0.001) and comorbidity rates (hypertension, COPD, diabetes, etc. p < 0.005) compared to those with normal LOS (Table 1). A greater percentage of patients in the prolonged LOS group were smokers (17.69% vs. 13.91%, p < 0.001) and ethnic minorities (12.98% vs. 11.35%, p = 0.07). Patients from the prolonged LOS group also presented suboptimal blood test results (higher leukocyte counts and reduced hematocrit, p < 0.001), had longer total operation time (174.62 min vs. 136.47 min, p < 0.001), and were more likely to receive transfusion before surgery (Table 1). A larger fraction of the revision surgeries were caused by infection (33.11% vs. 9.31%, p < 0.001) in the prolonged LOS group. Despite the statistical significance, the effect sizes of the differences between the two patient groups were generally small across the patient variables (Table 1).

Table 1 Baseline characteristics of patients undergoing revision total hip arthroplasty in the study cohort

Assessment of model performance

All models showed excellent discrimination and calibration performance in the training session. The five-fold cross-validation for all models reported an AUC of 0.83 to 0.88, a calibration slope of 0.84 to 1.32, a calibration intercept of -0.08 to 0.03, and a Brier score of 0.087 to 0.132 (Table 2). A similar level of performance was retained across the models with the testing dataset (Table 3). ANN delivered the best results in predicting prolonged LOS after revision THA (AUC: 0.82, calibration slope: 0.90, calibration intercept: 0.02, Brier score: 0.140, Fig. 1A—B). The decision curve analysis demonstrated that ANN produced higher net benefits than the default strategies that assumed all or no observations were positive (Fig. 1C). ANN was therefore selected for secondary modeling for each sex subgroup. Following a similar modeling procedure, the results showed comparable performance metrics of ANN between sexes (Table 3).

Table 2 Discrimination and calibration performance of the machine learning models for predicting LOS during training
Table 3 Discrimination and calibration performance of the machine learning models for predicting LOS during validation
Fig. 1
figure 1

Plots of performance metrics for artificial neural network. (A) the receiver operating characteristics curve is a plot of the true positive rate (TPR) against the false positive rate (FPR) of the model prediction at different classification thresholds. With a lower threshold, more items will be classified as positive, which increases both TPR and FPR. The area under the curve (AUC) is an aggregate measure of the model performance across all possible classification thresholds. A good model that is able to effectively distinguish between positive cases and negative cases will generate high TPR/FPR ratios at any threshold and therefore produce a high AUC value. (B) the calibration plot shows the agreement between the model predictions and observations in different percentiles of the predicted values. A complete match between predictions and observations will generate a diagonal line (slope: 1, intercept: 0). (C) the decision curve analysis compares the net benefit (a trade-off between the benefits and harms of a particular decision) of using a predictive model to those of two baseline strategies: “treating all” and “treating none” at different probability thresholds. A model showing higher net benefit than the baseline strategies possesses clinical utility

Model interpretation

The plot of feature importance revealed that the patient factors that most contribute to prolonged LOS after revision THA were preoperative blood tests (hematocrit < 36.81%, platelet count > 253.54 thousand/mm^3, and white blood cell count > 7.51 thousand/mm^3), preoperative transfusion, operation time (> 135.26 min), indications for revision THA (infection), and age (> 76 years) (Fig. 2). A local explanation of an individual who stayed 2 days following revision THA featured a male aged 65 years with an ASA level of 3 and normal laboratory test results. The patient required transfusion before surgery and underwent revision THA for 145 min. ANN predicted that his probability of experiencing prolonged LOS was 38.33%.

Fig. 2
figure 2

Global feature importance plot of the machine learning models for predicting prolonged length of stay after revision total hip arthroplasty

Discussion

The study developed four ML models to predict prolonged LOS using a comprehensive national dataset containing over 11,749 revision THA patients. Our findings indicated that all models had great prediction performance during training and testing sessions. ANN provided the most accurate predictions of patients subject to lengthened hospital stays, as reflected by its high level of AUC, calibration parameters, and Brier score. ANN also showed clinical utility by producing net benefits against varying probability thresholds in the decision curve analysis. Important predictors of prolonged LOS, as indicated by ANN, were preoperative laboratory results, preoperative transfusion, operation time, indications for the revision surgery, and age.

The application of ML models to predict complications after revision arthroplasty is a relatively recent field of research [31]. Our findings were similar to those of a previous study [32] based on a single institution database, which reported an AUC of 0.84–0.87, calibration slope of 0.85–1.12, and calibration intercept of 0.14–0.21 for ML models that predicted LOS after revision total knee arthroplasty. These results of previous studies and the current study indicated that the ML model was able to retain a high-level performance across different arthroplasty types as it consistently required several key variables to predict LOS, such as age, BMI, operation time, ASA score, and comorbidities [33]. In a retrospective study including 1,278 patients undergoing primary THA, Farley et al. [34] found that increasing age, high BMI, and comorbidities contributed to increased LOS. Roger et al. [35] identified older age, high ASA score, comorbidities, and long operation time as the risk factors of prolonged LOS following total joint arthroplasties. In addition to these previously established determinants, our models also highlighted the role of laboratory tests (total leukocyte count, hematocrit, and platelet count) in making accurate predictions, which was consistent with a previous report on using ML models to identify patients predisposed to prolonged LOS after primary THA [26]. Despite the correlation between sex and the level of these blood biomarkers, the result of secondary modeling in our study did not support the influence of sex-specific data skewness on the ML model’s performance. The reliance of the model prediction on an isolated factor appeared to be insensitive to sex types. Another explanation may be that the difference in sex ratio between the two LOS groups was small, therefore limited class imbalance effects were introduced during model development to bias the model decision.

Our study also found indication for revision TKA to be an important determinant of LOS after surgery, with infection being a significant contributor to prolonged LOS. This finding is in concordance with reports by Klemt et al. [32]. Periprosthetic infection is one of the most commonly seen indications for revision TKA and oftentimes results in patient dissatisfaction and poor surgical outcomes [36]. Infection is also associated with increased complexity of the revision surgery and a higher risk of postoperative complications [37,38,39], which entails extended monitoring and care support during recovery [36]. This finding substantiated the clinical utility of the ML models in decision curve analyses as it provided useful information for pre-operation patient counseling based on the type of indications for revision THA. It is worth noting that the small effect sizes of between-group differences at baseline should not be used equivalently to interpret the clinical significance of the model’s prediction. The risk of prolonged LOS on the individual level is not solely reliant on the value of isolated patient features. Our results underscore the ML model’s strength in detecting the hidden pattern across various data domains and deriving predictions from the collective effects of multiple pertinent factors. This advantage persists when comparing ML models to conventional logistic regression analysis. ML models excel at capturing complex interactions among variables without assuming linearity between the predictors and outcomes, which is the premise for regression analyses but usually does not hold true in high-dimension data. Although logistic regression offers better interpretability due to its simplified model structure, it is less likely to outperform ML models in predicting prolonged LOS given the large scale and complexity of the dataset in this study.

As indicated by the ML models in our study, the list of important predictors of prolonged LOS included a combination of both modifiable and unmodifiable patient factors. The components of the laboratory tests are modifiable factors that had a major contribution to the model prediction. Various clinical management and supplement strategies are available to optimize the number of white blood cell counts and hematocrit levels before surgery, thereby mitigating the risk of prolonged LOS [40]. For instance, increased leukocyte count can be addressed by identifying the underlying infection and treatments through target antibiotic therapies [41]. Preoperative screening for infections allows timely intervention, which is crucial in preventing postoperative complications [42]. Chronic conditions such as diabetes are also potential causes of abnormal white blood cell counts [43]. Hyperglycemia can impair leukocyte function, increasing the susceptibility to infections and prolonging hospital stays. Effective glycemic control through adjustments in diabetic medication and lifestyle modifications, including dietary changes, can improve immune function and the white blood cell level. Unmodifiable predictors, on the other hand, underlay the applicability of the ML models by consolidating the accuracy and reliability in identifying individuals at risk of extended hospitalization. Incorporating the models into clinical routine has the potential to allows clinicians to better stratify patients based on their risk profiles. This stratification can facilitate preoperative counseling regarding expectations of potential hospital arrangements and more importantly, inform discharge planning, ensuring that appropriate resources and support are available for patients at risk to reduce costs associated with delayed transitions and patient dissatisfaction. The positive net benefits produced by ML models in decision curve analysis also support their clinical utility in helping with decision-making.

There are several limitations to consider when interpreting the outcomes of this study. First, the study was based on retrospective data from the ACS-NSQIP database. The accuracy of the model prediction might be affected by selection bias and data misrepresentation during manual information entry. The study cohort only included patients who had revision THA between 2013 and 2020, the findings may not be applicable in a different timeline. Second, we used a dichotomization method to categorize the outcome of the hospital stay. This method facilitates discrete class labeling in ML modeling but has the limitation of reducing the data granularity and discarding the nuanced information of variability in an LOS spectrum. The cut-off method of using the 75th percentile value may not generalize to other clinical settings or patient populations due to possibly different data distribution. This threshold is empirical and does not necessarily reflect the latest clinically meaningful distinctions. Finally, the ML models were created using a limited number of patient characteristics. Other patient factors, such as the surgical approach and ambulation protocol, have been previously reported to influence LOS but were not included in our study as they were not recorded in the ACS-NSQIP database. The actual benefits of the ML models in predicting patient outcomes warrant further investigation.

In conclusion, this study utilized a national-scale patient cohort to develop ML models that accurately predicted prolonged LOS following revision THA. ANN yielded the best performance in outcome discrimination and calibration during both training and testing sessions. All models showed great clinical utility in the decision curve analyses. Important predictors of LOS included preoperative laboratory tests, preoperative transfusion, operation time, indications for the revision surgery, and age. The integration of ML models into clinical workflows may assist in optimizing patient-specific care coordination, discharge planning, and cost containment after revision THA surgery.