Introduction

Up to 60% of hospitalized patients will eventually require surgical intervention, making the operating room (OR) simultaneously a significant source of revenue and overhead [1, 2]. One factor that contributes to a significant overhead cost is the inefficient use of OR time [3]. OR utilization is a metric that is often used to determine the efficiency of OR use and compared against known benchmarks [4]. Inadequate estimation of case duration can lead to both under- and overutilization of the OR time.

The process of surgical scheduling begins by compiling a list of cases and their predicted durations. If cases consistently run longer than anticipated, OR overutilization will result in costly overtime pay and staff dissatisfaction. Even worse, if case times are consistently shorter than expected, OR underutilization will result in increased staff idle time, which is associated with up to a 60% higher cost [5]. Furthermore, scheduling inefficiencies often have downstream effects on various performance metrics (i.e., length of hospital stay, patient satisfaction), which can have large ramifications on patient outcomes and hospital reimbursement.

One common approach to predicting case duration places the responsibility upon the surgeon, who personally reserves a block of OR time based on surgical approach, patient comorbidities, and clinician expertise. With this method, surgeons overestimate case duration up to 32% of the time, and underestimate 42% of the time [6]. Another common approach uses the electronic health record (EHR) to calculate a case duration based on historical data for a given procedure and/or surgeon. When comparing these two approaches, EHR-generated case times have been shown to have a modestly higher accuracy [7]. However, commercially available EHRs only generate case durations for the average patient – it does not take patient factors (e.g. age, sex, body mass index, allergies, ASA Physical Status classification and associated comorbidities), procedure-specific considerations (e.g. implant type, use of invasive monitoring, anesthesia type), hospital logistics (staff, equipment, time of day, day of the week), or prior milestones (e.g. case delays, cancellations, turnover times) into account, despite studies showing that these factors can have up to a 30% impact on total case duration [8, 9].

A third approach that is rarely employed is to combine the first two approaches by combining clinician input with EHR-generated predicted case duration (pCD), defined as the predicted time between a patient’s entry into the OR and their exit. A more novel approach to this would be to utilize machine-learning with natural language processing to leverage pre-existing data in the EHR in order to tailor pCD to each patient. Our study compares the performance of one such novel algorithm to the conventional methods of using historical means. We hypothesized that there will be a significant improvement in pCD accuracy. If this is the case, this new approach to predicting case duration may represent an additional means of improving OR efficiency and facilitating scheduling of cases.

Methods

Feature selection

Formation of the algorithm began with feature selection. To do so, data at the institutional level were pushed through a series of Extraction, Transformation, and Loading processes (ETL) in order to correlate and optimize data for consumption by the machine-learning engine. The original dataset going through this ETL process for each case included patient information, providers involved in the case, facility details, procedure being performed, as well as prior events in the surgical suite. Table 1 summarizes some of the information processed in this step. All of the data points that are input into the model came from existing documentation about surgical cases in an EHR system. Given that these data can be in different formats across different EHR systems (or even different versions of the same EHR system), the Leap Rail platform is designed such that it is capable of extracting this information from a variety of different input formats and systems. These raw data are then staged and transformed into a tab-separated values file format and fed into the machine learning model. The output is a simple numeric prediction for the length of surgical cases which can be published in a variety of formats, depending on the needs of the consuming system.

Table 1 Examples of OR case-related factors (patient, providers, facility/room, procedure, prior events) included in the machine-learning algorithm

Appropriate numeric representations were assigned to non-numeric informative attributes such as free text surgeon comments in the EHR. This information was analyzed in order to build a corresponding set of relevant individual measurable properties: features [10]. Choosing informative, discriminating and independent features is a crucial step for an effective algorithm. This is an iterative process and the resulting features can vary by the dataset being used for training. For the machine-learning algorithm used in this study, over 1500 features were identified and subsequently fed into various machine-learning algorithms such as gradient-boosted tree regression, decision trees, and random forests [11,12,13].

Model development and training

The proprietary Leap Rail® engine uses a combination of supervised learning algorithms. A supervised learning algorithm makes predictions based on a set of examples [14]. Each example used for training is labeled with the value of interest. For this study, the examples given to the supervised learning algorithms were historical surgical cases performed at the institution (n ~15,000), and the value of interest was the actual case duration.

A supervised learning algorithm looks for patterns in those labels. It can use any information that might be relevant among all the provided features in training data. Each algorithm looks for different types of patterns. After the algorithm finds the best patterns it can, it uses that pattern in making predictions for future cases.

In order to select the best algorithm for the cases performed at the institution, historical case data were split into training and test sets. The Leap Rail engine was exposed to the training data in order to create multiple models using different algorithms and feature sets, and their performance was subsequently measured against unseen surgical cases using the test set. Model performance was objectively compared, and the most accurate model was selected for use at the institution.

The winning model was deployed and used for predicting duration of unseen future surgical cases. In this study, we have evaluated the accuracy of pCD generated by the Leap Rail engine on these unseen cases compared to the EHR pCD predictions.

Study data/variables collected

All data were received from Leap Rail, Inc. (San Francisco, CA, USA). All potential patient or clinician identifiers were removed prior to data transmission and analysis. For this study, we examined all operative cases performed at NorthBay Healthcare, a Level II Trauma center located in Solano County, California from January to March 2018. Raw data were collected and subsequently provided by Leap Rail, Inc., the developer of the machine-learning algorithm used for this study. Variables collected from each case included primary procedure, subspecialty, EHR-predicted duration, and actual case duration. EHR-predicted case times were generated using the Cerner System (North Kansas City, MO, USA) currently used at the study facility, which did so by using historical means. Actual case duration was defined as the patient in-room time to patient out-of-room time. Each procedure name was determined by its corresponding ICD-10 code, while the primary specialty for each case was determined by recording the booking surgeon’s credentials. Performing surgeons were aware of Leap Rail pCD as well as the pCD generated by the existing EHR (Cerner).

Surgical cases were excluded from the study if the Leap Rail engine was unable to generate a predicted case time due to insufficient training data. There were no further exclusion criteria for this study. Microsoft Excel 2016 (Redmond, WA, USA) was used for table and figure production.

Statistical methods

The primary outcome was the absolute difference between the actual duration and the predicted value (i.e. prediction error). It was summarized using median value and interquartile range (Q1-Q3), and was compared using Wilcoxon rank-sum test between Leap Rail and EHR groups. Additionally, for each subspecialty, the prediction error was compared using the same test and unadjusted p values were reported. When defining having prediction error ≤ 15 min as acceptable, a Chi-square test was performed to compare whether there was a significant difference of the number of acceptable predictions in each group. All analyses were performed using R software version 3.4.1. Statistical tests were two-sided with α = 0.05.

Results

A total of 1059 surgical cases were performed at NorthBay Healthcare during the study period of three months. Leap Rail was able to generate a predicted time for 990 of these cases, which represented approximately 93.5% of total surgical volume. No further cases were excluded. The most heavily represented subspecialties were Gastroenterology (n = 207, 20.9%), General Surgery (n = 240, 24.2%), and Orthopedics (n = 166, 16.8%). The least represented were Pulmonary (n = 1, 0.1%), Oral/Maxillofacial (n = 2, 0.2%), and Interventional Radiology (n = 4, 0.4%).

For all cases, the median absolute difference between actual and predicted case times was 20 min for Leap Rail (Q1-Q3, 9–40) and 27 min (Q1-Q3, 13–50) for EHR (Cerner) (Table 2), demonstrating a 7-min improvement for Leap Rail predictions (p value<0.0001). A density distribution plot (Fig. 1) comparing the frequency of pCD error by group further illustrates this improvement. Here, the curve for Leap Rail predictions is shifted towards the left, indicating both an increased likelihood of smaller error and a decreased likelihood of larger error when compared to the EHR. Fig. 2 and Fig. 3 graph these data in a scatterplot, with the straight line representing a theoretical perfect relationship between predicted and actual case durations. Here, Leap Rail had a higher Pearson correlation compared to the EHR, again indicating increased accuracy. There is also see the tendency of EHR pCD to fall within the same time groups despite their actual durations, indicating a level of inflexibility in generated case times.

Table 2 Median value of absolute difference, with interquartile ranges, between predicted values and actual values by groups
Fig. 1
figure 1

Density plot of the absolute difference of actual values and predictions by groups

Fig. 2
figure 2

Scatterplot of actual duration versus Leap Rail prediction, with Pearson correlation and the correlation = 1 line added

Fig. 3
figure 3

Scatterplot of actual duration versus EHR prediction, with Pearson correlation and the correlation = 1 line added

When broken down by subspecialty, Leap Rail was more accurate for 14 of the 16 subspecialties; however, of those 14 subspecialties, only the findings for Gastroenterology, General Surgery, Orthopedics, and Urology were statistically significant (Table 2). Subsequently, Fig. 4 shows the box plot of prediction error for these four specialties as well as all cases. Here, the median absolute differences are consistently higher in the EHR group when compared to the Leap Rail group. However, both groups still had significant amount of outliers, indicating intra-operative variability that was difficult to account for regardless of prediction modality.

Fig. 4
figure 4

Box plot of the absolute differences by subspecialty and all cases

For a single case pCD, predictions were 31.2% and 41.1% accurate for EHR and Leap Rail, respectively, when defining the threshold for a clinically significant prediction error for a single case as 15 min (p < 0.0001) (Table 3). When looking at case duration in aggregate over a 3-month period, the 990 observed cases amounted to 110,130 min of operating room time. EHR-predicted total surgery time was 79,435 min compared to the Leap Rail predicted time of 98,554 min. Overall, this resulted in a 70% reduction in scheduling inaccuracy for the operating suite (Table 4).

Table 3 A 2-by-2 table demonstrating when prediction error ≤ 15 min was defined as acceptable
Table 4 Cumulative differences between predicted vs actual case times (in minutes)

Discussion

In this study, we compared the accuracy of Leap Rail surgical case time prediction to that of conventional EHR approach. When examining all cases during the study period, our results showed a statistically significant improvement of approximately 7 min per case. When broken down by subspecialty, the most prominent improvements were seen in cardiology (although this effect was not significant), followed by orthopedics and urology, which had up to a 15-min improvement in accuracy. Conventional EHR outperformed Leap Rail for two subspecialties (neurology and maxillofacial), but these differences were not statistically significant.

There is a difference between statistical and clinical significance that needs to be addressed. In fact, one may be skeptical about the clinical significance of a 7-min improvement in accuracy per case. However, even when setting a threshold of 15 min as a clinically significant error (a more typical unit of time used in OR scheduling), Leap Rail still represented a 10% improvement in accuracy over conventional means. Furthermore, as some ORs can have up to 6–7 cases per day, the cumulative effect may ultimately represent a significant financial advantage if it is possible to schedule an additional case [15]. This is best reflected by looking at the compounding effect of total OR cases through the 3-month study period: Leap Rail represented a ~19,000 min, or a 70% reduction in scheduling inaccuracy overall.

Previous studies have estimated the cost per OR minute to the hospital to be between $22 and $133 per minute [16]. Therefore, by appropriately utilizing the ORs, this would result in a significant reduction in costs over the course of 3 months of this pilot study. This estimate represents a direct savings through OR staffing costs, and actually likely underestimates the cumulative effect seen downstream throughout the hospital system, which includes decreased length of hospital stay due to delays/cancellations, etc. Therefore, despite a seemingly insignificant time savings per case, the cumulative effect could certainly lead to a significant increase in revenue.

While this study focuses on total OR case duration, it is notable that Leap Rail is able to sub-categorize each case into different surgical, anesthesia, and turnover times. This ability to account for variance within each sub-segment is one of Leap Rail’s biggest advantages over current scheduling modalities. However, we did not directly assess Leap Rail’s performance in this capacity, as the EHR did not generate discrete values for each of these sub-segments. Furthermore, it is ultimately total case duration that is the main determinant of surgical scheduling. Nonetheless, we expect this capability to be reflected in this study via the improvements seen with Leap Rail’s pCD accuracy.

With this in mind, we believe that future studies are warranted to further validate our results. In particular, it would be reasonable to apply the Leap Rail algorithm more broadly among several institutions, and for a longer period of time. The benefits of this would be twofold – one could expect to see more generalizability as Leap Rail is exposed to a wider array of cases, and one could also expect to see more accurate predictions as additional training data are incorporated into the machine learning process. Furthermore, future studies could further investigate the accuracy of Leap Rail predictions for anesthesia and turnover time. By doing so, one may be able to determine additional avenues in which OR inefficiency can be minimized.

There have been several prior studies investigating potential means of optimizing surgical scheduling. Many of these studies have established the use of historical means as a superior method of predicting surgical case duration [7, 8]; other studies have investigated using mathematical modeling to improve case sequencing [17, 18] or predicting cases durations when having sparse historical data [19]. However, none of these studies utilized a similar multi-modal approach towards predicting surgical case durations. Furthermore, none of these models were built utilizing a machine-learning algorithm that would be expected to improve with the incorporation of additional training data.

There are some limitations to our methodology. A potential weakness of the Leap Rail engine is that it relies on a training set in order to generate a predicted case time. Therefore, it may be unable to do so for procedures that are rare or difficult to classify (multi-specialty surgeries, or combined procedures). Nonetheless, these types of procedures are the minority within any hospital system and are unlikely to represent a significant proportion of OR utilization. Furthermore, these weaknesses are also applicable to conventional EHR case time prediction, which also relies on historical case data. This is best reflected in the 93.5% total utilization rate of Leap Rail, which shows that it is largely applicable to a significant portion of surgical volume.

Another weakness of the designed study was a relatively small sample size. While there was statistically significant improvement in many subspecialties, there were others where the benefit of using Leap Rail was less validated. This was particularly magnified in specialties such as Pulmonary, Oral/Maxillofacial, or Interventional Radiology, which had very few available cases for comparison. However, this study was not intended to be an exhaustive comparison between EHR and Leap Rail generated surgical case times; rather, this study was designed as a “proof of concept” to demonstrate the potential value of taking a more multimodal approach to case time prediction.

Machine learning models are by definition biased towards their training data which in the case of the model in this study includes information about patients, procedures, healthcare providers, and the facility itself. This model is designed to be able to make high accuracy predictions for many new patients. Leap Rail regularly re-trains their models at each facility with the addition of new case data, so that the new models can account for new team members or even changes in physical topology of the facility or clinical workflows. Also, the model reviewed in this study was custom made for a given organization using training data about its patients, procedures, physicians, nurses, and the facility itself. Therefore, the same exact model is not meant to, and will not perform well if simply applied to a different organization. However, the same methodology and the “know-how” can be applied to training data from other organizations with the expectation of achieving similar results.

In summary, our study compared the accuracy of pCDs generated by an EHR with that of a novel machine-learning algorithm. Our results were statistically significant, and indicated that the algorithm had an average improvement of 7 min per case. This represented an aggregate 70% reduction in scheduling inaccuracy over a three-month period. In trialing this approach, we found that the Leap Rail engine represents a significant improvement in reducing prediction error for surgical case duration. Application of such an algorithm could potentially lead to significant savings for the hospital.