1 Introduction

The growth of personalized data from Electronic Health Records (EHR) and biomarkers has motivated us to investigate how to harness this individualization for problems in healthcare where the modeling of uncertainty is also critically required. We present a framework using quantile regression forests (QRF) to generate individualized distributions integrable into three optimizations paradigms. We demonstrate the effectiveness of our individualized optimization approach in terms of basic theory and practice. Specifically, we focus on operating room scheduling because it is exactly the type of problem that we expect can benefit greatly from our framework.

The operating room (OR) is well known to be both the most significant source of revenue and cost for hospitals [46]. As a significant driver of hospital operations and their efficiency, ORs are often affected by significant schedule volatility as they vary from low utilization to high levels of overtime [15, 37, 40]. It remains a challenge to schedule surgeries because surgical durations suffer from a high degree of variability due to the patient, procedure, and surgeon-related factors [27]. Studies on surgery scheduling, accordingly, often entail stochastic durations characterized by specified distributions as probabilistic estimators. These studies commonly defer to distributions on a population level where each patient is treated equally as coming from a single “generating” pool or a limited number of categories/types [14, 15, 35, 41]. The reason is it can be difficult to model the distribution of case times for every procedure type because there are a large number of distinct procedures, and many are known to have few observed instances [17, 36, 47]. Other studies have proposed Bayesian methods to obtain estimators even when data is limited [16, 18, 34]. These approaches still do not include, among others, patient-specific factors (e.g., comorbidities, body mass index). As a result, to our best knowledge, highly personalized case time distribution estimates have not been used for optimization in this domain.

Recently, decision-making problems like portfolio optimization [1, 21], assortment planning [26], and medical dosing [4] have introduced the use of individualized information by leveraging machine learning (ML) methods. Following this idea, we introduce individualized estimates via ML for use in optimization under uncertainty. Unlike the works above, we require the ML model to give distributional predictions to quantify the uncertainty individual to each case. Our proposed use of QRF accomplishes this as this method naturally outputs the prediction of an entire distribution. The benefit of ML methods, like QRF, is that they allow us to include patient, procedure, surgeon, and hospital factors for prediction even for unobserved feature values, and in a nonlinear manner beyond classical statistical tools [3]. Along this vein, we mention that, in the robust optimization paradigm, the closest work to ours is [50] who suggested several ML approaches to create uncertainty sets. However, in contrast to [50], our chosen ML approach of QRF accommodates not only robust optimization but also other optimization modeling methodologies.

We show that the common paradigms of optimization under uncertainty, specifically sample average approximation (SAA), robust optimization (RO), and distributionally robust optimization (DRO), are all amenable to incorporating QRF. These optimization approaches depend on a decision-maker’s risk tolerance, and QRF can be integrated into the respective formulations to enhance efficacy. Among these paradigms, SAA represents a risk-neutral decision process as a stochastic optimization method that approximates an expectation objective by sampling [28, 45]. RO, which requires only a support set for the random variables, is risk-averse as it focuses on the worst-case outcomes [5, 7]. DRO lies somewhere between SAA and RO in that it considers the worst-case performances among partially informed probability distributions [6, 13, 53].

As a demonstration of our approach, we consider the problem of setting surgery start times. We formulate an SAA, RO, and DRO scheduling model using a QRF to navigate the difficult trade-off that arises from the stochasticity of patient case durations. Namely, surgeries occur in sequence, and so if a job’s start time is set too late, then the job before it may finish with too much time left over. Such an event creates idling and can lead the last surgical case in the block to operate in overtime. On the other hand, a job scheduled too early likely cannot start on time because it needs to wait for the previous job to finish first. This delay can cause patient dissatisfaction and adverse health outcomes.

Theoretically, we illustrate how the statistical consistencies of our formulations (with respect to their target guarantees) follow the known properties of QRF. In the RO and DRO settings, we also study uncertainty sets naturally deduced from QRF and their tractabilities for the surgery problem. To complement these results, we conduct empirical experiments in collaboration with Memorial Sloan Kettering Cancer Center (MSKCC) in New York, NY, USA. We test our formulated models using real-patient data and numerically highlight the strengths of our QRF-optimization integrated framework.

The remainder of this paper is organized as follows: Section 2 introduces our setting and notation for surgery scheduling. Section 3 reviews QRF and Section 4 demonstrates how we integrate QRF into the three optimization approaches as well as their corresponding consistency results. Section 5 provides a numerical example to test our proposed methodology. We conclude with Section 6.

2 The surgery scheduling model

We consider a single operating room scheduling problem where the number of planned operations and their sequence for a given day is fixed. A common practice in surgery departments like those in cancer centers is that appointments for elective surgeries are made a few days before the surgery. Accordingly, we assume knowledge of the biometric characteristics of the incoming patients (namely the “features” in the predictive model). Our optimization model uses predicted distributions of the surgery durations, built from available patient features, to determine the best starting time for each surgery to account for the uncertainty of surgical durations. The model’s objective captures the previously mentioned trade-off between the minimization of patient wait times and the occurrence of overtime. We chose to employ a more basic model because its analysis will show that QRF is suitable for use with the three optimization methodologies considered in this paper. This is aligned with other works in the scheduling literature (e.g., [20, 30, 35] ) for clearer presentation of a proposed methodology. Moreover, we share a similar justification with the previous literature. Our objective can include idle times without significant deviation of notation if their cost rates are identical to those of the waiting times. Overall, the model setting is general enough to apply to problems in other areas where the adjustment of time allowances for a sequence of jobs is used for improved operational behavior or cost containment [42, 52].

We follow similar notations as in [14]. The model’s decision is to select the start time of each surgery for a set of n elective surgeries in a given sequence on a given day for a given operating room. Let T be the scheduled closing time of the day and \(z_{i}\) define the random duration (respectively, the outcome) of the ith surgery. The decision maker has to set the surgery time allowance of the ith case, \(x_{i}\), such that the first starts at time zero, the second at \(x_{1}\), the third at \(x_{1} + x_{2}\), and so on. Assuming that the start time for the first surgery is zero, the start time of any subsequent surgery is scheduled at the sum of the time allowances of all previous surgeries. We also denote by \(Z=(z_{i})_{i=1,\ldots ,n}\) and \(X=(x_{i})_{i=1,\ldots ,n}\) the vectors of surgery durations and surgery time allowances. Based on these assumptions, the waiting time, \(w_{i}\), which is defined as the difference between the scheduled surgery starting time and the actual starting time whenever the previous procedure ends, and the overtime l, can be represented as:

$$\begin{aligned} w_{i}= & {} \max \{0,w_{i-1} + z_{i-1} - x_{i-1}\}, i = 2, ..., n\end{aligned}$$
(1)
$$\begin{aligned} l= & {} \max \{0,w_{n} + z_{n} - x_{n}\} \end{aligned}$$
(2)

We assume that the first surgery always starts on-time and, hence, \(w_{1} =0.\) Through rescaling, without loss of generality, we assume each unit of waiting time costs one unit for any of the subsequent surgeries. Therefore, how long each patient personally has to wait from their appointed time until the start of their surgery is precisely the corresponding waiting time cost. Lastly, an overtime cost of \(\phi\) per unit time is incurred if we run later than time T. Given the cost function \(f(X,Z) = \sum ^{n}_{i=2} w_{i} + \phi l\) and the definitions above, a surgery scheduling model, assuming hypothetically that Z are known, is constructed as:

$$\begin{aligned} \min _{x \in \mathcal {X}} \quad&f(X,Z) = \sum ^{n}_{i=2} w_{i} + \phi l \nonumber \\ s.t. \quad&w_{i+1} =\max \{ 0,w_{i} +z_{i} - x_{i}\} \qquad i= 1, ..., n-1\nonumber \\&l =\max \{0, w_{n} +z_{n} - x_{n}\} \end{aligned}$$
(3)

where \(\mathcal {X}=\{x_{i}\ge 0 \ \forall i, \sum ^{n}_{i=1} x_{i} \le T\}\) and \((x_{1},...,x_{n})\) are the surgery time allowance decision variables. By a monotonicity argument on the objective function in terms of \(w_{i}\)’s and l, it is standard to see that Eq. 3 can also be reformulated as

$$\begin{aligned}&\underset{\underset{w,l}{x \in \mathcal {X}}}{\min } \quad f(X,Z) \nonumber \\ s.t. \quad&w_{2} \ge z_{1} - x_{1} \nonumber \\ \quad&w_{i+1} \ge w_{i} +z_{i} - x_{i} &&\qquad i= 2, ..., n-1\nonumber \\&l \ge w_{n} +z_{n} - x_{n}\nonumber \\&w_{i}, l \ge 0 &&\qquad i= 2, ..., n \end{aligned}$$
(4)

where \((w_{2},...,w_{n},l)\) are introduced as auxiliary decision variables. Note that the closing time T is used in the feasible region \(\mathcal X\). For completeness, we show in Appendix 1 that idle times can be considered in this model by adding \(\sum _{i=1}^{n-1} (x_{i} - z_{i}) + w_{n}\) to the objective function when their cost rates are identical to those of the waiting times. When the Z are stochastic, we can replace the objective of Eq. 3 as either E[f(XZ)] or \(\min \{q:P(f(X,Z)\le q)\ge 1-\delta \}\), where \(E[\cdot ]\) and \(P(\cdot )\) are the expectation and probability taken with respect to Z. The former is an expected value formulation, and the latter is a percentile formulation. Common approaches like SAA and RO provide approximate solutions to these formulations, as we describe in Section 4.

3 Conditional distributions and quantile regression forests

Under stochasticity, we need the distribution of each \(z_{i}\) to be tailored to the characteristics of each patient to ensure model Eq. 4 is solving a problem that considers the subtle differences across patient cases. Letting \(\Xi\) denote the patient’s feature (e.g., gender, age etc.) vector in the feature space \(\mathcal B\), we approximate \(F(z|\xi )=F(z|\Xi =\xi )\), the distribution function of the surgery duration given \(\Xi =\xi\).

Studies on machine learning (ML) methods for patient-specific surgical duration estimates have mainly focused on point estimators (i.e., the mean response) (e.g., [3, 44, 48]) and are typically not directly capable of predicting the probability distributions needed for optimization models involving uncertainty. Although bootstrap-type sampling can generate such predictions in basic linear models (see Section 3.5 of [23]), we consider elaborate quantile ML methods to be a more direct way of accomplishing this in complicated modeling environments. The simplest quantile method to estimate \(F(z|\xi )\) is linear quantile regression (LQR) (see [29] for details), but it relies on linear assumptions. Thus, we are motivated to study QRF based on its ability to handle more complex high-dimensional data settings [38].

QRF extends from the widely known ensemble method of random forests (RF) (see [9] for details) and shares a similar procedural design. RF models are built to output mean predictions by bootstrap aggregating (or bagging) decision trees, which is known to achieve stable prediction [19] and possess variance reduction properties [10]. Leveraging this, rather than taking the mean of responses within the same leaf as the output of a tree, QRF takes the empirical distribution of the responses instead [33]. We refer to [38] for a more complete overview on QRF.

We close this section by stating that QRF recovers the true conditional distribution as the number of observations in the training data increases under appropriate regularity conditions. That is, the conditional distribution of z constructed from QRF is \(\hat{F}(z|\xi )\), and it is known to converge in probability to \(F(z|\xi )\) (see Theorem 1 of [38] for specific details). The next section will show how this consistency property can be leveraged to provide asymptotic guarantees for various optimization under uncertainty formulations that take into account individualized information.

4 Individualized optimization under uncertainty

This section presents our integration of QRF into three optimization formulations that capture the stochasticity of the surgery duration in an individualized fashion. We first consider an expected value formulation and SAA in Section 4.1. Then we move to a percentile formulation and RO in Section 4.2, and finally revisit the expected value formulation and DRO in Section 4.3.

We aim to provide theoretical justification for the validity of our approach by showing asymptotic guarantees for each method, as is the standard for algorithms of this nature. Through our analysis, we also hope to convey our reasoning on how ML methods can be incorporated into existing optimization modeling paradigms to enhance the quality of decisions, in particular, with respect to individualization. Moreover, our theoretical results suggest that the proposed approach can be widely applied to problems beyond just the one we consider in this paper. On the other hand, an admitted limitation of our analysis is that it characterizes only asymptotic behaviors. Therefore, our numerical results in the following section aim to offer empirical support for the practical use of our approach in the OR scheduling setting, where limited sample sizes commonly exist.

4.1 Expected value optimization and sample average approximation

Our first considered formulation is Eq. 3 but with an expected value objective function

$$\begin{aligned} E[f(X,Z)|\xi _{1},\ldots ,\xi _{n}]=\int f(X,Z)\prod _{i=1}^{n}F(dz_{i}|\xi _{i}) , \end{aligned}$$
(5)

which minimizes the average overall scheduling cost conditional on the feature \(\xi _{i}\) of each patient, who uses a surgery duration \(z_{i}\). In this objective, we assume the distributions of all patients are conditionally independent, and we approximate the distribution \(F(z|\xi _{i})\) by QRF, namely \(\hat{F}(z|\xi _{i})\).

For convenience, we denote \(\hat{E}[\cdot |\xi _{1},\ldots ,\xi _{n}]\) as the conditional expectation under independent \(\hat{F}(z|\xi _{i})\)’s. Exact computation of the \(\hat{E}[f(X,Z)|\xi _{1},\ldots ,\xi _{n}]\) in this setting can be demanding. Thus, we use sample average approximation (SAA) to approximate the problem. In particular, we generate a set of scenarios under \(\prod _{i=1}^{n}\hat{F}(z_{i}|\xi _{i})\), denoted \(\mathcal {S} = \{s_{1}, ..., s_{k}\}\). Every scenario \(s \in \mathcal {S}\) is associated with a realization of \(Z(s)=(z_{i}(s))\) and with \(w_{1}(s) = 0\),

$$\begin{aligned} w_{i}(s)= & {} \max \{0,w_{i-1}(s) + z_{i-1}(s) - x_{i-1}\}, i = 2, ..., n \\ l(s)= & {} \max \{0,w_{n}(s) + z_{n}(s) - x_{n}\}. \end{aligned}$$

From Eq. 4 we can approximate the expected value problem via SAA as

$$\begin{aligned} \underset{\underset{w,l}{x \in \mathcal {X}}}{\min } \quad \sum ^{K}_{k=1}\frac{1}{K}\big (\sum ^{n}_{i=2} w_{i}(s_{k}) + \phi l(s_{k})\big ) \end{aligned}$$
(6)
$$\begin{aligned} s.t. \quad w_{i+1}(s_{k}) \ge w_{i}(s_{k}) +z_{i}(s_{k}) - x_{i}&\qquad i= 1, ..., n-1, \nonumber \\&\qquad k= 1, ..., K \end{aligned}$$
(7)
$$\begin{aligned} l(s_{k}) \ge w_{n}(s_{k}) +z_{n}(s_{k}) - x_{n} \qquad k= 1, ..., K \end{aligned}$$
(8)
$$\begin{aligned} w_{i}(s_{k}), l(s_{k}) \ge 0 &\qquad&i= 2, ..., n, \nonumber \\&&\qquad k= 1, ..., K \end{aligned}$$
(9)

where K is the number of scenarios independently generated in the SAA. Formulation Eq. 6 naturally combines the distributional prediction of QRF into the expected value minimization. As noted previously, although modeling case durations by distribution fitting is standard, utilizing the variety of data to capture the particularities across different case durations is more challenging. In this regard, unlike other approaches that may also consider nonidentical distributions in surgery scheduling, the QRF generates a distribution for every case unique to each patient as it is conditional on all their available health information. Our numerical study in Section 5.2 shows this is highly beneficial to solution quality. Under standard conditions, the SAA problem Eq. 6 has a solution and optimal value that converge to those of Eq. 5, as we formally state next.

Theorem 1

Let H(X) be the objective function Eq. 5 (suppressing \(\xi _{1},\ldots ,\xi _{n}\) for convenience), and \(H^{*}\) be the optimal value when solving Eq. 3 with Eq. 5 as the objective function. Let \(\tilde{X}^{*}\) be an optimal solution to Eq. 6 where the scenarios are drawn from the distribution \(\prod _{i=1}^{n}\hat{F}(z_{i}|\xi _{i})\). Assume that z and \(\xi\) satisfy the conditions in Appendix 1, and moreover that z is bounded a.s. within \(\mathcal A\subset \mathbb R_{+}\). We have \(H(\tilde{X}^{*})\overset{p}{\rightarrow }H^{*}\) as \(K,N\rightarrow \infty\), where N is the observation size in building the QRF and K is the scenario size in SAA.

For proof of Theorem 1 see Appendix 1.

4.2 Robust optimization

Robust optimization (RO) provides an alternative paradigm to account for the uncertainty in the surgery duration. In the RO paradigm, we replace the stochasticity with a so-called uncertainty set or ambiguity set, which is a deterministic set that, intuitively, captures the likely realization of the surgery duration (other interpretations are possible, e.g., [5], page 33 discussion point B).

Formulation Eq. 4 has resemblance with single-server queues, for which [2] has introduced an RO approach to estimate relevant quantities. Following their framework, we introduce \(\underline{\Gamma _{k}}\) and \(\overline{\Gamma _{k}}\) and assert that the surgery times belong to the uncertainty set

$$\begin{aligned} \tilde{\mathcal {U}} = \left\{ (z_{1}, z_{2}, ..., z_{n}) \biggm | \underline{\Gamma _{k,i}} \le \sum _{j=k}^{i-1} z_{j} \le \overline{\Gamma _{k,i}} \text { ,} \quad \forall k < i \le n \right\} \end{aligned}$$
(10)

The reason why we consider a set on \(\sum _{j=k}^{i-1} z_{j}\), instead of other possible candidates (e.g., merely \(z_{k}\) themselves), is that the objective function, upon rewriting, depends explicitly on these quantities which gives rise to easy bounds. However, this convenient choice can be plausibly replaced by others, which result in convex optimization problems that can still be handled by standard solvers.

RO considers

$$\begin{aligned} \min _{x \in \mathcal {X}} \max _{z \in \tilde{\mathcal {U}}} \left( \sum _{i=2}^{n} w_{i} + \phi l\right) \end{aligned}$$
(11)

where \(w_{i}\) and l satisfy Eqs. 1 and 2. Note that the waiting time of the ith patient can be expressed as

$$\begin{aligned} w_{i} = \max \{w_{i-1} + z_{i-1} - x_{i-1}, 0\} = \max _{1 \le k < i} \left( \sum _{j=k}^{i-1}(z_{j} - x_{j}), 0\right) \end{aligned}$$
(12)

and it is also trivial to see that \(l=w_{n+1}\). Putting these into Eq. 11 gives

$$\begin{aligned}&\min _{x \in \mathcal {X}} \max _{z \in \mathcal {U}} \left( \sum _{i=2}^{n} \max _{1 \le k< i} \left( \sum _{j=k}^{i-1}(z_{j} - x_{j}), 0\right) \right. \nonumber \\&\left. + \phi \max _{1 \le k < n+1} \left( \sum _{j=k}^{n}(z_{j} - x_{j}), 0\right) \right) . \end{aligned}$$
(13)

Switching the order of maximizations, Eq. 13 is upper bounded by

$$\begin{aligned}&\min _{x \in \mathcal {X}} \left( \sum _{i=2}^{n} \max _{1 \le k< i} \max _{z \in \mathcal {U}} \left( \sum _{j=k}^{i-1}(z_{j} - x_{j}), 0\right) \right. \nonumber \\&\left. + \phi \max _{1 \le k < n+1} \max _{z \in \mathcal {U}} \left( \sum _{j=k}^{n}(z_{j} - x_{j}), 0\right) \right) . \end{aligned}$$
(14)

The innermost maximization can be easily seen to be attained at the upper bounds imposed in \(\mathcal {U}\), so that Eq. 14 is equivalent to

$$\begin{aligned}&\min _{x \in \mathcal {X}} \left( \sum _{i=2}^{n} \max _{1 \le k< i} \left( \overline{\Gamma }_{k,i-1} - \sum _{j=k}^{i-1} x_{j}, 0\right) \right. \nonumber \\&\left. + \phi \max _{1 \le k < n+1} \left( \overline{\Gamma }_{k,n} - \sum _{j=k}^{n} x_{j}, 0\right) \right) . \end{aligned}$$
(15)

Now, denoting \(Q_{i} = \max _{1 \le k < i} \left( \overline{\Gamma }_{k,i-1} - \sum _{j=k}^{i-1} x_{j}, 0\right)\), then Eq. 15 can be reformulated into the following linear program

$$\begin{aligned} &\min _{x \in \mathcal {X}} \quad \sum _{i=2}^{n} Q_{i} + \phi Q_{n+1}\nonumber \\& s.t. \quad Q_{i} \ge \overline{\Gamma }_{k,i-1} - \sum _{j=k}^{i-1} x_{j} & \begin{aligned} \\& i = 2, ... ,n+1 \\ & 1 \le k < i \end{aligned} \nonumber \\& \qquad Q_{i} \ge 0 & i= 2, ... , n+1. \end{aligned}$$
(16)

The question remains how to calibrate \(\overline{\Gamma }_{k,i}\)’s. Using the idea of data-driven RO (e.g., [8]), one can set \(\overline{\Gamma }_{k,i}\)’s so that

$$\begin{aligned} \hat{P}\left( \underline{\Gamma }_{k,i} \le \sum _{j=k}^{i-1} z_{j} \le \overline{\Gamma }_{k,i} \text { ,} \forall k < i \le n \Bigg |\xi _{1},\ldots ,\xi _{n}\right) \ge 1-\delta , \end{aligned}$$
(17)

where \(\hat{P}(\cdot |\xi _{1},\ldots ,\xi _{n})\) refers to the probability under \(\prod _{i=1}^{n}\hat{F}(z_{i}|\xi _{i})\). Notice that in fact only \(\overline{\Gamma }_{k,i}\)’s are used; the \(\underline{\Gamma }_{k,i}\)’s can be dropped by adopting our subsequent analysis to a one-sided bound instead of two-sided in a straightforward manner. The guarantee of Eq. 17 can be translated to the optimal value of Eq. 11 and hence Eq. 16 provides an upper bound to

$$\begin{aligned} \min _{x \in \mathcal {X}}\min \{q: \hat{P}\left( f(X,Z) \le q|\xi _{1},\ldots ,\xi _{n}\right) \ge 1-\delta \} \end{aligned}$$

Namely, the optimal \(1-\delta\) quantile of f(XZ) under \(\prod _{i=1}^{n}\hat{F}(z_{i}|\xi _{i})\). Theorem 2 below details a more elaborate version of this claim, taking into account the Monte Carlo noises that we discuss next.

To find \(\underline{\Gamma }_{k,i}\) and \(\overline{\Gamma }_{k,i}\), we find \(\hat{\Gamma }\) such that

$$\begin{aligned}&\hat{P}\left( \left| \frac{\sum _{j=k}^{i-1} z_{j}-\sum _{j=k}^{i-1} \hat{\mu }_{j}}{\sqrt{\sum _{j=k}^{i-1}{\hat{\sigma }_{j}^{2}}}}\right| \le \hat{\Gamma }\text {,} \forall k < i \le n \Bigg |\xi _{1},..,\xi _{n}\right) \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \ge 1-\delta \end{aligned}$$
(18)

where \(\hat{\mu }_{j}\) and \({\hat{\sigma }_{j}^{2}}\) are the mean and variance from \(\hat{F}(\cdot |\xi _{j})\). The left hand side of the expression inside the probability in Eq. 18 is a centered and normalized version of \(\sum _{j=1}^{i-1}z_{j}\) that appears often in the central limit theorem. The choice of \(\hat{\Gamma }\) in Eq. 18 then implies that one can choose

$$\begin{aligned} \underline{\Gamma }_{k,i}=\sum _{j=k}^{i-1} \hat{\mu }_{j}-\hat{\Gamma }\sqrt{\sum _{j=k}^{i-1}{\hat{\sigma }_{j}^{2}}} \end{aligned}$$

and

$$\begin{aligned} \overline{\Gamma }_{k,i}=\sum _{j=k}^{i-1} \hat{\mu }_{j}+\hat{\Gamma }\sqrt{\sum _{j=k}^{i-1}{\hat{\sigma }_{j}^{2}}} \end{aligned}$$
(19)

Now, to find the \(\hat{\Gamma }\) that satisfies Eq. 18, we can use quantile estimation (e.g., [22]). Simulate, say K, i.i.d. copies of Z, and for each Z one can calculate

$$\begin{aligned} \max _{k<i\le n}\left| \frac{\sum _{j=k}^{i-1} z_{j}-\sum _{j=k}^{i-1} \hat{\mu }_{j}}{\sqrt{\sum _{j=k}^{i-1}{\hat{\sigma }_{j}^{2}}}}\right| \end{aligned}$$
(20)

Find the \(\lfloor K(1-\delta )\rfloor\)-th order statistic of the K copies of Eq. 20, call it \(\tilde{\Gamma }\). When K is large, \(\tilde{\Gamma }\) will roughly satisfy Eq. 18. Note that, desirably, this calibration approach based on quantile estimation does not require a large n in Eq. 18, which could actually be relatively small in applications.

Using the above formulations and procedure gives the following:

Theorem 2

Assume that z and \(\xi\) satisfy the conditions in Appendix 1. Let \(\tilde{H}^{*}\) be the optimal value of formulation Eq. 16, with \(\overline{\Gamma }_{k,i}\) calibrated from Eq. 19 and \(\hat{\Gamma }\) approximated by \(\tilde{\Gamma }\), the \(\lfloor K(1-\delta )\rfloor\)-th order statistic of Eq. 20 among the K samples of Z generated under \(\prod _{i=1}^{n}\hat{F}(z_{i}|\xi _{i})\). Assume that the true conditional distribution \(F(z|\xi )\) is continuous and strictly increasing in z. Let \(H^{*}\) be the optimal value for \(\min _{x \in \mathcal {X}}\min \{q: P\left( f(X,Z)\le q|\xi _{1},\ldots ,\xi _{n}\right) \ge 1-\delta \}\). Then

$$\begin{aligned} \liminf _{N\rightarrow \infty ,K\rightarrow \infty }\tilde{H}^{*}\ge H^{*} \end{aligned}$$

The proof of Theorem 2 is in Appendix 1.

Note \(\min _{x \in \mathcal {X}}\min \{q: P\left( f(X,Z)\le q|\xi _{1},\ldots ,\xi _{n}\right) \ge 1-\delta \}\) means minimizing the \((1-\delta )\)-quantile of f(XZ). Theorem 2 stipulates that the RO formulation can be viewed as a conservative approximation of this optimization. Also, note that when the continuity assumption of \(F(z|\xi )\) is removed, we can modify the above procedure slightly by inflating our obtained \(\tilde{\Gamma }\) by an arbitrarily small constant, i.e., we use \(\tilde{\Gamma }+\epsilon\) for some small \(\epsilon\), and our argument (detailed in the proof) will carry through to get the same guarantee in Theorem 2.

4.3 Distributionally robust optimization

We next consider distributionally robust optimization (DRO). This approach targets expected value objective function, like in the case of SAA, but under only partial information of the distributions. More specifically, it optimizes the worst-case expected value among all distributions that are in an uncertainty set or an ambiguity set which represents the partial information. In our scheme, we assume the quantiles of each patient’s surgery duration distribution at a list of given probability levels are known. These information pieces can be drawn from the QRF, achieving individualization.

The motivation for using DRO is that its solution can be more robust to some hidden uncertainty. In our circumstance, for instance, imposing enough quantile information means we know the distribution of each patient’s surgery duration distribution, but we do not assume any dependency structure among the patients. The DRO solution is thus robust against this hidden stochasticity that is not revealed by the individualization, which focuses only on the prediction for each patient.

More concretely, for each patient i, we choose a sequence \(q_{i1}<q_{i2}<\cdots <q_{im}\), and set \(r_{ij}=\hat{F}(q_{ij}|\xi _{i})\) which can be inspected from the QRF. Roughly speaking, \(q_{ij}\) is the \(r_{ij}\)-th quantile of the duration distribution given \(\xi _{i}\). For convenience, we assume that \(r_{im}=1\), so that the \(q_{im}\) is the upper limit of the support of the data. We consider the uncertainty set

$$\begin{aligned} \hat{\mathcal U}=\{P\in \mathcal P:P(z_{i}\le q_{ij})=r_{ij},\ i=1,..,n, j=1,..,m\} \end{aligned}$$
(21)

where \(\mathcal P\) is the set of all probability distributions supported on \(\mathbb R_{+}^{n}\). Here, the constraints indicate that the marginal \(r_{ij}\)-th quantiles for patient i are known to be \(q_{ij}\) under the QRF. They also include the information that the largest possible value of \(\hat{F}(z|\xi _{i})\) is estimated to be \(q_{im}\).

We seek to solve

$$\begin{aligned}&\min _{x \in \mathcal {X}} \max _{P \in \hat{\mathcal {U}}} E_{P} \big [f(X,Z)\big ]. \end{aligned}$$
(22)

We approach Eq. 22 using the technique in [35], which first replaces f(XZ) as the optimal value of a linear program (LP) given by

$$\begin{aligned} \begin{aligned} \max _{y} \quad&\sum ^{n}_{i=1} (z_{i} - x_{i})y_{i}\\ s.t. \quad&y_{i} - y_{i-1} \ge -1&&\qquad 2\le i \le n \\&y_{n} \le \phi \\&y_{i} \ge 0&&\qquad \forall i=1,...,n. \end{aligned} \end{aligned}$$
(23)

This can be derived from the definition of \(w_{i}\) and l in Eqs. 1 and 2, and considering the dual of the resulting LP, where \(y_{1},\ldots ,y_{n}\) are the dual variables. A slight change of notation to (23) also leads to the dual formulation (see Appendix 1 ) when considering idle times. Then problem Eq. 22 can be rewritten as

$$\begin{aligned} \min _{x \in \mathcal {X}} \max _{P \in \hat{\mathcal {U}}} E_{P} \big [\max _{y \in \Omega } \sum ^{n}_{i=1} (z_{i} - x_{i})y_{i} \big ], \end{aligned}$$
(24)

where \(\Omega\) refers to the constraint set in Eq. 23, and \(y=(y_{1},\ldots ,y_{n})\). We now focus on the inner maximization problem, \(\max _{P \in \hat{\mathcal {U}}} E_{P} \big [\max _{y \in \Omega } \sum ^{n}_{i=1} (z_{i} - x_{i})y_{i} \big ]\). The following lemma transforms it into a more manageable form:

Lemma 1

The dual representation of the optimization \(\max _{P \in \hat{\mathcal {U}}} E_{P} \big [\max _{y \in \Omega } \sum ^{n}_{i=1} (z_{i} - x_{i})y_{i} \big ]\) is

$$\begin{aligned}&\min _{\rho } \max _{y \in \Omega } \sum ^{n}_{i=1} \max _{j=1,\ldots ,m}\bigg \{(q_{ij} - x_{i})y_{i} - \sum ^{m}_{k=j} \rho _{ik} \bigg \}\nonumber \\&+ \sum ^{n}_{i=1} \sum ^{m-1}_{j=1} r_{ij} \rho _{ij} + \sum ^{n}_{i=1} \rho _{im}, \end{aligned}$$
(25)

where \(\rho _{11},\ldots ,\rho _{nm}\) are the dual variables corresponding to the quantile constraints.

For proof of Lemma 1 see Appendix 1.

By Lemma 1, problem Eq. 22 can be represented by

$$\begin{aligned}&\min _{x \in \mathcal {X}, \rho } \max _{y \in \Omega } \sum ^{n}_{i=1} \max _{j=1,\ldots ,m}\bigg \{(q_{ij} - x_{i})y_{i} - \sum ^{m}_{k=j} \rho _{ik} \bigg \} \nonumber \\&+ \sum ^{n}_{i=1} \sum ^{m-1}_{j=1} r_{ij} \rho _{ij} + \sum ^{n}_{i=1} \rho _{im}. \end{aligned}$$
(26)

We want to transform Eq. 26 to a linear program. Note that \(\max _{y \in \Omega } \sum ^{n}_{i=1} \max _{j=1,\ldots ,m}\bigg \{(q_{ij} - x_{i})y_{i} - \sum ^{m}_{k= j} \rho _{ik} \bigg \}\) is a convex maximization problem over the polyhedron \(\Omega\), and it suffices to consider its extreme points for an optimal solution. This allows us to follow the method discussed in Proposition 2 of [35] to reduce it to solving an LP relaxation of an integer program using the following proposition.

Proposition 1

Problem Eq. 22 can be reformulated as the following linear program

$$\begin{aligned} &\min _{x,\rho ,\lambda } \quad \sum ^{n}_{i=1} \lambda _{i} + \sum ^{n}_{i=1} \sum ^{m-1}_{j=1} r_{ij} \rho _{ij} + \sum ^{n}_{i=1} \rho _{im}\nonumber \\ & s.t. \quad \mu _{io} \ge \bigg \{q_{ij}\pi _{io} - \sum ^{m}_{k= j} \rho _{ik}\bigg \} & \begin{aligned} \\ \\ & 1\le i \le n \\& i \le o \le n+1 \\& 1 \le j \le m \end{aligned} \nonumber \\& \quad \sum ^{min\{o,n\}}_{i=g} \mu _{io} \le \sum ^{min\{o,n\}}_{i=g} x_{i}\pi _{io} + \lambda _{i} & \begin{aligned} \\ &1\le g \le n \\& g\le o \le n+1\end{aligned} \\& \qquad \sum ^{n}_{i=1} x_{i} \le T\nonumber \\ & \qquad x_{i} \ge 0 & 1\le i \le n, \end{aligned}$$
(27)

where \((\lambda _{1},...,\lambda _{n})\) are the dual variables, \(\pi _{io}=y_{i}\) for all \(i \in [g,o]\) and any \(g\le o\). Also, \(\mu _{io}= \max _{j=1,\ldots ,m} \{q_{ij}\pi _{io} - \sum ^{m}_{k = j} \rho _{ik}\}\) for \(o=1,...,n+1\).

For proof of Proposition 1 see Appendix 1.

The following shows that the DRO under the quantile information drawn from the QRF gives an asymptotic bound for the true expected value:

Theorem 3

Assume that z and \(\xi\) satisfy the conditions in Appendix 1. Suppose that the true conditional distribution function \(F(z|\xi )\) is continuous and strictly increasing in z. Let \(H^{*}\) be the optimal value of Eq. 3 with objective being Eq. 5, and \(\hat{H}^{*}\) be the optimal value of Eq. 27. Then for any \(\epsilon >0\), we have

$$\begin{aligned} P\left( H^{*}\le \hat{H}^{*}+\epsilon \right) \rightarrow 1 \end{aligned}$$
(28)

as \(N\rightarrow \infty\), where N is the data size in constructing the QRF.

For proof of Theorem 3 see Appendix 1.

Note that Theorem 3 does not require strong duality of the dual formulation in Eq. 25, as the upper bound in Theorem 3 can be obtained as long as Eq. 25 upper bounds its primal moment problem. However, we do need strong duality of an associated moment problem when the quantiles are set to be the true quantities, and this property is guaranteed by the continuity and strict monotonicity of \(F(z|\xi )\).

Finally, also note that, unlike SAA and RO discussed before, our DRO approach does not require running simulation from the QRF.

5 Numerical study

Table 1 Features (covariates) in the MSKCC dataset grouped into four main categories that were used for training the predictive models

To examine the effectiveness of our approach, a numerical study was conducted using real-patient data from Memorial Sloan Kettering Canter (MSKCC), New York, NY, USA, one of the leading cancer treatment and research institutions in the world. The center has surgical operations across thirteen different services and in a total of forty different operating rooms. Our data contains the recorded surgeries for all services between 2010 to 2016 with 129,742 distinct patients, but for our analysis, we only used those in Urology (URO). Within the data, we considered hospital, patient, operational, and temporal factors as done by [39] to achieve highly accurate case duration predictions. Table 1 details the features corresponding to these factors and note that this includes patient-specific information like age, gender, race, BMI, treatment history, and comorbidities to achieve individualization.

The surgical durations used in our analysis are defined as the recorded times measured from wheels-in to wheels-out. While no outliers were thrown out, each case in our data has similar time stamps as those described by [34] and, likewise, observations with discrepancies between recorded patient and operation times were removed. Our final dataset contained 23,176 observations after also excluding patients with any missing time records. Table 2 summarizes the number of surgeries, unique primary surgeons, and unique current procedural terminology (CPT) codes, along with the number and proportion of inpatient surgeries and the historical durations of surgeries in the final dataset. Table 3 summarizes key demographics of patients in the data set.

Table 2 Characterization of operational data for URO service in MSKCC
Table 3 Demographics of the patient data from the URO service

5.1 Test setup

The individualized distributions from QRF were evaluated for each optimization model in Section 4 by comparing their performances against equivalent models using alternatively constructed distributions. There were three distribution constructing benchmarks considered for our experiments and analysis. The first was a single distribution fitted to all case durations in the training set. The second benchmark stratified the data and fitted a distribution by the primary surgeons since durations for the same procedure can vary significantly between surgeons. Linear quantile regression (LQR) was used as the third benchmark to ensure the QRF can be compared to a similar method. Fitting distributions by CPT codes was not a considered benchmark because this approach follows closely with MSKCC’s planned times, which are a considered benchmark in Section 5.4. MSKCC’s scheduling system estimates these hospital planned times (H-PT) by using the median durations of the 20 most recent historical cases filtered by the CPT, then surgeon, and then operating room. The system will relax the filtering requirements for the operating room and surgeon if they do not yield any matches. Specific samples may also be excluded based on the system’s defined outlier threshold and on occasion, these planned times can be overridden by a lead surgeon. We constructed predictions using the median from log-normal distributions fitted for each CPT (all had p-values > 0.1 using the Anderson-Darling test) and compared them with the H-PT. For CPTs with single instances, the lone value was considered the prediction. Our analysis presented in Table 13 of Appendix B found no statistical difference between the H-PT and CPT approach (see Appendix Table 14). Hence, we proceeded without considering CPT fitted distributions, given that we can infer their performance by examining the H-PT. We set \(80\%\) of the total available data for training/fitting each distribution approach and save the remaining \(20\%\) for use in evaluating the performance of the optimization methods. For the split, we used the most recently recorded observations in our data for the test set.

For the possible distribution to use, we considered the normal and log-normal distributions as they have been historically found to be good models for surgery durations [36, 56]. Note, these studies constructed distributions by conditioning on procedure type, while our distributions are not. For this reason, we fit a normal, log-normal, Weibull, and gamma distribution using all cases in the training set to verify if our setting is consistent with previous findings. Table 4 summarizes the estimated parameters for each model and their p-value from the Anderson-Darling (AD) test. The results indicate that each distribution was a poor fit which is not unexpected when the sample size is large, and the true distribution is not in the considered family. Since the log-normal model held the best p-value, we further checked how well this log-normal model would fit for the test set with the AD test and found it is an acceptable fit (p-value = 0.056). Lastly, we followed the procedure of [49] and obtained comparable results in that log-transformed data was considered a good fit in \(71.1\%\) of the Shapiro-Wilk tests while it was only \(62.3\%\) for non-transformed data. As a result, we proceeded with using the log-normal distribution to fit over all cases and cases grouped by the primary surgeon. The QRF model was built using 10-fold cross-validation for hyper-parameter tuning, while LQR requires no tuning. Both models used the same set of features described in Table 1 and no additional preprocessing was applied to the training set beyond encoding categorical variables for LQR.

Table 4 Parameter estimates and goodness of fit summary

5.2 Numerical results

We break down the results for the URO surgery schedules into regimes that correspond to the number of cases performed on a surgical day. Specifically, we organized the cases each day in the data set based on their operating room to yield schedules with the number of surgeries ranging from 2 to 11. A breakdown of the total number of surgeries corresponding to each case size is presented in Table 12 of Appendix B. The tables we show in this section further bin the size of schedules into pairs of two to generate five different regimes to summarize our results. For a breakdown of waiting and overtime costs for schedules of each case size separately, we refer to Appendix Figures 5, 6, 7, 8, 9 and 10. One should keep in mind that because OR time is extremely costly, most surgical days have as many cases as can be fit into the day. Therefore, regimes on average that consist of more surgeries in a day reflect lower average case durations.

Fig. 1
figure 1

Box-plots for the SAA method comparing the L-Norm, L-Norm-PS, LQR and QRF model’s out-of-sample error for the different surgery regimes organized by the number of cases scheduled each day

Table 5 Weighted percentage improvement of the out-of-sample performance error when the QRF is used rather than the L-Norm, L-Norm-PS, LQR models in mean and percentiles for the Individualized Sample Average Approximation Optimization framework
Fig. 2
figure 2

Box-plots for the RO method comparing the L-Norm, L-Norm-PS, LQR and QRF model’s out-of-sample error for the different surgery regimes organized by the number of cases scheduled each day

Table 6 Weighted percentage improvement of the out-of-sample performance error when the QRF is used rather than the L-Norm, L-Norm-PS, LQR models in mean and percentiles for the Individualized Robust Optimization framework

The results to be presented in this section is based on the out-of-sample performance cost computed using (1) and (2) and the realized surgery durations from the test set. Consistent with [35], all results assume equal penalties for waiting and overtime, which is sufficient to illustrate the type of behavior each approach exhibits. For the intended OR surgical day length, T, we did not use MSKCC’s official hours. Their hours include planned late rooms, which every day extends the official hours for a varied set of rooms, and this data is not collected consistently enough to be available. Instead, we chose T to be the sum of actual surgical duration data for a given day because this is the minimal possible time a room would run. MSKCC considers waste as any minute past what is necessary for the OR, and so it remains consistent with their view that everything past this defined T is overtime. While total planned time in each model’s solution will be exactly equal to T, for our experiments, setting it this way still allows the realized objective costs to capture the accumulated inaccuracy of the distributions provided by a prediction approach for the individual cases. In Section 5.3, we use the median of historical cases for T to establish that our methodology is practical.

Fig. 3
figure 3

Box-plots for the DRO method comparing the L-Norm, L-Norm-PS, LQR and QRF model’s out-of-sample error for the different surgery regimes organized by the number of cases scheduled each day

We start with the SAA framework discussed in Section 4.1. As a preliminary, we solved the problem repeatedly with sample sizes ranging from 10 to 6000, each with 5 replications, to evaluate the convergence of the SAA model’s solutions. We use \(90\%\) of the training data to train our QRF model and select four instances of sizes 3,5,7 and 9 from the remaining \(10\%\) of the training data for validation. The validation set was split in the same manner as our test set, based on the chronological ordering of the cases. Appendix Figure 4 displays the trade-off between the size of samples drawn from QRF distributions and consistency of the solution cost for the SAA model. At 5000 samples, the model’s objective cost converges with minimal variance, and its solution is good compared to the actual case durations from the validation data. Moreover, it was observed that 5000 samples was a generous quantity for use with the surgery distributions derived from the benchmark approaches. Therefore, to define the schedule of start times for each problem instance, we implemented SAA by drawing, for every surgery in that surgical day, 5000 samples from the surgery duration distributions derived from each approach considered (i.e., the L-Norm, L-Norm-PS, LQR, and QRF).

Table 7 Weighted percentage improvement of the out-of-sample performance error when the QRF is used rather than the L-Norm, L-Norm-PS, LQR models in mean and percentiles for the Individualized Distributionally Robust Optimization framework

In Figure 1, the distribution of overall costs for each of the four models, stratified by the number of surgeries in the surgical day, is shown. Our individualized approach using QRF suggests lower scheduling costs (in minutes) than the other three models. More concretely, we can see from Table 5 that the improvement in the solution cost tends to increase for instances with a larger number of surgeries. For a global comparison, we computed a weighted average (by the number of instances) of the performance of each model across all the categories of cases per day. We found QRF yields a \(55.1\%\), \(56.6\%\), and \(23.1 \%\) lower average cost than L-Norm, L-Norm-PS, and LQR across all case sizes.

Analogous to the approach above for SAA, we next consider the RO framework discussed in Section 4.2. For each schedule, we implemented RO using uncertainty sets based on Eq. 10 derived from L-Norm, L-Norm-PS, LQR, and QRF. We set \(K = 5000\) and \(\delta = 0.05\) for each model. Figure 2 depicts the distributions of overall costs for each of the four models grouped by the number of surgeries each schedule has in the test set. Here, we see that the RO approach’s conservative nature incurs a higher cost than the SAA approach. Table 6, however, still shows QRF again noticeably outperforms the L-Norm and L-Norm-PS models. The percentage improvement of QRF over these two models is also even more significant than under the SAA framework as surgery size grows. On the other hand, LQR is a closer competitor to QRF under RO where for schedules with two to three surgeries, it outperforms QRF by 1.2% on average. This is not necessarily unexpected, given that both LQR and QRF are ML approaches capable of predicting individualized distributions. The fact that QRF is only outperformed slightly by LQR for one regime but outperforms LQR for all other, larger-size regimes indicates our approach remains favorable, especially when handling larger-size schedules. Altogether, QRF leads to \(64.2\%\), \(68.8\%\), and \(2.0\%\) less total cost than L-Norm, L-Norm-PS, and LQR based on weighted averages over all regimes.

Finally, we consider the DRO framework in Section 4.3. We implemented DRO with constraints dictated by percentiles \(\{0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,\) \(0.9,0.95\}\) of the surgery duration distributions derived from L-Norm, L-Norm-PS, LQR and QRF. Figure 3 once again shows improvement of QRF over the other three models regarding the distributions of overall costs. In particular, Table 7 shows over all cases, QRF obtains a \(67.1\%\), \(62.8\%\), and \(22.8\%\) less out-of-sample cost than compared to L-Norm, L-Norm-PS, and LQR, respectively.

Table 8 Setting parameter T with historic median based on schedule case size. The results show the weighted percentage improvement of the out-of-sample performance error when the QRF is used rather than the L-Norm, L-Norm-PS, LQR models for each optimization framework

5.3 Sensitivity analysis of T

To understand the influence of the T parameter in models, we conduct sensitivity analysis to determine its effect on the out-of-sample performance. First, we rerun our models with scaling of the original T values used in our numerical experiments by \(\{ 0.90,0.95,1.05, 1.10 \}\). In Appendix B, Tables 20, 21 and 22 detail the results corresponding to this scaling for the SAA, RO, and DRO approaches. The results indicate a generally positive correlation between the QRF model’s performance over the other benchmark models and the scaling value. The association was nonlinear for only the RO framework with LQR. Even these results, however, show that the QRF approach performs better than LQR for larger values of T. Averaging across all case sizes, we find by varying T that QRF improves L-Norm, L-Norm-PS, and LQR for all considered optimization methodologies by no fewer than \(35.0\%\), \(33.8\%\), and \(0.8\%\), respectively.

The findings indicate that QRF will still be beneficial so long as MSKCC can set T to reasonably reflect the cases each day. Next, we provide results to show that MSKCC can realistically see the benefits of utilizing QRF in practice. In particular, each model was rerun with T set to the historical median sum of actual surgical durations in the training data based on their case size. Table 23 in Appendix B presents the median values found from the training and test set.

Table 8 presents the percent improvement in using the QRF approach over other benchmarks in each optimization framework with our median-based T. Like previous results, QRF strongly improves upon the standard distribution fitting benchmarks across all optimization frameworks. The QRF demonstrated a better out-of-sample performance than L-Norm overall by \(33.9\%\) for SAA, \(54.4\%\) for RO, and \(61.9\%\) for the DRO framework. For L-Norm-PS, the QRF approach averaged across regimes was \(30.2\%\), \(59.0\%\), and \(54.6\%\) better under the SAA, RO, and DRO framework, respectively. LQR was the closest approach to QRF, having performed \(1.1\%\) better in the RO framework and \(9.2\%\) and \(16.7\%\) worse in the SAA and DRO framework. Even if MSKCC used the simple idea of medians to set T, the QRF approach can still typically provide superior results to other benchmarks. The results suggest that using a more sophisticated idea to estimate T more accurately will only lead to better performance with QRF.

Table 9 MAE (RMSE) in minutes of the median, minimum, and maximum values of the samples drawn from QRF, LQR, L-Norm, and L-Norm-PS
Table 10 Using the median corresponding to distributions from QRF, L-Norm, L-Norm-PS, LQR, and H-PT to Predict Case Durations

5.4 Insights

For insight into the improvement numbers, we analyzed each of the four approaches in terms of prediction accuracy, separate of the optimization task. This is because for each of the percentage improvements shown in Tables 5, 6, and 7, all models were equal aside from their parameters. Note, the mean absolute error (MAE) and root-mean-square error (RMSE) presented in this section are in the unit of minutes and generally referenced as error(s).

Table 9 displays the MAE and RMSE from the true durations in the testing data for the median, min, and max values of the 5000 samples drawn as described in Section 5.2. The results displayed in Table 9 show that even the extremes of the drawn samples (i.e., the largest sampled value for each case) of the QRF is, on average, significantly closer to the true durations than L-Norm and L-Norm-PS. We see then QRF gives a distribution of the surgery time that has less variability, and this smaller variability, in turn, means smaller uncertainty in the optimization. As a result, the solution of the model needs to “hedge” a smaller variability of scenarios and is less conservative. This is the opposite of fitting a single or limited number of distributions from the population. Although every patient’s durations, when aggregated, do form a “pooled” distribution, patients and procedures correlate with the cases’ specifics. Thus, it is intuitive to see why having some patients share the same duration distribution with identical parameters can lead to more inaccurate modeling and hence poorer optimization solutions. For additional analysis of each distribution model on the test data see Tables 18 and 19 in Appendix B.

We further extended our analysis to include the MSKCC’s planned times (H-PT) so that five approaches (QRF, LQR, L-Norm, L-Norm-PS, and H-PT) can be compared to the true durations in the test set. The median was taken as the prediction to be consistent with how MSKCC’s planned times are derived when evaluating for accuracy. In addition, the median was also found to produce the lowest mean absolute error (MAE) for QRF and LQR. Table 10 provides the correlation, MAE, and RMSE associated with each approach. We also provide in Table 10 the MAE from the subset of predicted times that were greater than their observed duration (over predicted) and the subset of predicted times that were less than their observed duration (under-predicted). By ordering the MAE corresponding to each approach (excluding H-PT) from smallest to largest, a similar structure to the results in Tables 5, 6, and 7 can be seen in that QRF performs best, followed by LQR, L-Norm-PS, and L-Norm. Table 10 details the number of non-identical distributions constructed over our entire dataset to highlight the association between individualization and accuracy. Note, these are based on the quantiles for QRF and LQR, the distribution parameters for the log-normal distributions, and assumed for H-PT. The errors of the over-predicted and under-predicted groups further supports the notion that QRF produces fewer variable distributions.

Table 11 Breakdown of QRF vs. H-PT predictive power for procedures with 10 or less total instances in the entire data

As discussed previously, a known challenge in using CPT codes is that there exists a large number of them, a majority of which have limited or no historical data [16, 34]. The results of table 10 show QRF outperforms the H-PT, which is from a model stratified by CPT code. To more precisely understand the reason, we examined if QRF is better suited than H-PT for predicting the duration of procedures with little to no historical data. Before proceeding, we note that if a procedure has zero past observations, MSKCC’s planning system will use similar surgeries performed by surgeons to impute a value for the H-PT. Likewise, ML methods can extrapolate for procedures with no historical observations by examining other relevant features. For analysis, we focused on CPT codes associated with ten or fewer historical observations and compared the predictive power of QRF versus H-PT in two breakdowns presented in Table 11. The first breakdown is predicted durations for procedures with at least one instance in the training data, while the second is for procedures with zero instances present in the training data.

The QRF slightly underperforms the H-PT when predicting the durations of previously unseen procedures but noticeably outperforms the H-PT if just a few records of a procedure are available. Interestingly, we find that all procedures with zero instances in the training data were part of secondary services scheduled through the URO department. Such events can occur when patients have urologic cancer that metastasized to regions like the thorax. As mentioned in Section 5.1, MSKCC’s scheduling system primarily filters by CPT to predict surgery durations. This means secondary URO services could have jointly used CPTs with another department. Therefore, we reason that H-PT is likely better than QRF at predicting previously unseen procedures because H-PT uses historical records outside the URO department. It is still promising, however, that QRF could achieve statistically the same MAE as H-PT with fewer observations (t-test: p-value \(= 0.242\)).

A set of important/significant variables (see Table 17 in Appendix B) that showed up in both QRF and LQR were the relative value unit measurements (RVU). RVUs are standard scale measures currently used in the U.S. by the Centers for Medicare and Medicaid Services to determine physician fees (Hsiao et al., 1992). They are based mainly on the relative time, including pre-procedure, surgeons have previously taken to complete a specified service. As a result, RVUs are a proxy for the expected procedure duration that our QRF model can use without explicit knowledge of CPT codes. While the LQR model considered the primary surgeon as a significant variable, it is interesting that the QRF model did not. QRF’s ability to accommodate complex, nonlinear associations makes it possible that our model also identifies surgeons inherently through trends in the number and value of RVUs along with other possible feature interactions. Thus, we believe our model incorporates the essential characteristics of the procedure and surgeon in large part through RVUs, and this allows it to make estimates even for cases without CPTs in the historical data.

6 Conclusion

The integration of decision-making and individualization is a critically important area of research for advancing healthcare delivery. In the context of surgery scheduling, the duration of each case in a given day varies due to their distinct characteristics, which leads to significant uncertainty. Previous studies have commonly captured this uncertainty by fitting case duration distributions at the aggregate level. However, this approach cannot generate individualized distributions to characterize the subtle differences in uncertainty across cases. Individualized distributions require conditioning on a patient’s unique features.

This paper shows how QRF is an alternative method to distribution fitting that yields duration distributions individualized to every case. We present a framework to incorporate the tailored distributions generated from QRF into SAA, RO, and DRO models. As theoretical justification, reformulations and consistent statistical guarantees are derived for each optimization-under-uncertainty approach. We further conduct a case study using MSKCC data for empirical support of our framework. The numerical results reflect QRF and its individualized duration distributions can lead to optimization model solutions that significantly outperform distributions fitted at an aggregate and stratified level.

Our primary objective is to show the value of the QRF prediction method and the potential benefit individualized uncertainty modeling can bring to decision-making problems. Based on our numerical experiments, an alternative ML method to QRF can perform slightly better on specific case sizes depending on the optimization framework. To find the best ML model that works for a hospital, we would like to state that bootstrapping is a potentially viable way to obtain individualized prediction distributions from non-quantile-based ML models. This approach, however, may not always be appropriate. For instance, the formalized approach to using bootstrap samples with linear regression models is limited in requiring random errors to be homoscedastic [12]. One benefit of a quantile method like LQR is that it is known to outperform least-squares linear models when errors are non-Gaussian [29]. Even more appealing than LQR is QRF, given that it does not require any assumptions on the error distribution or the functional relationship between the outcome variable and its predictors. Ultimately, whether it is QRF, LQR, or bootstrap sampling, choosing which approach to use is a multi-faceted modeling decision, and we are not advocating that QRF is always the best ML method to use. This paper strongly suggests that QRF, as a theoretically compatible approach for various optimization modeling frameworks, strikes a nice balance between model flexibility, computation, and performance. As a result, we believe any hospital not currently using individualized surgery schedules can easily obtain value in practice from individualized schedules through a QRF.

We foresee several potential directions to follow based on our study. It would be interesting to further understand how QRF predictions can impact solutions when the objective has waiting, idle, and overtime costs that are non-identical and how choosing the quantiles can trade off between idle and waiting times. Other penalty and cost types, and their sensitivities, can also be considered. We also plan on a follow-up study to understand the effectiveness of each optimization approach, their comparisons, and trade-offs in practice as we continue our collaboration with MSKCC. In this regard, we are excited by the future direction in researching how our individualized framework can be further enhanced to provide practical value in building decision-support tools. For surgery scheduling, this could include additional ML methods for setting parameters in the optimization models that depend on the complexity of the surrounding cases. As another direction, we could extend our approach to other healthcare problems such as chemotherapy infusion treatment sessions scheduling [11] or operating room planning [24]. Precision medicine and personalized, prescriptive analytics are evolving fields that can use sophisticated data-driven decision-making methods. We hope our framework and results encourage further exploration of individualized optimization under uncertainty in healthcare.