Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

12.1 Introduction

The term Ranking and Selection (R&S) broadly refers to solution methods developed to solve the R&S problem. The R&S problem is a stochastic optimization problem in which the decision-maker wishes to choose the “best” among a finite set of design points, or “systems,” when the performance of each system can only be observed with error. The R&S problem can be considered a special case of the more general simulation optimization (SO) problem, which is a (usually) nonlinear optimization problem whose objectives and constraints, if present, can only be observed with error as output from a stochastic simulation (see, e.g., Pasupathy and Ghosh 2013; Fu 2015 for overviews). Among SO problems, the R&S problem is unique in the sense that it engenders interesting research questions only in the stochastic context: The deterministic black-box analog of the R&S problem is complete enumeration. In contrast, when the system performance measures are defined implicitly through a black-box stochastic simulation model, the decision-maker can only observe each system’s performance by constructing an estimator whose precision depends on the simulation budget expended. In this context, interesting methodological questions arise. For example, two key questions are (a) how does one guarantee that at the end of simulating, the estimated best system is truly the best system with high probability and (b) how does one allocate a finite simulation budget across systems to efficiently identify the best system? For more than 60 years, researchers have sought answers to these questions, resulting in a large body of R&S literature. For overviews and entry points into this literature, see Bechhofer et al. (1995), Gupta and Panchapakesan (2002) for origins, and Goldsman and Nelson (1998), Kim and Nelson (2006b), and Branke et al. (2007) for the stochastic simulation perspective.

R&S was originally developed by the statistics community during the 1950s, 1960s, and 1970s (Bechhofer 1954; Paulson 1964; Fabian 1964; Dudewicz and Dalal 1957; Rinott 1978), but following its appearance at the Winter Simulation Conference (WSC) , the nexus of research shifted from the statistics community to the stochastic simulation community sometime in the 1980s and 1990s. There were two key reasons for this shift: (a) optimizing a function embedded in a stochastic simulation was a natural goal for simulation practitioners, and early SO researchers borrowed existing methods from the statistics community and (b) efficiency in R&S often comes from sequential sampling and comparison of systems, and the barrier to sequential algorithms was far lower in computer experiments than in the industrial and biostatistics applications for which R&S was invented. Further, while the statistics community was often concerned with having procedures for different (non-normal) populations, the emphasis in the simulation community was on designing procedures for larger and larger numbers of alternatives, with normality being plausible due to averaging, e.g., via batch means (Schmeiser 1982).

During this shift, WSC becames a key venue joining the statistics and simulation communities. To the best of our knowledge, R&S first appeared at WSC in a 1976 session entitled, “Statistical Basis for Selection Among Alternatives,” which contained the work of Turnquist and Sussman (1976) and Dudewicz (1976). R&S became an increasingly popular topic after Goldsman published a survey paper in the 1983 WSC Proceedings (Goldsman 1983). Since then, WSC has served as the initial publication venue for many key advances in the R&S literature, with the area still active at WSC 2016 (e.g., Dong and Zhu 2016).

The significant advances in computing power over the last 40 years, and particularly the recent proliferation of parallel computing platforms, is one reason why R&S is still an active research topic at WSC today.Footnote 1 Originally developed with serial computing platforms in mind, R&S procedures are now being redesigned for deployment as parallel procedures—a surprisingly nontrivial endeavor. Serial R&S procedures of the 1990s and early 2000s measure efficiency as the total number of simulation observations required on a single processor, ensure efficiency by being fully sequential, and tend to work well when the number of systems is small. On a parallel platform, appropriate efficiency measures include processor utilization, wall-clock time, and monetary cost to rent processors. Fully sequential procedures may require bottleneck-inducing synchronization . Further, while today’s computing power ensures we can solve small problems fast, it should also enable us to solve much bigger problems. Thus, parallel computing platforms require procedures that minimize new measures of efficiency, guard against the bottlenecks that can arise in a parallel setting and can handle a large number of systems. A handful of such parallel R&S procedures exist; all have made their debut at WSC.

Since R&S was originally designed to select among a small number of categorial or unordered alternatives, one may wonder, when do large R&S problems arise? First, large problems with categorical choices arise naturally in some applications, such as drug discovery and plant breeding. The respective goals in these applications are to find the best drug molecule among many potential drug molecules (Negoescu et al. 2011), and to find the best plant breeding pairs to produce a progeny population with desirable properties (Hunter and McClosky 2016). Second, problems with a very large but finite number of alternatives on an ordered space would seem to be most naturally solved by using algorithms that can exploit the spatial structure, such as R-SPLINE (Wang et al. 2013) or COMPASS (Hong and Nelson 2006; Xu et al. 2010). However, because of its simplicity and ability to provide a statistical guarantee that the selected system is truly the global best, R&S is often a go-to method for practitioners. For example, R&S can be applied to large problems that are created by considering all feasible combinations of a set of decision variables. Xu et al. (2010) describe a SO problem with \(500^6\) feasible solutions obtained by considering all 500 possible values of 6 order-up-to levels in a supply chain problem. A characteristic of such problems is that many (most) of the feasible solutions are substantially inferior to the better ones, and R&S procedures can exploit this tendency. Even when a search algorithm such as R-SPLINE or COMPASS is employed first, R&S can be used to provide a statistical guarantee as to which of the visited solutions is the best (Boesel et al. 2003).

In this chapter, we discuss the current state of the art in R&S for large problems solved on parallel computing platforms. We assume the reader is familiar with serial R&S procedures, at a broad level. In rethinking R&S procedures for parallel implementation, we provide a new stylized model for representing parallel R&S methods (Sect. 12.2), discuss mathematical and computational formulations of existing serial R&S procedures under the stylized model (Sects. 12.3 and 12.4, respectively), discuss design principles for efficiency and validity of parallel R&S procedures (Sect. 12.5), discuss existing parallel R&S procedures (Sect. 12.6), and speculate on the future of parallel R&S procedures (Sects. 12.7 and 12.8).

12.1.1 Problem Setting and Notational Conventions

R&S addresses the following SO problem: Let the true expected performances of the k competing systems be denoted

$$\begin{aligned} \mu _1\le \mu _2\le \cdots \le \mu _{k-1} \le \mu _k, \end{aligned}$$

where a larger mean is better, and let \(\mathcal {S}=\{1,2,\ldots ,k\}\) denote the set of indices of all systems. We refer to system k, or any system tied with system k, as the best. Often, the best system is assumed to be unique, in which case \(\mu _{k-1}<\mu _k\). Recall that we are unable to observe the true expected performances directly. Then suppose we are given a simulation oracle that can provide us with random variables \(Y_{i1},Y_{i2},\ldots ,Y_{in}\), where \(Y_{i r}\) is a random variable representing the performance of system i on the rth simulation replication, \(r=1,2,\ldots ,n\), \(i\in \mathcal {S}\). For all systems \(i\in \mathcal {S}\), we estimate the value of \(\mu _i\) with a consistent estimator such as the sample mean \(\bar{Y}_i(n)\mathrel {\mathop :}=\mathop {\textstyle \sum }_{r=1}^n Y_{i r}/n\).

An R&S procedure is an algorithm that attempts to return the best system using only the estimators of the expected system performances. In this chapter, the estimators of the expected system performances after obtaining \(n_i\ge 1\) simulation replications from each system \(i\in \mathcal {S}\) are \(\{\bar{Y}_i(n_i):i\in \mathcal {S}\}\). We assume an R&S procedure returns the system with the largest estimated mean as the estimated best system, so that \({\hat{K}}\mathrel {\mathop :}=\arg \!\max _{i\in \{1,2,\ldots ,k\}}\{\bar{Y}_i(n_i):i\in \mathcal {S}\}\) is the estimated best system. (See, e.g., Peng et al. 2016, for an example, in which a system other than the one with the largest estimated mean is returned.) Since we may only assess each system with a finite computational budget, there is always a positive probability that an R&S procedure will return some system other than the best. Thus, R&S procedures are usually created to satisfy some form of mathematical or statistical objectives, which we discuss in Sect. 12.3.

12.1.2 Scope

We classify as R&S any procedures that include three key ingredients: (a) they are applied to a finite number of systems whose expected performance can only be observed with error as simulation output, (b) the procedure will simulate all of the systems and construct consistent estimators of their expected performance, and (c) the decision-maker wishes to select the best by comparing these systems to each other. While we consider only the single-objective problem formulation, we note that stochastically constrained and multiobjective versions of the R&S problem exist. For example, see Andradottir and Kim (2010), Pasupathy et al. (2014) for the stochastically constrained case and Lee et al. (2010), Feldman et al. (2015), Feldman and Hunter (2016), Hunter et al. (2017) for the multiobjective case.

R&S is closely related to some versions of best-arm identification in stochastic multi-armed bandit (MAB) problems; see Bubeck and Cesa-Bianchi (2012) and Jamieson and Nowak (2014). However, there are differences: MAB is often concerned with online decision-making so as to accumulate the most reward, while R&S is always an offline optimization problem. Perhaps more critically, the key assessment of an MAB algorithm is its “big-O” computational complexity (convergence rate) when selecting the best, while R&S is concerned with finite-time performance, even when asymptotic methods are used in the analysis. As a result, MAB algorithms tend to be simple, have few distribution-specific assumptions, and their computational complexity is determined up to some unknown constants; as compared to R&S that tries to exploit specific distributions to gain efficiency and to assure validity even when unknown constants must be estimated. We focus exclusively on R&S.

12.2 A Stylized Computational Model for R&S

Throughout this chapter, we use a stylized computational model to facilitate our discussion of both serial and parallel R&S procedures. In this section, we define and discuss the stylized model.

We formulate a stylized computational model for parallel R&S procedures by breaking all simulation and calculation tasks that must be completed during an R&S procedure into jobs. All R&S procedures contain two primary tasks for processors to complete: (a) performing simulation replications, and (b) calculations completed after simulation replication output is obtained, such as comparing the performances of systems to each other to select the estimated best. Thus we define job j as the ordered list comprised of obtaining simulation replications and performing calculations,

$$\begin{aligned} J_j\mathrel {\mathop :}=\{(\mathcal {Q}_j,\varDelta _j,\mathcal {U}_j),(\mathcal {P}_j, \mathcal {C}_j)\}, \end{aligned}$$

where

  • \(\mathcal {Q}_j\subseteq \mathcal {S}\) a set containing the indices of systems to be simulated;

  • \(\varDelta _j = \{\varDelta _{ij}\}\) specifies how many samples to take from each system \(i\in \mathcal {Q}_j\);

  • \(\mathcal {U}_j\) is the assigned block of random numbers with which to perform the simulation replications;

  • \(\mathcal {P}_j\) is a list of jobs whose termination must precede the calculation \(\mathcal {C}_j\), if any, and

  • \(\mathcal {C}_j\) is a list of non-simulation calculations or operations to perform.

We allow \((\mathcal {Q}_j,\varDelta _j,\mathcal {U}_j)\) or \((\mathcal {P}_j, \mathcal {C}_j)\) to be null, so that a job can consist of just simulations or just calculations. Since \(J_j\) is ordered, we assume that the simulation replications in \((\mathcal {Q}_j,\varDelta _j,\mathcal {U}_j)\) are completed before the calculation \(\mathcal {C}_j\) begins. In the presence of only one processor, the list of jobs is usually created and performed dynamically by a single processor. In the presence of multiple processors, the list of jobs must be coordinated to preserve precedence requirements. For example, some simulation replications must be obtained from each system before their performances can be compared to each other.

When the number of processors \(p\ge 2\), we broadly assume that parallel algorithms operate in what is known as a master–worker framework. In this framework, one master processor coordinates the activities of one or more worker processors. The workers execute jobs determined by the master, and report results back to the master. Communication may occur through shared memory or via message passing. A master–worker framework can also be implemented in multiple tiers, in which each worker acts as a master to, and coordinates the tasks of, one or more sub-workers. For simplicity and ease of exposition, we assume only one such tier for now.

Remark 12.1

We acknowledge the existence of various parallel computing architectures and frameworks for parallel algorithm design (see, e.g., Barlas and Kaufman 2015). We take a higher level approach that enables us to focus on broad R&S procedure design concepts, instead of the details related to the underlying parallel computing architecture.

When a master sends a job to a worker, we assume that all data required to do the job is also transferred, or is otherwise accessible by shared memory. Likewise, when a worker completes a task, we assume relevant data is transferred back to the master. Ensuring efficient data transfer is an important part of designing parallel algorithms; however for exposition, we suppress data transfer information in our framework. Thus while a worker’s job may entail performing simulation replications and calculating statistics such as a sample mean, we do not explicitly denote whether the worker transfers just the sample mean back to the master, or the sample mean and all data used to compute the sample mean.

In the master–worker framework, we assume the (possibly dynamic) list of jobs

$$\begin{aligned} \mathcal {J}\mathrel {\mathop :}=\{J_j: 1\le j\le M\}, \end{aligned}$$

is created and maintained by the master processor, where job \(1\le M \le \infty \) is some (possibly random) terminal job; \(M=\infty \) denotes the list of jobs for a non-terminating algorithm. When a worker processor completes a job or becomes idle, it communicates any results back to the master processor and requests a new job. Henceforth, let \(0<T_j< \infty \) be the wall-clock time that job \(J_j\) finishes, so that

$$\begin{aligned} T_e(\mathcal {J}) = \max _{j=1,2,\ldots , M}T_j \end{aligned}$$

is the (possibly random) ending time of the procedure.

Remark 12.2

We assume the master creates jobs that can be sent to the workers for execution. Some jobs, particularly jobs containing only calculations, may be executed by the master. Since only the master creates jobs, we do not consider the creation of jobs to be a job.

In modern computing environments, R&S procedures may be completed by purchasing processing power from a service. Since cores may often be purchased in increments such as 4, 8, 16, 48, or 64, with the price per hour varying by the type of processing power provided, we formulate the general cost to purchase p processors for s time units as a function c(ps). For a total budget b, we require \(c(p,s)\le b\). Define the function t(pb) as the maximum amount of time we purchase on p processors, so that

$$\begin{aligned} t(p,b)\mathrel {\mathop :}=\max \{s:c(p,s)\le b\}. \end{aligned}$$

12.3 Mathematical Formulations of Existing R&S Procedures

Recall that because we cannot simulate every system infinitely often, upon termination, R&S procedures have some positive probability of selecting a system other than the true best. However, most R&S procedures are designed to control this error probability. In this section, we formulate the common goals of existing R&S procedures using the stylized model in Sect. 12.2.

First, we note that most R&S procedures are in some way concerned with the optimality gap between the true best system and the estimated best system, \(\mu _k - \mu _{\hat{K}}\). We say that a correct selection (CS) event occurs if this optimality gap is zero, and \(\mu _k = \mu _{\hat{K}}\). An ideal R&S procedure would always deliver a CS for any computational budget \(n\ge k\). Since this ideal is impossible in the presence of noise, compromises are made, and the chosen compromise affects the nature of the procedure. R&S procedures may be classified by a number of different approaches and compromises, although these boundaries are not always sharp (see also Pasupathy and Ghosh 2013; Dong and Zhu 2016):  

Fixed-precision versus fixed-budget guarantee :

Fixed-precision procedures execute until some form of guarantee holds, usually on the optimality gap between the selected and true best systems. Fixed-budget procedures attempt to allocate a fixed computational budget in a way that minimizes a loss function that penalizes an incorrect selection event.

Finite-sample versus asymptotic validity :

Finite-sample procedures provide some provable guarantee within a finite sample size, such as achieved probability of correct selection (PCS). Asymptotic validity procedures achieve guarantees only in some meaningful limit.

Frequentist versus Bayesian guarantee :

Frequentist probabilistic guarantees are averaged over (conceptually) repeated applications of the procedure. Bayesian probabilistic guarantees are conditioned on the data and averaged over the sources of parameter uncertainty.

 

In the next two sections, we discuss some of the standard compromises and approaches for creating R&S procedures. Later, we argue that the relevant compromises may be affected by the decision to implement the procedure in a parallel computing environment. In our discussion, we group procedures by whether they are fixed-precision or fixed-budget procedures, which often, but not always, determines the computational formulation of the R&S procedure, as we discuss in Sect. 12.4.

12.3.1 Mathematical Formulation of Fixed-Precision Guarantees

Ideally, fixed-precision R&S procedures are guaranteed to deliver the optimal solution with a pre-specified frequentist probability, which we denote by \(1-\alpha \) for \(1-\alpha \in (1/k,1)\). This guarantee is called the probability of correct selection (PCS) guarantee, and is expressed as

$$ \mathbb {P}\{\mu _{\hat{K}}=\mu _k \}\ge 1-\alpha . $$

If there are multiple optima, or several solutions with close performance, delivering this guarantee can be computationally infeasible. As a result, making one of the following additional compromises is typical.

  • One can assume that the best is unique, and accept the possibility of substantial computation before termination, as in Fan et al. (2016).

  • One can allow for a practically significant difference \(\delta >0\), also called an indifference-zone (IZ) parameter, and instead require

    $$ \mathbb {P}\{{\hat{K}}=k \mid \mu _k-\mu _{k-1}\ge \delta \}\ge 1-\alpha . $$

    The IZ compromise has been widely adopted.

  • One can be satisfied with returning a good solution with optimality gap no larger than a user-specified \(\delta \):

    $$ \mathbb {P}\{\mu _k - \mu _{\hat{K}}\le \delta \}\ge 1-\alpha . $$

    Some of the procedures that deliver a guaranteed PCS also deliver a guaranteed probability of good selection (PGS), but this is not always the case.

  • One can be satisfied with

    $$ \mathbb {P}\{{\hat{K}}\in [k, k-1, k-2,\ldots , k-m+1] \} \ge 1 - \alpha . $$

    That is, one can be satisfied with selecting a top-m solution based on rank order, irrespective of the actual optimality gap. This is the compromise behind ordinal optimization (see, e.g., Chen and Lee 2010).

  • One can be satisfied with a subset \(\hat{\mathcal {S}}\subseteq \mathcal {S}\) such that

    $$ \mathbb {P}\{k\in \hat{\mathcal {S}}\}\ge 1-\alpha . $$

    Subset procedures are closely related to multiple comparison procedures that provide simultaneous confidence intervals on some set of differences, and in particular to multiple comparisons with the best (MCB, Hsu 1984). Subset guarantees can often be delivered with weak assumptions, but the conclusion may also be weak if the subset is large. Subset procedures may be used within other R&S procedures for screening or removing systems from the consideration that are estimated as inferior.

While all R&S procedures strive to be efficient, fixed-precision procedures require statistical guarantees to hold. Thus we formulate the objective of fixed-precision procedures by placing a hard constraint on the guarantee, but we wish to purchase processors p and create a job schedule \(\mathcal {J}\) such that we minimize the expected (scaled) completion time of the procedure plus the (scaled) monetary cost of the R&S procedure. Under the stylized model in Sect. 12.2, we formulate this problem as

$$\begin{aligned} \text{ minimize }_{p,\mathcal {J}}\quad&\mathbb {E}[\beta _t T_e(\mathcal {J}) + \beta _c c(p,T_e(\mathcal {J}))] \quad \text{ s.t. } \quad \mathbb {P}\{G\}\ge 1-\alpha , \end{aligned}$$

where \(\beta _t\ge 0\) and \(\beta _c\ge 0\) are scaling coefficients, and the event G denotes a “good event” upon termination of the procedure, in whatever form. For example, \(G= ({{\hat{K}}=k} \mid \mu _k-\mu _{k-1}\ge \delta )\) for an IZ compromise, and \(G = (k\in \hat{\mathcal {S}})\) for a subset selection compromise. Usually, \(\beta _t\in \{0,1\}\) and \(\beta _c=1-\beta _t\), so that only expected wall-clock time or only expected cost is minimized, depending on the cost structure of the parallel computing environment. To ensure the probabilistic guarantee constraint is satisfied, we require purchasing as many processor hours as the procedure requires to terminate at time \(T_e(\mathcal {J})\); thus, the monetary budget for purchasing processor hours should be \(b=\infty \).

12.3.2 Mathematical Formulation of Fixed-Budget Guarantees

In contrast with fixed-precision procedures, in which the simulation budget is determined in part by the required precision, the goal of most fixed-budget procedures is to identify the best system efficiently under a fixed simulation budget. Thus most fixed-budget procedures provide an “efficiency guarantee,” which we formulate as

$$\begin{aligned} \text{ minimize }_{p,\mathcal {J}}\quad&\mathbb {E}[\mathcal {L}(G^c,\mathcal {J})] \quad \text{ s.t. } \quad t(p,b) \le t^*, \end{aligned}$$

where the function \(\mathcal {L}\) is some type of loss function that depends on an undesirable event (which, loosely speaking, we denote as \(G^c\)) upon termination of the procedure, and \(t^*\) is a fixed limit on processor hours we purchase. This formulation implies that we wish to choose the processors and the job configuration to minimize the expected loss associated with an incorrect decision, subject to a hard budgetary constraint on the amount of processor hours. The budgetary constraint on the amount of processor hours differs from the traditional constraint on the total number of simulation replications; this formulation provides a more accurate way to measure cost in a parallel computing setting. Note that equivalently, we could formulate the constraint in terms of monetary cost instead of time.

Several prominent fixed-budget guarantee methods include those provided by OCBA (Optimal Computing Budget Allocation) (Chen et al. 2000), the Bayesian Expected Value of Information (EVI) (Chick et al. 2010) and Knowledge Gradient (KG) (Frazier et al. 2008) methods, and the frequentist SCORE (Sampling Criteria for Optimization using Rate Estimators) framework (Pasupathy et al. 2014), which generalizes the work of Glynn and Juneja (2004) and has a close relationship with OCBA and EVI (Ryzhov 2016).

12.3.3 Guarantees Require Standard Assumptions

Whether assuring a desired PCS or minimizing an expected loss, there is an underlying output distribution with respect to which the PCS or expected loss is evaluated. This underlying distribution may be derived from a strong assumption about the simulation output data, or hold asymptotically under weaker conditions. Establishing that these probability guarantees hold in small samples usually requires strong distribution assumptions. Asymptotic analysis (e.g., as \(\delta \rightarrow 0\)) can establish attainment in a large-sample sense. In either case, the actual distribution depends on both (a) the simulation model itself and (b) the sequence of jobs executed. Dependence on (b) is typically not a concern when there is only a single processor, but as discussed in Sect. 12.5, it is critical when jobs are executed in parallel.

To ensure the guarantees from the previous two sections hold, we define the standard output assumptions as follows. Recall that \(Y_{ir}\) is a random variable representing the performance of the ith system on the rth simulation replication, for each \(r=1,2,\ldots \) and all \(i\in \mathcal {S}\).

Definition 12.1

The standard output assumptions comprise the following:

  1. 1.

    (Within) for all systems \(i\in \mathcal {S}\), the random variables \(Y_{ir},\ r=1,2,\ldots \) are i.i.d. normally distributed with finite variance, and

  2. 2.

    (Between) for all pairs of systems \(i,i'\in \mathcal {S}\), the random variables \(Y_{ir}\) and \(Y_{i'r'}\) are independent for all \(r=1,2,\ldots \) and all \(r'=1,2,\ldots \).

The validity of a serial R&S procedure can usually be established under the standard output assumptions. However, these assumptions may be overly stringent. We now provide several common relaxations to the standard output assumptions.  

Within Relaxations :

for all systems \(i\in \mathcal {S}\), the random variables \(Y_{ir},\ r=1,2,\ldots \), (a) are i.i.d. with finite variance; (b) are stationary with finite variance; or (c) appropriately standardized, satisfy a Functional Central Limit Theorem.

Between Relaxations :

for all pairs of systems \(i,i'\in \mathcal {S}\), the random variables \(Y_{ir}\) and \(Y_{i'r}\) are positively correlated for all \(r=1,2,\ldots ,\) where the positive correlation is induced by the use of common random numbers (CRN) .

 

CRN is a rule for assigning a set of jobs \(\{J_j,\ j \in \mathcal {B}^{(b)}\}\) a “common” block of random numbers \(\mathcal {U}_j = \mathcal {U}^{(b)}\) for all \(j \in \mathcal {B}^{(b)}\), so that the blocks \(b=1,2,\ldots \) exhaust all jobs that require simulation replications in \(\mathcal {J}\). The use of CRN across systems to induce a positive correlation and thereby reduce the variance of the difference \(\bar{Y}_i(n) - \bar{Y}_{i'}(n)\) has long been a staple of R&S methods to improve statistical efficiency; see for instance Nelson (2013). Because CRN can, in fact, increase variance if simulation outputs are not appropriately paired with equal numbers of observations across systems, the use of CRN imposes an additional coordination problem when there are multiple processors. As a result, CRN has not yet been central to parallel R&S procedures. Therefore, we assume independent blocks of random numbers for each job from here on unless specifically indicated. For simplicity in our stylized model, whenever the blocks of random numbers are independent, we drop the the specification of \(\mathcal {U}_j\) and instead write

$$\begin{aligned} J_j\mathrel {\mathop :}=\{(\mathcal {Q}_j,\varDelta _j),(\mathcal {P}_j, \mathcal {C}_j)\}. \end{aligned}$$

12.4 Computational Formulations of Existing Serial R&S Procedures

Once we have a mathematical formulation of the goals of the R&S procedure, we require a computational formulation of the procedure that can be implemented on one or more processors. To naïvely implement existing serial R&S procedures in a parallel computing setting, we require an assignment of jobs to processors such that the standard assumptions from the original serial procedure, in whatever form they exist, are still satisfied. The simplest way to accomplish this goal is to parallelize only the parts of the procedure that can be completed in an embarrassingly parallel fashion, and complete all other tasks in the original sequence. As is common in the parallel computing literature, we use the term embarrassingly parallel to refer to jobs that are trivially implemented in parallel and require no coordination or synchronization.

Before we provide naïve parallel computational formulations of existing serial fixed-precision and fixed-budget R&S procedures, we define the concepts of coupled operations and stages, which are concepts that assist with ordering jobs. First, recall that \(\mathcal {C}_j\) is a list of calculations that are performed as part of a job j. Typical calculations that arise in R&S procedures include

  • determining the sample mean and sample variance of the ith system, \(\bar{Y}_i(n)\) and \(S^2_i(n)\mathrel {\mathop :}=(n-1)^{-1}\textstyle \sum _{r=1}^{n} (Y_{ir}-\bar{Y}_i(n))^2,\) respectively,

  • performing a pairwise comparison \(\bar{Y}_i-\bar{Y}_{i'}\) for two systems i and \(i'\),

  • determining the paired sample variance

    $$S^2_{i,i'}(n)\mathrel {\mathop :}=(n-1)^{-1}\textstyle \sum _{r=1}^n (Y_{ir}-Y_{i'r}-(\bar{Y}_i(n)-\bar{Y}_{i'}(n)))^2,$$
  • updating a sample allocation rule \(\mathfrak R\) in a fixed-budget procedure like OCBA, and

  • updating a posterior distribution in a Bayesian R&S procedure.

Since operations like pairwise comparisons and calculating paired sample variances require the simulation output of two or more systems, we refer to these operations as coupled.

Definition 12.2

We define the following:

  • A coupled operation or coupling is an operation or calculation in which the simulation output of two or more systems is required.

  • A fully coupled operation or full coupling is an operation or calculation that requires the simulation output of all systems still in contention at that point in the procedure.

Thus coupled operations occur when the estimated system performances must be compared to each other, as in a pairwise comparison, or when some key quantity must be calculated that requires the compilation of simulation output from multiple systems. For example, calculating the estimated best system \({\hat{K}}= \arg \!\max _{i\in \mathcal {S}}\{\bar{Y}_{i}(n_i)\}\) is a fully coupled operation. Coupling is distinct from the concept of synchronization in parallel algorithms, since coupling is across systems, and synchronization is usually across processors. However, when there is a cost to switch from simulating one system to another, it may make sense to assign processors to simulate particular systems, in which case a coupled operation may require synchronization of simulation output across processors.

Since R&S procedures consist of simulations and comparisons, they are usually implemented in what are called stages. While the definition of the term “stage” has not always been consistent in the R&S literature, in the context of our stylized model, we define a stage as follows:

Definition 12.3

A stage is a portion of an R&S procedure that begins with the first simulation output obtained after initialization or after the last fully coupled operation, and ends when the next fully coupled operation terminates.

While the individual calculations required in the full coupling may be split into jobs that are carried out by multiple processors, when the final calculation of the full coupling is complete, then the stage is over. When variances are unknown, the minimum number of stages is two (Dudewicz and Dalal 1957); the variances are estimated in the first stage. Thus, the first stage almost always consists of obtaining \(n_0\ge 2\) observations from each system, and ends with a fully coupled calculation of key information for implementing the next stage.

In the computational formulations that follow, our goal is to demonstrate the coupling structure of naïve parallelization of each type of serial procedure. Thus, we provide only a straightforward formulation of jobs \(J_j\). We acknowledge that many such formulations exist; some are more efficient than others.

12.4.1 Computational Formulation of Fixed-Precision Procedures

To create a computational formulation for fixed-precision procedures, we begin by formulating existing serial procedures using the stylized model described in Sect. 12.2. We provide a basic formulation of two prominent versions of fixed-precision procedures that have different coupling structures: two-stage procedures, which have exactly two stages with two full couplings, and fully sequential procedures, which have many stages and frequent full couplings.

The first stage of a two-stage procedure usually begins with obtaining \(n_0\ge 2\) simulation replications from each system, and ends with fully coupled operations that use rules to screen systems and to calculate second-stage sample sizes such that desired statistical guarantees hold. The screening and sampling rules, which we denote as rules \(\mathfrak R\), are often functions of the user-specified parameters \(\alpha \) and \(\delta \), and the variances of the system performances. The simulation replications in each of the two stages can be farmed out to worker processors in an embarrassingly parallel fashion, while the master completes all coupled operations. The full coupling at the end of each stage requires a full synchronization across all processors. A naïve two-stage fixed-precision procedure with an optional subset selection step is provided in Algorithm 1. Prominent two-stage procedures include Rinott (1978) and NSGS (Nelson et al. 2001). Ni et al. (2014) provide a parallel version of NSGS that is slightly different from Algorithm 1, called NSGS\(_p\).

figure a

While two-stage procedures can be completed using mostly embarrassingly parallel computation with little synchronization , they are less efficient than fully sequential procedures in terms of the expected total number of simulation replications required. Fully sequential procedures gain sampling efficiency by frequent comparisons and screening. Arguably, fully sequential procedures have the maximum number of stages and hence a “maximal” coupling structure. In each Stage 2+, one simulation replication is obtained from each system still in contention, sample means and paired variances are updated, and inferior systems are screened out. Since the number of simulation replications per system is equal across surviving systems in each stage of the procedure, that is, \(n_{i}=n_{i'}\) for all \(i,i'\in \mathcal {Q}\), some fully sequential procedures such as \(\mathcal {K}\mathcal {N}\) can be implemented with CRN, further enhancing efficiency. However, screening is an inherently coupled operation—especially when screening requires all pairwise comparisons between systems. Thus when adapting R&S procedures to a parallel computing platform, there exists a tension between sampling efficiency gained by frequent screening, and the potential inefficiency of attempting to perform frequent coupled screening operations across many processors. A generic, naïvely parallelized fully sequential procedure is provided in Algorithm 2. Prominent fully sequential procedures include \(\mathcal {K}\mathcal {N}\) (Kim and Nelson 2001).

figure b

12.4.2 Computational Formulation of Fixed-Budget Procedures

Fixed-budget procedures often take a similar computational structure, outlined in Algorithm 3. As in the fixed-precision procedures, the first stage begins by obtaining \(n_0\ge 2\) simulation replications from each system, and ends with a (usually) fully coupled operation that determines how to allocate the \(\varDelta \) simulation replications in the next stage, using a sampling rule \(\mathfrak R\). This process of obtaining \(\varDelta \) simulation replications per stage and updating the sampling rule is repeated until the total simulation time has been exhausted. Since the frequency of the coupling and the number of stages is determined by the parameter \(\varDelta \), these procedures tend to have a flexible coupling frequency. We note that some procedures are designed for myopic sampling, such that \(\varDelta =1\), while other procedures are more flexible in the choice of \(\varDelta \).

figure c

12.5 Parallelization: Efficiency and Validity

Having presented fairly straightforward parallel computational frameworks for existing serial R&S procedures that should preserve the standard assumptions from the original serial procedure, one may wonder, why not simply use these procedures? While such procedures surely can be implemented, they are unlikely to scale well to larger problem instances and to achieve the levels of speedup and efficiency we would like to see from a parallel R&S algorithm. The concepts of speedup and efficiency (or scalability) are defined in Barlas and Kaufman (2015) as follows. Suppose we are handed a parallel algorithm that requires \(t_p\) wall-clock time to be run on p identical processors in parallel, and \(t_s\) wall-clock time to be run on only one of the processors. Then the speedup is defined as the ratio of the sequential time to the parallel time, \( speedup \mathrel {\mathop :}=t_s/t_p\). Given \(p\ge 1\) processors, the efficiency is defined as the scaled speedup, \(\textit{efficiency}\mathrel {\mathop :}=s/p = t_s/(p t_p)\). Thus speedup gives a measure of how beneficial it is to execute the algorithm in parallel, while efficiency measures the utilization of the available processors. An Efficiency of 1 corresponds to linear speedup, in which case the speedup is equal to the number of processors, p. Embarrassingly parallel jobs that are appropriately load-balanced across cores tend to achieve almost linear speedup.

Since embarrassingly parallel implementations achieve almost linear speedup , it seems that two-stage procedures would perform the best in parallel. However, recall that two-stage procedures require more simulation replications, on average, than those that have more frequent coupled operations like screening. Thus it may benefit the procedure to introduce more frequent coupled operations. However, we have two potential forms of idleness that arise: (a) the master may be idle waiting for every simulation replication to complete before it performs any coupled operations—especially if simulation replication completion times are random—and (b) if screening and fully coupled operations take significant computational effort on the master, many worker processors must wait for the master to create new jobs.

Then, we may wish to design a procedure that does not require the processors to wait for each other. Unfortunately, the standard assumptions are most easily assured by one-job-at-a-time execution. Estimators can become biased if we do not wait for all parallelized simulation replications to complete; such bias was investigated by Heidelberger (1988) and Glynn and Heidelberger (1990, 1991). Further, serious violations of the standard assumptions can occur if jobs are executed in parallel but output data are used as available and without enforcing conformance with single-processor execution. These include the following, as described in Luo et al. (2015), Ni et al. (2013, 2017).  

Random sample size :

The number of observations from system i when the nth overall observation is obtained may be random if job execution time is variable.

Not i.i.d. :

The observations \(n_i\) from system i may not be i.i.d. if the order in which jobs complete is not the order in which the jobs were dispatched, and there is a dependence between returned value and execution time.

Dependence across systems :

A difficult-to-characterize dependence across systems’ outputs can be induced if elimination of system i by system \(i'\) frees processors that affect the number of observations obtained from other systems.

  These issues suggest that we must employ output coordination strategies that ensure all calculations \((\mathcal {P}_j, \mathcal {C}_j)\) across jobs j are executed as they would be if there were only a single processor. However, this still leads to a potential degradation of efficiency and speedup from the two forms of idleness: master waiting for simulation replications, and workers waiting for the master’s calculations.

Based on this analysis, efficient parallel R&S requires procedures with one or more of the following characteristics: (a) they implement careful load balancing to retain the standard assumptions without significant idling and overwhelming communication; or (b) they are valid under weaker assumptions than the standard ones; or (c) the procedure uses a combination of the strategies above. We discuss existing parallel R&S procedures of both types in the next section.

12.6 Existing Parallel Ranking and Selection Procedures

Existing parallel R&S procedures overcome some of the shortcomings of the naïve parallelization of existing serial R&S procedures. We now discuss the state of the art in both fixed-precision and fixed-budget parallel ranking and selection procedures.

12.6.1 Parallel Fixed-Precision Procedures

We describe four parallel fixed-precision procedures and formulate each procedure in terms of the types of jobs created and deployed to the workers. First, Luo and Hong (2011), Luo et al. (2015) extend the \(\mathcal {KN}\) procedure (Kim and Nelson 2006a, 2001), which provides a PCS guarantee, to the parallel setting in two distinct ways: a conservative vector-filling procedure (VFP) that strictly enforces the standard assumptions, and an aggressive asymptotic parallel selection (APS) procedure that is valid under weaker assumptions. Both algorithms resemble Algorithm 2 in that they use elimination at every stage; their key difference is in how they define the completion of a stage. Then, we discuss a simple divide-and-conquer approach by Chen (2005) for when the number of processors is small. Finally, a substantial extension of the divide-and-conquer approach is provided by the good selection procedure (GSP) of Ni et al. (2017), which provides a PGS guarantee.

Of these procedures, we highlight two procedures for their strategies related enhancing efficiency and maintaining validity. First, APS never allows the master to idle waiting for simulation replications, but maintains its validity under conditions weaker than the standard ones. Second, GSP maintains validity under the standard assumptions, but performs careful load balancing to maintain efficiency.

12.6.1.1 VFP: Vector-Filling Procedure

In VFP, the master creates/executes three types of jobs:

  1. 1.

    Initialization jobs:

    $$ \mathcal {J}_0 = \left[ \{(1, n_0), (\emptyset )\}, \{(2, n_0), (\emptyset )\} \ldots , \{(k, n_0), (\emptyset )\}, \{(\emptyset ), (\mathcal {P}_0, \mathcal {C}_0)\} \right] $$

    where the set \(\mathcal {P}_0\) includes the k preceding simulation jobs, and the calculations \(\mathcal {C}_0\) include computing the variance of all pairwise differences, making pairwise comparisons of the sample means of all k systems, and possibly eliminating some systems.

  2. 2.

    Round-robin simulation jobs: Conceptually, there is an infinite set of sets of simulation jobs

    $$ \mathcal {J}_\ell = \left[ \{(1, 1), (\emptyset )\}, \{(2, 1), (\emptyset )\} \ldots , \{(k, 1), (\emptyset )\}\right] , \ell = 1,2,\ldots $$

    that obtain one additional replication from each system. However, if at stage \(\ell ^\prime \) system \(i^\prime \) is eliminated, then all \(\{(i^\prime , 1), (\emptyset )\}\) jobs are eliminated from the unexecuted simulation job set. Upon completion of a simulation job, a worker pulls the next simulation job in the sequence to execute.

  3. 3.

    Elimination jobs: Stage \(\ell \) is defined by an elimination job \(\{(\emptyset ), (\mathcal {P}_\ell , \mathcal {C}_\ell )\}\), where \(\mathcal {P}_\ell \) contains all simulation jobs \(\mathcal {J}_\ell \), and \(\mathcal {C}_\ell \) performs pairwise comparisons of all systems that have not been eliminated at an earlier stage. The elimination jobs are executed by the master.

The VFP terminates when there is only one system that has not been eliminated. The term “vector filling” is appropriate because the VFP enforces the standard assumptions by associating each simulation output with its job set \(\mathcal {J}_\ell \), and only performing a full coupling for stage \(\ell \) when all jobs in \(\mathcal {J}_\ell \) have completed. For this reason, outputs from later job sets, say \(\mathcal {J}_{\ell +1}\), that complete before jobs in \(\mathcal {J}_\ell \) must be held in a vector for later elimination calculations.

12.6.1.2 APS: Asymptotic Parallel Selection

The APS procedure is superficially similar to the VFP, but a small change makes its computational profile quite different.

  1. 1.

    Initialization jobs:

    $$ \mathcal {J}_0 = \left[ \{(1, n_0), (\emptyset )\}, \{(2, n_0), (\emptyset )\} \ldots , \{(k, n_0), (\emptyset )\}, \{(\emptyset ), (\mathcal {P}_0, \mathcal {C}_0)\} \right] $$

    where the set \(\mathcal {P}_0\) includes the k preceding simulation jobs, and the calculations \(\mathcal {C}_0\) include computing the marginal variance of all k systems, making pairwise comparisons of the sample means of all k systems, and possibly eliminating some systems. (These initialization jobs are the same as those in VFP.)

  2. 2.

    Round-robin simulation jobs: Conceptually, there is an infinite set of sets of simulation jobs

    $$ \mathcal {J}_\ell = \left[ \{(1, 1), (\emptyset )\}, \{(2, 1), (\emptyset )\} \ldots , \{(k, 1), (\emptyset )\}, \{(\mathrm {phantom}, 0), (\emptyset )\} \right] , \ell = 1,2,\ldots $$

    that obtain one additional replication from each real system, and no replications from a “phantom” system. Again, if at stage \(\ell ^\prime \) real system \(i^\prime \) is eliminated, then all \(\{(i^\prime , 1), (\emptyset )\}\) jobs are elminated from the unexecuted simulation job set. Upon completion of a simulation job, a worker pulls the next simulation job in the sequence to execute, which could be a phantom.

  3. 3.

    Elimination jobs: Stage \(\ell \) is defined by an elimination job \(\{(\emptyset ), (\mathcal {P}_\ell , \mathcal {C}_\ell )\}\), where \(\mathcal {P}_\ell \) is the \(\ell \)th phantom job. The calculation \(\mathcal {C}_\ell \) updates marginal variances and performs pairwise comparisons of all systems that have not been eliminated at an earlier stage using all available simulation output data. The elimination jobs are executed by the master.

The APS procedure defines a stage as the completion of a phantom job, but otherwise makes no attempt to process simulation jobs in any order. Thus, it is aggressive in that the master never idles waiting for a particular real simulation job to complete, but it is subject to all of the violations of standard assumptions described in Sect. 12.5. The validity of APS is asymptotic, as \(\delta \rightarrow 0\), with the key insight being that since there are \(p < \infty \) processors the simulation jobs may only be out of order by an asymptotically negligible p jobs.

12.6.1.3 Simple Divide-and-Conquer

An early paper by Chen (2005) describes a simple approach that is sensible when the number of processors p is small; GSP below can be considered a substantial extension of this idea. There are two types of jobs:

  1. 1.

    Group R&S jobs: The k systems are divided as evenly as possible into p nonoverlapping groups of systems, say \(\mathcal {G}_1, \mathcal {G}_2, \ldots \mathcal {G}_p\), and p jobs are formed

    $$ J_j = \{(\mathcal {G}_j, \varDelta _j), (\mathcal {G}_j, \mathcal {C}_j)\},\ j = 1,2,\ldots , p $$

    where each job j is a complete R&S procedure that returns a group-best selected system \(\widehat{i}_j\) along with its accumulated output data.

  2. 2.

    Final R&S job: Let \(\mathcal {Q}= \{\widehat{i}_1, \widehat{i}_2,\ldots ,\widehat{i}_p\}\), the group bests. Then the final job is

    $$ J_{p+1} = \{(\mathcal {Q}, \varDelta _{p+1} ), (\mathcal {Q}, \mathcal {C}_{p+1})\},\ j = 1,2,\ldots , p $$

    which performs a R&S procedure on the group-best systems \(\mathcal {Q}\) starting with their previously accumulated data and \(\mathcal {C}_{p+1}\) computes the sample means and selects the best.

Chen (2005) suggests some specific R&S procedures for each type of job, but the framework is flexible. The simplicity of this strategy is appealing, but it will lose effectiveness when \(k \gg p\) so that the group R&S jobs themselves are challenging.

12.6.1.4 GSP: Good Selection Procedure

The GSP procedure of Ni et al. (2017) (also see Ni et al. 2013, 2014, 2015) provides a PGS guarantee, instead of the usual PCS guarantee, under the standard output assumptions. GSP exhibits good speedup and efficiency using careful load-balancing and reduced computation. Several key strategies of GSP include: (a) distributing screening tasks to the workers in a divide-and-conquer fashion to avoid overwhelming the master with screening calculations; (b) using only a reduced number of pairwise comparisons instead of completing all pairwise comparisons; and (c) carefully constructing load-balanced jobs of large-enough size to prevent overwhelming the master with communication. As a result, when implemented in a high-performance computing (HPC) environment in C with MPI, the master is idle most of the time. However in its idleness, the master usually is ready to communicate and can ensure the workers are not idle most of the time.

GSP has three stages and one “phase,” which is a sequential portion of the algorithm containing multiple stages. GSP’s stages and phases are: an optional load-balancing stage; an initialization stage with screening; a sequential phase that contains multiple stages and is somewhat similar in structure to Algorithm 2 after initialization; and a Rinott stage, similar in structure to Algorithm 2. The sequential phase is intended to harness the efficiency of sequential screening to create a subset of contender systems likely to contain the best. In the Rinott stage, the appropriate sample sizes for the remaining systems are calculated, and the simulations for remaining competitive systems are completed in an embarrassingly parallel fashion.

In this section, we assume we have p worker processors, where the zeroth processor is the master. In each stage or phase, the master creates the following types of jobs:

  1. 1.

    Optional load-balancing jobs: The master randomly permutes the systems in \(\mathcal {S}\) and assigns an approximately equal number of systems to groups \(\mathcal {G}_0^1,\ldots ,\mathcal {G}_0^p\), for each processor. Then the master creates jobs

    $$\begin{aligned} \mathcal {J}_{0}=\{(\mathcal {G}_0^w,n^*_0),(\emptyset ,\{\bar{T}_i: i\in \mathcal {G}_0^w\})\}_{w=1}^p, \end{aligned}$$

    where \(\bar{T}_i\) is the average simulation completion time across all replications \(n^*_0\) from simulating system i. After calculating statistics \(\bar{T}_i\), the simulation output is thrown away, due to potential dependence between the output random variable \(Y_{ir}\) and the simulation replication completion time \(T_{ir}\).

  2. 2.

    Initialization jobs: Using information from the optional load-balancing step if available, the master partitions the systems in \(\mathcal {S}\) into load-balanced simulation groups \(\mathcal {G}_1^1,\ldots ,\mathcal {G}_1^p\) for each processor. Then the master creates jobs

    $$\begin{aligned} \mathcal {J}_{1}=\{(\mathcal {G}_1^w,n_0),(\emptyset ,\{(\bar{Y}_{i}(n_0),S^2_i(n_0),\mathcal {C}_1):i\in \mathcal {G}_1^w\})\}_{w=1}^p, \end{aligned}$$

    where \(\mathcal {C}_1\) is a screening calculation that only reports the surviving systems and their sufficient statistics to the master. The master updates the surviving systems \(\mathcal {Q}\).

  3. 3.

    Sequential phase jobs: The master divides the systems into approximately load-balanced screening groups \(\mathcal {G}_2^1,\ldots ,\mathcal {G}_2^p\) using rule \(\mathfrak R^{\text { GSP}}_1\), so that each processor always screens the same set of systems. The master also uses rule \(\mathfrak R^{\text { GSP}}_2\) to determine an appropriate “batch size” \(b_i\) of simulation replications to obtain from each system \(i\in \mathcal {Q}\) in each simulation job, so that the master is not overwhelmed with communication. The sequential phase ends when a pre-determined maximum number of batches has been simulated, or when \(|Q|=1\).

    1. a.

      Simulation jobs: The master creates and maintains an ordered list of batched simulation jobs for each system \(i\in \mathcal {Q}\). Whenever a worker becomes idle and the master indicates that some systems in its screening group are not ready for screening, the worker requests the next simulation job in the list, for any system \(i\in \mathcal {Q}\). For each system still in contention \(i\in \mathcal {Q}\), the \(\nu \)th simulation job is

      $$\begin{aligned} J_{i,\nu }=\{(i,b_i),(\emptyset , \bar{Y}(b_i))\}, \nu =1,2,\ldots . \end{aligned}$$
    2. b.

      Within-group and best-across-processor screening jobs: Whenever a processor becomes idle and the \(\nu \)th simulation batch has completed for all systems in its screening group, the processor pulls the “screening job” for its group from the master,

      $$\begin{aligned} J^w_{\nu }=\{\emptyset , (\{J_{i,\nu }:i\in \mathcal {G}_2^w\},\mathcal {C}^w_\nu )\}, \end{aligned}$$

      where \(\mathcal {C}^w_\nu \) is an all pairwise screening job within the group \(\mathcal {G}_2^w\), as well as among the best systems who have completed batch \(\nu \) from the other screening groups. The processor then reports the indices of eliminated systems to the master, who updates the set of systems still in contention, \(\mathcal {Q}\). Note that the \(\nu \)th screening must occur before the \((\nu +1)\)th screening, and so on. Per Definition 12.3, each within-group screening that uses the best systems from screening groups on all the other processors constitutes the end of a stage.

  4. 4.

    Rinott stage jobs: If \(|\mathcal {Q}|>1\), the master uses a rule \(\mathfrak R\) to determine the Rinott stage sample sizes \(N_{i,4}\) for all remaining systems \(i\in \mathcal {Q}\). Let \(N_i\) be the total number of simulation replications observed from each system \(i\in \mathcal {Q}\) so far before the Rinott stage, and define \(N_{i,4}^+\mathrel {\mathop :}=\max \{0,N_{i,4}-N_i\}\) as the number of additional simulation replications required from system i. The master then arranges the required additional simulation replications for each system into load-balanced “batched” jobs; for ease of exposition, we omit the batching notation in this stage. Then for all \(i\in \mathcal {Q}\) such that \(N_{i,4}^+>0\), the master creates the jobs

    $$\begin{aligned} J_{4,i} = \{(i, N_{i,4}^+),(\emptyset ,\bar{Y}_i(N_{i,4}))\}. \end{aligned}$$

    After all simulation replications terminate, the master updates the sample means with the latest data and returns the estimated best system \({\hat{K}}\).

12.6.2 Parallel Fixed-Budget Procedures

Luo et al. (2000) is the first reported effort to parallelize an R&S procedure, specifically OCBA in the fixed-budget setting. Their base algorithm resembles Algorithm 3, and they assume a master–worker environment with a small number of workers (\(p \le 3\) in their experiments).

The master creates/executes three types of jobs:

  1. 1.

    Initialization jobs:

    $$ \left[ \{(1, n_0), (\emptyset )\}, \{(2, n_0), (\emptyset )\} \ldots , \{(k, n_0), (\emptyset )\}, \{(\emptyset ), (\mathcal {P}_0, \mathcal {C}_0)\} \right] $$

    where the set \(\mathcal {P}_0\) includes the k preceding simulation jobs, and the calculations \(\mathcal {C}_0\) include computing the marginal sample means and variance of the k systems.

  2. 2.

    Simulation jobs: In the \(\ell \)th stage, p jobs \(\{(i_j, \varDelta ), (\emptyset )\}\) for \(j=1,2\ldots ,p\) are created, where \(i_j \in \{1,2,\ldots ,k\}\) denote p distinct systems, each allocated the same number of replications, \(\varDelta \). These jobs are executed by the p workers in parallel.

  3. 3.

    OCBA jobs: \(\{(\emptyset ), (\mathcal {P}_\ell , \mathcal {C}_\ell )\}\), where \(\mathcal {P}_\ell \) contains all of the simulation jobs from the \(\ell \)th simulation stage, and \(\mathcal {C}_\ell \) performs the OCBA optimization to find the p systems for whom an allocation of \(\varDelta \) additional replications would most rapidly increase an approximate posterior PCS expression. Simulation jobs are then created for these p systems.

By having the OCBA job hold for the return of all of the ongoing simulation jobs, this algorithm enforces the single-processor assumptions behind OCBA at each stage. As noted by the authors, there is a loss of statistical efficiency by simulating the top-p OCBA systems at each stage, rather than simulating the single best then reevaluating, but there is a gain in computational efficiency. The algorithm terminates when a fixed number-of-replications budget is expended. A related paper by Yoo et al. (2009) also applies OCBA in a parallel search setting where not all systems are expected to be simulated.

12.6.3 Available Implementations of Parallel R&S Procedures

To the best of our knowledge, only one commercial simulation product, Simio (http://www.simio.com), has implemented R&S procedures that exploit parallel computing. Simio has implemented two fixed-precision procedures: \(\mathcal {KN}\) (Kim and Nelson 2001, 2006a) which uses multiple processors on a local PC, and GSP (Ni et al. 2017) which is specifically designed to use high-performance or cloud computing. \(\mathcal {KN}\) gains efficiency by obtaining replications in parallel; in every other sense it is the single-processor algorithm and it implements full synchronization at every stage.

There are also public code repositories that contain parallel versions of R&S procedures. In this paragraph, the citations provide links to code repositories that are publicly available at the time of writing. GSP has been implemented in MapReduce (Ni 2015a), MPI (Ni 2015b), and Spark (Ni 2015c). Code for a parallel version of OCBA is available (Li 2017). As a repository for the simulation optimization community, http://www.simopt.org (Henderson 2016), also contains test problems for a variety of problem types, as well as an algorithms library with publicly available code.

12.7 A Future Research Agenda

Effective and efficient parallel R&S procedures of the future seem likely to be obtained by a careful coordination of a number of ideas. Here is a part of the roadmap as we see it.

Assignment of jobs to processors is clearly a type of stochastic parallel-machine scheduling problem as addressed by the operations research literature (see for instance Pinedo 2015). The objective in such problems is often to minimize makespan, which is analogous to our objective in the fixed-precision formulation, and sequencing constraints are similar to our dependence of certain computations on the completion of particular jobs, \((\mathcal {P}_j, \mathcal {C}_j)\). A key difference is that the jobs that need scheduling in parallel R&S may evolve based on the simulation outputs obtained from earlier jobs, rather than being all available in advance or arriving according to some exogeneous stochastic process. Nevertheless, this is a deep literature whose lessons should not be ignored.

Strategies that avoid full coupling seem critical as the number of all pairwise comparisons grows as \(O(k^2)\). Thus, as k increases it becomes computationally prudent to simulate more outputs than strictly needed for, say, correct selection to avoid coupling. This can be done from at least two directions:

  1. 1.

    Distributed screening: Couplings of \(k^\prime \ll k\) systems to screen out inferior systems and pass competitors to full couplings, thereby reducing the comparisons to \(O((k^\prime )^2)\).

  2. 2.

    Distributed killers: Obtaining high-precision estimates of an apparently good solution and distributing it to all or groups of systems to screen out inferior ones; this type of screening is O(k).

The fixed-budget formulation, when expressed as a limit on the number of simulation replications, has always been somewhat artificial. A fixed monetary or time budget for parallel computation, on the other hand, is both concrete and relevant. To us, the joint choice of a number of processors p and jobs to execute \(\mathcal {J}\) to minimize expected loss with respect to a monetary budget looks very challenging indeed. We suspect that a strategy that chooses p based on a priori problem characteristics, and then treats it as fixed when optimizing over \(\mathcal {J}\), will be the most productive avenue.

Finally, parallel R&S for very large numbers of systems should cause us to revisit the standard R&S objectives as described in Sects. 12.3.112.3.2. For very large k a PGS guarantee seems more relevant and easier to obtain than a PCS guarantee, as it seems likely there are many close competitors. More critically, any objective that returns a single system \(\widehat{K}\) without additional inference about the others seems questionable. Consider an alternative objective:

Suppose, based on previous experience with similar problems, a known standard for “good” performance of \(\mu ^\star \) can be established. Finding, with high probability, the subset of systems with \(\mu _i \ge \mu ^\star \) is a fully uncoupled problem that is embarrassingly parallel. A related approach by Singham and Szechtman (2016) defines inclusion of inferior systems in the subset as a “false discovery” and sets as the objective bounding the false discovery rate. In terms of both conservatism of the inference and growth of computation, these ideas scale better than the traditional objectives.

12.8 WSC 2017

At the time of writing, we are aware of at least one paper on parallel R&S under review for WSC 2017. Thus parallel R&S continues to be an active research area at WSC.