1 Introduction

TensorFlow [1] is one of the most used machine learning (ML) framework in industry [10] and shares similar functionality with other solutions such as PyTorch [19] or MXNet [5]. While TensorFlow supports different types of ML applications, this paper focuses on supervised learning ones because of the two phases that characterize their lifecycle: training and inference. In the former case, algorithms like logistic regression, decision trees, and deep neural networks are used to create prediction models starting from known input-output pairs (e.g., pictures and contained objects), called training set. In the latter case, generated prediction models are used as oracles to infer the result on new, unknown inputs. The first phase makes these applications batch ones, while the second phase requires that these applications be interactive.

Both phases are characterized by highly parallel operations (e.g., matrix calculus) that can exploit multi-core architectures. TensorFlow eases the use of multi-core CPUs and also of GPUs, which provide hundreds of cores and very fast executions. Oftentimes, these applications are executed in the cloud, where virtual machines (VMs) equipped with GPUs and dedicated execution frameworks can easily be rented from many cloud providers.

TensorFlow (similarly to other ML frameworks) does not allow users to define constraints on response times (Service Level Agreements or SLA) for these applications, and resource management is driven by user experience or by simple default policies that do not take actual application needs into account. Training would call for deadlines, that is, constraints on the maximum span of batch processing [21], while inference calls for average response times, computed on a number of subsequent invocations over a predefined time window.

Several approaches in the literature focus on the resource management of ML training [3, 12], while the inference phase calls for new studies and approaches. Existing solutions applied to interactive web applications [2, 7] cannot be reused since they do not consider the heterogeneity introduced by GPUs but only different types of virtual machines. CPUs and GPUs are interdependent resources while different VMs are not. GPUs are faster than CPUs but they also use CPUs to load and write data, and to be activated. Moreover, they have different scaling capabilities: CPUs can precisely be scaled by allocating fractions of cores to single applications; GPUs can only be time-shared among applications. While faster GPUs alone are usually not enough to serve realistic workloads, the coordinated use of CPUs and GPUSs becomes mandatory to offer reasonable execution times.

On the other hand, solutions that combine the management of CPUs and GPUs target the training phase (or long-lasting processing), they focus on scheduling and loadbalancing algorithms, and do not consider dynamic resource provisioning [17, 18]. Finally, in inference mode the distributed heterogeneous execution of multiple concurrent ML applications is still not completely supported in TensorFlow (as in other similar tools) and users are required to manually configure their deployments.

This paper presents ROMA, an extension of TensorFlow that helps the deployment and oversees the inference phases of multiple concurrent ML applications deployed onto a shared cluster of nodes that offer both CPUs and GPUs. ROMA manages containerized TensorFlow models, automates their deployment using KubernetesFootnote 1, a well known container orchestrator, and allows users to define SLAs as constraint on the response time. ROMA enacts the control at three different levels. A centralized component exploits heuristics to prioritize the scheduling of application requests on GPUs or CPUs according to their needs. Distributed control-theoretical planners allocate the amount of CPUs needed to each application by considering the boost introduced by GPUs. An intermediate level handles resource contentions that could happen when the system saturates.

The evaluation based on four real-world applications shows that ROMA: i) enables the distributed concurrent execution of multiple applications on heterogeneous resources ii) minimizes the number of SLA violations (reduction \(75\%\)) compared to static and rule-based solutions, and a simplified control-theoretic approach, and iii) optimizes the use of cluster resources by avoiding unneeded allocations (\(24\%\) resource saving on average).

The rest of the paper is organized as follows. Section 2 introduces ROMA, its architecture and deployment model. Section 3 presents how the schedulers work, and Sect. 4 explains the employed control-theoretical planners. Section 5 shows the empirical evaluation we carried out to assess ROMA. Section 6 discusses the related work and Sect. 7 concludes the paper.

2 ROMA

ROMAFootnote 2 is a comprehensive resource management solution that eases the deployment and operations of multiple interactive ML applications. ROMA can be useful to both users interested in running their ML applications and service providers. In the former case, ROMA helps the user manage resources efficiently and meet set response times. In the latter case, ROMA allows the service provider to allocate fewer resources to each application and offer an higher level solution to users (ML as-a-service).

ROMA is an extension of TensorFlow but it can be easily integrated onto other ML platforms. TensorFlow, as other similar frameworks, does not provide any dedicated support to distribute the inference of new results on computed trained models, neither it takes into account concurrent executions specifically. An extension, called TensorFlow ServingFootnote 3 (TF Serving for brevity) permits users to expose a trained model by means of a built-in web server and a dedicated REST API but the distributed deployment is not supported. ROMA wraps TF Serving instances into containers using DockerFootnote 4. Docker also provides means to allocate and share CPU cores among multiple processes through CPU quotas. GPUs can be mounted on Docker containers by using external tools, as the NVIDIA Container ToolkitFootnote 5.

The deployment of TF Serving containers is enacted using Kubernetes. Kubernetes manages Pods, that are, groups of co-located containers and volumes, which bind ephemeral containers to persistent data stores. Deployments manage the deployment of pods, along with the number of needed replicas, and how they can be upgraded and configured. Services bring communication among related pods by adding shared networking, load-balancing, and external access. Kubernetes also offers dedicated plugins for AMD and NVIDIA boards (the NVIDIA Container Toolkit is then required) to exploit GPUs [15], but a single GPU cannot be associated with more than one container, and fractions of GPUs cannot be requested (they can only be allocated as complete units).

Fig. 1.
figure 1

ROMA.

2.1 Architecture

Figure 1 shows the architecture of ROMA while managing three ML models. ROMA uses a centralized node, called dispatcher, and multiple distributed nodes, called workers. Dispatcher allows users to add trained models (applications), receives inference (execution) requests, and uses schedulers to distribute these executions on workers’ devices. Each worker provides one or more devices, that is, at least one CPU and zero or more GPUs.

ROMA deploys model executables, that are containers wrapping a TF Serving instance loaded with one or more models, as Kubernetes pods into workers. For each managed model, multiple model executables (i.e., replicas) can be deployed onto different workers to handle intense workloads. Each model executable can be instructed to process a request on CPUs or GPUs. Moreover, model executables are deployed onto workers along with a dedicated control theory-based controller (CT Controller) in charge of the fine-grained allocation of CPU cores.

Gateway accommodates requests in dedicated execution queues, one for each application (i.e., trained model). Requests are kept in the queues waiting for execution, that is, waiting for a GPU or CPU to become available. Requests are removed from the queues and assigned for execution to model executables by two different schedulers, one for GPUs and one for CPUs. The two schedulers exploit different heuristics to prioritize requests and instruct model executables to process them on either a GPU or a CPU.

GPU Scheduler extracts requests from the queue of the model with the greatest difference between expected and measured performance (see Sect. 3) to boost executions. CPU Scheduler works together with CT Controllers. It removes requests from queues by using a fair round-robin policy and instructs the proper model executables to use CPU cores to process them. CT Controllers accelerate or decelerate these executions by continuously modifying the CPU cores allocated to model executables. Their control period is extremely fast (i.e., 1 s) and allocated resources are changed on the fly, without restarting model executables (vertical scalability).

When GPU Scheduler instructs a model executable to process a request by using a GPU, the average time needed to execute that model executable abruptly decreasesFootnote 6. Distributed CT Controllers handle this sudden change and react by decreasing the number of allocated CPU cores. Note that allocated cores could not be lowered even when GPUs operate because of other external factors (e.g., workload fluctuations).

Given that multiple CT Controllers work on the same worker node, their combined resource demand can be greater than the actual capacity of the node: a Supervisor deployed onto each worker oversees demands and manage contentions. Collected data on resource demand, contention, and execution times can then be used to deploy new model executables and new workers, but this is out of the scope of this paper. Both schedulers and supervisors exploit lightweight heuristics to be reactive and manage incoming requests properly.

In the case of extremely high workloads, the dispatcher can easily be replicated to accommodate a higher level of parallelism without any changes to the underlying control strategies. In this case, clients connect either directly to one of the available replicas or to an additional load balancer that in turn distributes the traffic to the dispatchers. Then, each dispatcher can work independently of the others by only scheduling the traffic portion it receives. Local CT Controllers just need to be informed of the amount of requests executed by the GPUs without any additional knowledge on the deployment of the other components. Workers can be managed by a single designated dispatcher or shared among multiple ones. In the latter case, the multiple schedulers would not interfere with one another since their algorithms only use application-level performance data that are locally measured by each dispatcher.

2.2 Deployment

As soon as a user submits a trained model, along with its SLA, ROMA Launcher generates or updates required Kubernetes deployments and services to let the system deploy and manage the model executables.

ROMA uses two strategies to deploy model executables. The user can set the number of to-be-deployed replicas for each model. Replicas can also be added and removed dynamically according to application needs. The placement of model executables can either balance their number on the different nodes or deploy them onto the same worker until a predefined number of replicas is reached. Note that ROMA does not allow one to deploy multiple replicas of the same model on the same worker node for the same device. If model executables need more resources on the fly, CT Controllers takes care of it without creating new replicas.

To exploit the different devices, each model executable is bounded to a specific device. In particular, given m models selected to be deployed onto a worker node, ROMA provisions: (i) m model executable containing one model each, and binds them to the node’s CPU(s), (ii) one model executable, containing all models, for each GPU, and (iii) one container that includes the CT Controllers of all models, the Supervisor, and one actuator implemented as a Kubernetes volume. This means that since we assume that the worker depicted in Fig. 1 comes with two GPUs, and it manages three models, ROMA deploys six containers in total.

This deployment allows ROMA to exploit the means provided by Kubernetes for using GPUs on each model and also to exploit the CPUs when needed. As already said, the Supervisor and models’ CT Controllers manage CPU cores. As for CPUs ROMA deploys a different container for each model because resources can be allocated to them independently. Since GPUs cannot be shared among multiple containers, nor can their cores be allocated to different models, a single container per GPU with all models is enough. The GPU Scheduler is in charge of electing the model that can exploit the GPU to serve the next inference request (this is done by calling an internal, model-specific TensorFlow Serving endpoint). At each control step, ROMA uses an actuator based on Docker out of Docker (DooD) to provide on-the-fly reconfiguration of running containers. DooD is a volume that provides means to launch Docker commands (e.g., to re-configure a container) within another containerFootnote 7.

3 Schedulers and Supervisors

The goal of ROMA is to fulfill constraint over the response time. While in the following we constrain the average response time, more conservative metrics (e.g., high percentiles) would only require a stricter set-point and more used resources, and would provide additional tail-latency guarantees. However, our evaluation (see Sect. 5) shows that even by only constraining the average response time, ROMA provides a lower maximum response time than other competitor approaches (e.g., rule-based).

Given a model m, the average response time computed over a given time window w can be formulated as follows.

$$\begin{aligned} \tau _{R_m} = \frac{\sum _{g=1}^{G} (\tau _{Q_g} + \tau _{P_g}) + \sum _{c=1}^{C} (\tau _Q{_c} + \tau _{P_c})}{G+C} \end{aligned}$$
(1)

where G and C are the numbers of requests executed on the GPUs and CPUs respectively in w, \(\tau _{Q_i}\) is the time spent by a request i in the queue, while \(\tau _{P_i}\) is

figure a

the time spent by a GPU or a CPU to process request i. An SLA on \(\tau _{R_m}\) can state that:

$$\begin{aligned} \tau _{R_m} <= \alpha \cdot \tau _{SLA_m} = \tau _{R_m}^{\circ } \end{aligned}$$
(2)

where \(\tau _{SLA_m}\) is the threshold on the response time defined in the SLA for model m and \(\alpha \) is a parameter, which ranges between 0 and 1, that defines the set point \(\tau _{R_m}^{\circ }\) for model m. If \(\alpha = 1\) then the set point matches \(\tau _{SLA_m}\); lower values are more conservative and let the system tolerate more imprecision.

As already said, ROMA distributes the processing of requests to the different devices in the cluster by means of the two dedicated schedulers. Their goal is to select both which request to execute next and on which device. The rationale is that GPU Scheduler always selects the request of the model with the “highest” needs (see below). To complement GPUs, requests are also scheduled for processing onto CPUs by means of a round-robin policy where the (non-empty) queues to serve are selected randomly. Note that CPU Scheduler could find queues empty if the GPUs are fast enough to process all the workload alone.

GPU Scheduler is activated in an event-based fashion. Function freeGPU (Algorithm 1) is executed as soon as a GPU (parameter gpu) becomes free, that is, at system startup and when a GPU completes the execution of a request. In particular, we designed a heuristic that, for each model m, takes into account a weighted average (\(\tau _{W_m}\)) of measured response times (\(\tau _{R_m}\)) and of the estimated response times of the requests that are in the queues waiting to be processed (\(\tau _{E_m}\)). The estimation is computed by using the accumulated queue time of each request (\(\tau _Q\) in Eq. 1) and the profiled processing time on GPUs \(\tau _{PG_m}\). Parameter \(\beta \), which ranges in interval [0, 1], defines the weight associated with \(\tau _{R_m}\) and \(\tau _{E_m}\). A higher value of \(\beta \) gives more importance to requests in the queue and makes the system more responsive to workload bursts. Given the computed averaged response time \(\tau _{W_m}\), the distance from the set point \(\tau _{R_m}^{\circ }\) is computed as \(\epsilon _m\) (lines 21–25). The selected model \(m_S\) is the one with the highest \(\epsilon _m\). The first request in queue \(m_S\) is the one that is processed by gpu using the proper model executable (lines 28–30).

The actual allocation of CPU cores is managed by the CT Controllers associated with the different model executables. For this reason, CPU Scheduler dispatches requests to CPU devices using a round robin policy. CPU Scheduler repeatedly removes a request from a randomly selected queue and schedules it for CPU execution on a randomly selected model executable. This way the load sent by CPU Scheduler to each model executable is homogeneous and the burden of managing CPU allocation is handled locally by CT Controllers. Each worker is associated with a Supervisor in charge of refining the resource allocation computed by CT Controllers in case of contention. At each control step (1 s), a CT Controller computes the amount of CPU cores \(u_{C}\) (core allocation demand) needed by its model executable, which embeds model m, to meet set response time \(\tau _{R_m}^{\circ }\) (as described in Sect. 4). Each CT Controller computes its \(u_{C}\) independently of the others, that is, they do not communicate.

Supervisors use the heuristic shown in Algorithm 2 to compute a feasible core allocation \(u_C'\) for each CT Controller deployed on a worker. First, all the core allocation demands \(u_{C}\) are gathered in a vector \(\mathcal {U}_C\) (line 1–6). Being MC the total number of cores provided by the worker, and GC the number of CPU cores statically allocated to support GPU execution, the difference between MC and GC is the actual amount of cores that can be allocated (AC) to model executables (line 7) in a given worker. As mentioned before, GPUs and CPUs are interdependent since the former consume the processing power of the latter to load data in memory and to be activated. Note that if GC is set to 0, GPUs will slow down requests running on CPUs. This is seen by CT Controller as another disturbance that is naturally mitigated by the control logic (described in Sect. 4). Moreover, \(\eta \) is the ratio between AC and the sum of all demanded cores, that is, the sum of all \(u_{C}\) (line 8). Given \(\eta \), each \(u_C'\) is computed as follows. If \(\eta \) is less than 1, the actual demand cannot be fulfilled since demanded cores are more that available ones (under provisioning). Each \(u_C'\) is then computed by multiplying each \(u_C\) by \(\eta \) (line 12). If \(\eta \) is equal to 1, the amount of demanded cores matches available ones (AC), and \(u_C' = u_C\). If \(\eta \) is greater than 1, available cores are over provisioned. However, we introduce parameter \(\gamma \) to maximize resource utilization (line 14). The default value (\(\gamma = 0\)) implies that \(u_C' = u_C\). If \(\gamma \) is between 0 and 1, we allocate more cores and obtain more responsive models. \(\gamma = 1\) means that all AC cores are always used. Finally, the state of each CT Controller is updated using \(u_C' \) and computed core allocation is actuated.

4 Controllers

To design the CT Controller we need a dynamic modelFootnote 8 for the relationship between the CPU and GPU allocation (\(u_C\), \(u_G\)) and the response time \(\tau _R\); \(u_C\) and \(u_G\) jointly modify the output rate \(r_o\) from the queue, the input rate \(r_i\) being an exogenous disturbance. CT Controllers do not require any knowledge of the application structure (i.e., of the operations to execute on input data) and the same dynamic model is general enough to support different kinds of compute- and GPU-intensive interactive applications (e.g., machine learning inference, scientific calculus, graph-based computations), with proper profiling. This is possible because the proposed controllers are grey-box, that is, their model does not include all aspects of the system but just the ones that describe its physics. The employed fast feedback-loop (control period equals to 1 s) is in charge of correcting the imperfections of the model at runtime. Here we represent the compound of the above in a simplified manner (yet adequate, as the reported tests will show) as an additive perturbation, and we set:

$$\begin{aligned} \left\{ \begin{array}{rcl} \tau _R(t) &{}=&{} \tau _Q(t)+\tau _P(t) \\ \tau _Q(t) &{}=&{} \frac{\ell (t-\tau _Q(t))}{r_o(t)} \\ \frac{d\ell (t)}{dt} &{}=&{} r_i(t)-r_o(t) \\ r_o(t) &{}=&{} r_{on}\big (u_C(t),u_G(t)\big )+d_o(t) \\ \tau _P(t) &{}=&{} \tau _{Pn}\big (u_C(t),u_G(t)\big )+d_P(t) \end{array} \right. \end{aligned}$$
(3)

where \(\tau _Q\) the time spent on the queue, \(\tau _P\) is the processing time downstream of the queue depending on \((u_P,u_G)\) through a nominal relationship \(\tau _{Pn}(\cdot ,\cdot )\) with an additive disturbance \(d_P\), and \(r_{on}(\cdot ,\cdot )\) is the \((u_C,u_G)\rightarrow r_o\) relationship in some “nominal” condition, and \(d_o(t)\) the combined effect of all the disturbances.

Model (3) explains the physics of the system, but is not suitable as is for control design owing to the contextual presence of a differential equation and an implicit one with delay. It however evidences that under the above assumptions, response time control boils down to queue length control. From Eq. (3) one notices that (i) at steady state \(r_o\) has to balance \(r_i\) but this can happen for any \(\ell \), hence (ii) a steady-state variation of \(\tau _R\) is obtained by transiently causing an input/output rate imbalance via \(u_C\) and/or \(u_G\), and then restoring the balance once the desired \(\tau _R\) is achieved as the new queue length, divided – mind the balance – by the through rate, gives the necessary \(\tau _Q\), hence \(\tau _R\).

$$\begin{aligned} \left\{ \begin{array}{rcl} \frac{d\ell (t)}{dt} &{}=&{} r_i(t)-r_o(t) \\ r_o(t) &{}=&{} \mu _C u_c(t) + \mu _G u_G(t) \\ \tau _R(t) &{}=&{} \frac{\ell (t)}{r_o(t)} \end{array} \right. \end{aligned}$$
(4)

where the gains \(\mu _C\) and \(\mu _G\) account for the processing speed of CPUs and GPUs, respectively, and the delay is considered negligible with respect to the control time scale. Linearised in the vicinity of an operating point described by nominal values of the throughput and the required waiting time, \(\overline{r}_o\) and \(\overline{\tau }_R\) to name them,

Overall, therefore, the compound of the above gives rise to the continuous-time transfer function description:

$$\begin{aligned} \varDelta \tau _R(s) = G_{\tau _RC}(s) \varDelta u_C(s) + G_{\tau _RG}(s) \varDelta u_G(s) \end{aligned}$$
(5)

where uppercase letters denote the Laplace transform of the corresponding lowercase variables and:

$$\begin{aligned} G_{\tau _RC}(s) = -\frac{\mu _C}{\overline{r}_o}\,\frac{1+s\overline{\tau }_R}{s}, \quad G_{\tau _RG}(s) = -\frac{\mu _G}{\overline{r}_o}\,\frac{1+s\overline{\tau }_R}{s} \end{aligned}$$
(6)

where s is the Laplace transform complex variable. Transforming (6) to discrete time, we conclude that a physically grounded \(\mathcal {Z}\)-transform model (denoting by z the corresponding complex variable, i.e., the one-step advance operator) takes the form:

$$\begin{aligned} \varDelta \tau _R(z) = G^*_{\tau _RC}(z) \varDelta u_C(z) + G^*_{\tau _RG}(z) \varDelta u_G(z) \end{aligned}$$
(7)

where

$$\begin{aligned} G^*_{\tau _RC}(z) = -\,k_C\,\frac{z-b}{z-1}, \quad G^*_{\tau _RG}(z) = -\,k_G\,\frac{z-b}{z-1} \end{aligned}$$
(8)

Parameters \(k_C\), \(k_G\), and b can be obtained online by profiling the applications of interest and fitting measured responses to those of the dynamic model. In this work we assume that when a GPU takes part of the work —which is represented as a step-like behaviour of \(u_G\)— the CPU attempts to restore the required \(\tau _R\) so as to free the GPU as soon as possible. This means requiring that the closed-loop transfer function from \(u_G\) to \(\tau _R\) has a zero in \(z=1\). The said transfer function then becomes:

$$\begin{aligned} F_o(z) = \frac{z-1}{z-p} \end{aligned}$$
(9)

where parameter \(p\in [0,1]\) governs the required response speed: \(p\rightarrow 0\) means faster response, \(p\rightarrow 1\) slower. This gives controller

$$\begin{aligned} G_c(z) = \frac{(k_C-1)z^2+(2-k_Cp-k_Cb)z+k_Gbp-1}{k_C(z-1)(z-b)} \end{aligned}$$
(10)

i.e., a real PID. To further reduce computational complexity, we however decided to employ a PI controller, that is,

$$\begin{aligned} G_c(z) = K \frac{z-a}{z-1} \end{aligned}$$
(11)

and prescribe the closed-loop poles to coincide in \(z\,=\,q\), where q is interpreted as p above. This is achieved by setting:

$$\begin{aligned} K = \frac{4(a-1)(b-1)}{k(b-a)^2}, \quad a = \frac{(2-b)q-b}{q-2b+1} \end{aligned}$$
(12)

while the presence of integral action ensures zero steady-state errors.

5 Evaluation

This section describes the experiments we carried out to evaluate the feasibility and benefits of ROMA.

To run the experiments, we deployed ROMA on a cluster of three virtual machines on Microsoft Azure: one VM of type HB60rs with a CPU with 60 cores and 240 GB of memory for the dispatcher, and two VMs, as worker nodes, of type NV6 equipped with a NVIDIA Tesla M60 GPU and a CPU with 6 cores and 56 GB of memory. We also used an additional instance of type HB60rs for generating the client workload.

The experiments exploited four existing ML applications: Skyline Extraction [9], ResNet [11], GoogLeNet [20], and VGG16 [22]. The first application uses a combination of computer-vision algorithms to extrapolate the horizon skyline from a set of images and the others perform classification tasks. In particular, ResNet exploits a residual neural network, while GoogLeNet (G.Net) and VGG16 employ two different deep convolutional neural networks. All these four models were trained and then used in inference mode with companion sample images.

ROMA (\(\alpha \,=\,0.8\), \(\beta \,=\,0.5\) and \(\gamma \,=\,0\)) was set to use a static deployment strategy and we deployed all applications onto the two worker nodes. We statically reserved \(GC\,=\,0\) cores for the GPUs, to say that the additional disturbances introduced by the usage of CPUs for loading and operating GPUs are handled by CT Controllers. These controllers were manually tuned: \(K\,=\,0.15\) and \(a\,=\,0.11\).

We compared ROMA against the four exemplar systems we implemented by using a different heuristic for the GPU Scheduler and/or another type of controllers instead of CT Controllers. All these systems used a round robin scheduler (RR) for GPUs. In addition, system RR+rules used a rule-based controller that allocated 1 additional CPU core to a model executable if the response time is greater than or equal to \(0.8*SLA\). If the response time is equal to or less than \(0.2*SLA\) it de-allocated a core. The control period was set to 15 s. System RR+CT used the same CT Controllers as ROMA for managing CPU resources. The control period was set to 1 s. System RR+max statically allocated all cores (6 per worker) fairly distributed to applications. System RR+min statically allocated a minimum amount of cores (1 per worker) equally distributed to applications.

We tested the systems by running two concurrent applications at a time (test 2-Apps): i) GoogLeNet and Skyline Extraction and ii) ResNet and VGG16. We repeated each test 3 times for a total of 60 executions (5 systems, 4 applications, and 3 executions). Table 1 shows the SLAs (in seconds) and workloads (in incoming requests per second) used in the experiments. Each experiment lasted 300 s and the workload of each application was changed with a different step (shown in column Workload) every 37 s (8 times). Table 2 shows the average (\(\tau _{R}\)) and maximum response times (\(\tau _{R_{M}}\)) in seconds along with the standard deviation (\(\tau _{R_{\sigma }}\)), the number of SLA violations (V), and the number of allocated CPU resources (Res) measured as \(cores*seconds\).

Table 1. SLA and workloads.
Table 2. Comparison.

With the first application pair, ROMA produced 0 violations and a resource allocation equal to 748 (where the lower means the better). RR+rules allocated 1.5 times the resources used by ROMA without avoiding SLA violations and obtained longer response times. RR+CT performed similarly to ROMA, but ROMA allocated GPUs in a smarter way (i.e., lower average and maximum response times) and thus relying on CPUs less frequently, which means saving a greater amount of resources. The allocation of all cores makes RR+max the fastest system, but by using more than 5 times the CPU resources utilized by ROMA. Finally, RR+min consumed fewer resources than the other systems at the cost of obtaining 40 SLA violations.

With the second application pair, ROMA obtained 10 SLA violations and a resource allocation of 1633 \(cores*seconds\). Once again ROMA was able to outperform the other systems showing a better balance between violations and resource usage, and lower average and maximum response times. Given the presence of VGG16, the use of GPUs was fundamental to make the system serve the incoming workload. Results show that a round robin scheduling of GPUs was not sufficient to avoid SLA violations even if all the CPU cores were always allocated statically (RR+max produced 40 violations). Compared to ROMA, RR+rules showed a higher response time and 120 SLA violations and an allocation of almost the same amount of CPU resources. Even with a smarter allocation of CPUs (RR+CT) the obtained response time was almost double the one measured with ROMA and the number of SLA violations were 70. RR+min violated the SLAs 180 times and also presented an average response time greater than 1.5 s (almost three times greater than set SLAs).

Fig. 2.
figure 2

System experiments - All Apps.

As final experiment, we ran the four applications concurrently (test All Apps) for a total of 60 additional executions (4 applications, 5 systems, 3 repetitions each). Table 2 presents obtained results and the charts of Fig. 2 show the response times obtained with ROMA and with RR+rules (the best competitor) using the workloads and SLAs reported in Table 1. ROMA was able to always keep the response time under the SLAs (0 violations), with an overall average response time equals to 0.167 s, a maximum response time of 0.427 s, and allocated 1767 \(cores*seconds\). In contrast, RR+rules frequently violated the SLAs while executing VGG16 and ResNet, and resulted in slower executions (average and maximum response times equal to 0.409 and 1.913 s, respectively). RR+CT obtained 90 violations and higher response times than ROMA, while RR+max obtained 0 violations but allocated 3600 \(cores*seconds\). The combined use of the heuristic that favors executions on GPUs for resource-hungry applications and its control theory-based CPU allocation made ROMA not only faster but also able to exploit fewer resources than all the other systems (except w.r.t. RR+min that violated the SLA 170 times).

6 Related Work

Several solutions deal with the management of heterogeneous resources at the node level but not GPUs. For example, the solution presented by Lakew et al. [16] exploits control theory to provision multiple resources dynamically to satisfy SLAs. Similarly to ROMA, they exploit containers and can reconfigure resources dynamically. Farokhi et al. [8] present a fuzzy control approach that coordinates the autoscaling of CPU cores and memory. They show that the coordinated control of multiple resources outperforms the performance of the same system with independent controllers.

These approaches manage complementary resources: CPUs uses memory (and also disks) for completing a task, while ROMA exploits competing resources since a request can be executed on either CPUs or GPUs. This means that ROMA must consider both scheduling and resource provisioning while aforementioned works focus only on the latter.

Different approaches focus on the management on GPUs and CPUs. For example, Khadil et al. [13] present OSched, a resource-aware scheduler for OpenCL jobs that aims to maximize the throughput of the hosting infrastructure. Chen et al. [4] propose a solution for improving the performance of MapReduce applications by scheduling map and reduce tasks on CPUs and GPUs using heuristics. They compared their approach with CPU-only and GPU-only versions of the system obtaining an improvement between 20% and 110%.

Compared to these works, ROMA is different from both the control and application domain point of views. First, the mentioned approaches focus on the scheduling of computing tasks on GPUs and CPUs, while ROMA combines both scheduling and fine-grained resource allocation in a comprehensive solution. ROMA ’s scheduling heuristics cooperate with control-theoretical planners in order to minimize constraint violations while optimizing resource usage. Second, existing solutions focus on the management of GPUs in the context of long-lasting compute intensive applications (e.g., machine learning training jobs), while ROMA focus on interactive ML applications. To the best of our knowledge ROMA is the first solution that provides an architecture, a deployment model and a comprehensive resource management approach for ML inference.

7 Conclusions and Future Work

The paper presents ROMA, an extension of TensorFlow that eases the management and operation of ML applications executed on a cluster of heterogeneous resources (GPUs and CPUs) in inference mode. ROMA allows users to constrain applications execution times and exploits scheduling heuristics and control-theory based resource provision to run them efficiently. The assessment of the work uses four real-world applications and shows promising results.