Keywords

1 Introduction

This paper is focusing on the problem of inaccurate job runtime estimates as provided by users. We use existing results  [6, 10, 13] and we try to understand the impact that inaccuracy has on various aspects of job scheduling performance. Importantly, we study whether a technique improving runtime estimates has some significant impact on system’s behavior. For this purpose, we use a simple runtime predictor which we have developed on our own. This predictor uses historic data to generate runtime estimates for newly arriving jobs. Althought it has been developed on our own in 2018, we have learned recently that our predictor uses similar idea to the predictor used in the past  [12].

The main contribution of this paper is as follows. First, we demonstrate that even simple predictor can significantly improve inaccurate job runtime estimates (Sect. 2). Second, we use detailed simulations to analyze the impact of refined estimates on the job scheduling performance. We analyze how the improved estimates impact the number of backfilled jobs and job wait times (Sect. 3). Last but not least, in Sect. 4 we analyze deeply if refined runtime estimates can improve predictability of system behavior. To achieve this goal we use two different scenarios. In the first one, we analyze the accuracy of job waiting time predictions (Sect. 4.1). This scenario obviously focuses on system users that naturally want to know how long their jobs will have to wait before being processed by the system, i.e., here the question is “when will a job start?”. In the second scenario (Sect. 4.2) we analyze the ability of the scheduler to correctly predict (in advance) which node(s) will be selected for each waiting job. In this case, instead of focusing on the question “when?” we rather try to answer the question “where?”.

The motivation here is related to jobs requiring either large amount of data and/or jobs requiring special pre-processing, e.g., an ad hoc and independent local file system. In both cases, the time needed to either stage the data and/or setup the file system can cause temporarily low CPU utilization. Our goal is to determine whether it is possible to stage the data and/or deploy the local file system in advance, thus limiting idle CPU time. Obviously, to make this advance staging/setup possible, the scheduler must provide rather accurate advance job allocations (ahead of actual job start). We illustrate the benefit of this approach in Fig. 1 and later describe in full detail in Sect. 4.2.

Fig. 1.
figure 1

Normal scenario for data staging is wasting CPU cycles during data staging (top). Considered solution relying on accurate node predictions uses advance data staging onto compute node(s) before job start, reducing idle CPU time (bottom).

2 Job Walltimes, User Estimates and Predictor

In this paper, job walltime (or job runtime) denotes the time it takes to execute the job on a computing node(s). This time is not known in advance. Instead, user is requested to provide an estimate for each job. This walltime estimate is used as an upper bound by the resource manager, i.e., job is killed when its actual runtime exceeds the walltime estimate. It is not surprising that in practice these walltime estimates are therefore very inaccurate and overestimated in order to prevent jobs from being killed  [3, 8]. This causes the relatively high overestimation. Second, scheduling systems also frequently classify jobs according to some default runtime limits. For example, there can be different job queues with different maximum job runtime defaults. Frequently, these default runtime values are then used by many jobs due to users laziness. As a result, most jobs in the system use only few common estimates and therefore “look similar” to the scheduling algorithm (e.g., backfilling).

2.1 Workload Traces

In the remainder of this paper we will be using four different real-life workload traces that come from the Karlsruhe Institue of Technology in Germany (FH1 and FH2), Cornell Theory Center (CTC SP2) and San Diego Supercomputer Center (SDSC SP2). These traces can be obtained at Parallel Workloads Archive  [2]. We begin our analysis by showing how user-based estimates are inaccurate and overestimated in all four considered workloads.

This is captured in Fig. 2, which shows the cumulative distribution functions (CDF) of actual runtimes and user estimated job walltimes (blue and red line, respectively). Clearly, user-provided walltime estimates are (very) inaccurate and overestimated. Therefore, we introduce a simple walltime predictor, which tries to refine these overly long estimates by more accurate values.

Fig. 2.
figure 2

Cumulative distribution functions (CDF) of actual runtimes, user estimated walltimes and predicted job walltimes for all four workloads (Color figure online).

2.2 Walltime Predictor and Its Performance

The considered walltime predictor is working on a per-user basis, i.e., a new runtime estimate for a given job of a user is computed using information about previous jobs of that user.

The predictor is an extended version of the predictor used in our previous work  [4]. It measures the fraction of job’s actual runtime and user’s estimate (see \(usage_{wall}\) in Formula 1), i.e., it measures to what extent the estimated walltime (\(est\_walltime\)) was actually used. Since the user’s estimate is the upper bound of job runtime, \(usage_{wall}\) falls between 0.0 and 1.0 representing the relative usage of requested walltime. In other words, the technique measures by how much a user overestimates job’s runtime, which is similar to what has been used in  [12].

$$\begin{aligned} usage_{wall}(job_i)= & {} \frac{runtime(job_i)}{est\_walltime(job_i)}\end{aligned}$$
(1)
$$\begin{aligned} predicted\_wall(job_i)= & {} est\_walltime(job_i) \cdot \max _{i-5 \le k \le i-1} usage_{wall}(job_k) \end{aligned}$$
(2)

Once the \(usage_{wall}\) is computed, it is stored to be used in the future, i.e., once a new job of this user arrives in the system. When this happens, the five most recent \(usage_{wall}\) values are considered and their maximum is chosenFootnote 1. The job’s walltime estimate is then multiplied by this maximum (see Formula 2) and the resulting \(predicted\_wall\) is the predictor’s output. It represents a conservative strategy, where the predicted walltime is calculated using the known relative accuracy of user’s recent estimates. By choosing the maximum \(usage_{wall}\) (i.e., choosing a job where the difference between actual and estimated runtime was minimal), this technique aims to minimize the number of cases where the new predicted walltime will be underestimated. At the same time, by ignoring older jobs it reflects aging and orients itself more on the recent user’s workload characteristics. When a job turns out to be underestimated (\(predicted\_wall(job_i) < runtime(job_i)\)) the predictor increases its initial estimate by a factor of two. It does so (over the time) until the estimate is sufficient (and job completes) or until the estimate meets the original user estimate (this is still a hard limit which cannot be exceeded).

Fig. 3.
figure 3

Avg. absolute errors of runtime predictions (per user) with respect to the walltime being used (user estimate, predictor).

The impact of the predictor can be observed in Fig. 2 (green line). Clearly, the over-estimated user-based walltimes are now distributed closely to what is the real distribution of actual runtimes (blue line). The improving effect is especially visible when the user-based estimates are very bad, which is the case for the FH1 and also FH2 workloads.

We also analyzed, how our simple predictor performs with respect to individual users. Therefore, we have computed the average absolute errors of both user-based estimates and predictor-based walltimes per each user in the system and plotted these pairs in a line chart which we show in Figs. 34 (users on the x-axis are ordered according to the average user estimate error).

Fig. 4.
figure 4

Avg. absolute errors of runtime predictions (per user) with respect to the walltime being used (user estimate, predictor).

These figures illustrate that for some users our simple predictor is not able to reduce the average error very well. Still, it decreases the errors compared to those user-based walltimes quite successfully. Clearly, some users are really poor in judging the duration of their jobs, providing estimates that are (on average) several hours or even days longer than necessary. From this point of view, even a simple predictor like the one we presented makes a good sense to use.

3 Impact on Job Scheduling Performance

So far, we have demonstrated using several existing workload traces that predictor can improve the accuracy of walltime estimates. This section uses detailed simulations to analyze the impact of walltime predictor on the performance of the system. For this purpose, each of those four systems is modeled in Alea simulator  [5] and the workload is replayed. For comparison, simulations use either perfect estimates (i.e., exact job runtimes are used), user-provided or predictor-generated estimates to build the job schedule, respectively.

3.1 Scheduling Policy

In all experiments we use conservative backfilling  [8] as it heavily relies on provided estimates. The schedule is built using available runtime estimates (perfect, user-provided or predictor-based, respectively). In case a job finishes earlier than expected, the schedule is updated using schedule compression algorithm  [8]. During the update, jobs are checked one by one and a start time of each job is adjusted, i.e., it is moved into the earliest possible time slot with respect to previously adjusted jobs (compression phase). Similarly, if a job’s runtime is underestimated, it is first prolonged (see the discussion in Sect. 2.2) and then the existing schedule is updated, i.e., jobs are reinserted using same mechanism as during the aforementioned compression phase.

3.2 Metrics and Results

As our performance indicators we measure the percentage of backfilled jobs and the distribution of job wait times. Let us first discuss the impact of improved estimates on the backfilling ratio. The intuition suggests that with more accurate estimates (and even more with perfectly known runtimes) the ratio of backfilled jobs should increase. Conversely, when using inaccurate, user-provided and overestimated walltimes, the backfilling ratio should be significantly lower since most jobs “appear to be too long” for existing gaps. The results we obtained (see Fig. 5) seem to follow this expectation but the differences are not very dramatic.

Fig. 5.
figure 5

The percentage of backfilled jobs with respect to the walltime being used (exact runtime, user estimate, predictor).

Figure 5 shows the percentage of backfilled jobs with respect to the accuracy of estimates being used. With the exception of FH1 workload, there is only slight improvement in the backfilling ratio when accurate or predictor-based walltimes are used. The difference between FH1 and remaining workloads is most likely caused by the very bad original user-based estimates available in FH1. As can be seen in Fig. 2 (top), FH1 has the worst user-based estimates among all workloads. It is important to understand that improved walltimes do not guarantee higher backfill ratio. In fact, with better walltime estimates, not only individual jobs “look shorter” but also all available holes in the schedule become “shorter” as job runtimes are less overestimated. Therefore, the probability that a job will be backfilled within existing holes is only slightly higher when estimates are betterFootnote 2.

Fig. 6.
figure 6

Distribution of job wait times with respect to the walltime being used (exact runtime, user estimate, predictor).

We also measured the impact of improved walltime estimates on the distribution of job wait times which is captured in Figs. 6 and 7. The main general difference between these two figures is the large amount of jobs that start immediately (see Fig. 6). This is caused by the relatively smaller backlog of waiting jobs compared to the CTC and SDCS workloads (see Fig. 7). Other than that, most of the workloads show that user-based inaccurate estimates are associated with worse distribution of job wait times, i.e., more jobs fall into categories representing long waiting. As soon as the predictor is used, we see a common tendency where job wait times are decreased.

Fig. 7.
figure 7

Distribution of job wait times with respect to the walltime being used (exact runtime, user estimate, predictor).

The only exception is the CTC workload (see Fig. 7 (top)), where the use of predictor does not produce better distribution of job wait times, instead it stays very close to the one dictated by user-provided estimates. This phenomenon has been observed in the past, e.g., in  [1, 3] and deeply explained in  [13] (using the same CTC workload trace). Apart from the explanation provided in  [13], we would like to add that metrics like wait time can be easily influenced by even subtle changes in the job processing ordering. Job execution order can be easily manipulated either purposely (e.g., via fair-sharing mechanism) or “accidentally”, e.g., as a side effect of using the predictor. Predictor’s estimates may shuffle the order in which jobs are executed as backfilling tends to prefer shorter jobs. Clearly, this shortest job first-like scheduling reduces average wait time.

Figure 8 shows a hypothetical example of three different schedules that are however composed of the same set of jobs. All three schedules have the same makespan (\(C_{max} = 10\) time units) but exhibit very different job wait times. This figure illustrates the impact of job ordering in a somewhat extreme scale, yet we believe it illustrates nicely that metrics like wait time may oscillate quite wildly.

Fig. 8.
figure 8

The impact of job execution ordering on job wait times.

4 Impact on Accuracy of Predictions

This section is focusing on the ability of the scheduling system to provide predictions concerning job wait times and future job-to-node allocations. First, we analyze the accuracy of wait time predictions in Sect. 4.1. Next, Sect. 4.2 focuses on the impact that (in) accurate walltime estimates have on the capability of the job scheduler to provide advance node predictions for waiting jobs, i.e., the ability to correctly predict where a waiting job will be executed.

4.1 Wait Time Predictions

It is quite convenient when the scheduling system is able to provide information when a given job is likely to start executing. This is especially useful for interactive jobs. Although the system can use a scheduler-independent solution like QBETS  [9], we use this section to analyze the accuracy of predictions that the scheduler can provide on its own. In this case, the scheduler is using the schedule (built by conservative backfilling) to estimate how long a job will wait before its execution will start. Since these wait times can be continuously refined (shortened) as jobs are completing earlier and the schedule is compressed, we use the initial estimate returned by the scheduler at the moment of job arrival. This can be seen as the first “response” the user gets when he or she submits the job. The accuracy of predictions is measured by computing the absolute error of predicted wait time with respect to the actual job wait time.

Fig. 9.
figure 9

Absolute errors in predicted wait times of jobs when using either user-estimated walltimes or predictor.

Figure 9 shows the absolute errors in predicted job wait times when using either user-provided walltime estimates or the predictor-based estimates. It nicely illustrates the ability of the predictor to decrease the prediction error. This is mostly visible in case of FH1 and FH2 workloads which have the worst user-based estimates (see Fig. 2 and related discussion). From this point of view, predictor-based estimates can deliver much better predictions of job waiting times, thus giving the users more optimistic responses concerning when their workloads will be executed.

4.2 Node Allocation Predictions

The ability to predict node(s) for waiting job in advance can be very useful. If the target node(s) is known for reasonably long time (e.g., at least a few minutes ahead) this time can be used to stage all job-related data in advance, thus saving CPU cycles. Also, if needed, local ad hoc file system can be setup for a waiting job in advance. We have illustrated this approach in Fig. 1.

This section analyzes whether it is feasible to predict the target node(s) ahead. More specifically, we measure for how long the target node has been known prior job start. We called this time period Valid Node Allocation Time Period and denote it as \(T_{node}\). To sum up, for each job the \(T_{node}\) is computed as the difference between job start time and the time when the final accurate prediction has been made (i.e., the one that was correct). In case a job starts immediately after its submission, \(T_{node}\) is equal to zero. Otherwise, the job is waiting and the scheduler is adding that job into the schedule. If perfect walltime estimates (exact runtimes) are used, the schedule being built is accurate and can be used to correctly predict target node(s) for jobs in advance. However, since job walltime estimates are inaccurate, the schedule is constantly adapting to jobs finishing at different times than previously predicted. This impedes the effort of the scheduler to provide reasonably accurate node predictions as planned node and time allocations can change virtually at any time. As we have shown in our earlier work  [11], with inaccurate estimates only a small portion of jobs can achieve accurate predictions, i.e., only a fraction of jobs gets a reasonably high (useful) \(T_{node}\).

Fig. 10.
figure 10

An example of backfilling distorting previous node allocations for job #4 as a result of its “gap filling” approach.

\(T_{node}\) is also negatively influenced (reduced) by backfilling approach  [11]. As demonstrated in Fig. 10, jobs being backfilled can distort previous node allocations, thus decreasing \(T_{node}\) for those jobs being affected. Therefore, instead of backfilling, we use simple First Come First Served (FCFS) policy to built the job schedule in this use case. This means that existing holes in the schedule are not being filled with later arriving jobs in order to reduce the negative effect observed in backfilling. FCFS policy simply finds the earliest available free slot at the end of the schedule, without searching for possible gaps within the existing schedule. For example, if FCFS was used in the scenario presented in Fig. 10 instead of backfilling, FCFS would place new job behind job #5 leaving all preceding gaps unused.

Moreover, we have extended the approach used in our previous paper  [11] and developed a significant extension to the scheduling policy, which allows us to “pin” jobs to their planned nodes if their predicted start time is near enough. In other words, if a job is about to start soon (e.g., during 1 h) we do not allow the scheduler to change the planned nodes for this job. Thus this job’s node allocation is fixed and its \(T_{node}\) cannot decrease. Without node-pinning, jobs can be shifted to different nodes — e.g., as a result of early job completion and the following schedule compression procedure. This reshuffling of the whole schedule then destroys existing predictions while reseting all corresponding \(T_{node}\) values to zero. Therefore, pinning helps to keep predictions valid subject to inaccuracies in the job schedule. In our implementation, node-pinning is only activated when the planned job’s start time is within a given “pinning interval” x, i.e., that job is planned to be executed within next x minutes.

Fig. 11.
figure 11

Distribution of durations of valid node predictions (\(T_{node}\)) with respect to the pinning interval (none, 1 h, 2 h) and walltime being used (exact runtime, user estimate, predictor).

In the following text, using a series of experiments, we analyze how the \(T_{node}\) values are distributed with respect to varying accuracies of walltime estimates (exact, user-estimated, predictor). We also measure what is the effect of the newly developed node-pinning functionality. For this purpose, we analyze the effect of either no node-pinning or pinning with short interval (x = 1 h) or pinning with long interval (x = 2 h). Since the absence of backfilling and the use of node-pinning can potentially lead to poor performance, we also measure this impact by comparing job wait times of all considered scenarios.

Fig. 12.
figure 12

Distribution of durations of valid node predictions (\(T_{node}\)) with respect to the pinning interval (none, 1h, 2h) and walltime being used (exact runtime, user estimate, predictor)

Figures 1112 shows the durations of valid node predictions (distribution of \(T_{node}\) values) with respect to the pinning interval (none, 1 h, 2 h) and walltime being used (exact runtime, user estimate, predictor). From these distributions we can quickly identify the positive effect that node-pinning has on the ability of the scheduler to provide reasonable advance node predictions. On the other hand, the effect of predictor is rather moderate. It does show positive effect in case of FH1 workload (the one having worst user-based estimates) as it significantly increases the fraction of jobs that have \(T_{node} > 0\) s. Moderate positive effect can be seen for FH2, CTC and SDSC workloads (comparing estim vs. predictor). However, in most cases the largest benefit is evidently achieved through node-pinning rather than through improved walltime estimates.

The node-pinning functionality developed in this work sadly has some obvious drawbacks. Since we do not use backfilling, nodes can easily become underutilized when waiting jobs are already pinned to different (busy) CPUs. To measure the impact, we provide Figs. 1314 that show wait time distributions corresponding to the experiments focusing on node predictability.

Fig. 13.
figure 13

Distribution of job wait times with respect to the pinning interval and walltime being used.

Fig. 14.
figure 14

Distribution of job wait times with respect to the pinning interval and walltime being used.

Clearly, the effect of FCFS policy (even increased by node-pinning) is heavily influencing the wait time distributions as can be seen by comparing these distributions to those observed for “pure” backfilling in Figs. 67. FCFS worsened waiting times while node-pinning added even more delays. Also, it is worth noticing that the accuracy of estimates plays smaller role in this use case since the FCFS policy is less dependent on the quality of estimates than conservative backfilling used in Sect. 3.

4.3 Summary

To sum this use case, we have shown that proposed node-pinning extensions of the FCFS policy along with the predictor help to increase node predictability (higher \(T_{node}\) values) but at the expense of deteriorated wait times and poorer utilization. This clearly is not a desirable outcome and it leaves the problem open for further research. At the same time, we have seen that when using accurate walltimes reasonable node predictions are achievable and for many jobs their \(T_{node}\) is sufficiently long to allow for advance data staging. There are several classes of computations where job runtimes can be predicted very accurately and advance data staging is very useful due to the large size of input data. One such type of computation represents, e.g., the processing (converting) of raw video data.

Also, not all jobs probably require such a special treatment, i.e., only truly data-demanding jobs benefit from advance data staging. Therefore, it would be interesting to pin only these data-heavy jobs and use the remaining jobs to efficiently “fill the holes”, i.e., use selective backfilling. Our results show that the benefits of FCFS does not overweight its poorer performance.

5 Conclusion and Future Work

In this work we have studied walltime estimates and the use of a simple walltime predictor with respect to their impact on several different aspects of job scheduling. This analysis focused on the predictor’s accuracy and the effect it has on system performance and predictability. Our main findings are summarized in the following list:

  • Even simple predictor improves the accuracy of walltimes

  • Better accuracy improves backfilling opportunities but the effect is not dramatic

  • Wait time can be slightly reduced with better estimates (but other effects play role)

  • Significant improvement can be achieved in wait time predictions

  • With the existing predictor, node allocations cannot be predicted very well

  • Better node predictions can be achieved by using heavily modified FCFS-based scheduler using “node-pinning”, however job waiting times then deteriorate heavily.

In the future, we want to implement more advanced predictors as well as test them in practice using the PBS Pro system being used in the Czech national computing infrastructure MetaCentrum  [7]. Also, we would like to further improve our node-pinning policy and introduce selective backfilling which should reduce the negative impact on job wait times and utilization. Another promising, yet more demanding, way is to design “node-aware” scheduling algorithm, which takes node prediction into account. For example, the new algorithm can try to place each job so as to overlap with the minimal number of previous jobs, thus limiting constant job re-allocations upon early job completions. In the example of Fig. 10, this would suggest placing job #4 as continuing job #1 from the outset, instead of having it use nodes that had previously been assigned to both job #1 and job #2.