1 Introduction

With the ubiquitous availability of clusters of commodity machines and the ease of configuring them in the Cloud, there is growing interest in executing data analytics workloads in distributed environments such as Hadoop MapReduce and Spark. For effective use of resources, a job needs to be partitioned into tasks running in parallel on different workers. We will use the term worker to refer to a single processing unit, i.e., a single physical or virtual core. Hence a c-core machine would support up to c concurrent workers.

Given an analytics operator in a data-intensive computation, our goal is to minimize its total execution time by determining (1) a partitioning of its work, (2) the number of tasks these partitions are mapped to, and (3) the degree of parallelism for task execution. We equivalently refer to this execution time as the makespan of the corresponding set of tasks.

In contrast to previous work, our goal is to include operator-specific partitioning parameters into the optimization process. This is important, because user-defined data processing operators are common in Hadoop and Spark dataflows, and it is often difficult to determine which partitioning-parameter values will result in the fastest job execution time. For illustration, consider a user who wrote a MapReduce program for dense matrix multiplication based on the well-known block partitioning. It partitions the left matrix into \(B_0\)-by-\(B_1\) blocks, and the right one into \(B_1\)-by-\(B_2\) blocks. (See Sect. 4.4 for details.) In addition to number of tasks and degree of parallelism during execution, the user now also has to choose the best values for \(B_0\), \(B_1\), and \(B_2\). To do so with state of the art approaches, she essentially had two options.

First, she could train a blackbox machine learning model to predict makespan from a variety of features [2], including the block sizes, input size, output size, number of tasks, degree of parallelism, task size variance, and so on. This approach is convenient for the user, because it does not require deep understanding of distributed system interactions. The model can be trained automatically on labeled data, obtained from an appropriate benchmark that measures makespan for a variety of configurations. Unfortunately, finding the minimum-makespan configuration in a blackbox model requires exhaustive trial-and-error probing of the model. While a single prediction might only take a microsecond, exploring all combinations of just 10 different values for 10 parameters would take \(10^{10}\) microseconds, i.e., almost 3 h.

Second, she could explore DBMS cost models, which estimate the cost of an operator as the sum of the number of operations performed, weighted by per-operation cost. These models require a fairly complete understanding of system-level details, e.g., the number of random and sequential I/O performed. Those depend on implementation details of the underlying system and are difficult to specify for makespan estimation in distributed systems. Furthermore, DBMS cost models do not take resource bottlenecks into account.

To address the shortcomings of existing techniques, we propose to generally follow the machine-learning approach, but to do so with the simplest possible model type. The model’s structure should enable fast makespan optimization, while at the same time being flexible enough to capture a distributed execution “sufficiently” accurately.

Arguably the simplest approach with any hope for being practically useful is to estimate task execution time as a linear combination of its input size (I), output size (O), and computation complexity (C) as \(c_0 + c_1 I + c_2 O + c_3 C\). The parameters intuitively represent fixcosts (\(c_0\)), data transfer rates (\(c_1\), \(c_2\)), and processing speed (\(c_3\)). This model is abstract in the sense that it reflects algorithm properties, not implementation or system aspects. Since the parameters are estimated based on training data obtained from actual benchmark executions on the same cluster, they represent averages over a large number of low-level processing steps and thus automatically account for underlying processing complexities [5].

To use an abstract model like \(c_0 + c_1 I + c_2 O + c_3 C\) for makespan optimization, the user has to express I, O, and C as functions of the partitioning parameters of interest. This requires human expertise, but is strictly easier than for traditional DBMS cost models. Note that the resulting function might not be linear in the partitioning parameters. Consider the first map phase of matrix multiplication, for which in Sect. 4.4 we derive map task duration as \(c_{1_0} + c_{1_1} (N_0 N_1 + N_1 N_2) / n_1 + c_{1_2} (N_0 N_1 B_2 + N_1 N_2 B_0) / n_1\). All that was needed to obtain this formula were (1) input size per task (\((N_0 N_1 + N_1 N_2) / n_1\)) and (2) output size and computation complexity per task (\((N_0 N_1 B_2 + N_1 N_2 B_0) / n_1\)). We believe that this represents a relatively small burden, because the program designer has to understand the algorithmic impact of partitioning choices anyway, in order to design an effective distributed program.

This relatively small additional effort for the programmer to reveal high-level algorithm properties to the optimizer pays big dividends in optimization time, compared to simply providing the operator as a blackbox. For example, matrix multiplication has 10 partitioning parameters (Sect. 4.4), requiring exploration of a 10-dimensional space of combinations. Our approach reduces complexity to three dimensions, because for the other seven our model can derive optimal settings analytically. Assuming 10 values explored in each of those 7 dimensions, this reduces optimization cost by a factor of \(10^7\)!

But can an abstract makespan model capture the complexities of a distributed system, in particular task interactions and resource bottlenecks? Fortunately, any function can be approximated with multiple linear pieces. Our experiments show that for a piecewise linear model (Fig. 1), it only takes a small number of pieces to be sufficiently accurate. The reason for this lies in the way resources are consumed. Consider a network link that can transmit data at a certain rate. Ideally, transmitting twice the amount of data should take twice as long. However, in practice greater competition for resources typically increases overhead cost and hence the effective transmission rate may drop. Figure 2 shows a typical observation for a MapReduce program, where the time for shuffling data across the network increases more rapidly after about 600MB. The model can capture this behavior by using a different slope for larger data.

Fig. 1
figure 1

Schematic illustration of piecewise linear models for a 2-round computation with homogeneous tasks. The model for round 1 is partitioned on task input size only. The model for round 2 is partitioned on both parallelism degree and task output size

Fig. 2
figure 2

Shuffle time versus data size (MB) for round 2 of the matrix product algorithm

Piecewise linear models also offer two additional benefits. First, the program designer does not need to specify the dependency of I, O, and C on the partitioning parameters overly accurately, as long as the formula captures the dominating terms. For instance, for a program whose computation cost is \(C = n \log n + n + \sqrt{n}\), it suffices to specify \(C = n \log n\)—something the programmer is familiar with from traditional O-notation complexity analysis. The only downside is that the model may potentially need more linear pieces to be sufficiently accurate. As a second benefit, the model pieces provide insights about bottlenecks. For example, for the reduce phase of sorting (Sect. 4.3), model training for a cluster of quad-core machines determined that three pieces were needed when all four cores were used. Input coefficient \(c_1\) had value 5.5, 9.9, and 12 for “small”, “medium”, and “large” input size, respectively. For executions using only two cores per machine, the model created only two such pieces with \(c_1\) equal to 4.4 for “small”, and 4.9 for “large” inputs. Hence it automatically captured the I/O-dominated nature of sorting. With four cores competing for data access, larger input size stresses I/O and memory bus more than when only two cores are used.

This work makes the following main contributions:

  1. 1.

    We propose a linear makespan model for the rounds of a data-intensive computation (Sect. 2) and show how it can account for bottlenecks through domain partitioning into a piecewise linear model. For makespan optimization, we show how model structure can be exploited to prune the optimization-parameter search space (Sect. 3).

  2. 2.

    In Sect. 4, we introduce an instantiation of the general model for problems with homogeneous tasks. It enables us to prove even stronger results, significantly reducing the dimensionality of the optimization-parameter search space and thus decreasing optimization cost by orders of magnitude.

  3. 3.

    We present a framework for model training in Sect. 5.

  4. 4.

    We show through extensive experiments (Sect. 6) that the proposed models are sufficiently accurate, i.e., capture the relative makespan behavior for different optimization-parameter settings. This is explored for essential data analytics operators (join, sort, matrix product).

Related work is discussed in Sect. 7, and we conclude in Sect. 8.

2 General model

Fig. 3
figure 3

Distributed data-intensive computation as sequence of rounds, consisting of shuffle followed by local computation. Each box in a column symbolizes a worker

Despite the diversity of analytics operators, at the system level every distributed data-intensive computation relies on the same basic building block: local data processing on multiple worker machines in parallel, preceded by global data exchange to get the appropriate input to each worker. In line with nomenclature of modern big-data processing platforms Hadoop MapReduce and Spark, we will refer to the latter as shuffle phase; and we will use the term round (of computation) for the building block (see Fig. 3). We are interested in the simplest possible, but “sufficiently accurate” cost model for the running time of a round, which we refer to as makespan model.

Our proposed function for modeling running time T of a round is defined as

$$\begin{aligned} T = \beta _0 + \beta _1 {\mathcal {I}} + \beta _2 I_m + \beta _3 O_m + \beta _4 C_m + \beta _5 \tau _m. \end{aligned}$$
(1)

\(\beta _0\) accounts for the fixcosts of starting up the round, which can be significant in a distributed setting. Term \(\beta _1 {\mathcal {I}}\) captures the impact of the total amount of input \({\mathcal {I}}\) to be shuffled and transferred to the workers. For the remaining four terms, notice that computation on the different workers happens in parallel. Hence makespan, in contrast to traditional DBMS cost optimization, is determined by the most loaded worker, a.k.a. “straggler”. Consequently, the model does not depend on the total input, output, and computation on all workers, but only on the input (\(I_m\)), output (\(O_m\)), and computation (\(C_m\)) on that one straggler. Term \(\beta _5 \tau _m\) accounts for the fixcost for starting up and shutting down the \(\tau _m\) tasks assigned to this worker. This and other important notation is shown in Table 1.

Table 1 Important notation

2.1 General practical aspects

Model training happens offline, i.e., before the model can be used for optimization of a given job. It follows the standard approach of supervised learning in general, and linear regression in particular. First, a suite of benchmark jobs is executed, covering a variety of values for model variables \({\mathcal {I}}\), \(I_m\), \(O_m\), \(C_m\), and \(\tau _m\). For each job, round execution time T is recorded, resulting in a 6-tuple \(({\mathcal {I}}, I_m, O_m, C_m, \tau _m, T)\). Given a set of such tuples, standard least-squares estimation produces the best-fit values of the \(\beta \)-coefficients. Due to the small number of variables, overfitting is not a concern and hence we do not apply regularization. (Our experiments confirm similar prediction accuracy on both training and withheld test data.)

For more details about the model training process, refer to Sect. 5. Training of piecewise linear models is analogous, with the additional step of data-driven determination of a domain partitioning when needed (see Sect. 2.2).

Model use In order to use the proposed model, the programmer has to express variables \({\mathcal {I}}\), \(I_m\), \(O_m\), \(C_m\), and \(\tau _m\) as functions of the partitioning parameters she would like to tune. We demonstrate this for three diverse operations below, showing that often the model can also be simplified. It will become clear that our abstract model is much easier to determine than a corresponding DBMS-style cost model based on low-level operations.

In addition to the automatically learned \(\beta \)-coefficients and the user-provided functions for variables \({\mathcal {I}}\), \(I_m\), \(O_m\), \(C_m\), and \(\tau _m\), the optimizer only needs traditional selectivity estimation, identical to the same functionality in a DBMS, in order to estimate intermediate result size for data processing pipelines consisting of multiple rounds. Then it can estimate the values of \({\mathcal {I}}\), \(I_m\), \(O_m\), \(C_m\), and \(\tau _m\) for each round, and simply plug them into Eq. 1 to estimate round time.

Model realism Will this abstract model be sufficiently accurate to be useful? Indeed, it is more powerful than it may at first seem. To see this, consider the following possible concerns.

The shuffle phase does not transfer all data in bulk before the local computation phase—both are usually interleaved. This means that after completing a task belonging to a round, the worker might later process another task of the same round. In that case, the worker requests input data for the new task after the completion of the previous one. Equation 1 still applies, because, mathematically, it simply captures the fact that total input \({\mathcal {I}}\) to a round is essential for capturing shuffle cost, no matter the actual interleaving of data transfer and local computation. If data transfer is spread over multiple waves of tasks, then the model might automatically determine a lower value of \(\beta _1\), i.e., lesser impact of larger total input on makespan.

When input is not evenly balanced across workers, then total input \({\mathcal {I}}\) might not suffice to explain variations in shuffle time. In that case, term \(\beta _2 I_m\) can pick up some of the effects. An analogous argument applies to other low-level system operations. For instance, a map task might have to spill buffer content to disk. The corresponding reading and writing time will be accounted for by the \(I_m\) and \(O_m\) terms. When larger output causes more frequent buffer spilling, the model can capture this automatically by learning that a larger \(\beta _3\) value is needed for a model piece covering larger \(O_m\) values. (See discussion of piecewise linear models below.) In general, since the \(\beta \)-parameters are estimated from actual benchmark executions, they represent averages over a large number of low-level processing steps. This agrees with recent results by Duggan et al. [5], who showed that a single variable can account for underlying processing complexities in their performance prediction approach.

A MapReduce combiner is treated like any other local computation functionality in a round. It affects the user-specified functions for \(O_m\) and \(C_m\), as well as the value of \({\mathcal {I}}\) for the following reduce phase.

We next discuss the two major challenges in making the linear model practically applicable: accounting for interaction effects and determining \(I_m\), \(O_m\), \(C_m\), and \(\tau _m\).

2.2 Accounting for task interactions and bottlenecks

Interaction effects occur when tasks executed in parallel on a multicore processor compete for resources, e.g., memory bus and local disk(s). They also occur when multiple machines compete for access to shared network links or switches, slowing down data transfer and local computation. This can be captured by partitioning our model into \(k \ge 1\) ranges \((p_0,p_1]\), \((p_1,p_2]\),..., \((p_{k-1},p_k]\) of degrees of parallelism. Bottlenecks appear not only when multiple tasks compete for resources. The local computation of a task might also get delayed by I/O wait time caused by its own I/O operations, requiring different model coefficient values for different ranges of input and output size.

The result of partitioning the design space is a family of piecewise linear models, each with its own combination of values for \((\beta _0, \beta _1, \beta _2, \beta _3, \beta _4, \beta _5)\). We say that a model covers the corresponding partition defined by a range of parallelism degrees (p), total input size (\({\mathcal {I}}\)), and input size (\(I_m\)) and output size (\(O_m\)) on the most loaded worker. The partitioning can be determined in a fully data-driven manner from the training data, e.g., by minimizing the residual sum of squares [30] or by using a model tree [23]. For parallelism degree, we ensure that the number of cores per CPU is considered as follows: For a cluster consisting of w / cc-core machines, all interval endpoints that are multiples of the number of workers, i.e., all values in \(\{ i \cdot w / c:\, i=1,2,\ldots , c\}\), are explicitly considered as possible split points for a parallelism-degree range. Intuitively, these values correspond to a degree of parallelism of 1 to c per physical machine. Figure 1 illustrates the overall structure of the proposed models. It shows a stylized example for the homogeneous task case, discussed in Sect. 4. For each round of the computation, there is a separate piecewise linear model.

2.3 Estimating max load

Estimating \(I_m\), \(O_m\), \(C_m\), and \(\tau _m\), i.e., the input size, output size, computation, and number of tasks on the most loaded worker can be challenging. If the number of tasks in a round is less than or equal to the number of workers, w, then \(\tau _m=1\) and it suffices to identify the “heaviest” individual task. When the number of tasks exceeds w, then some workers will receive multiple tasks and \(I_m\), \(O_m\), \(C_m\), and \(\tau _m\) will depend on the actual scheduling policy used for assigning tasks to workers.

Notice that schedulers in distributed data-processing systems like MapReduce and Spark actively attempt to balance load at runtime by assigning tasks incrementally. In particular, initially just w tasks will be scheduled—one task per worker. Only after a worker reports completion of a task, will it receive the next. Hence when the number of tasks is “sufficiently” large and load between tasks does not vary “too much”, then each worker will receive a similar share of the total load. This implies that one could estimate \(I_m\) as total input divided by w, \(O_m\) as total output divided by w, and \(C_m\) as total computation time divided by w.

Unfortunately, when task load is highly skewed, e.g., one of the tasks accounts for half of the total load, then those averages would result in significant under-estimation. We propose the following general technique for addressing this problem through lightweight simulation. Given a set of tasks, one can simply execute the task assignment algorithm used by the scheduler. Task running time is estimated using the model parameters, i.e., for task i with input \(I_i\), output \(O_i\), and computation complexity \(C_i\), the estimate is \(\beta _2 I_i + \beta _3 O_i + \beta _4 C_i + \beta _5\).

3 Makespan optimization using the general model

The structure of a linear model provides valuable insights about the importance of the different terms. In particular, the larger the value of a coefficient, the greater the term’s impact on makespan. In addition to insights, simple model structure can be exploited to reduce optimization cost. We show this for the general model in this section, then discuss even stronger results in Sect. 4.2 for operators with homogeneous tasks.

3.1 Search space exploration

The optimization-parameter search space consists of all combinations of possible values for the tuning parameters of interest. We focus on parameters controlling problem partitioning into tasks, and their parallel execution:

  • number of tasks: n,

  • degree of parallelism during execution: p (\(p \le w\)),

  • a set Z of operator-specific parameters controlling problem partitioning.

Fig. 4
figure 4

Relationship between partitioning parameters

Figure 4 illustrates the relationship between the parameters. Notice that changing the value of a parameter \(z \in Z\) may affect total input, output, and computation cost of the round. For example, a more fine-grained partitioning may require additional input duplicates. On the other hand, different values for n or p do not affect total input size (\({\mathcal {I}}\)), total output size (\({\mathcal {O}}\)), or total computational complexity (\({\mathcal {C}}\))—they only control how the different partitions are “packaged” into tasks and how many tasks are executed concurrently, respectively. We will explain this in more detail for an example in Sect. 3.3.

3.2 Search space pruning

The following lemma will enable us to limit the search space of operator-specific parameters controlling problem partitioning, by establishing a lower bound for makespan of a round. Intuitively, this lower bound corresponds to the (possibly unattainable) ideally balanced load assignment where each worker receives the same number of tasks and the same share of total input, output, and computation. The \(\min \{p,n\}\) term accounts for scenarios when the number of tasks (n) is smaller than the degree of parallelism (p): then at most n of the p workers can receive a task.

Lemma 1

No matter how n tasks of a round with total input \({\mathcal {I}}\), total output \({\mathcal {O}}\), and total computation \({\mathcal {C}}\) are assigned to p concurrent workers, round time T is lower-bounded by \({\bar{T}} = \beta _0 + \beta _1 {\mathcal {I}} + \beta _2 \frac{{\mathcal {I}}}{\min \{p,n\}} + \beta _3 \frac{{\mathcal {O}}}{\min \{p,n\}} + \beta _4 \frac{{\mathcal {C}}}{\min \{p,n\}} + \beta _5 \frac{n}{\min \{p,n\}}\).

Proof

Let \(I_i\), \(O_i\), and \(C_i\) denote input, output, and computation for task i, \(1 \le i \le n\). Then \({\mathcal {I}} = \sum _{i=1}^{n} I_i\), \({\mathcal {O}} = \sum _{i=1}^{n} O_i\), and \({\mathcal {C}} = \sum _{i=1}^{n} C_i\). This implies for the total load induced by all tasks:

$$\begin{aligned} \sum _{i=1}^{n} \left( \beta _2 I_i + \beta _3 O_i + \beta _4 C_i + \beta _5\right) \;=\; \beta _2 {\mathcal {I}} + \beta _3 {\mathcal {O}} + \beta _4 {\mathcal {C}} + \beta _5 n. \end{aligned}$$

The load on the most loaded worker must be greater than or equal to the average load per worker, i.e., \((\beta _2 {\mathcal {I}} + \beta _3 {\mathcal {O}} + \beta _4 {\mathcal {C}} + \beta _5 n)/p\). Also notice that the “heaviest” task even by itself will induce a load at least as high as the average over all tasks, i.e., \((\beta _2 {\mathcal {I}} + \beta _3 {\mathcal {O}} + \beta _4 {\mathcal {C}} + \beta _5 n)/n\). The load of the worker receiving that task will therefore be lower-bounded by the per-task average as well. This immediately implies the same lower bound for the most loaded worker in the system (whose total assigned load is at least as high as that of the worker receiving the heaviest task). Putting these lower bounds together, we obtain

$$\begin{aligned}&\beta _2 I_m + \beta _3 O_m + \beta _4 C_m + \beta _5 \tau _m \ge \\&\quad \quad \max \left\{ \frac{\beta _2 {\mathcal {I}} + \beta _3 {\mathcal {O}} + \beta _4 {\mathcal {C}} + \beta _5 n}{p}; \frac{\beta _2 {\mathcal {I}} + \beta _3 {\mathcal {O}} + \beta _4 {\mathcal {C}} + \beta _5 n}{n}\right\} , \end{aligned}$$

and hence \(T \ge \frac{\beta _2 {\mathcal {I}} + \beta _3 {\mathcal {O}} + \beta _4 {\mathcal {C}} + \beta _5 n}{\min \{p, n\}}\), completing the proof of the lemma. \(\square \)

Lemma 1 can be exploited for search space pruning as follows. For given task number n and degree of parallelism p, we immediately obtain a lower bound \({\bar{T}}\) based on total inherent input size, inherent output size, and inherent computation of a round. The inherent values are those for the unpartitioned execution of the operator. Partitioning can never decrease them, but will typically increase them, e.g., require additional input copies or additional computation steps for post-processing.

Whenever more fine-grained partitioning increases \({\mathcal {I}}\), \({\mathcal {O}}\), or \({\mathcal {C}}\), the corresponding value of lower bound \({\bar{T}}\) will increase accordingly. Hence exploration of even more fine-grained partitioning can be terminated as soon as the lower bound exceeds the best makespan found so far. We illustrate this for joins in Sect. 3.3. Even when more fine-grained partitioning does not increase the lower bound, it still provides valuable information. Knowing that the best makespan found so far is within a small factor of the lower bound, the user may decide to stop exploration early.

Note that when applying Lemma 1 to piecewise linear models, each piece could return a different lower bound. The entire model’s lower bound is the minimum of the per-piece lower bounds, over all pieces that could still be reached during the optimization-parameter space exploration. (For instance, if application parameter settings are explored in increasing order of total input size \({\mathcal {I}}\), then model pieces for smaller input size ranges do not need to be considered for the lower bound.)

3.3 Example: equi-join

Consider equi-join \(R \bowtie S = \{(r, s) \in R \times S:\, r.A = s.A\}\) and let \(R_a = \{r \in R:\,r.A = a\}\) and \(S_a = \{s \in S:\, s.A = a\}\) be the subsets of tuples from R and S, respectively, with join attribute value a. We will refer to \(R_a \cup S_a\) as the group for join attribute value a. Then the equi-join can be expressed as \(R \bowtie S = \bigcup _{a \in A} R_a \times S_a\), i.e., the union of Cartesian products for each group.

For skewed input, some groups are significantly larger than others, causing load imbalance and hence a delay in job completion. Skew can be addressed by splitting large groups into smaller sub-groups, e.g., using rectangular partitioning. More formally, the set Z of operator-specific partitioning parameters is defined as the set of integer pairs \(\{(r_a, s_a):\, r_a \ge 1, s_a \ge 1, a \in A\}\), or \([(r_a,s_a)]_{a \in A}\) for short. The best partitioning algorithm to date by Li et al. [19] explores Z by greedily incrementing the \(r_a\) or \(s_a\) that maximizes a benefit-cost ratio based on load variance reduction versus additional input duplication due to subgroup partitioning (Algorithm 1). Note that the algorithm relies on a simplified version of Eq. 1: There the term for \(C_m\) is omitted, because computation cost is linear in input and output; and the \(\beta _5 \tau _m\) term collapses into \(\beta _0\), because the number of tasks is set equal to the degree of parallelism and hence \(\tau _m = 1\).

figure a

[19] determines the values for \({\mathcal {I}}\), \(I_m\), and \(O_m\) in line 5 by executing a load assignment strategy such as random or LLD when packing the (sub) groups into tasks. Since the makespan model in Algorithm 1 is a special case of our general model (Eq. 1), we can leverage Lemma 1 to terminate the loop in a principled way. More precisely, it is easy to see that with increasing \(r_a\) and \(s_a\), the lower bound \({\bar{T}}\) will increase because \({\mathcal {I}}\) keeps increasing due to the additional input duplicates. Hence the while-loop can be terminated safely (i.e., with the guarantee that no better makespan can be found for more fine-grained partitioning) as soon as the lower bound exceeds the predicted makespan for the best partitioning found so far.

4 Homogeneous tasks

This section presents analytical results that enable a significantly greater reduction in optimization cost for a class of problems where all tasks have “similar” load. We refer to these as homogeneous tasks. Task homogeneity occurs frequently in practice, typically by design, because the programmer attempts to distribute load evenly over the workers. For example, for distributed sorting, input is range-partitioned based on (approximate) quantiles, so that each partition receives about the same amount of data. Even for equi-joins, hash partitioning often distributes load fairly evenly as long as groups are not “overly skewed.”

4.1 Makespan model for homogeneous tasks

In the homogeneous model, each of the n tasks handles approximately 1 / n of the total input, output, and computation. Schedulers also can easily balance load across workers, assigning about n / p tasks to each of the p workers. When n is not divisible by p, the most loaded worker will receive \(\lceil n/p \rceil \) of the tasks. Together with Eq. 1, we obtain makespan \(\beta _0 + \beta _1 {\mathcal {I}} + \lceil \frac{n}{p}\rceil \left( \beta _2 \frac{{\mathcal {I}}}{n} + \beta _3 \frac{{\mathcal {O}}}{n} + \beta _4 \frac{{\mathcal {C}}}{n} + \beta _5\right) \).

We propose to further simplify this formula by dropping the first two terms, resulting in the following makespan model H for homogeneous tasks:

$$\begin{aligned} H = \left\lceil \frac{n}{p}\right\rceil \left( \beta _2 \frac{{\mathcal {I}}}{n} + \beta _3 \frac{{\mathcal {O}}}{n} + \beta _4 \frac{{\mathcal {C}}}{n} + \beta _5\right) \end{aligned}$$
(2)

Notice that dropping terms does not make the model “less correct,” but simply reduces its flexibility in capturing real-world behavior. For instance, \(\beta _0\) is a per-job fixcost, while \(\beta _5\) represents per-task fixcost. Without \(\beta _0\) in the formula, the model can implicitly account for the effect of \(\beta _0\) by increasing \(\beta _5\). The same applies to \(\beta _1 {\mathcal {I}}\) and \(\beta _2 {\mathcal {I}}/n\), which both model a dependency of makespan on input \({\mathcal {I}}\). Our experiments will show that the resulting models are sufficiently accurate for makespan optimization.

Analogously to the discussion in Sect. 2.2, we propose a piecewise linear model to account for task interactions and bottlenecks. For Eq. 2, partitioning is considered for parallelism degree (exactly as for the general model), per-task input size, and per-task output size. Figure 1 shows a stylized example, with 1-dimensional lines as stand-in for a plane in 3-dimensional space.

4.2 Makespan optimization for homogeneous tasks

The following powerful lemma enables us to derive the optimal task number and parallelism degree for all applications with homogeneous tasks. Recall that in a piecewise linear model, each piece covers some range \((p_l, p_h]\) for parallelism degree, range \((i_l, i_h]\) for input, and range \((o_l, o_h]\) for output. Intuitively, the lemma states that parallelism degree should be set to the largest value possible for the linear piece, i.e., \(p_h\). And the number of tasks, n, should be set to the smallest possible multiple of \(p_h\) that is allowed based on the input and output range constraint for the linear piece. Note that changing the number of tasks affects the input and output per task, i.e., for some values of n, \({\mathcal {I}}/n\) or \({\mathcal {O}}/n\) might not be inside range \((i_l, i_h]\) and \((o_l, o_h]\), respectively. If n cannot be set to a multiple of \(p_h\), then it should be set to the largest possible value allowed for this linear piece. In the special case of the model being a single linear piece, the lemma implies that both parallelism degree p and number of tasks n should be set to w. This makes perfect sense, because in the absence of non-linear behavior, it is best to use all machines and to achieve this with the smallest number of tasks possible (i.e., one task per worker).

Lemma 2

Let \(H = \lceil \frac{n}{p}\rceil \left( \beta _2 \frac{{\mathcal {I}}}{n} + \beta _3 \frac{{\mathcal {O}}}{n} + \beta _4 \frac{{\mathcal {C}}}{n} + \beta _5\right) \) be a makespan model covering range \((p_l, p_h]\) for parallelism degree, range \((i_l, i_h]\) for input, and range \((o_l, o_h]\) for output. Then H is minimized by setting \(p = p_h\) and \(n = \min \{\lceil n_l/p_h \rceil p_h; n_h \}\), where \(n_l = \max \{ \lceil {\mathcal {I}} / i_h \rceil ; \lceil {\mathcal {O}} / o_h \rceil \}\) and \(n_h = \min \{ \lfloor {\mathcal {I}} / (i_l+1) \rfloor ; \lfloor {\mathcal {O}} / (o_l+1) \rfloor \}\).

Proof

First consider the constraints on the number of tasks, n, imposed by the range for input and output size. For input, per-task input \({\mathcal {I}}/n\) has to fall into range \((i_l, i_h]\), which implies \({\mathcal {I}}/i_h \le n < {\mathcal {I}}/i_l\); and analogously \({\mathcal {O}}/o_h \le n < {\mathcal {O}}/o_l\). Together, and taking into account that n has to be integer, this yields \(n \in [n_l, n_h]\), where \(n_l = \max \{ \lceil {\mathcal {I}}/i_h \rceil ; \lceil {\mathcal {O}}/o_h \rceil \}\), and \(n_h = \min \{ \lfloor {\mathcal {I}}/(i_l+1) \rfloor ; \lfloor {\mathcal {O}}/(o_l+1) \rfloor \}\).

To find the value of p that minimizes \(H = \lceil n/p \rceil (\beta _2 {\mathcal {I}}/n + \beta _3 {\mathcal {O}}/n + \beta _4 {\mathcal {C}}/n + \beta _5)\), notice that none of the terms other than n / p is affected by the choice of p, and that \(\lceil n/p \rceil \) in monotonically non-decreasing in p. Hence the optimal choice for p is the largest value possible, i.e., \(p = p_h\). This implies that we are left to determine the value of n that minimizes \(H(p=p_h) = \lceil n/p_h \rceil (\beta _2 {\mathcal {I}}/n + \beta _3 {\mathcal {O}}/n + \beta _4 {\mathcal {C}}/n + \beta _5)\). To deal with the ceiling function, we separate the problem into two cases.

Case 1 the range of possible values for n contains a multiple of \(p_h\). We show that the smallest such multiple minimizes H. Formally, the case condition states that there exists an integer \(k \ge 1\) such that \(n_l \le k p_h \le n_h\). For any such k, consider all \(n \in [n_l, n_h]\) with \(\lceil n/p_h \rceil = k\), i.e., all n that satisfy \((k-1)p_h < n \le k p_h\). For these values of n, let \(H_k = k (\beta _2 {\mathcal {I}}/n + \beta _3 {\mathcal {O}}/n + \beta _4 {\mathcal {C}}/n + \beta _5)\). Note that \({\mathcal {I}}\), \({\mathcal {O}}\), and \({\mathcal {C}}\) are not affected by the choice of n. The problem partitioning is controlled by the operator-specific partitioning parameters in set Z; the choice of n only determines how these partitions are grouped into tasks. As a consequence, \(H_k\) is minimized by choosing the largest n in \((k-1)p_h < n \le k p_h\), i.e., \(n = k p_h\).

We now determine the optimal choice for k. For \(n = k p_h\), \(H = \lceil k p_h/p_h \rceil (\beta _2 {\mathcal {I}}/(k p_h) + \beta _3 {\mathcal {O}}/(k p_h) + \beta _4 {\mathcal {C}}/(k p_h) + \beta _5)\), which simplifies to \(H = (\beta _2 {\mathcal {I}} + \beta _3 {\mathcal {O}} + \beta _4 {\mathcal {C}})/p_h + k \beta _5\). Since \((\beta _2 {\mathcal {I}} + \beta _3 {\mathcal {O}} + \beta _4 {\mathcal {C}})\) is not affected by the choice of n, this function is minimized for the smallest possible k, i.e., for \(k = \lceil n_l/p_h \rceil \) and hence \(n = \lceil n_l/p_h \rceil p_h\).

Case 2 the range of possible values for ndoes not contain a multiple of \(p_h\). Then there exists an integer \(k' \ge 1\) such that \((k'-1) p_h< n_l \le n_h < k' p_h\). This implies \(\lceil n / p_h \rceil = k'\) for all values of n in \((n_l, n_h]\). Like for case 1, this function is minimized for the largest possible choice of n, i.e., \(n_h\).

To combine the solutions derived for both cases, note that in case 1, \(\lceil n_l/p_h \rceil p_h \le n_h\). For case 2, the case condition implies \(\lceil n_l/p_h \rceil = \lceil n_h/p_h \rceil \). Together with \(\lceil n_h/p_h \rceil \ge n_h/p_h\) (by definition of the ceiling function), this implies \(\lceil n_l/p_h \rceil \ge n_h/p_h\) and hence \(\lceil n_l/p_h \rceil p_h \ge n_h\). Hence setting \(n = \min \{\lceil n_l/p_h \rceil p_h; n_h \}\) will minimize H, no matter which of the two cases applies. \(\square \)

4.3 Example: sorting

Sorting plays a central role in data analysis, therefore we first demonstrate how to apply abstract piecewise linear makespan models to the classic sort algorithm in Hadoop MapReduce. Consider a user who is satisfied with the Hadoop defaults for the map phase (one map task per file chunk, assign tasks to all workers), but would like to optimize the reduce phase. She performs the following analysis to leverage our approach.

Let N denote the size of the input data. The map phase only shuffles the input, hence reduce phase input size is \({\mathcal {I}} = N\). Its output is the sorted data set, resulting in \({\mathcal {O}} = N\). Reduce tasks simply merge pre-sorted runs they receive from the mappers, therefore \({\mathcal {C}} = N\) as well. Note that each reduce task is responsible for a range of keys. A good implementation creates q ranges based on (approximate) q-quantiles, therefore \(Z = \{q\}\) is the set of operator-specific parameters. Note that in order to generate \(n_r\) reduce tasks, one simply sets \(q = n_r\). With \(p_r\) denoting parallelism degree in the reduce phase, this analysis implies for Eq. 2:

$$\begin{aligned} H_\mathrm {r} = \left\lceil \frac{n_r}{p_r}\right\rceil \left( \beta _2 \frac{N}{n_r} + \beta _3 \frac{N}{n_r} + \beta _4 \frac{N}{n_r} + \beta _5\right) = \left\lceil \frac{n_r}{p_r}\right\rceil \left( c_{r_0} + c_{r_1}\frac{N}{n_r}\right) , \end{aligned}$$

where \(c_{r_0} = \beta _5\) and \(c_{r_1} = \beta _2+\beta _3+\beta _4\). Note how terms for variables with the same function collapse in the linear model.

Overall, the user only had to select the homogeneous-task case and specify \({\mathcal {I}} = {\mathcal {O}} = {\mathcal {C}} = N\). Then our approach automatically solves \({{\mathrm{argmin}}}_{n_r, p_r} H_\mathrm {r}\). Using Lemma 2, the optimizer immediately derives the optimal settings of \(p_r\) and \(n_r\) for each piece of the piecewise linear model, selecting the pair with the lowest predicted makespan as the global winner. Details are shown in Algorithm 2. Instead of exhaustively exploring many \((n_r, p_r)\) combinations, optimization cost is linear in the number of model pieces. Using a larger number of linear pieces improves model accuracy, but increases optimization cost—a directly tunable tradeoff.

figure b

To appreciate how the optimization process takes task interactions and bottlenecks into account, consider first the special case where the model consists of a single linear piece covering parallelism degrees (0, w], input size (0, x], and output size (0, x], for some sufficiently large \(x > N\). The for-loop in Algorithm 2 would be executed once, returning \(p_r=w\) and \(n_r=\min \{w; N\} = w\). (Note that \(N/x < 1\) and we assume \(N \ge w\), i.e., the number of workers does not exceed the number of input records.) Stated differently, the algorithm determines that the problem should be partitioned into w tasks—one per worker—and all tasks should be executed in a single wave in parallel.

Now consider a cluster of w / 2 dual-core machines and assume that when using both cores on a worker, the memory bus on the worker slows down data transfer rate from memory to core, causing the cores to wait for data. During model training, our approach would automatically determine from the training data that two different linear models are needed: one covering parallelism degree \(p_r \in (0,w/2]\), and the other \(p_r \in (w/2, w]\). The for-loop in Algorithm 2 now compares predicted makespan for two configurations of \((p_r,n_r)\): (w / 2, w / 2) for the model covering \(p_r \in (0,w/2]\) and (ww) for the model covering \(p_r \in (w/2, w]\). Stated differently, if the memory-bus bottleneck leads to a severe slowdown, the optimal solution may be to use only half of the cores—one per machine—and execute the reduce phase in a single wave of w / 2 concurrently executed tasks. This perfectly captures the intuition that if the memory bus is the bottleneck (and not the CPU), then it may be better to only use one of the two cores per machine.

4.4 Example: dense matrix product

Dense matrix multiplication represents a more challenging workload with high data transfer costs, but also significant CPU load in some rounds due to the large number of multiplications and additions. Furthermore, matrix partitioning increases total cost due to data replication. Dense matrix multiplication was identified as an important computation problem in a recent UC Berkeley survey on the parallel computing landscape [3].

Fig. 5
figure 5

Block-wise parallel matrix multiplication in 4 rounds. U is partitioned into \(2\times 2\) blocks, V into upper and lower half, i.e., \((B_0,B_1,B_2)=(2,2,1)\)

Consider a programmer who implemented the classic block-partitioning algorithm for dense matrix-matrix multiplication in MapReduce. As illustrated in Fig. 5, input matrix U with dimensions \(N_0\times N_1\) is partitioned into \(B_0 \cdot B_1\) blocks, each of size \(N_0 / B_0\) by \(N_1 / B_1\); V (with dimensions \(N_1\times N_2\)) is partitioned into \(B_1 \cdot B_2\) blocks, each of size \(N_1 / B_1\) by \(N_2 / B_2\). Each block from U will be multiplied with the \(B_2\) corresponding blocks from V, for a total of \(B_{0} \cdot B_1 \cdot B_{2}\) block-pair multiplication tasks. Note that each U block is duplicated \(B_2\) times, each V block \(B_0\) times. The data duplication (map: round 1) and local multiplication (reduce: round 2) form the multiplication job (m-job). If \(B_1 > 1\), then each block-pair product represents only a partial result. In that case an aggregation job (a-job) needs to read and re-shuffle these partial results (map: round 3) and sum them up (reduce: round 4).

Based on this understanding of the computation, the programmer now proceeds as discussed for the sort program, expressing input, output, and computation in terms of the partitioning parameters. In addition to \(p_i\) and \(n_i\), \(i \in \{1,2,3,4\}\)—the parallelism degree and number of tasks in each of the four rounds—this includes the operator-specific partitioning parameter set \(Z = \{B_0, B_1, B_2\}\). Note that \(n_2 = B_0B_1B_2\) according to the above analysis.

The resulting per-round makespan models are as shown below. For readability, we present the version with collapsed terms for variables with identical functions. (Note that rounds 3 and 4 are executed if and only if \(B_1 > 1\).)

$$\begin{aligned}&H_1 = (c_{1_0} + c_{1_1} (N_0 N_1 + N_1 N_2) / n_1 + c_{1_2} (N_0 N_1 B_2 + N_1 N_2 B_0) / n_1) \cdot \lceil {n_1} / p_1 \rceil , \\&H_2 = (c_{2_0} + c_{2_1} (\frac{N_0N_1}{B_0B_1} + \frac{N_1N_2}{B_1B_2}) + c_{2_2} \frac{N_0N_2}{B_0B_2} + c_{2_3} \frac{N_0N_1N_2}{B_0B_1B_2}) \cdot \lceil B_0 B_1 B_2 / p_2 \rceil , \\&H_3 = (c_{3_0} + c_{3_1} N_0 N_2 B_1 / n_3) \cdot \lceil {n_3} / p_3 \rceil , \\&H_4 = (c_{4_0} + c_{4_1} N_0 N_2 B_1 / n_4 + c_{4_2} N_0 N_2 / n_4) \cdot \lceil {n_4} / p_4 \rceil . \end{aligned}$$

The problem partitioning that minimizes estimated makespan is defined as \({{\mathrm{argmin}}}_{B_0, B_1, B_2, p_1, p_2, p_3, p_4, n_1, n_3, n_4} H_1 + H_2 + H_3 + H_4\). With traditional cost models, this would require trial-and-error exploration of a 10-dimensional search space. Using our approach, we can again leverage Lemma 2 to derive optimal settings for all parallelism degrees and task numbers. Hence the optimization problem simplifies to

$$\begin{aligned} \mathop {{{\mathrm{argmin}}}}\limits _{B_0, B_1, B_2} H_1 + H_2 + H_3 + H_4. \end{aligned}$$
(3)

This reduces optimization cost by orders of magnitude, from search in 10 dimensions to 3 dimensions. (Note that optimization cost is linear in the total number of linear pieces, across all rounds.)

5 Model training

Recall the basic approach to model training as introduced in Sect. 2.1: A suite of profiling benchmarks is executed on the cluster, producing a labeled set of training records. Each record is a tuple of values for the model variables (input, output, computation, number of tasks), and its label is the corresponding observed makespan. The values of the \(\beta \)-parameters are determined from the labeled data using the standard least squares approach. While this follows standard machine learning practice, there are a few subtleties in the specific context of our problem.

Representative training data Mainstream supervised learning methods, including linear regression, assume that training data is drawn from the same distribution as the “unseen” data, i.e., inputs for which the model will be used to predict the label. For our makespan model, this means not only should the profiling benchmarks include a variety of input sizes, output sizes, and computation costs, but they also need to be executed for a variety of parallelism degrees. Furthermore, we propose on-the-fly model fitting based on the following two observations: First, due to the small number of variables and the simple model structure, it does not take more than a few dozen training records to estimate the model coefficients. Second, we are dealing with a relatively simple function surface. Makespan is (approximately) monotonically increasing in input size, output size, and computation cost. Hence the best training records for predicting the makespan of an unseen data point are those “closest” to it. When predicting makespan for a given configuration at runtime, we therefore select the 30 most similar training records and determine the \(\beta \)-coefficients on-the-fly from these 30 training records. On-the-fly model fitting takes only milliseconds.

A surrogate measure for makespan In a distributed execution, determining start and end time of a round requires careful measurement and solid understanding of system-internals. (Notice that this is a challenge for the system admin, not for users and application developers!) On the other hand, it is fairly straightforward to measure running times of individual tasks. We therefore are interested in exploring if our makespan models are still usable if actual makespan measurements in the training data are replaced by a surrogate measure based on individual task running times. In particular, for the homogeneous-task case we propose as a surrogate measure the product of number of task waves \(\lceil n/p \rceil \) and average task running time. Our experiments in Sect. 6.2 show, that it indeed works very well.

Determining the value of\({{\varvec{C}}}_m\) Values for input size \({\mathcal {I}}\) of a round, as well as total input size \(I_m\), output size \(O_m\), and number of tasks \(\tau _m\) on the most loaded worker are easy to observe during benchmark execution. For computation cost \(C_m\), we have to apply the user-provided function to the observed variables the function depends on. Consider an operator where a partition’s computation complexity is \(n \log n\) for input of size n. If the most loaded worker received two partitions whose input sizes are \(n_1\) and \(n_2\), respectively, then we record \(C_m = n_1 \log n_1 + n_2 \log n_2\).

Simpler model for collapsed terms Recall from the analysis of sorting and matrix product, how model terms collapse when variables are expressed by the same function. To exploit this, one can fit a model in the low-dimensional space. This enables use of smaller training sets.

6 Experiments

We implemented all algorithms in Hadoop MapReduce or Spark, and conducted experiments on eleven different systems with diverse properties. They include in-house clusters (9h36,2h24,6s12,7h14), a research cluster (20h160) provided by CloudLab [32] and EMR clusters with various sizes (EmrX, where\(X=10,20,30,40,50\), andEmr12s) on Amazon Web Services. For details see Table 2.

Table 2 Cluster specifications

For simplicity, in most experiments on Hadoop, the number of map tasks is left at the default value, i.e., total map input size divided by Hadoop Distributed File System (HDFS) block size. Only for small data sets whose size is smaller than the product of desired parallelism degree and HDFS block size, we set the number of map tasks equal to the desired parallelism degree.

The design of the profiling benchmark for model-parameter fitting is an open challenge. For applications where the same queries are periodically re-executed, e.g., social network analysis as the network evolves, it would be beneficial to execute a specialized benchmark containing only those queries, and covering a narrow range of input and output sizes based on current and near-future graph properties. On the other hand, for an infrequently used custom operator, it might not be feasible to include it in the benchmark at all. Optimal profiling-benchmark design is beyond the scope of this paper. For proof-of-concept of our approach, we need to show that for some “reasonable” benchmark, the resulting model is effective for makespan optimization. Note also that the system can collect free training data each time an operator is executed during data analysis. Since join, sorting, and matrix product are common operators, it is reasonable to assume that dozens of training records for them exist. In the experiments, we then fit the collapsed models (e.g., \(H_\mathrm {r} = \lceil n_r / p_r \rceil (c_{r_0} + c_{r_1} N / n_r)\) for reduce phase of sorting as discussed in Sect. 4.3) to the training records from the same operator. To avoid overly optimistic results, we ensure that there are no more than about 100 training records and they cover a wide variety of values for model variables. Furthermore, only simple synthetic data is used for profiling. This way predictions made for other synthetic data distributions and for real data are a true test how well the model extrapolates to distributions it has never seen before.

6.1 General model

We first study the general model (Eq. 2), applied to join and sorting.

QueriesJOIN computes the full equi-join on the join key, emitting all result tuples. JOIN-AGG computes an equi-join whose results are aggregated on-the-fly as they are generated. (We compute the sum over a non-join attribute.) Only a single output tuple is emitted for each join group. Sorting sorts a set of integers in increasing order.

Data For joins, we use both synthetic and real data sets. Zipf-\(n-z\) denotes a pair of synthetic data sets with Zipf-distributed join attribute, with skew parameter z. If the two inputs have different z, we include both, e.g., Zipf-5m-[1,0] indicates that one data set has \(z=1\), the other \(z=0\). For real-world data sets, skew often falls between 0.25 and 1.0, where \(z=0\) results in uniform distribution. cloud-5m denotes a pair of real data sets containing 5 million tuples randomly sampled from a set of cloud reports [12]. They are joined on latitude, which was quantized into 10 equi-width bins to model a climate-zone based correlation analysis. ebird-all is another real data set containing 1.89 million bird sightings, each with 1657 attributes describing properties of observation event, climate, landscape, etc [22]. ebird-basic is the same set, but with only the 953 most important columns. For both eBird data sets, we compute the self-join on three Boolean attributes, capturing presence (yes or no) of the top-3 most frequently reported bird species in North America. This was motivated by correlation studies exploring habitat properties based on species appearance patterns.

These data sets cover a wide range of values of \({\mathcal {I}}\), \(I_m\) and \(O_m\). Specifically, for JOIN, we have \({\mathcal {I}} \in [10^6, 2 \times 10^9]\), \(I_m \in [10^5, 5\times 10^8]\) and \(O_m \in [9\times 10^7, 2\times 10^9]\); for JOIN-AGG, we have \({\mathcal {I}} \in [5\times 10^8, 2\times 10^{11}]\), \(I_m \in [2\times 10^8, 4\times 10^9]\), and \(O_m \in [10^{10}, 5\times 10^{11}]\). The profiling benchmark consists of 100 join queries (20 per parallelism degree; covering degrees 10, 20, 30, 40, and 50), executed on simple synthetic data, each generated as follows. First, we draw a value for \({\mathcal {I}}\), \(I_m\), and \(O_m\) uniformly at random from the above ranges. Then we create a pair of join inputs with the following properties: There are p different join keys such that the largest group has input and output size \(I_m\) and \(O_m\), respectively. The other \(p-1\) join groups are equal to each other in input and output per group, selected so that total input and output size reach the target values. Despite being trained on these simple data distributions, the makespan model is very accurate for the Zipf and the real data as discussed below.

For Sorting, we create data sets consisting of 100 million to 1.2 billion randomly generated numbers of type Long (8 bytes per record) for training and testing on EmrX, \(X \in \{10,20,30,40,50\}\). These data sets result in \({\mathcal {I}} \in [10^8, 1.2\times 10^9]\) and \(I_m = O_m \in [2\times 10^6, 1.2\times 10^8]\). The profiling benchmark is generated as discussed for joins by randomly drawing 50 configurations from the ranges (10 per parallelism degree).

6.1.1 Prediction accuracy

To test our approach, we apply it to real data and to synthetic data not used for training. Figure 6 shows that our model can accurately predict makespan, for both training and test data, and on both the local cluster 7h14 and in the cloud on Emr30. Not surprisingly, the simpler sort operation is easier to predict. Relative error tends to be around 1% and never exceeds 5%. For JOIN, the root mean squared error (RMSE) for training and test data are 332.97 and 148.66, respectively. For JOIN-AGG, they are 157.27 and 158.67, respectively. For Sorting, they are 15.81 and 6.57, respectively.

Fig. 6
figure 6

General model: predicted versus measured makespan. Both training (red dots) and test cases (green triangles) are near the blue dotted perfect-prediction line (Color figure online)

6.1.2 Optimizing operator-specific parameters

We apply the general model to determine the optimal values of operator-specific parameters for Join and Sorting on Amazon EMR clusters consisting of virtual machines with a single core each. To isolate the effect of these parameters, we set parallelism degree and number of tasks to the number of partitions created based on the operator-specific parameters.

For Join, more fine-grained partitioning increases \({\mathcal {I}}\), while it may or may not decrease \(\beta _2 I_m + \beta _3 O_m + \beta _4 C_m\). The challenge for the optimizer is to determine the right tradeoff between these two factors. Figure 7a shows results for the Cartesian product of two data sets, each containing 660, 000 tuples with 1000 integer attributes. Our model predicted the best makespan for 12 partitions, and Fig. 7a indicates that this indeed is the winning setting. Similarly, for Sorting, the model suggested to use 50 partitions on all 50 workers, independent of input size. Figure 7b confirms that this indeed minimizes makespan.

Fig. 7
figure 7

Optimal operator-specific partitioning

6.1.3 Safe pruning for distributed join

Figure 8 reports load and predicted makespan as Algorithm 1 greedily explores join-group partitionings. Max load first decreases due to improved load balancing for smaller partitions, quickly approaching average load. Hence Lemma 1 is very effective in determining a safe stopping point beyond which no better makespan can be found. In Figure 8a, in step 11, the average load per worker (blue dots) reaches \(1.41\times 10^9\), which is greater than the minimal max load (red dots) found before (\(1.40\times 10^9\) in step 3). Hence it is safe to terminate the greedy algorithm in step 11. Similarly, in Fig. 8b, the average load in step 51 (\(8.39\times 10^8\)) surpasses the minimal max load (\(8.38\times 10^8\) in step 30) and hence the greedy algorithm can safely terminate. Note that if the user is satisfied with a partitioning within 10% of optimal, then Lemma 1 makes it possible to determine that iterations can be terminated already after 16 steps.

Fig. 8
figure 8

Number of greedy partitioning steps versus load and predicted makespan (running time of the join) for JOIN-AGG

Our experiments generally show that it is practical and efficient to use the safe termination condition. Table 3 lists the number of extra steps (from the step where the optimal partitioning was found) until safe termination was detected based on Lemma 1. Note that 40 steps take only about half a second of computation time.

Table 3 Additional steps after finding optimum, until safe termination

6.2 Homogeneous-task model

The main purpose of these experiments is to provide a proof of concept that the simplified homogeneous makespan model with a “small” number of linear pieces is accurate enough to rank “good” above “bad” partitionings. In all experiments, the piecewise linear model for a round had between 1 and 7 pieces. To explore the feasibility of the easier-to-obtain surrogate makespan measure, we replaced in all training records true makespan with the surrogate measure (Sect. 5). This makes the model less accurate, but as our experiments will demonstrate, it is still effective for makespan optimization.

6.2.1 Sorting

Fig. 9
figure 9

Sorting: measured value of H versus input size on 9h36 for Map (left) and Reduce (right) phase

Fig. 10
figure 10

Sorting: predicted versus measured value of H on 9h36 for Map (left) and Reduce (right) phase

We present measurements on clusters 9h36 and Emr10. All piecewise models for 9h36 are partitioned into ranges (0, 18] and (18, 36] on parallelism degree. Partitioning on task input and output size varies. We created 54 queries, each defined by a data set and a number of waves for execution. Data sets are drawn from a pool of 15 sets, each with a cardinality selected randomly between 100 million and 2.7 billion, containing random numbers of type Long (8 bytes per record). The number of waves is selected randomly between 1 and 10. We randomly select 41 of these queries to fit the regression-model parameters, while the other 13 are used for testing.

Figure 9 presents the relationship between input size and the value of H (Eq. 2). The y-axis reports H computed from observed \({\mathcal {I}}\), \({\mathcal {O}}\), and \({\mathcal {C}}\). (Degree of parallelism was set to the number of workers for all runs.) The dotted green line shows a piecewise linear model fitted to the data.

Figure 10 compares predicted and measured values of H for map and reduce phase of sorting on cluster 9h36. The red dots are for training cases, while the green triangles are for test cases. All individual times and the overall trend are captured very accurately, as the relative errors are mostly around 1%, and never exceed 5%. For map phase, the RMSEs for training and testing are 0.98 and 1.16, respectively. For the reduce phase, they are 3.29 and 1.81.

Table 4 shows that accurate estimation of H can still result in significant underestimation of makespan. This is caused by the use of the surrogate measure (number of waves times average task time, instead of true makespan, for model training), which does not capture delays caused by stragglers. However, this bias is consistent, allowing the model to capture the trend correctly, no matter if all cores or only half of them is used per machine. For large inputs, it identifies the I/O-related bottleneck: doubling the number of cores used per machine results in virtually no improvement of makespan when data size reaches 1.6 billion records.

6.2.2 Matrix multiplication

All models are partitioned into parallelism-degree ranges based on multiples of the number of machines in the cluster; partitioning on input and output size varies. The training set consists of 104 problem instances, covering 12 different matrix-size combinations (square matrices from \(10k\times 10k\) to \(30k\times 30k\) and also extreme rectangular ones up to \(200\times 4\times 10^6\)), each with 3 to 20 (\(B_0,B_1,B_2\))-combinations. We randomly pick the matrix sizes and (\(B_0,B_1,B_2\))-combinations in the above ranges. We then test the model on 57 independent problem instances, drawn from the same distribution. As Fig. 11 shows, predicted and true value of H are again very close. The training RMSEs for the four rounds (2 MapReduce jobs) are 6.99, 9.78, 1.69 and 6.82, respectively; the testing RMSEs for the four rounds are 4.95, 5.86, 1.11 and 3.00, respectively.

Table 4 Degree of parallelism versus measured and predicted makespan on 9h36
Fig. 11
figure 11

Matrix product: predicted versus measured value of H on 9h36. The test cases (red dots) are near the perfect-prediction line (blue dotted line) (Color figure online)

Like for sorting, our model underestimates true makespan due to the use of the surrogate measure, but can still correctly separate “good” from “bad” partitionings. In all cases our approach would find a near-optimal configuration. Table 5 confirms this for both synthetic and real data sets (from the UCI Machine Learning Repository [20]). There our technique is applied to the step where the data matrix is multiplied with its own transpose. Table 6 confirms that this observation also holds for Spark.

Note that for both real data sets (Table 5d, e), our model correctly discovers that setting \((B_0, B_1, B_2)\) to (1, 18, 1) results in lower makespan than (1, 36, 1). We confirmed that due to I/O bottlenecks, it is better to only use half of the available cores per machine, even though round 2 performs a huge number of arithmetic operations (more than \(11 \times 10^9\) for the Census data).

Table 5 Ranking quality: predicted versus true makespan (sec) for matrix product (Hadoop MapReduce, (a)–(c) are synthetic data, (d) and (e) are real data)
Table 6 Ranking quality: predicted versus true makespan (in seconds) for matrix product (synthetic data, Spark)

7 Related work

Structured cost models that capture execution details are essential for query optimization in relational DBMS [24], and they can be highly accurate when tuned [33]. Recent work has shown that DBMS-style optimization can also be applied to other workloads, e.g., gradient descent computation that commonly occurs in machine learning [16]. Li et al. [18] rely on DBMS optimizers, and hence low-level cost models, to determine asymmetric data partitioning for heterogeneous clusters. When applied to homogeneous clusters of equally capable machines, e.g., on the Amazon cloud, these models assign the same input share to each worker, but do not optimize the partitioning parameters discussed in this article.

DBMS cost models motivated similar approaches for MapReduce and other distributed data analysis systems [14, 21, 29, 31, 34]. Simplified cost models for Hadoop and Spark systems are also proposed [10, 11], but they focus on the impact of adding more worker machines, ignoring the impact of operator-specific partitioning parameters. Our work is orthogonal to research on lowering the cost of MapReduce programs by minimizing the number of rounds [9, 17].

As an alternative to structured cost models, blackbox-style machine learning techniques were explored for a variety of performance prediction problems [2, 5, 6, 8, 13]. For all previous cost models, the effect of partitioning parameters on makespan is relatively complex, hence makespan minimization would have to rely on trial-and-error style exploration of possible parameter settings. For dense matrix multiplication, this corresponds to a 10-dimensional space of \((B_0, B_1, B_2, p_0, p_1, p_2, p_3, n_1, n_3, n_4)\) combinations. (Note that Ernest [29] could possibly derive optimal settings for all \(p_i\), \(i=0,\ldots , 3\), reducing complexity to 6 dimensions.) In contrast, our approach sacrifices some prediction accuracy to simplify model structure. This enables analytical derivation of optimal settings for most parameters, reducing complexity to 3 dimensions for dense matrix multiplication.

Shi et al. [25] identify four key system parameters to optimize MapReduce makespan. While similar in spirit to our approach, they do not include operator-specific partitioning parameters in their analysis. And due to the complexity of the model, there are no results comparable to our Lemmas that enable more efficient optimization for the key parameters.

We use dense matrix multiplication to showcase model design and makespan optimization for an analytics operator with a demanding I/O and CPU profile. Previous work explored a variety of performance-related aspects for matrix multiplication on parallel architectures. This includes load balancing [28], minimizing communication cost [1, 4, 15, 26], and optimizing for memory hierarchy [7, 27].

8 Conclusions

Starting with the goal of minimizing makespan for distributed data-intensive computation, we set out to identify the “simplest possible”, “sufficiently accurate” model to predict makespan of data analytics operators. To this end, we proposed abstract models that are piecewise linear functions depending only on input, output, and computation complexity. Our approach has two main benefits. First, it simplifies tying problem-partitioning parameters to model variables (input, output and computation) for user-defined operators, e.g., programs written in MapReduce or Spark. Second, we showed that the linear structure can be exploited for more efficient optimization algorithms. It enabled pruning of values from the optimization-parameter search space and even a significant reduction of search space dimensionality (for homogeneous tasks). For instance, optimization complexity was reduced from a search process in ten dimensions to only three for matrix product; for sorting the optimal solution was directly obtainable in closed form.

Our experiments indicated that a small number of pieces achieve sufficient prediction quality, enabling us to find near-optimal problem partitionings very efficiently. In future work, we will explore tuning of partitioning parameters along with system parameters external to user programs, by integrating our ideas into optimizers like Starfish [13].