1 Introduction

Data parallelism is the most common parallel decomposition strategy, by which an application’s domain is decomposed into as many partitions as workers assigned to the computation. Such strategy is cache hierarchy neglectful and hence, in many cases, does not harvest the benefits provided by the (consistently growing amount of) cache hardware available in current computers, from laptops to high-end server nodes.

An adequate mapping of a computation onto the underlying memory hierarchy is crucial to fully harness the computational power of modern architectures. However, cache memory management is completely transparent to user-level programming. Such responsibility typically falls upon the hardware infrastructure, whose function is only to guarantee that recently accessed data are closer to the computing unit(s) than the remainder, since it will likely be accessed again. Cache replacement algorithms are based upon heuristics, such as least recently used, that do not always serve the application’s best interest, given that they base their decision on historic information rather than on information about future accesses. This limitation has a major impact on the performance of temporal locality sensitive computations, such as stencil computations and computations over matrices.

To overcome this problem, it is up to the compiler, the run-time system or, ultimately, the programmer to implement efficient application-specific mappings that maximize the cache hit ratio. Cache-guided optimizations in mainstream compilers consist essentially of loop transformations [12] directed at sequential loops, which, in the context of parallel computing, may only be applied to the internal execution of tasks. In fact, the data locality issue in parallel computing has been mostly addressed at language level, via linguistic constructions for the explicit programming of the memory hierarchy [3, 7, 10, 1719]. However, these place a heavy burden on the programmer, requiring in-depth knowledge of parallel programming and computer architecture. Moreover, they apply divide-and-conquer strategies that cannot be applied in frameworks that cleanly separate the problem decomposition stage from the execution stage such as MapReduce [5]. The same reasoning may be applied to cache-oblivious algorithms [8], which oblige the programmer to design divide-and-conquer algorithms that do not depend on the specificities of a particular cache hierarchy.

In this work, we are particularly interested in the aforementioned class of computations, where an initial decomposition stage generates a set of partitions, upon which a computation stage is subsequently applied in parallel. More precisely, we are interested on exploring how this decomposition stage may enhance the data locality properties of the ensuing computation, so that the latter may transparently benefit from a good mapping onto the underlying memory hierarchy. To carry out such enterprise we have been researching on how to address some key challenges, namely on how to deduce mappings for specific memory hierarchy configurations from the same source code, and on how to ensure performance portability across a wide range of configurations. Overall, the main contributions of this paper are:

  1. 1.

    domain decomposition principles and algorithms that take into consideration the complex memory hierarchies of current computer architectures, and cleanly separate the concern of cache-conscious decomposition from the remainder,

  2. 2.

    implementation of these algorithms in the context of the Elina framework [15] for Java parallel computing, and

  3. 3.

    a study that compares our cache-conscious approach against the classical, cache-neglectful strategy.

The presented approach stands out from the state-of-the-art because it requires no source code modifications to enable cache-conscious decomposition. The programmer only has to supply generic (non-application-specific) information on how to partition the data structures in use.

With regard to the comparative study, to the best of our knowledge, no such study exists. Works such as [3, 7, 19] simply present speedup analyses against sequential versions of the benchmarks. There is no evidence that the effort required from the programmer (to explicitly map the computation onto the memory hierarchy) actually delivers performance gains when compared to simpler strategies. Conversely, our study attests the validity of our proposal, showing that it delivers significant speedups to computations that are particularly sensitive to temporal locality.

The remainder of this paper is structured as follows: Sects. 2 and 3 present the principles and implementation of our cache-conscious decomposition of data parallel computations; Sect. 4 evaluates our proposal from a performance perspective, with a particular focus on the comparison against the classical strategy for domain decomposition; Sect. 5 positions our approach relatively to the current state-of-the-art; finally, Sect. 6 presents our final conclusions and prospective future work.

2 Cache-conscious domain decomposition

Domain decomposition in distributed memory environments is a two-stage operation: initially, the domain is partitioned among the nodes that compose the distributed system, and secondly, within each node, it is further partitioned among the worker threads locally assigned to the computation. We argue that these two stages must be clearly decoupled, so that stage-specific optimizations may be devised. For instance, handling node heterogeneity only makes sense at cluster level, while cache awareness only makes sense at node level.

The focus of this paper is on how to explore domain decomposition (at node level) to enhance the locality properties of data parallel computations. To that end, we leverage the cache hardware available at each node and apply a cache-conscious strategy that takes into account the characteristics of some (or all) levels of the target memory hierarchy, and not only of the worker threads assigned to the computation (the standard horizontal approach). As a result, the number of resulting partitions will be a function of the target machine’s cache hierarchy.

Figures 1 and 2 highlight the differences between the horizontal decomposition approach and our cache-conscious proposal. In the latter case, the domain is decomposed in such a way that each partition fits—its size is a function of—a given target cache level (TCL). Additionally, the behaviour of each worker assigned to the computation is modified so that it iteratively applies the user-defined computation upon a stream of partitions, rather than to a single one. The number of workers is preserved, the amount of work performed by each worker is also generally preserved,Footnote 1 but the granularity of the data upon which the user-defined computation is individually applied is (potentially) much smaller, thus enhancing locality.

Fig. 1
figure 1

Application of a user-defined computation over a horizontally decomposed domain

Fig. 2
figure 2

Application of a user-defined computation over a cache-consciously decomposed domain

To successfully apply our strategy, we have to address two distinct challenges: (i) how to decompose a computation’s domain so that such computation may make better use of the available cache hardware and (ii) how to efficiently schedule the resulting tasks onto the available set of workers, while also leveraging data locality.

2.1 Decomposition

For the sake of generality, we allow a computation’s input data set (our domain to decompose) to be built from multiple sub-domains, typically individual data structures, each with its own decomposition strategy. For example, in the classic matrix multiplication, one can implement a single decomposition strategy that splits the 3 matrices involved, or decompose the 3 matrices individually and have a way of combining the resulting individual partitions. In this latter strategy, a partition of the input data set must comprise a partition of each of its sub-domains.

Naturally, the number of indivisible units of each sub-domain may not be a multiple of the number of desired partitions. This fact may be easily solved by distributing the remainder units among the regular-sized partitions, causing an unbalancing of, at most, one indivisible unit. However, problem-specific constraints may impose further restrictions upon the number of partitions and/or the geometry of the decomposition as a whole. Stencil computations present this kind of restrictions with regard to the number of elements and their relative positions. For example, consider a computation over a two-dimensional grid, where at instant \(t_{i+1}\), the value of each of the grid’s elements is a function of its own value and of its 8 adjacent elements at instant \(t_i\). To meet these restrictions, each partition of the domain must comprise \(3 \times n\) elements, with \(n \ge 3\), organized in grids of, at least, \(3\ \text {rows} \times 3\ \text {columns}\).

This problem-specific information must be conveyed in the decomposition algorithms supplied by the programmer that we refer to as distributions and must, therefore, be included in the interface that regulates the implementation of such algorithms (Table 1). Additionally, to perform our cache-conscious systematic decomposition, we will require an extra set of functions, whose purpose we will unveil as we progress along this section.

Table 1 The Distribution \(<\) T \(>\) interface

2.1.1 Determining the number and size of the partitions

Given a domain D, composed of d sub-domains (\(D= \bigcup _{i=0}^{d-1} D_i\)), its decomposition into a set of partitions \(P_D\), such that each partition fits a given TCL, requires the calculation of a value for the number of partitions (np) such that

$$\begin{aligned} \forall p \in P_D,\quad size(TCL) \ge size(p) = \sum _{i=0}^{d-1} \frac{size(D_i)}{np} \end{aligned}$$

where size() denotes the size in bytes of either the enclosed cache level or partition. Recall that a partition of a domain comprises a partition of each of its sub-domains.

figure a

The proposed value of np must be validated by each of the distribution algorithms involved (one per sub-domain), being this verification logic defined in function validate. Algorithm 1 presents the procedure for assessing if a given number of partitions (np) is valid: it assures that each sub-domain may be split into that many partitions, according to its distribution algorithm (line 3), and that the summation of a sub-domain partition’s size, i.e. the size of a partition of the entire domain (totalPartitionSize), fits in the TCL (line 7). The algorithm depends on an estimation of how many bytes a partition will occupy in the TCL. To enable the experimentation with different heuristics, the estimation is delegated on function \(\varphi \) (line 5), supplied as parameter. The outcome of the algorithm is the validation, or invalidation, of the candidate np value. In the latter case, information concerning values higher than the candidate is also supplied. This information is used to delimit the search space, as subsequently explained.

To compute the optimal size of a partition that fits in a TCL, we apply a binary search: the value of np begins in the number of workers assigned to the execution (nWorkers) and doubles in every iteration until a valid solution is found or all values larger than np are invalid (according to Algorithm 1). From then onwards, the search’s interval is continuously narrowed to find the smallest valid np value. Given that the size of each individual partition is inversely proportional to np, such solution is optimal for the provided input parameters. The nWorkers lower bound guarantees that the algorithm generates, at least, as many partitions as available workers, to fully exploit the designated resources.

2.1.2 The \(\varphi \) function

The definition of the \(\varphi \) function implies a trade-off between accuracy, computational overhead, and wasted cache space. A simple approach (\(\varphi _{s}\)) is not to take into consideration either the size of the target architecture’s cache line size (CACHE_LINE_SIZE in the algorithm) nor the partition’s geometry. Thus, the function simply computes the number of bytes the partition takes:

$$\begin{aligned}&\varphi _s(cacheLineSize, dist, np) \\&\quad =\, dist.\texttt {getElementSize}() \times \left\lfloor {dist. \texttt {getAveragePartitionSize}(np)+0.5}\right\rfloor \end{aligned}$$

where the result of getAveragePartitionSize is rounded up to the closest integer to better suit the most common expected partition size.

A more conservative estimate (\(\varphi _{c}\)) considers to some extent the two previously neglected dimensions, at the expense of more computational overhead:

$$\begin{aligned}&\varphi _c(cacheLineSize, dist, np) \\&\quad = cacheLineSize \\&\qquad \times \left\lceil \frac{dist.\texttt {getAveragePartitionSize}(np) \times dist. \texttt {getElementSize}()}{dist.\texttt {getAverageFirstDimSize}(np)} \right\rceil \\&\qquad \times \left( \left\lceil \frac{dist.\texttt {getAverageFirstDimSize}(np)}{cacheLineSize} \right\rceil + 1\right) \end{aligned}$$

Function getAverageFirstDimSize returns the average length (in number of elements) of the first dimension of the domain. This information is particularly important for the decomposition of multi-dimensional domains, specially multi-dimensional arrays.Footnote 2 We are assuming a row-major order memory layout and, in such context, the output of the getAverageFirstDimSize function is crucial to understand the breakdown of a partition into cache lines. The use of the average value conveys some extra information to the system when the size of the partitions is not uniform, as happens when the size in bytes of the sub-domains is not a multiple of np.

In \(\varphi _{c}\), the size of the partition’s first dimension is adjusted to the boundaries of the cache line. Furthermore, an extra cache line is added to consider the eventual misalignment of the partition to such boundaries. This approach is likely to ensure that the entire working set fits the TCL, but its conservative nature will eventually waste more space than the first approach. Table 2 illustrates, for both approaches, the estimated number of bytes that a partition of size size(p) will take in a cache with cache line size size(cl).

Table 2 Estimated number of bytes that a multi-dimensional partition of size size(p) bytes (whose first dimension comprises F bytes) occupies in a cache with line size size(cl) bytes
Fig. 3
figure 3

Block decomposition for the matrix multiplication problem

As an example, consider the block decomposition for the parallel computation of the classic matrix multiplication problem illustrated in Fig. 3. A partition of the domain must comprise a block of each input matrix and space for the computed result block, to be placed in the output matrix. Note, however, that every block partition of the first matrix (A) must be paired with all the block partitions that compose a line of the second (B). Consider now a concrete instance of the problem where both A and B are \(1024 \times 1024\) square matrices of 4-byte-long integers, and a TCL with 64 KB. For tentative np value \(256 = 16 \times 16\) the estimation given by \(\varphi _{s}\) is \((\frac{1024}{16})^2 \times 3~\hbox {matrices} \times 4~\hbox {bytes} = 49{,}152~\hbox {bytes}\), while the one given by \(\varphi _{c}\) is \(64 \times \frac{(\frac{1024}{16})^2 \times 3~\mathrm{matrices} \times 4~ \mathrm{bytes}}{\frac{1024}{16}} \times (\lceil \frac{\frac{1024}{16}}{64} \rceil + 1) = 64 \times 64 \times 3~\hbox {matrices} \times 4~\hbox {bytes} \times (1 + 1) = 98{,}304~\hbox {bytes}\). Thus, \(np = 256\) is valid when using \(\varphi _{s}\) but not when using \(\varphi _{c}\).

None of the presented \(\varphi \) functions take into consideration set associativity. The actual location of the data in the process addressing space is not made available in many programming languages, e.g. Java. Moreover, the computational complexity required to take such knowledge into consideration would, most likely, subsume the eventual benefits. Nonetheless, considering a cache replacement algorithm of the least recently used family, the subjugation of a partition ’s size to the TCL’s capacity highly contributes for having the delimited data fully loaded (minus conflicting cache lines) on such cache level.

2.2 Scheduling

The scheduling stage assigns pairs (instance of the computation, associated partition)—our tasks—to a set of workers. The problem diverges from the common scheduling of data parallel computations because the amount of tasks generated by the cache-conscious decomposition approach largely exceeds the number of cores available in the machine.

Revisiting the application of the matrix multiplication algorithm of Fig. 3 to matrices of dimension \(1024 \times 1024\), each block of matrix A will have to be combined with 16 blocks of matrix B. Applying a one-to-one mapping from partitions to tasks will result in a total of \(16 \times 16 \times 16 = 4096\) tasks.

Spawning as many workers as tasks is not viable in this context, as having the number of execution flows far exceeding the number of computing resources penalizes performance. In addition, given the small granularity of each task, to have a pure dynamic work-stealing-based scheduling policy will lead to considerable overheads, since the worker threads will spend a non-negligible percentage of their time fetching work, rather than executing it. Thus, performance in this context is highly dependent of an efficient and locality-aware mapping between tasks and workers. Our solution is to perform an initial static scheduling that assigns a cluster of tasks to each worker, which sequentially iterates upon the stream of the said tasks (recall Fig. 2). We advocate that this static work distribution increases the system’s overall performance, since workers do not have to search and compete for work. Nonetheless, when in the presence of irregular computations, dynamic scheduling techniques may also be useful to balance the load across the workers. We do not yet explore such techniques, even though we are aware of work in the field that already embeds some hierarchical concerns [13, 21].

In the scope of this work, we present two distinct task clustering strategies. The challenge is to trade-off the schedule’s efficiency against the overhead (temporal and spatial) that the determination of the next task to execute might impose on the overall execution. More complex clustering strategies are likely to perform more calculations and require more memory, thus stealing space in the cache for the actual computation’s data.

2.2.1 Contiguous clustering (CC)

Assigns an equally sized contiguous cluster of tasks to each worker, according to their unique identifier. Given a set of n workers and a set of m tasks, a worker with rank i is assigned the following cluster of tasks: \([(i\times \frac{m}{n}), ((i+1)\times \frac{m}{n})[\). Whenever m is not a multiple of n, the first r workers (for \(r = m \bmod {} n\)) receive one extra task. Figure 4 illustrates the application of the CC strategy to the scheduling of 14 tasks among 4 workers.

Fig. 4
figure 4

Contiguous clustering: scheduling of 14 tasks among 4 workers

The rationale behind this strategy is twofold: (1) introduce minimal overhead during scheduling, and (2) exploit spatial locality between tasks operating upon consecutive partitions.

2.2.2 Sibling round-robin clustering (SRRC)

Builds on the fact that the Last Level Cache (LLC) is shared by multiple computing cores. Thus, if two or more workers running in such cores share data, the number of LLC misses will decrease, reducing the accesses to main memory. Once again, a matrix multiplication example is paradigmatic, as multiple partitions share blocks of both input matrices.

The scheduling algorithm comprises two distinct assignment levels: the cluster assignment level assigns clusters of tasks to groups of workers, whilst the task assignment level assigns tasks within a cluster to workers within a worker group. In the first level, the size of the task clusters is ruled by the TCL to LLC ratio:

$$\begin{aligned} clusterSize = \frac{size({LLC})}{size({TCL})} + \left( cores({LLC}) - \left( \frac{size({LLC})}{size({TCL})} \bmod {} cores({LLC})\right) \right) \end{aligned}$$

The second term simply ensures a proper distribution of the work when in the presence of remainder. It uses the number of cores that share an LLC, denoted by cores(LLC), as the distribution unit.

Fig. 5
figure 5

SRRC cluster assignment: assignment of clusters of tasks among four groups of workers running on sibling cores that share an LLC

Let us represent the resulting set of clusters (C) as follows \(C = \{c_0,\ldots , c_{n_c-1}\} \), where \(c_i\) denotes a particular cluster, and \(n_c\) the number of clusters. Consider now that the set of workers (W) may itself be grouped according to affinities of the cores (where they will carry out their execution) to the LLC. Accordingly, W may be represented as \(W=\{w_0,\ldots ,w_{n_w-1}\}\) where \(w_i\) denotes a particular worker group and \(n_w\) denotes the number of such groups. Given these definitions, the scheduling of C among W follows a round-robin strategy that assigns, to each group of workers \(w_i \in W\), the following subset of clusters:

$$\begin{aligned} C_i = \{c_j \in C | j \bmod {} n_w = i \wedge j < (n_c - (n_c \bmod {} n_w))\} \end{aligned}$$

To guarantee a schedule as even as possible, when the number of clusters is not a multiple of work groups, the remainder clusters are merged in a special cluster, named CC Cluster. This cluster also comprises the tasks that could not form a cluster (given by \({\textit{clusterSize}} \times (n_t \bmod {} ({\textit{clusterSize}} \times n_w))\)) and is scheduled according to the CC strategy.

Figure 5 illustrates the cluster assignment for a machine with four groups of sibling cores sharing four different LLCs, along with the assignment of the resulting clusters to the worker groups. The task assignment within a cluster (illustrated in Fig. 6) is performed in a round-robin fashion among the workers that compose the group.

Fig. 6
figure 6

SRRC task assignment: assignment of the tasks within a cluster among the workers that compose the target group

2.3 Worker-core affinity

The thread scheduling policies of modern operating system are aware of the threads’ cache footprints, keeping threads on the same core as much as possible. Nonetheless, in some cases we need to compulsorily constrict the set of cores on which a thread may execute, namely to properly apply the SRRC strategy. This strategy assumes that the workers operating over a given task cluster execute on cores sharing an LLC. Therefore, it is important to map the affinity between workers and cores accordingly.

Not to be too restrictive, we allow worker threads to be freely scheduled among the cores under the lowest shared cache level, a strategy we have adequately baptized as Lowest Level Shared Cache affinity mapping. As an example, in a quad-core architecture with dedicated L1 caches, an L2 cache shared by each two cores, and a single L3 cache, the Lowest Level Shared Cache affinity mapping allows the operating system to freely schedule worker threads between every two cores that share a L2 cache.

2.4 Synchronization-free execution engine

To leverage the devised cache-conscious domain decomposition strategy, the underlying execution engine must provide efficient access from each worker thread to the tasks assigned to it by the scheduling policy in place. Our approach is to allow the workers to directly access the tasks generated by decomposition stage, which are stored contiguously in a vector. This avoids the performance penalty of moving tasks to worker-local data structures. Accordingly, each thread iterates through the shared vector to sequentially fetch and execute each task scheduled to it.

All accesses to the shared task vector are synchronization free. This is possible because each worker receives a disjoint set of task clusters, and the associated index sets are locally computable. From its unique rank identifier, a worker is able to determine the index of the first task to execute, and the relative position of all the remainder. The computational complexity of this operation is bound to the scheduling policy. The SRRC approach requires two loops to iterate over the whole set of tasks assigned to it and needs to deal with remainders within and across clusters. In turn, the CC counterpart requires a single loop over a contiguous vector.

3 Implementation details

We have prototyped our proposal in the Elina Java parallel computing framework [6, 9, 14] for distributed and shared-memory environments. Elina supports both embarrassingly parallel data parallel and MapReduce computations. Moreover, it is a very modular framework that cleanly separates most of the system-level functionalities into independent modules, whose concrete implementation is specified via a configuration file. As a result, equipping the framework with a new implementation of given module, such as the decomposition or scheduling strategy, requires only the implementation of a pre-defined interface, and altering a configuration file. This option allowed us to, on one hand, have programming and execution models close to what may be found in the most used MapReduce-based frameworks for the processing of large data sets, such as Hadoop [1], and on the other hand, to evaluate multiple system solutions by simply modifying the framework’s configuration, without having to modify the framework’s core or the application’s source code.

Our implementation efforts were thus directed to the development of multiple adapter modules, namely for the domain decomposition strategy (cache-conscious and horizontal), the scheduling algorithm (Contiguous Clustering and Sibling Round-Robin Clustering), the Lowest Level Shared Cache affinity mapping, among others. Here, we will just briefly review how we uniformly represent memory hierarchies in the framework, and how we have implemented the thread-to-cache affinities.

3.1 Platform-independent representation of a memory hierarchy

There is no standard format for the storing of hardware-related information by operating systems, nor there is a single tool that provides such information for (at least) the most known operating systems, such as Windows, Linux and Mac OS X. Accordingly, we had to develop our own platform-independent representation of a memory hierarchy. For that purpose we use the JSON data representation format. A memory hierarchy representation is hence defined as a set of nested JSON objects comprising the following fields:

  • size—size of each individual memory element that composes the current memory level (in bytes), such as the RAM memory associated to a given CPU or the size of a particular cache memory.

  • cacheLineSize—size of the cache coherency line (in bytes). This field is present only if the memory level represents a cache level.

  • siblings—array of arrays of sibling cores sharing each copy of the memory level.

  • child—object containing the memory level information of the lower (child) level (is null if the current level is the bottom most in the hierarchy).

As a proof of concept we implemented a tool that automatically generates a platform-independent JSON representation of a node’s memory hierarchy from the information stored in the /sys/devices/system/cpu/ directory of a Linux installation. Listing 1 showcases the result obtained for a NUMA (Non-Uniform Memory Access) architecture comprising two quad-core CPUs and 8 GB of RAM.

figure b

3.2 Thread-to-cache affinities in Java

The Java standard API does not provide means for establishing affinities between threads running in the Java virtual machine and cores of the underlying processor(s). Ergo, to implement the Lowest Level Shared Cache affinity mapping we had to resort to external commands, namely jstack to obtain the correspondence between the Java virtual machine’s and the operating system’s thread identification, and taskset to set the affinity of the threads according to the Lowest Level Shared Cache policy. Note that while the jstack tool is cross-platform, taskset is only available in (most) Linux distributions.

4 Evaluation

Our experimental evaluation aims to characterize which kind of applications may benefit from the devised cache-conscious decomposition approach, and to quantify such benefit. For that purpose, we carried out a comparative performance analysis against the pre-existing horizontal work distribution featured in Elina. Elina’s portable programming model and run-time system allowed us to use the same source code on both settings. We simply deployed the run-time system with different instances of several modules, namely of the decomposition strategy, the scheduling strategy, and the \(\varphi \) function.

4.1 Test infrastructure

All experiments were conducted on two different types of nodes, with the following characteristics:

  • System A—2 Quad-Core AMD Opteron™ Processor 2376 with three cache levels: a 64 KB L1 data cache per core, a unified 512 KB L2 cache per core, and a unified 6 MB L3 cache per processor.

  • System I—2 Dual-Core Intel(R) Xeon(R) CPU X30 hyperthreaded with three cache levels: a 32 KB L1 data cache per core, a unified 256 KB L2 cache per core, and a unified 8 MB L3 cache per processor.

All nodes run the Debian Linux distribution with Linux kernel version 2.6.26-2-amd64. The installed Java platform is OpenJDK 7 (version 1.7.0_21).

4.2 Benchmarks

To conduct our study we chose seven benchmarks. The first two are widely used operations over matrices, namely matrix multiplication (MatMult) and matrix transpose (MatTrans). The problem class indicator for both benchmarks represents the dimension of the matrices involved (only square matrices were considered).

Gaussian Blur (GaussianBlur) blurs an image (represented as a matrix) by convolving it with a Gaussian function. The benchmark requires two parameters: the image to blur and the radius of the blurring window for each pixel. The problem class indicator follows notation S-R, where S stands for the target image’s dimension and R for the blurring radius.

Crypt, SOR and Series are adapted from the JavaGrande benchmark suite. Crypt was adapted to cipher and decipher files instead of messages. The problem class indicator is bound to the size of the file to cipher/decipher. SOR computes the Successive Over Relaxation algorithm for a matrix of dimension \(N \times N\), whilst Series computes the first N Fourier coefficients of the function \(f(x) = (x+1)^x\) on the interval [0, 2]. On both these benchmarks, the problem class indicator denotes the value of parameter N.

Finally, WordCount is an implementation of the classic word count MapReduce example. As in Crypt, the problem class indicator is bound to the size of the file to process.

4.3 Implementing the Distribution interface

We argue that the effort that our approach demands from the programmer is relatively small, when compared with the potential performance gains—a claim that cannot be sustained by systems that promote the explicit programming of the memory hierarchy, such as Sequoia [7], due to the programming labour involved.

figure c

To justify our allegation, we present in Listing 2 a concrete implementation of the Distribution interface for the decomposition of two-dimensional integer arrays. The focus is on the modifications imposed on the pre-existent distribution algorithm, for it to comply to the new interface.Footnote 3 For that reason, we omit the implementation of method partition, given that it is independent of the decomposition strategy. Method getElementSize is also absent but only for code reuse reasons. The implementation is inherited from the base class, and simply returns the size (in bytes) of an element of the array: 4 in this case. The implementation of the remainder methods is fully depicted in the listing.

Note that the distribution forces the array to be partitioned into as many blocks per column as for row, hence why the validate method forces np to be a perfect square.

4.4 Performance evaluation

4.4.1 Cache-conscious versus horizontal decomposition

Tables 3 and 4 depict the maximum performance gains relatively to the canonical horizontal decomposition and a sequential version of the benchmarks. A colour scale allows to quickly identify the best and worst case scenarios. Darker the background colour, better is the result. The results are relative to the best (TCL, \(\phi \) function) parametrization for each benchmark configuration. The sensitivity of the results to these parameters will be discussed along Sect. 4.4. The comparison with the sequential version of the benchmarks are presented only to a given a better perceptive of the overall gains delivered by either approach.

Note that the implementation of the benchmarks and of the partition method (of the selected Distribution) is the same for the cache-conscious and horizontal configurations. The difference lies in the deployed configuration of the Elina framework, which, to achieve an horizontal decomposition, uses only the partition method (of the selected Distribution), while it requires all interface’s methods (see Table 1 and Listing 2) to perform the cache-conscious decomposition. The sequential version is a simple single-threaded Java implementation.

Table 3 Best cache-conscious parametrization versus horizontal decomposition and versus the sequential version: temporal locality sensitive benchmarks
Table 4 Best cache-conscious parametrization versus horizontal decomposition and versus the sequential version: no temporal locality sensitive benchmarks

The benchmarks can be clearly classified into two groups: in Table 3 are the ones where cache-conscious decomposition brings considerable performance gains over horizontal decomposition, and in Table 4, the ones where both decomposition strategies are on a par. These results are clearly bound to the cache locality properties of the benchmarks. As could be expected, only the ones featuring both spatial and temporal locality—MatMult, MatTrans, GaussianBlur and SOR—really benefit from the cache-conscious decomposition. The charts in Figs. 7 and 8 graphically substantiate this statement, presenting the speedup obtained by both decomposition approaches in the two systems. The results are better for System A than for System I because, in the former, each core runs a single hardware thread, while, in the latter, cores are hyperthreaded and thus run two hardware threads. Consequently, in System A, caches are shared by less threads and hence there are less conflict and capacity misses.

Fig. 7
figure 7

Speedup of both decomposition approaches vs the sequential implementation in System A

Fig. 8
figure 8

Speedup of both decomposition approaches vs the sequential implementation in System I

The performance gains for MatMult and MatTrans are considerable (up to 6–7 times), especially as the size of the problem increases. This is a major performance boost, given that we are in the presence of a data locality optimization not extra parallelization. The results obtained for MatMult 1000 in System A escape the norm, thus deserving a more careful explanation. Given the size of the matrix, the number of workers, and the size of the L1 cache (64 KB), the horizontal decomposition produces partitions not much bigger than the L1 cache. Consequently, for this particular parameterization, the computation is already cache-friendly and there is not much room for cache-related optimizations. The same does not apply to System I because the cache is smaller (32 KB) and is shared by two hardware threads.

The gains for GaussianBlur and SOR are not as impressionable, but still very good: up to approximately 3 times better than the horizontal decomposition. Note that these two benchmarks are stencil computations that, by operating over two-dimensional neighbourhoods, already exhibit data access patterns that leverage cache to some extension. Naturally, as the size of the matrices and/or size the neighbourhood increases, this property fades and the gains of cache-conscious decomposition are more noticeable. Also note that in GaussianBlur the computation is somehow unbalanced due to the processing of the image’s borders. This fact can be further observed in the speedups against the sequential version, which are relatively lower when compared with the remainder benchmarks. This is true for both decomposition strategies.

Fig. 9
figure 9

Speedups of the CC strategy in System A, varying the size of the TCL

The benchmarks in the second set do not benefit from temporal locality, and are thus ideal for determining the overhead imposed by the cache-conscious decomposition, namely by the generation of a large amount of tasks and their subsequent scheduling. In the tested benchmarks, no significant performance gains or losses can be found. Crypt and Series iterate data sequentially (benefiting from spatial locality), without revisiting previously accessed data. Hence, no benefits are attained from enforcing temporal locality by keeping the partitions’ size within the TCL. Conversely, WordCount features temporal locality on the access to the map that stores the number of word occurrences. However, the random access pattern to such data precludes any attempt to, in advance, predict which data to place in the TCL. A cache-aware implementation of such map is a challenge to overcome. In sum, the presented results show that the overhead of our approach is negligible, attesting that it may be used in a wide range of applications, without concerns for the type of locality.

4.4.2 Sensitivity to the chosen TCL

Next we perform a sensitivity analysis of the results relatively to the chosen TCL and its size. To that end we vary the size of the TCL from the L1 to the L3 cache sizes. Given that the results are similar across all systems and scheduling strategies, we limit our discussion to just one configuration: CC scheduling strategy in System A (Fig. 9), and present the essence of the analysis in Table 5. The results are very insightful, as the optimal TCL value lies somewhere between the sizes of the L1 and the L2 caches and is benchmark and architecture dependent. We associate this to the fact that: (a) the JVM itself has a state that competes for space in the cache, and (b) we are not considering all the dimensions of cache hardware, particularly the number of ways, and hence are permeable to conflict misses.

Table 5 Speedups of the CC and SRRC strategies varying the TCL size: summary table

4.4.3 Sensitivity to the scheduling strategies and to the \(\varphi \) function

Also from Table 5 it can be observed that, for most cases, the impact of employing the CC or the SRRC scheduling strategies on the benchmarks’ absolute performance is not substantial. The TCL for which the benchmark attains its performance peak does depend on the scheduling strategy employed, but the absolute performance of such peaks are very close—the essential gains result from the base cache-conscious decomposition strategy. GaussianBlur is the only benchmark to consistently benefit from one of the strategies, SRRC. The gains are not considerable, peeking at \(8~\%\). However, mostly in System A, these gains are directly proportional to the increase in the radius. This behaviour results from the fact that larger blur windows increase the amount of data that is shared by tasks operating over contiguous windows.

Regarding the \(\varphi \) functions, we also evaluated the benchmarks using the two \(\varphi \) functions presented in Sect. 2.1. The results showed that cache line-awareness did not improve the performance for any of the assessed cache sizes. In fact, the results were worse, given the introduced overhead and the wasted cache space of the conservative approach.

4.4.4 Breakdown

To better evaluate the overhead imposed by the proposed cache-conscious approach (denoted CacheCons in Fig. 10), we broke down the execution of the MatMult benchmark, in system A, for \(N=2000\) and TCL size of 128k (the best performance). Decomposition and Scheduling denote the time spent in the decomposition of the domain and the subsequent scheduling of the tasks, Execution denotes the time spent in the actual matrix multiplication, and Reduction represents the time spent in the reduction of the partial results (see Fig. 3).

As depicted in the figure, the weight of the stages other than Execution is one order of magnitude higher in the cache-conscious than in the horizontal approach. This impact is mostly visible in the Reduction stage, that in the cache-conscious cases reaches almost 5 % of the whole execution time, given that number of results to reduce at node level is much higher. The remainder Decomposition and Scheduling stages are fully optimized for either clustering strategy, pertaining to less than 2 % of the total execution time. In this particular example, the cache-conscious decomposition generates 8000 tasks (1000 per each of the 8 workers) that are created and scheduled in less than 0.1 s.

Despite the overhead imposed in the three aforesaid stages, the overall performance of the cache-conscious case is much better. This is due to the substantial gains obtained in the heavier Execution stage that takes more than \(90~\%\) of the total execution time in all three cases.

Fig. 10
figure 10

Breakdown of MatMult \(N=2000\) (System A)—logarithmic scale

Fig. 11
figure 11

Speedup of the MatMult benchmark—cluster environment

4.5 Impact at cluster level

To illustrate the impact of our approach at cluster level, we present the speedup of two instances of the MatMult benchmark relatively to the sequential execution, when varying the number of nodes: 1, 2, 4 and 8 (each featuring 8 hardware threads). Figure 11 depicts the curves for the horizontal and cache-conscious decompositions. For the 2000 problem class, both approaches scale up to 8 nodes (64 hardware threads). Naturally, the node level gains of the cache-conscious approach transpire in the overall execution time. The performance peek of the cache-conscious decomposition delivers a 147 speedup over the sequential execution, and a 3.7 speedup over the horizontal approach. In the horizontal approach, as the number of nodes increases, the data partition assigned to each node, and subsequently to each worker thread, diminishes and approximates itself to the size of the higher cache levels. This performance increase is most noticeable from the 2- to the 4-node configuration. As such, one can also leverage the impact of cache locality in horizontal decompositions. However, these gains are ephemeral and do not scale, giving that they are bound to the assignment of small data partitions to nodes. Conversely, cache-conscious decomposition delivers higher gains for data sets of big dimensions. In this sense, it is particularly suitable for cluster environments, where the scale of the problems to solve is, by definition, large.

5 Related work

5.1 Hierarchical data parallelism

Our work is close to the hierarchical data parallelism field. However, contrary to our proposal, hierarchical data parallelism models place on the programmer’s shoulders the burden of decomposing the domain according to the traits of the memory hierarchy. Of these models, Sequoia [7] is the most prominent. It provides a programming language for the explicit programming of the memory hierarchy. The language’s main program building block is the task, a side effect-free function with call-by-value parameter passing semantics. Through tasks the programmer is able to express: parallelism, explicit communication and locality, isolation, algorithmic variants and parameterization. These properties allow programs written in Sequoia to be portable across machines without sacrificing the possibility of tuning the application for each one. Tasks run in isolation, and hence may be executed concurrently without requiring synchronization between cooperating threads. Parameter passing during task launching is the only communication mechanism available, which increases the complexity of expressing cooperative computations. Hierarchical awareness is achieved by providing different implementations (variants) of a given task. Inner variants reflect intermediate nodes in the memory hierarchy and have the purpose of decomposing the input dataset. leaf variants perform the actual computation, operating directly on working sets residing within leaf levels of the memory hierarchy. Mapping a hierarchy of tasks onto the hierarchical representation of memory requires the creation of task instances for every machine level involved. The programmer is required to provide the compiler with the task mapping specification for the machine where the algorithm will be compiled and executed.

Parallelism in Sequoia is assumed to be regular. Additional constructs to support irregular parallelism are proposed in [2], namely the call-up and spawn. call-up allows subtasks to access their parent’s heap, which can be used to modify its data structures. spawn provides for the dynamic generation of parallelism, being able to launch an arbitrary number of subtasks of the provided task.

Hierarchically Tiled Arrays (HTA) [3] is a programming paradigm that relies on a object type named tiled array for expressing parallelism and locality. HTAs are arrays partitioned into tiles, which can in turn be either conventional arrays or further tiled arrays, enabling hierarchical decomposition. The components of an HTA can be accessed in a way analogous to the conventional array indexing. Once more, the hierarchical decomposition is explicit in the program: the level function can be used to obtain, at run-time, the location of the argument within the tile hierarchy.

The Hierarchical Place Trees (HPT) [19] model combines concepts from Sequoia and the X10 languages. From the latter, it borrows the concepts of place and activity (task). It abstracts each memory module as a place, and therefore a memory hierarchy is abstracted as a place tree. Places are tagged with annotations that indicate their memory type and size. Moreover, a processor core is abstracted as a worker thread which in the HPT model, can only be attached to leaf nodes in the place tree. In contrast to Sequoia, HPT supports three different types of communication: implicit access, explicit in–out parameters, and explicit asynchronous transfer.

Less noticeable works that also address hierarchical decomposition are the studies by Wang et al. [18] and Kamil and Yelick [10]. These systems perform horizontal decompositions at both the cluster and the node level, firstly decomposing the domain at hand by the multiple nodes that compose the distributed environment, and secondly by the pool of workers available at each node.

There are several fundamental differences between our work and hierarchical data parallelism. The most evident is (as previously stated) that we do propose a systematic approach to the memory hierarchy-aware decomposition of data sets, and not programming abstractions for the programmer to do so. We do ask for some help from the programmer in the implementation of the Distribution interface, but it is architecture agnostic information. Moreover, none of these models can be applied to the class of computations that we are aiming for: computations where the decomposition stage is cleanly separated from the computation stage. Hierarchical data parallel models follow a divide-and-conquer approach that blends the two stages together. Finally, in rigour, we do not follow a hierarchical approach, as we do not iteratively partition a domain so that it first fits the higher memory level, and from then on, each of the inferior levels. We delegate such enterprise of the scheduling strategy. For instance, the SRCC strategy tries to keep in each LLC the stream of partitions that will be fed to workers running in the cores bound to that cache level.

5.2 In-memory MapReduce

There has been quite some work on the optimization of the MapReduce execution model to the intra-node reality. These efforts, commonly referred to in-memory MapReduce show some concern about data locality. For instance, Phoenix [14] adjusts the size of the input and output data of a map task, so that these data can fit in the L1 cache, which reveals a concern about the utilization of the cache. However, no attention is paid to the layout of the data in memory, nor to the overall organization of the memory hierarchy, namely the core-to-cache affinities and the degree of sharing of cache levels among cores. It is just an optional user-tuned parameter that conveys information about the preferred size for each partition.

Another system that presents a rationale close to ours is Tiled-MapReduce [4]. It employs a pipeline of map and reduce tasks that make use of the same memory spaces and reduces idle time of the processing units, promoting locality. Furthermore, it also makes use of tiling strategies for the partition of the domain. However, this tilling process does not take the memory hierarchy into consideration, a key contribution of our work.

Other data locality related concerns have been addressed in in-memory MapReduce implementations. Phoenix++ [16] and Metis [11] have directed their focus to the efficient implementation of the data structures that harbour the intermediate results of the map stage. Additionally, Phoenix++ also tries to take advantage of locality by applying the combiner stage every time a new intermediate result is emitted. The motivation is to, in an ad hoc manner, benefit from the likely possibility of the result still residing in a lower level cache.

In turn, HJ-Hadoop [20] extends Hadoop with features of the Habanero Java framework for multi-core parallelism. Data locality is explored by reducing the number of Java virtual machines created per node, allowing for a more efficient parallel execution of the map stage and for the buffering of the data feed to the user-defined functions.

6 Conclusions

In this paper, we defined and implemented a systematic cache-conscious strategy for decomposing the domain of an application according to the traits of the target machine’s memory hierarchy. A performance evaluation demonstrates the advantage of the approach when targeting computations that feature temporal locality in the access to data. We have obtained up to a 7.7 speedup relatively to the standard horizontal decomposition in this class of computations. In the remainder cases, no performance penalties were observed, foreseeing a wide applicability of the solution.

Another important conclusion drawn from this work is that the best clustering strategy, and TCL size configuration, is computation and architecture dependent. It is not possible to systematically use the same execution settings across applications and architecture, which compromises performance portability. To mitigate this problem, we are currently addressing the automatic inference of these configurations. The goal is to design and implement an auto-learning stage that, over time, progressively learns the best configurations to be applied for each problem and its input sizes, applying these settings upon a request to execute the given problem.

Finally, we have also presented initial evidence that cache-conscious decomposition is also of particular usefulness in cluster environments, as it improves the data locality of algorithms that manipulate large data sets. Future work will assess its applicability to the area of Big Data analytics.