1 Introduction

Numerous scientific and engineering compute-intensive applications spend most of their execution time in loop nests that are amenable to high-level optimizations. Typical examples include dense linear algebra codes (e.g. [4, 6, 57]) and stencil-based iterative methods (e.g. [19, 40, 50]). Those applications are typically executed on multi-core architectures, where the data access cost is hidden behind complex memory hierarchies. High-level loop transformations are critical to achieving high performance in such context to correctly exploit the various levels of parallelism (course-grained versus fine-grained) available and to leverage the program’s data locality potential. However, the best loop optimization sequence is program-specific and depends on the features of the target hardware. Thus, tuning the high-level loop transformations is critical to reach the best possible performance as illustrated by Pouchet  et al. [4446].

Although significant advances have been made in developing robust and expressive compiler optimization frameworks, identifying the best high-level loop transformations for a given program and architecture remains an open problem. Manually-constructed heuristics are used to identify good transformations, but they rely on overly simplified models of the machine. These simple static models are unable to characterize the complex interplay between all the hardware resources (e.g., cache, TLBs, instruction pipelines, hardware prefetch units, SIMD units, etc.). Moreover, optimization strategies often have conflicting objectives: for instance maximizing thread-level parallelism may hamper SIMD-level parallelism and can degrade data locality.

In the quest for performance portability, the compiler community has explored research based on iterative compilation and machine learning to tune the compiler optimization flags or optimization passes to find the best set of optimizations for a given combination of benchmarks and target architectures. Although significant performance improvements have been demonstrated [1, 27, 37, 39], the performance obtained has generally been limited by the optimizations selected for automatic tuning and by the degrees of freedom available for exploration.

The polyhedral optimization framework has been demonstrated as a powerful alternative to traditional compilation frameworks. Polyhedral frameworks can optimize a restricted, but important, set of loop nests that contain only affine array accesses. For loops that are amenable to polyhedral compilation, these frameworks can model an arbitrarily complex sequence of loop transformations in a single optimization step within a powerful and unified mathematical framework [22, 23, 31, 32, 35, 59]. The downside of this expressiveness is the difficulty in selecting an effective set of affine transformation coefficients that result in the best combination of tiling, coarse- and fine-grain parallelization, fusion, distribution, interchange, skewing, permutation and shifting [13, 28, 46, 53].

Past and current work in polyhedral compilation has contributed algorithms and tools to expose model-driven approaches for various high-level transformations, including (1) loop fusion and distribution to partition the program into independent loop nests, (2) loop tiling to partition (a sequence of) loop nests into blocks of computations, (3) thread-level parallelism extraction, and (4) SIMD-level parallelism extraction. There has been some recent limited success at developing analytical cost models to select good complex optimization sequences in the polyhedral model. For example, Bondhugula et al. [8, 9] proposed the first integrated heuristic for parallelization, fusion, and tiling in the polyhedral model subsuming all the above optimizations into a single, tunable cost-model. Individual objectives such as the degree of fusion or the application of tiling can implicitly be tuned by minor ad-hoc modifications of Bondhugula’s cost model. Nevertheless, it has been shown that these simple static models are ineffective at systematically select the most effective transformation on a range of numerical applications [46]. Previous work on iterative compilation based on the polyhedral framework showed that there are opportunities for large performance improvements over native compilers [3, 4446, 53], significantly outperforming compilation flag tuning, optimization pass selection, or optimization phase-ordering. However, directly tuning the polyhedral transformation in its original abstract representation remains a highly complex problem because the search space is usually infinite. Despite progress in understanding the structure of this space and how to bound its size [47], this problem remains largely intractable in its original form.

We now summarize the contributions of the current article. We address the problem of effectively balancing the trade-off between data locality and various levels of parallelism in a large set of high-level optimizations to achieve the best performance. As a direct benefit of our problem formalization, we integrate the power of iterative compilation schemes with the expressiveness and efficiency of high-level polyhedral transformations. Our technique relies on a training phase where numerous possibilities to drive the high-level optimizer are tested using a source-to-source polyhedral compiler on top of a standard production compiler. We show how the problem of selecting the best optimization criteria can be effectively learned using feedback from the dynamic behavior of various possible high-level transformations. By correlating hardware performance counters to the success of a polyhedral optimization sequence we are able to build a model that predicts very effective polyhedral optimization sequences for an unseen program. Our results show it is possible to achieve solid performance improvements by using the high-level transformation that was predicted best by our model, improving performance on average by \(2\times \) up to \(7\times \) over the native compiler. To the best of our knowledge, this is the first effort that demonstrates very effective discovery of complex high-level loop transformations within the polyhedral model using machine learning models. A performance that is close to the search-space-optimal can be attained by evaluating no more than 6 optimized program versions, using an iterative compilation approach. We explore different learning algorithms for building our models and report their ability to predict good polyhedral transformations. We observe that while no single algorithm is systematically the best for all benchmarks, by combining models we can reach \(8.7\times \) average performance improvement over the native compiler, by testing no more than 6 program versions. Finally, we study feature reduction on our set of performance counters and show that only a handful of counters is required to characterize programs for this problem.

In Sect. 2, we first present details on the optimization space we consider before analyzing the performance distribution of the considered search space in Sect. 3. We present our machine learning approach to select good optimizations in Sect. 4. Experimental results are presented in Sect. 5. We discuss related work in Sect. 6.

2 Optimization Space

We now present the optimization search space we consider in this work. Any candidate optimization in this space can be automatically computed and applied on the input program, thus producing a transformed variant to be executed on the target machine. Deciding how to select an optimization in this space is the subject of the later Sect. 4.

2.1 Overview of the Approach

High-level loop transformations are crucial to effectively map a piece of code onto a particular processor. Effective mapping typically requires the partitioning of the computation into disjoint parts to be executed on different cores, and the transformation of those partitions into streams to be executed on each SIMD unit. In addition, the way data is accessed by the code may need to be reorganized to better exploit the cache memory hierarchy and improve communication costs. Addressing these challenges for compute-intensive programs has been demonstrated to be a strength of the polyhedral optimization framework. Several previous studies have shown how tiling, parallelization, vectorization, or data locality enhancement can be efficiently addressed in an affine transformation framework [9, 23, 31, 35, 49, 54].

High-level optimization primitives, such as tiling or parallelization, often require a complex sequence of enabling loop transformations to be applied while preserving the semantics. As an example, tiling a loop nest may require skewing, fusion, peeling, and shifting of loop iterations before it can be applied. A limitation of previous approaches, whether polyhedral-based [28, 36] or syntactic-based [12], was the challenge of computing the enabling sequence that was required to apply the main optimization primitives. This led most previous work to be limited in applicability: the enabling transformations were not considered in a separate fashion, so that transformations such as tiling and coarse-grained parallelization could not be applied in the most effective fashion.

We address this issue by decoupling the problem of selecting a polyhedral optimization into two steps: (1) select a sequence of high-level optimizations, from the set shown in Table 1, this selection being based on machine learning and feedback from hardware performance counters; and (2) for the selected high-level optimizations, use static cost models to compute the appropriate enabling transformations. We thus keep the expressiveness and applicability of the polyhedral model, while focusing the selection decision only on the main transformations.

Table 1 shows the various high-level optimizations we consider and their parameter range, they are each described in the following sections. For each of these high-level optimizations, we rely on a polyhedral-based transformation framework to compute any required enabling loop transformation. If such sequence of enabling loop transformations exist, it will be found by the static model.

Table 1 High-level primitives considered in this work

We remark that high-level transformations are by far not sufficient to achieve optimal performance. Numerous low-level optimizations are also required, and chip makers such as Intel have developed extremely effective closed-source compilers for their processors.

We consider such compilers as black-boxes, because of the difficulty in precisely determining which optimizations are implemented and when. Technically, the loop optimization stages of those compilers may interact with our source-level transformations in a detrimental way, by either altering the loop structure we generated, and/or by becoming unable to apply a profitable loop optimization on the code we have generated. A typical example is SIMD vectorization, which may or may not be successfully applied by the back-end compiler on the transformed program we generated, even if the resulting program is indeed vectorizable. Consequently, our optimization scheme may result in sub-optimal performance if for instance the compiler is unable to apply SIMD vectorization on our transformed program while it was able to apply it on the original program or some other variants. A precise and fine-grain tracking of the back-end compiler optimizations applied would be required to avoid this potential issue, but we have not addressed this problem in the present work. We also highlight in Sect. 3 that indeed the optimal high-level transformation we generate must be compiler specific. Our approach considers the back-end compiler as part of the target machine, and we focus exclusively on driving the optimization process via high-level source-to-source polyhedral transformations.

2.2 Polyhedral Model

Sequences of (possibly imperfectly nested) loops amenable to polyhedral optimization are called static control parts (SCoP) [23, 28] roughly defined as a set of (possibly imperfectly nested) consecutive statements such that all loop bounds and conditionals are affine functions of the surrounding loop iterators and global variables (constants that are unknown at compile time but invariant in the loop nest). Relaxation of these constraints to arbitrary side-effect free programs has recently been proposed [7], and our optimization scheme is fully compatible with this extended polyhedral model.

Polyhedral program optimization involves the analysis of the input program to extract its polyhedral representation, including dependence information and array access patterns. These are defined at the granularity of the statement instance, that is, an executed occurrence of a syntactic statement.

A program transformation is represented by an affine multidimensional schedule. This schedule specifies the order in which each statement instance is executed. A schedule captures in a single step what may typically correspond to a sequence of loop transformations [28]. Arbitrary compositions of affine loop transformations (e.g., skewing, interchange, multi-level distribution, fusion, peeling and shifting) are embedded in a single affine schedule for the program. Every static control program has a multidimensional affine schedule [23], and tiling can be applied by extending the iteration domain of the statements with additional tile loop dimensions, in conjunction with suitable modifications of the schedule [28].

Finally, syntactic code is regenerated from the polyhedral representation on which the optimization has been applied. We use the state-of-the art code generator CLooG [5] to perform this task.

We used the open-source polyhedral compiler PoCCFootnote 1 for this paper, and we have extended it for the purposes of enabling more effective model-based program transformations.

2.3 Loop Tiling

Tiling is a crucial loop transformation for parallelism and locality. It partitions the computation into rectangular blocks that can be executed atomically. When tiling is applied on a program, we rely on the Tiling Hyperplane method [9] to compute a sequence of enabling loop transformations to make tiling legal on the generated loop nests.

Two important performance factors must be considered for the profitability of tiling. First, tiling may be detrimental as it may introduce complex loop structure and the computation overhead may not be compensated by the locality improvement. This is particularly the case for computations where data locality is not the performance bottleneck. Second, the size of the tiles could have a dramatic impact on the performance of the generated code. To obtain good performance with tiling, the data footprint of a tile should typically reside in the L1 cache. The problem of selecting the optimal tile size is known to be difficult and empirical search is often used for high-performance codes [55, 58, 60]. To limit the search space while preserving significant expressiveness, we allow the specification of a limited number of tile sizes to be considered for each tiled loop. In our experiments, we use only two possible sizes for a tile dimension: either \(1\) (i.e., no tiling along this loop level) or 32. The total number of possibilities is a function of the depth of the loop nest to be tiled: for instance, for a doubly-nested loop we test rectangular tiles of size \(1 \times 1\) (no tiling), \(1 \times 32,\,32 \times 1\) and \(32 \times 32\).

2.4 Loop Fusion/Distribution

In the framework used in the present paper, there is an equivalence between (i) maximally fusing statements, (ii) maximizing the number of tilable loop levels, (iii) maximizing locality and (iv) minimizing communications. In this seminal formulation, Bondhugula proposed to find a transformation that maximizes the number of fused statements on the whole program using an Integer Linear Program (ILP) encoding of the problem [8]. However, maximally fusing statements may prevent parallelization and vectorization, and the trade-off between improving locality despite reducing parallelization possibilities is not captured. Secondly, fusion may interfere with hardware prefetching. Also, after fusion, too many data spaces may contend for use of the same cache, reducing the effective cache capacity for each statement. Conflict misses are also likely to increase. Obviously, systematically distributing all loops is generally not a better solution as it may be detrimental to locality.

The best approach clearly depends on the target architecture, and the performance variability of an optimizing transformation across different architectures creates a burden in devising portable optimization schemes. Pouchet et al. showed that iterative search among the possible fusion structures can provide significant performance improvement [46, 47]. However, to control the size of the search space we rely on three specific fusion / distribution schemes that proved to be effective for a wide variety of programs. The three high-level fusion schemes we consider in this paper are: (1) no-fuse, where we do not fuse at all; (2) smart-fuse, where we only fuse together statements that carry data reuse and of similar loop nesting depth; and (3) max-fuse, where we maximally fuse statements. These three cases are easily implemented in the polyhedral framework, simply by restricting the cost function of the Tiling Hyperplane method to operate only on a given (possibly empty) set of statements.

Interaction with tiling. The scope of application of tiling directly depends on the fusion scheme applied on the program. Only statements under a common outer loop may be grouped in a single tile. Maximal fusion results in tiles performing more computations, while smart fusion may result in more tiles to be executed, but with fewer operations in them. The cache pressure is thus directly driven by the fusion and tiling scheme.

2.5 Wavefronting

When a loop nest is tiled, it is always possible to execute the tiles either in parallel or in a pipeline-parallel fashion. Wavefronting is the transformation creating a valid, pipeline-parallel schedule for the tiled execution. It is useful only to expose coarse-grain parallelism between tiles, when the outer-most tile loop is not already sync-free parallel. When wavefronting is turned on, the tile loops are modified to expose parallelism at the expense of increasing the distance between reused elements. So, there is a trade-off to the application of this transformation, and as such is part of our high-level optimization choices. Additionally, we remark that wavefronting will not be useful for a program where thread-level parallelism is not also useful.

2.5.1 Pre-Vectorization

The tiling hyperplane method finds a loop transformation that pushes dependences to the inner loop levels, naturally exposing coarse-grain parallelism. Nevertheless, for effective SIMD execution it is desirable to expose at least one level of inner parallelism. The pre-vectorization stage modifies the affine schedule of the program (that is, the transformation to be applied) so that the inner-most parallel loop dimension is sunk into the inner-most loop dimension.

This approach can guarantee that the inner-most loop(s) of a program are sync-free parallel, when the dependence permits. It has the advantage of enforcing the parallelism of all loops at a given depth in the generated loop nest, for the entire program. However this simple model shows significant limitations: no information is taken into account regarding the contiguity of data accesses, a critical concern for effective SIMD execution. Also, it works as a pre-pass before the code is transformed, i.e., it does not take into account the changes in the loop structure that is going to be generated by CLooG (our polyhedral code generator). Nevertheless, this model has shown potential to increase the success of the new SIMD-level parallelization pass that we have implemented in PoCC (detailed next).

2.5.2 SIMD-Level Parallelization

Our approach to vectorization extends recent analytical modeling results by Trifunovic et al. [54]. We take advantage of the polyhedral representation to restructure imperfectly nested programs allowing us to expose vectorization opportunities in inner loops. The most important part of the transformation to enable vectorization comes from the selection of which parallel loop is moved to the innermost position. The cost model selects a synchronization-free loop that minimizes the memory stride of the data accessed by two contiguous iterations of the loop [54]. This is a more SIMD-friendly approach than simple pre-vectorization. We note however that this interchange may not always lead to the optimal vectorization because of the limitations of the model or may simply be useless for a machine which does not support SIMD instructions. In addition, we have implemented this vectorization transformation pass on the generated code after applying the tiling hyperplane method. Our algorithm operates on individual loop nests generated after the separation phase of CLooG and does not require to have all loops at a given depth in the program to be vectorized in a similar fashion. To achieve this, we benefit from the fact that programs generated by CLooG are also polyhedral programs, that is we can re-analyze the transformed code and extract a polyhedral representation from it. When SIMD vectorization is turned on, we mark the vectorizable loops with ivdep and vector always pragmas to facilitate the task of the compiler auto-vectorization.

2.5.3 Thread-Level Parallelization

Thread-level parallelism is not always beneficial, e.g., with small kernels that execute few iterations or when it prevents vectorization. It is thus critical to be able to disable thread-level parallelism in some instances. We have designed a specific optimization pass in PoCC that analyzes the generated code, in a similar fashion to SIMD-level parallelization. It finds the outer-most parallel loop in each loop nest in the generated code using automated scalar privatization techniques. When this optimization is turned on, we use OpenMP to generate parallel code and insert a #pragma omp parallel for above the outer-most parallel loop that was found.

2.5.4 Register Tiling

Loop unrolling is known to help expose instruction-level parallelism. Tuning the unrolling factor can influence register pressure in a manner that is compiler and machine-dependent. Register tiling, or unroll-and-jam, combines the power of unrolling multiple permutable loops in the inner loop body. In our framework, register tiling is performed as a post-pass considering only the inner-most two loops in a loop nest, if they are permutable. We expose four different sizes for the unroll factor, which is the same for both loops to be unroll-and-jammed: 1 (no unrolling), 2, 4 and 8.

2.6 Putting it All Together

A sequence of high-level optimizations is encoded as a fixed-length vector of bits, referred to as \(T\). To each distinct value of \(T\) corresponds a distinct combination of the above primitives. Technically, on/off primitives (e.g., SIMD-level parallelization and thread-Level parallelization) are encoded using a single bit.

Non-binary optimizations such as the unroll factor or the tile sizes are encoded using a “thermometer” scale. As an illustration, to model unroll-and-jam factors we use three binary variables \((x,y,z)\). The pair \((0,0,0)\) denotes no unroll-and-jam, an unroll factor of 2 is denoted by \((0,0,1)\), an unroll factor of 4 is denoted by \((0,1,1)\), and unroll factor of 8 by \((1,1,1)\). Different tile sizes are encoded in a similar fashion. In our experiments, we only model the tile size on the first three dimensions (leading to 9 possibilities), and use a constant size for \(T\). Thus, for programs where the tiles have a lower dimensionality, some bits in \(T\) have no impact on the transformation.

To generate the polyhedral transformation corresponding to a specific value of \(T\), we proceed as follows.

  1. 1.

    Partition the set of statements according to the fusion choice (one in no-fuse, smart-fuse or max-fuse);

  2. 2.

    Apply the Tiling Hyperplane method [9] locally on each partition to obtain a schedule for the program that (a) implements the fusion choice, (b) maximizes the number of parallel loops, and (c) maximizes the number of tilable dimensions [8] on each individual partition;

  3. 3.

    Modify the schedule according to the pre-vectorization model, if pre-vector is set, to expose inner parallel loops;

  4. 4.

    Modify the schedule to generate a wavefront parallel schedule, if wavefronting is set.

  5. 5.

    Tile all tilable loop nests, if any, if tile is set. The tile sizes to be used are encoded in \(T\).

  6. 6.

    Extract the polyhedral representation of the generated code, for subsequent analysis.

  7. 7.

    Apply the per loop nest SIMD-level parallelization pass, if set.

  8. 8.

    Apply the per loop nest thread-level parallelization pass, if set.

  9. 9.

    Perform register tiling, if register-tiling is set. The unroll factors to be used are encoded in \(T\).

Candidate Search Space. The final search space we consider depends on the program. For instance, not all programs exhibit coarse-grain parallelism or are tilable. For cases where an optimization has no effect on the final program because of semantic considerations, multiple values of \(T\) lead to the same candidate code version. Nevertheless, the applicability of those optimizations directly derives from the expressiveness of the polyhedral model, which significantly improves over other existing frameworks.

The search space, considering only values of \(T\) leading to distinct transformed programs, ranges from 91 to 432 in our experiments, out of 1152 possible combinations that can be encoded in \(T\).

3 Analysis of the Performance Distribution

We now present an extensive quantitative analysis of the performance distribution of the search space of polyhedral transformations that we have built. Section 3.1 presents the machines and benchmarks we experimented with. We then extensively discuss the performance distribution for numerous benchmarks, machines and compilers, providing experimental evidence of the specificity of the optimal transformation to each of these three aspects.

3.1 Experimental Setup

We provide experimental results on two multi-core systems: Nehalem, a 2-socket 4-core Intel Xeon 5620 (16 H/W threads) with 16 GB of memory, and Q9650, a single socket 4-core Intel Quad Q9650 with 8 GB of memory. The back-end compilers used for the baseline and all candidate polyhedral optimizations are Intel ICC with option -fast and GCC with option -O3. ICC version 11.1 and GCC version 4.5 were used for Nehalem, and ICC version 10.1 and GCC version 4.5 were used for Q9650. Thus, we present results for four different machine-compiler configurations: (1) Nehalem-ICC11.1 (2) Nehalem-GCC4.5 (3) Q9650-ICC10.1 and (4) Q9650-GCC4.5.

Our benchmark suite is PolyBench v2.1 [30] composed of 30 different kernels and applications containing static control parts. The list of programs in PolyBench is shown in Table 2. We used the reference datasets [30].

Table 2 The 30 programs in PolyBench V2.1 were used for our training and testing of each prediction model

3.2 Overview of the Performance Distribution

We first propose to quantify the performance distribution, from the perspective of determining the fraction of the search space that achieves better performance than the original code, and the fraction that achieves a nearly optimal performance in the considered space. Figure 1 plots, for all considered architectures and compilers, the relative speedup (compared to the original program) achieved by each variants, for the covariance benchmark. Figure 2 shows a similar plot for the fdtd-2d benchmark.

Fig. 1
figure 1

Performance distribution for covariance

Fig. 2
figure 2

Performance distribution for fdtd-2d

First, we observe that the fraction of the search space which improves performance is dependent upon the benchmark. covariance and fdtd-2d are two representative benchmarks. For covariance the majority of the search space improves performance, for both architectures and compilers. This means that a simple random approach is likely to succeed in improving performance for this benchmark. For fdtd-2d and the Nehalem architecture, the opposite pattern is observed where only a marginal fraction of the space improves performance. This latter benchmark corresponds to the majority of benchmarks, where usually a significant fraction of the space degrades performance. Indeed, this result confirms that the search space we consider is very expressive and that many of the high-level optimizations shown in Table 1 can actually decrease performance if not parameterized properly.

More importantly, we observe for all cases that the fraction of the space which achieves nearly optimal performance is extremely small, usually below 1 % of the space. This pattern was observed for all benchmarks. This severely limits the ability of any naive statistical model to discover the best performance in this space, thus motivating our use of the powerful machine learning algorithms instead.

3.3 Variability Across Compilers

We now provide observations about the relative performance of similar transformations when used for the same machine and benchmark, but with two different compilers. Figures 3 and 4 plots, for the two machines, the performance distribution for gemm and seidel using GCC and ICC, but sorted by increasing performance for GCC.

Fig. 3
figure 3

Performance distribution for gemm and seidel for the E-5620 machine

Fig. 4
figure 4

Performance distribution for gemm and seidel for the Q-9650 machine

The most prominent feature of these plots is that the best performing optimized variants for GCC (on the far right of the plots) are not the best performing variants for ICC, and conversely. That is, the sequences of high-level optimizations achieving the best performance differ for each compiler. This is particularly shown in Fig. 3 for gemm or in Fig. 4 for seidel. This pattern is not systematic for all benchmarks, but is dominant in our test suite. This can for instance be observed also in Figs. 1 and 2. One of the main reason for these differences comes from the very fragile optimization heuristics implemented in each compiler. The benefit of using our tool-chain to generate potentially effective transformations (e.g., tiling or register tiling) can break the application of high-performance scalar optimizations by the native compiler. This is because the transformed code to be compiled becomes syntactically more complex, and fragile optimization cost models in the native compiler can be challenged by its structure. Another reason comes naturally from the differences in optimizations being implemented in each compiler and when they are applied. For instance, we have manually observed instances where the cost model of GCC applies register tiling more aggressively than ICC, thus making useless (if not detrimental) the application of register tiling in our framework.

3.4 Variability Across Machines

We now present experimental evidence of the sensitivity of transformation sequences to the target machine. In Fig. 5 we plot, for the same benchmark and the same compiler, the performance distribution for both machines sorted by ascending performance for Nehalem.

Fig. 5
figure 5

Performance distribution for adi and syr2k using GCC 4.5

It is extremely interesting to observe that for both a stencil (adi) and a simple linear kernel (syr2k), the best performing variant for one processor has significantly lower performance on the other. We remark that despite being two Intel x86 64 bits processors, they significantly differ in design. The Q9650 is a Yorkfield dual-die quad-core design based on the Core 2 duo microarchitecture, with 12MB of L2 cache. The E5620 is a Westmere processor using the Nehalem microarchitecture, with hyper-threading and 12MB of L3 cache. This machine has two processors, and thus has a total of 16 H/W threads instead of 4 for the Q9650. And most notably, Intel made significant changes on the E5620 adding a second-level branch predictor and translation lookaside buffer and using the Intel QuickPath Interconnect for data transfers in place of the slower Front Side Bus.

As a consequence, the relative impact of transformations such as vectorization, parallelism, or data locality is significantly changed. For instance, when there is a trade-off between data locality and parallelization, a better and faster data cache hierarchy in the Nehalem diminishes the need for our framework to focus on data locality improvement (e.g., tiling), but more available threads makes the need to perform effective parallelization for the Nehalem critical.

3.5 Sensitivity to Different Datasets

The best set of optimizations often depends on the program data, in particular when the input data can significantly change the instructions executed by the program. For instance, the sparsity of a matrix affects the number of floating point operations in a sparse-matrix computation. Previous work investigated the use of standardized datasets for arbitrary programs [14, 26], and their impact on compiler optimization.

The present work focuses exclusively on static control programs, where the control flow (and therefore the program instructions to be executed) can be fully determined at compile-time. That is, SCoPs are not sensitive to the dataset content: for a given input dataset size, the best optimization will be the same for any value of the elements of this dataset. This is a strong benefit of focusing exclusively on SCoPs, as we are not required to fine-tune the optimization sequence for different sets of inputs of identical size.

3.5.1 Dataset Size

Despite a lack of sensitivity to the input data content, the total number of instructions the program will execute still heavily depends on the dataset size, that is the number of input data elements. We distinguish two features immediately depending on the dataset size: (1) the ability to keep (most of) the data in cache, for smaller datasets; and (2) the profitability of parallelization, when the cost of spawning threads is to be compared with the total number of instructions to be executed.

We illustrate the following with a matrix-multiply example. For the first case, we multiply matrices of size \(8 \times 8\). The total dataset size for the three matrices (double) is 1.5kB, and the total number of floating point operations is 1024. The data fits fully in L1 data cache, and 512 SSE3 SIMD operations are required to execute the vector operations corresponding to the complete computation. In this case, program transformations such as tiling (for locality) and thread-level parallelization will clearly not improve the performance, because they address performance issues that are not seen for this program. Considering now the same program, but operating on matrices of size \(1024 \times 1024\). The dataset size is now 25MB, and does not fit in the on-chip data cache. The overhead of spawning threads is negligible in comparison of the total number of operations to be executed. Therefore, optimizations such as tiling and thread-level parallelization become highly profitable.

The results presented in the current Section are limited to a single dataset size, the reference one for PolyBench 2.1 [30]. Depending on the benchmarks, the dataset size can be L2 resident, L3 resident, or most of the time larger than L3. That is, the benchmark set we have used spans most of the typical dataset size cases that can arise. The workload also differs vastly from benchmark to benchmark, as shown in Sect. 3.6 where for several benchmarks thread-level parallelization is not the optimal transformation.

We also remark that the machine learning techniques described in Sect. 4 use a hardware counters characterization of the input program to select the best transformation. Variation in the dataset size, and its implication in terms of data footprint and total workload to be executed are fully captured by hardware counters. Our approach to determine the best transformation is therefore robust to different dataset sizes, provided we use a training set that correctly spans the various key problem sizes associated with the profitability of each optimization. We remark that in the present work, we have limited our study to only one problem size per benchmark, focusing exclusively on out-of-L1 dataset sizes. These dataset sizes are representative of program which may benefit from agressive loop transformations, the target of the present article. Complementary techniques such as versioning can be used to find the optimal transformation for a few typical dataset sizes such as L1-resident, L2-resident, L3-resident and larger than L3. Such technique has been successfully used in ATLAS for instance [57] and is compatible with our approach.

3.6 Analysis of the Optimal Transformations

We conclude our analysis of the search space we consider by reporting, for both machines, the high-level optimizations that are used for the best performing optimized program variant obtained. Table 3 reports, for each benchmark and compiler, the optimizations turned on for the best performance on Nehalem. Table 4 shows the same data for the Q9650.

Table 3 Summary of the high-level optimizations used to achieve the best performance on Intel E-5620, when using GCC 4.5 (g) or ICC 11.1 (i)

We observe interesting correlations between transformations. For instance, pre-transformations for vectorization is always beneficial (or at least, not detrimental) when using the SIMD-level parallelization pass. Wavefronting is useful for some stencils which are tiled, but not all. For instance, wavefronting is not beneficial on Nehalem for fdtd-apml. In contrast, wavefronting is always used in the best optimization for stencils on the Q9650 machine, together with tiling.

Table 4 Summary of the high-level optimizations used to achieve the best performance on Intel Q9650, when using GCC 4.5 (g) or ICC 10.1 (i)

The benefit of rectangular tiling is demonstrated in numerous benchmarks, such as adi and dynprog for Q9650 or correlation and covariance for Nehalem. We observe that contrary to what is expected, tiling all loops which carry data reuse does not necessarily lead to the best performance. For instance, jacobi-1d-imper is not tiled, and for Nehalem, strip-mining the inner loop is the best choice for correlation and covariance. This result is counter-intuitive, and its observation is a contribution of our extensive analysis.

Building cost models for tiling is challenging, as illustrated by the difference in dimensions to be tiled between compilers and between architectures. On the other hand, as expected for Nehalem and for all but two benchmarks, thread-level parallelization is part of the optimal transformation, and a naive heuristic could simply always apply this optimization. This was expected as the number of available threads on Nehalem is large (16 total), thus making coarse-grain parallelization a critical optimization for performance. Looking at Q9650 shows a very different trend, where for 7 out 30 benchmarks, thread-level parallelization is not optimal when using GCC. In addition, GCC 4.5 does not have automatic OpenMP parallelization features turned on in -O3, so it is clear that the compiler did not parallelize outer loops. This also confirms our hypothesis that on Q9650, data locality and instruction-level parallelism are more critical performance factors.

4 Selecting Effective Transformations

In this paper, we focus on the search of polyhedral optimizations with the highest impact as described in Sect. 2. When considering a space of semantics-preserving polyhedral optimizations, the optimization space can lead to a very large set of possible optimized versions of a program [44]. We achieved a tremendous reduction in the search space size using expert pruning when compared to these methods, but we still have hundreds of sequences to consider. In this paper, we propose to formulate the selection of the best sequence as a learning problem, and use off-line training to build predictors that compute the best sequence(s) of polyhedral optimizations to apply to a new program.

4.1 Characterization of Input Programs

In this work, we characterize the dynamic behavior of programs, by means of hardware performance counters. Using performance counters abstracts away the specifics of the machine, and overcomes the lack of precision and information of static characteristics. Also, models using performance counter characteristics of programs have been shown to out-perform models that use only static features of program [12].

A given program is represented by a feature vector of performance counters collected by running the program on the particular target machine. We use the PAPI library [38] to gather information about memory management, vectorization, and processor activity. In particular, for all cache and TLB levels, we collect the total number of accesses and misses, the total number of stall cycles, the total number of vector instructions, and the total number of issued instructions. All counter values are normalized using the total number of instructions issued Table 5 lists the PAPI counters we have used for this work.

Table 5 Performance counters (PC): We collected 56 different performance counters available using PAPI library to characterize a program

4.2 Speedup Prediction Model

A general formulation of the optimization problem is to construct a function that takes as input features of a program being optimized and generates as output one or more optimization sequences predicted to perform well on that program. Previous work [12, 21] has proposed to model the optimization problem by characterizing a program using performance counters. We use a prediction model originally proposed by Cavazos et al. [11, 20], but slightly adapted to support polyhedral optimizations instead. We refer to it as a speedup predictor model.

This model takes as an input a tuple \((F, T)\) where \(F\) is the feature vector of all hardware counters collected when running the original program and \(T\) is one of the possible sequence of polyhedral primitives. Its output is a prediction of the speedup \(T\) should achieve relative to the performance of the original code. Figure 6 illustrates the speedup prediction model.

Fig. 6
figure 6

Our speedup prediction model takes in two inputs. The first input is a characterization of the program consisting of a feature vector F of performance counters. The second input is a set of possible optimizations from the polyhedral optimization space. The target value for this predictor is the speedup of a specific optimization set T over baseline

We implemented the speedup prediction model by using six different machine learning algorithms shown in Table 6 using Weka V3.6.2 [10]. For each machine learning algorithm, we used the default settings, except for support vector machines (SVM) and linear regression (LR). For the SVM and LR algorithms, we conducted experiments to tune parameters of those algorithms.

Table 6 The six machine learning algorithms we evaluated in this paper

The regression-based model demonstrates the relationship between dependent and independent variables, and we can use this model to observe the dependent variables according to the change of given independent variables. We used linear regression to fit a predictive model to a dependent variable which is speedup of programs, and independent variables which are performance counters and the polyhedral optimization sequence. Support vector machines (SVM) is a supervised machine learning algorithm, used for both classification and regression, and it can apply linear techniques to non-linear problems. First, an SVM algorithm transforms the training data into a linear space by using kernel functions, and then it uses a linear classifier to separate data with a hyperplane. SVMs not only finds a hyperplane to separate data, but it also finds the best hyperplane, i.e., the maximum margin hyperplane, that gives the largest separation of the data in the training examples from the set of all hyperplanes. IBk and K* are instance-based learning algorithms that predict the classification of new instances based on instances already classified in the memory. These types of algorithms assume that similar instances belong to a similar class. IBk first selects the K-nearest neighbors of a new instance, then selects the class of the neighbor that is closest amongst them [2]. We used Euclidean distance function of all the features to find the nearest neighbors, but other distance functions can also be used, e.g., Manhattan distance. K* also selects from instances already classified. It then chooses the class of the predominant instance, but uses entropic distance. Entropic distance is defined as the complexity of transforming one instance into another one [15]. M5P generates M5 model trees, which look like conventional decision trees, but have linear regression functions at the leaf nodes. MLP (MultiLayer Perceptron) is neural network classifier, and we used back propagation to train the network.

4.3 Model Generation and Evaluation

We train a specific model for each target architecture, as the specifics of a machine (e.g., cache miss cost, number of cores, etc.) significantly influence what transformations are effective for it. In addition, to evaluate the quality of machine learning algorithms used, we train one specific model for machine.

A model is trained as follows. For a given program \(P\) in the training set, (1) we compute its execution time \(E\) and collect its performance counters \(F\); (2) for all possible sequences of polyhedral optimizations \(T_i\), we apply the transformation to \(P\) and execute the transformed program on the target machine, this gives an execution time \(E_{T_i}\), and the associated speedup \(S_{T_i} = E / E_{T_i}\); (3) we train the model with the entry \((F, T_i) = S_{T_i}\). This is repeated for all programs in the training set. This is illustrated in Fig. 7.

For a new input program, we first collect a feature vector of performance counters from several runs of the program. Then, we use the model to predict the expected speedup of each set of optimizations \(T_i\). By predicting the performance of each possible set, it is possible to rank them according to their expected speedup and select the set(s) with the greatest speedup. This is illustrated in Fig. 8.

Fig. 7
figure 7

To train the model, we collect performance counters F from each program in training set. We also collect the speedup for each optimization set \(T_i\). We build the prediction model with F, \(T_i\), and the associated speedup over baseline for \(T_i\)

Fig. 8
figure 8

To use the model, we collect performance counters F for a given new program. We give F and an optimization set T as input to model. The models produces a predicted speedup for each optimization set \(T_i\). This prediction can be used in both an non-iterative and iterative scenario

Each of our models must predict optimizations to apply to unseen programs that were not used in training the model. To do this, we need to feed as input to our models a characterization of the unseen program. We then ask the model to predict the speedup of each possible optimization set \(T_i\) in our polyhedral optimization space, given the characteristics of the unseen program. We order the predicted speedups to determine which optimization set is predicted best, and we apply the predicted best optimization set(s) to the unseen program.

Note in the experiments presented below, we use a leave-one-benchmark-out cross-validation procedure for evaluating our models. That is, the six models (LR, SVM, IBk, K*, M5P, or MLP) are trained on all program variants of each of the \(N-1\) benchmarks and evaluated on the program variants of the benchmark that was left out. This procedure is repeated for each benchmark to be evaluated, that is, we construct a different model for each program in our training set.

4.4 1-Shot and Multi-Model Evaluation

The models presented above output a single optimization sequence for an unseen program. For the rest of the paper, we refer to this approach as a 1-shot model.

It is worth considering an empirical evaluation of several candidate optimization sets, since a given model may not correctly predict the actual best optimization set for a program. A typical source of misprediction comes from the back-end compiler: depending on the input source code, the back-end compiler may perform different optimizations. For example, we observed in our experiments that for the benchmark 2 mm (computing two matrix multiplications \(tmp = A.B; output = tmp.C\)), the best performance when using Intel ICC 11.1 is achieved when no tiling is applied by our polyhedral compiler. We suspect this is because ICC performs specific optimizations on this particular computation (matrix-multiply), since in this setup tiling 2 mm to make it L1-resident decreases the performance. However, another program with similar hardware counter features may be compiled entirely differently by ICC, and as shown by our experiments even the same program is handled differently by ICC 11.1 and ICC 10.1 on two different machines.

We propose to also evaluate an approach that combines the output of multiple individual models. In contrast to the 1-shot model, which does not require to run the transformed variant on the machine, we propose to resort to a small number of iterative compilation steps. We call this second scheme multi-model, since we use each optimization set that was predicted best by each individual 1-shot model. The optimization set which has the best observed performance is retained as the output of the compilation process.

5 Experimental Results

We now present extensive experimental results, using the platforms and benchmarks detailed in Sect. 3.

5.1 Training and Testing Time

We report in Table 7 the training and testing time for the 6 ML algorithms we have used. We recall that we have used Weka V3.6.2 [10], running on a Core2 Duo. The Training operates on all the program variants generated for a set of \(N-1\) benchmarks, out of \(N\) benchmarks in our test suite. The Testing is done on all the variants of the benchmark which was left out during training. Because of different numbers of variants for each benchmark, we report the time range taken by the training and testing in Table 7.

Table 7 Weka V3.6.2 Training and Testing time, for the various ML algorithms we have used

5.2 Evaluation of the Machine Learning Models on Nehalem

We show in Tables 8 and 9 the performance of the six different machine learning models we have evaluated using the 1-shot approach on the Nehalem machine. For each benchmark, we report the maximal performance improvement over the original code in the search space (Opt column). This improvement is relative to the performance obtained by running the native compiler on the original code. We also report the improvement obtained by a simple random method of generating an optimization set (averaging the performance of 100 distinct random draws). (Random column). Finally, we report the performance improvement achieved by the optimization set in our search space that is predicted best by each model (columns LR to MLP). The fraction (in percentage) of the optimal improvement is also reported.

Table 8 This table shows performance improvement on the Intel Xeon E5620 (baseline: GCC 4.5 -O3) for the 1-shot model
Table 9 This table shows performance improvement on the Intel Xeon E5620 (baseline: ICC 11.1 -fast) for the 1-shot model

5.2.1 Analysis

We observe that a very important performance improvement can be achieved with the polyhedral optimization sets we consider, on average a speedup of \(9\times \) for GCC and \(5.6\times \) for ICC, with peaks up to 28\(\times \). The range of the performance improvement is wide: with GCC (ICC) as the back-end compiler, 7 (12) benchmarks shows an improvement below \(2\times \), while 10 (7) shows an improvement above \(10\times \). We also observe that for all benchmarks, there exists at least a polyhedral optimization set that improves the performance.

Note that we have shown in Sect. 3 that for many benchmarks a vast majority of optimization sets can decrease performance. This is confirmed using Random where, choosing optimization sets randomly, we decrease performance compared to the original code for 11 benchmarks when using GCC (17 for ICC) as the backend compiler. Still, a simple random strategy increases performance on average by up to \(2.5\times \) using GCC. This is explained by the very large performance improvement that can be obtained for some benchmarks where the majority of transformed variants provide a solid improvement, such as 2 mm or 3 mm. Nevertheless, the fraction of possible improvement achieved by a simple strategy such as Random remains very low, i.e., around 25 % for both compilers.

We observe that each of the six machine learning models we evaluated outperform Random, with K* reaching up to almost \(7\times \) performance improvement on average with GCC and with IBk reaching to \(3.73\times \) with ICC. On average, both K* and IBk are top two best models for both compilers. A very interesting observation is that none of the models, including K*, performs consistently best for all benchmarks. Table 9 shows the ability of each model to successfully predict the best optimization set for at least some benchmarks, while other models fail. That is, each model produces the best improvement on at least one benchmark. As an illustration, K* using GCC on gramschmidth does not provide an improvement, while LR produces an almost optimal variant reaching \(25.04\times \) improvement (98 % of the maximal improvement possible for this benchmark). Even MLP, the worst-performing model on average for GCC, succeeds in finding the optimal variant for gemver, while all other models fail at doing so. The effectiveness of the models on average differs depending on the backend compiler used. For example, MLP using ICC performs well (it is the third best model on average), while it is the worst model for GCC. Nevertheless LR and SVM perform on average consistently worse than IBk and M5P, and K* performs consistently best. We conclude this analysis by observing that for both compilers, no model is able to guarantee, in one shot, to not decrease the performance of the original code. K* significantly decreases the performance for 5 benchmarks for GCC and 6 for ICC, while SVM decreases the performance for 11 benchmarks for both compilers.

This motivates an approach we call the multi-model approach evaluated in Sect. 5.4, which combines the predictions of multiple models to select a transformation. We show in Sect. 5.4 that our multi-model approach decreases the performance for only one benchmark on Nehalem, for both compiler, while providing substantial performance improvements for the other benchmarks.

5.3 Evaluation of the Machine Learning Models on Core 2 Quad

Tables 10 and 11 report, for all benchmarks, the performance of the models we have evaluated in a similar fashion as in the previous section, but for the Q9650 machine.

Table 10 The table describes the performance improvement for the Intel Quad Q9650 (baseline: GCC 4.4 -O3) for our 1-Shot model
Table 11 The table describes the performance improvement for the Intel Quad Q9650 (baseline: ICC 10.1 -fast) for our 1-Shot model

5.3.1 Analysis

Similarly to the Nehalem machine, we observe that for most benchmarks and compilers there exists at least one polyhedral transformation that achieves a performance significantly better to the original code. Nevertheless the average maximal performance improvement in our search space is lower with the Q9650 machine than on the Nehalem. The Q9650 has a lower number of computing units available (less cores), and thus the maximal improvement obtained by exploiting parallelism is reduced. Still, a significant improvement is observed, from 3.7\(\times \) on average for GCC to 4.6\(\times \) for ICC, with peaks up to 23\(\times \). The number of benchmarks for which there is little to no improvement increased from Nehalem with up to 13 benchmarks having a speedup lower than 2\(\times \) for GCC.

Interestingly, comparing to the Nehalem, K* is not the best performing model on average for the Q9650. M5P and IBk perform consistently better on average, and they also reduce the number of benchmarks for which the performance is decreased by the selected transformation. We also notice for the Q9650 that each model performs well on a specific benchmark while the other models fail for that benchmark. For example, LR chooses an optimization set that reaches 90 % of the maximal improvement for 2 mm, while all other models fail to discover more than 34 % of the maximal improvement for GCC. Similarly, MLP is the only model that finds the optimal optimization set for fdtd-2d. This pattern is observed for several benchmarks.

We conclude this analysis by observing that similarly to the Nehalem results, the models can decrease the performance of several benchmarks relative to not optimizing the benchmarks with PoCC. For example, M5P decreases the performance for seven benchmarks for both compilers. In contrast, the multi-model approach (discussed next) significantly reduces the number of benchmarks it obtains worse performance on than not using PoCC.

5.4 Multi-Model Evaluation

Table 12 reports the performance achieved by the multi-model approach for each benchmark on each architecture/compiler pair. In contrast to the 1-shot approach, which is a compile-time only approach, our multi-model approach requires six optimized variants of the code to be executed on the target machine, corresponding to the predicted optimization configurations from each of the six individual models. That is, we optimize a benchmark with the optimization configuration predicted to give the best speedup from each machine learning model. Then, we evaluate each of these optimized variants of the benchmark and return the variant that gives the best speedup.

Table 12 This table shows the performance improvement of using the multi-model approach for all configurations

5.4.1 Analysis

We observe in Table 12 that our multi-model approach significantly outperforms the 1-shot approach, for all architecture/ compiler pairs we experimented with. In terms of average performance improvement, the multi-model consistently discovers more than 70 % of the optimal improvement, a jump from 58 % for ICC on Q9650. More importantly, for the majority of the benchmarks for Nehalem, the optimal (or close to optimal) variant is discovered by the multi-model. For the Q9650, the optimal variant is discovered by the multi-model approach for more than a third of the benchmarks.

The multi-model approach also achieves very significant improvements in terms performance guarantees: for all architecture/compiler pairs, the multi-model approach selects a variant that does not degrade performance for all but two benchmarks at most. However, there are still some benchmarks that see a degradation in performance with the multi-model approach. For example, dynprog sees its performance consistently degraded for all architecture/compiler pairs. This is a benchmark for which there is very little performance improvement to be discovered in the search space. We also see a degradation for trmm on the Q9650 using GCC.

In addition, we observe that our multi-model approach on Nehalem using ICC fails to discover good performance for fdtd-2d, but also to a lesser extent with jacobi-2d. This is consistent with the 1-shot results shown before. We suspect this is because for these two stencil codes, in contrast to other benchmarks, tiling is not part of the optimal transformation (shown in Table 3). So it is likely that the models output a tiled variant, which for this very specific architecture/compiler configuration does not lead to the best performance.

Table 13 This table shows the average performance improvement of the multi-model approach and the 6-shot models for all configurations

Table 13 shows that average performance improvements with our multi-model approach can outperform all models using six evaluations, which we term the 6-shot approach. That is, using our multi-model approach (which uses six predicted optimization configuration, one from each model), we can achieve better performance improvements than by using the top six predicted optimization configurations from any one single model.

5.5 Evaluation of Feature Selection

In this section, we evaluate our models using a subset of performance counters deemed to be the most predictive. The main benefit of using a subset of counters is that we can reduce the number of application runs it takes to characterize that application. Using the full set of performance counters available on an architecture can take a large number of program executions to collect, especially if multiplexingFootnote 2 is not used. In addition, training time can be reduced if the number of performance counters collected is reduced. To find an effective subset of performance counters to use, we analyzed the output of LR models trained using greedy attribution selection mode. The output of the model lists the performance counters that were most informative in building the model. For each architecture/compiler configuration, we used the subset of performance counters that were used to build the LR model. As a result, we used 2 to 5 performance counters depending on the architecture/compiler configuration as shown in Table 14. Each configuration had a unique subset of important performance counters, including counters for different cache levels, TLB statistics, and instruction types.

Table 14 This table shows the 2 to 5 performance counters that were picked by the LR model for each architecture/compiler configuration

Using the six different machine learning algorithms we evaluated for this research, we retrained our models with the subset of performance counters deemed important by the LR model, and we compared these models to using the full set of performance counters available on each architecture. Table 15 shows results from evaluating each of the models in a 1-shot scenario, and Table 16 shows results for our multi-model scenario using the subset of performance counters.

Table 15 This table shows the performance of 1-Shot models using a subset of performance counters
Table 16 This table shows the performance improvement of Multi-Models using the subset of performance counters found when using the LR models

5.5.1 Analysis

Table 15 shows results of building our models using a subset of performance counters and used in a 1-shot scenario. For our 1-shot models, we observed that the LR models give the same performance for both the subset and full set of performance counters, except for Q9650-GCC. However, even this exceptional case shows only slight degradation, less than 0.05\(\times \), when compared to using the full set of counters. We found that the SVM, IBk, and MLP models trained with our performance counters subset gave less improvements than using a full set. However, we noticed that some architecture/compiler configurations and machine learning algorithms did achieve improvements using the subsets of performance counters. For example, Nehalem-ICC with IBk achieved 60.7 % of the optimization space optimal (OPT), while the same model trained with the full set of achieved only 53.8 %. We also observed that we often achieve better average performance improvement with K* and M5P models with the subset of performance counters versus using the full set, especially for the K* model on the Nehalem-GCC configuration. For those models, we achieved 81.1 % of OPT with the subset of performance counters, while we achieved 77 % of OPT with the full set. Thus, we are able to build a model with better prediction quality using a smaller set of performance counters than using the full set.

Table 16 shows our results when using our multi-model models with the subset of performance counters. Using, the subset of counters slightly outperformed models versus using a full set for the Q9650 architecture. Although, the amount of improvements are not substantial, we can see the potential when using a subset of performance counters since we are able to build a model that achieves similar performance as when using the full set. In the case of the Nehalem configurations, we observed the opposite of the Nehalem results. Models that were trained with a full set of performance counters outperformed those models trained with a subset.

5.6 Evaluation of ML Algorithms Parameters

In this section, we discuss the impact of the various parameters of the machine learning algorithms we tested, and their relative impact on the quality of the predictors we build. Table 17 summarizes the results, in terms of average performance, for numerous different parameters values we have tested. We use the Weka option designations, and we mark with an asterisk the parameter configuration we have selected for the experiments presented in Sect. 5.2. We show in bold the best average performance improvement obtained for a given model category (i.e., MLP, SVM, etc.) and a given machine \(+\) compiler (i.e., Nehalem-ICC, Nehalem-GCC, etc.).

Table 17 Average performance improvement with different machine learning parameter configurations for 1-Shot PC Model

For LR, we evaluated different feature selection methods. S= 0 means we use the M5\(\prime \) method, S = 1 means we do not use any feature selection method, and S = 2 means we use a greedy method. The default setting for Weka is S = 0. For all machine-compiler configurations except Q9650-GCC, we achieved the best prediction results with greedy method (S = 2). For Q9650-GCC, the prediction model with no feature selection gives the best prediction results by only a marginal fraction over the greedy method, and for three out of four configurations the greedy method provides the best results.

For SVM, we evaluated the GaussianKernel and the NormalizedPolyKernel kernel functions. We evaluated a range of gamma values (G) for Gaussiankernel, and a range of of exponent values (E) for NormalizedPolyKernel. For both kernels, we evaluated a different complexity values (C). We tried G = 0, 0.01(default), 25, 30, 50, 75, C = 1(default), 2, 4, and E = 1, 2(default), 4, 8, 16. The best performance improvement is achieved with the NormalizedPolyKernel kernel, using C = 1 and E = 8 for all machine-compiler configurations.

For IBk, we tested with several number of neighbors K = 1(default), 2, 5. The default value for K is 1 in Weka, but we observe that using K = 5 gives the best prediction result for both compilers. Although this configuration does not give the best prediction for Q9650, the prediction result is very close to the best. Thus, we selected K = 5 for our experiments.

For M5P, we evaluated different values for M which indicates the minimum number of instances. We tried M = 1, 2, 4(default), 10, 50. M = 4 gives the best result for three machine-compiler configurations except Q9650-ICC, and M = 2 gives the best result for Q9650-ICC, but this is only 0.02\(\times \) higher than the performance improvement with M = 4. We selected M = 4, as the final parameter setting, which is also the default Weka setting.

For K* algorithm, we tuned B and M, where B is the parameter for global balancing, and M is the decision on how missing attribute values are handled. We evaluated B = 0, 20(default), 25, 50, 75, 100, and M = a(default), t where each indicates a different way to handle missing values (\(a\) is to average column entropy curves, and \(t\) is to normalize over the attributes). B = 20, M = a achieves the best prediction for both machines with ICC compiler, and is fairly close to the best for both machines with GCC.

For MLP, we tuned three different parameters: the learning rate (L), the number of iterations (N), and the number of hidden layers (H). We evaluated L = 0.05, 0.1, 0.3(default), 0.4, 0.5, 0.9, N = 500(default), 1000, 1500, and H = a(default) and t. For H, \(a\) is defined as \((\text{ number} \text{ of} \text{ attributes} + \text{ number} \text{ of} \text{ classes})/2\), and \(t\) is defined as \((\text{ number} \text{ of} \text{ attributes} + \text{ number} \text{ of} \text{ classes})\). Each machine-compiler configuration requires different parameter tuning for the best performance improvement for MLP. We selected the configuration with L=0.5, N=500, H=t.

6 Related Work

In recent years, considerable research has been performed on iterative compilation, and its benefits have been reported in several publications [1, 17, 18, 24, 29, 33]. Iterative compilation has been shown to regularly outperform the most aggressive compilation settings of commercial compilers, and it has often reached performance comparable to hand-optimized library functions [25, 48, 56, 57].

Machine learning and search techniques applied to compilation has been studied in many recent projects [16, 34, 37, 51, 52, 60, 61]. These previous studies have developed machine learning-based algorithms to enabling efficiently search for the optimal selection of optimizing transformations, the best values for the transformation parameters, or the optimal sequences of compiler optimizations. Generally, these studies customize optimizations for each program or local code segments, some based on code characteristics.

For example, Monsifrot et al. [37] used decision trees to decide whether to enable or disable loop unrolling. This was one of the early efforts on using machine learning to tune a high-level transformation. They showed an improvement of 3 % over a hand-tuned heuristic and 2.7 % over g77’s unrolling strategy on the IA64 and UltraSPARC, respectively. Stephenson et al. [52] used genetic programming to tune heuristic priority functions for three compiler optimizations within the Trimaran’s IMPACT compiler. For one of the optimizations, register allocation, they were only able to achieve on average a 2 % increase over the manually tuned heuristic. The results in these papers highlight the diminishing results obtained when only controlling a single optimization. In contrast, the research in this paper controlled numerous optimizations available in the PoCC compiler.

Kulkarni et al. [34] introduced a system that used databases to store previously tested code, thereby reducing running time. They also disabled some optimizations that did not seem to improve the running time of the kernel. These techniques are very expensive and therefore only effective when programs are extremely small, such as those used in embedded domains. Cooper et al. [17] used genetic algorithms to address the compilation phase-ordering problem. They were concerned with finding “good” compiler optimization sequences that reduced code size. Their technique was successful in reducing code size by as much as 40 %. However, their technique is application-specific, i.e., a genetic algorithm had to be retrained for each new program to decide the best optimization sequence for that program.

An innovative approach to iterative compilation was proposed by Parello et al. [41] where they used performance counters at each stage to propose new optimization sequences. An application was run and performance counter information was measured in order to identify performance anomalies that could be resolved by applying certain optimizations. The anomalies and proposed optimizations that could be applied to resolve them were encoded in a manually constructed decision tree. Even though this was a very systematic approach, the time required to manually construct this decision tree took weeks for each benchmark and was specific to a certain targeted architecture. In contrast, our technique does not need to generate performance counters during each iteration of optimizing the program, but instead a model is produced that can predict the best optimization sequences for a program. Also, we use machine learning to automatically construct the models used to predict the optimization configurations that were used.

Cavazos et al. [12] address the problem of predicting good compiler optimizations by using performance counters to automatically generate compiler heuristics. That work was limited to the traditional optimizations found in the PathScale compiler. Despite the numerous transformations considered, the complexity is not comparable to the restructuring transformations available in state-of-the-art polyhedral frameworks, such as the one we used in this work.

Park et al. [42] propose a novel program characterization technique, i.e., graph-based characterization, to use as input to a model that predicts optimizations to apply to a program. In this work, the authors characterize programs using the program’s control flow graph (CFG), and they construct prediction models using SVMs with a shortest path graph kernel. These specialized graph kernels take as input characterizations that preserve the graph-based topology of the program, in contrast to previous characterization techniques that are represented as fixed-length vectors. The authors show that this method of characterizing programs is competitive with previous characterization techniques.

Chen et al. [13] developed the CHiLL infrastructure, a polyhedral loop transformation and code generation framework. Tiwari et al. [53] coupled the Active Harmony search engine to CHiLL to automatically tune some high-level transformation parameters, such as tile sizes. In this paper, we target quite a different search space, i.e., we balance the trade-off between several possibly contradictory objectives, such as parallelization, data locality enhancement, and vectorization, demonstrating our results on a variety of benchmarks and machines.

Bondhugula et al. [8, 9] proposed the first integrated heuristic for parallelization, fusion, and tiling in the polyhedral model subsuming all the above optimizations into a single tunable cost-model. Individual objectives such as the degree of fusion or the application of tiling can be implicitly tuned by minor ad-hoc modifications of Bondhugula’s cost model. Pouchet et al. [45] performed empirical search to directly find the coefficients of the affine scheduling matrix in a polyhedral framework. While these results showed significant improvements on small kernels, the empirical search needed up to a thousand runs for larger benchmarks [44]. In this work, we have abstracted the scheduling matrix behind high-level polyhedral primitives and the associated cost models for selecting the enabling transformations, reducing the search space to as little as a few hundred possibilities in place of the billions of possible schedules. This enabled us achieve on average up to 80 % of the optimization space optimal performance in no more than six runs.

7 Conclusion

The problem of improving performance of applications through compiler optimizations has been extensively studied, in particular, to improve the portability of the optimization process across a variety of architectures. Iterative compilation and machine learning techniques have been demonstrated as powerful mechanisms to automatically compute good compiler flags, improving the speed of the generated program and automatically adapting compilers to each new target architecture.

However, in the multi-core era with increasingly complex hardware, very advanced high-level transformation mechanisms are required to efficiently map the program on the target machine. Complex optimization combinations of loop transformations are needed to implement the most effective orchestration of tiling, parallelization, and vectorization. While all these optimizations have been studied independently, in practice, they must be evaluated together to achieve the best performance possible.

A modern loop nest optimizer faces the challenge of sometimes contradictory cost models, simply because there is no single solution that may maximize parallelism, vectorization, data locality and still achieve the best performance. Very little work has been done to date in using learning models for selecting high-level transformations, to drive a loop nest optimizer that operates on a very rich and complex search space. Our work is the first to propose the use of learning models to compute effective loop transformations in the polyhedral model, encompassing tiling, parallelization, vectorization, and data locality improvement.

In this work, we leverage the power of the polyhedral transformation framework to automatically build very complex sequences of transformations, enabling tiling and parallelization transformations on a wide range of numerical codes. To select an effective optimization in this space, we have implemented a speedup predictor model that correlates the run-time characteristics of a program (modeled with performance counters) with the speedup expected from a given polyhedral optimization configuration. We evaluated our approach using several machine learning algorithms, on a variety of benchmarks, and two different multi-core machines. For the test suite, the best points in our optimization search space yield an average 9\(\times \) speedup (with peaks of up to 28\(\times \)) with GCC on an Intel Xeon E5620 and 3.6\(\times \) on Intel Xeon Q9650. Using the predictive machine learning models, testing at most six candidate optimization configurations on the target machine, we achieve an average speedup of \(8.7\times \) on E5620 and 3.2\(\times \) on Q9650 with GCC as the backend compiler and an average speedup of 4.8\(\times \) and 3.6\(\times \) using Intel ICC as the backend compiler.