Keywords

1 Introduction

Loop tiling is a loop transformation that exploits locality of data accesses in loop nests; the reused data stay in the cache and thus the number of cache misses is reduced. Although loop tiling does not always align with performance, it is one of the key optimizations for memory-bound loop kernels. The selection of an efficient tile size is of paramount importance as tiles of different sizes can lead to significant variations in performance. In this paper, we define a tile size as efficient if it achieves a reduced number of cache misses.

The two main strategies to address the tile size selection problem are analytical [16] and empirical [24]. The former refers to static approaches in which the tile size is selected based on static code analysis of the loop kernel and the memory configuration (number of caches, cache sizes, associativity, line size). Typically, the analytical model outputs the cache misses as a function of tile sizes, input size (of the executed kernel), and cache characteristics. The second strategy refers to empirical (experimental-based) approaches that rely on auto-tuning. In auto-tuning, the input program is executed multiple times assuming different tile sizes, until the best solution is found. The input program is considered as a black-box and no information of the source code is extracted.

In this paper, we first demonstrate two important inefficiencies of current analytical models and provide the theoretical background on how current models can address these inefficiencies. Second, we propose a new more accurate analytical model for loop tiling, for single-threaded programs.

The first drawback of current analytical models is that they do not accurately calculate the tiles sizes and as a consequence additional unforeseen cache misses occur (not captured by the model). The second drawback is that the tiles cannot remain in the cache in most cases due to the cache modulo effect. This is because the cache line size, cache associativity and data reuse of tiles, are not efficiently taken into account. Therefore, current models cannot accurately calculate the number of cache misses for each tile size, leading to sub-optimal tile sizes. On the contrary, the proposed method provides efficient tile sizes by accurately estimating the number of cache misses for each tile size.

Our experimental results show that by using our method it is possible to estimate the number of cache misses with an accuracy of about \(1\%\) using simulation and about \(3\%\) and \(5.5\%\) by using the processor’s hardware counters on L1 data cache and L3 cache, respectively, leading to more efficient tile sizes for static loop kernels.

The remainder of this paper is organized as follows. In Sect. 2, the related work is reviewed. The proposed methodology is presented in Sect. 3 while experimental results are discussed in Sect. 4. Finally, Sect. 5 is dedicated to conclusions.

2 Related Work

In [20], an analytical model for loop tile selection is proposed for estimating the memory cost of a loop kernel and for identifying the optimal tile size. However, cache associativity is not taken into account. In [8], the authors combine loop tiling with array padding in order to improve the tile size selection process for specific array sizes. In [4], authors use Presburger formulas to express cache misses, but they fail to accommodate the high set associativity values of modern caches. In [16], an improved analytical model is proposed where associativity value is taken into account, but the cache hardware parameters (cache line size and associativity) and data reuse, are not efficiently taken into account.

As we showcase in this work there is ample room for improvement in existing analytical approaches, as cache line size and associativity and the arrays’ memory access patterns, are not fully exploited.

Due to the problem of finding the optimum tile size is very complex and includes a vast exploration space [9], in addition to general methods, a large number of algorithm-specific analytical models also exist for Matrix-Matrix Multiplication (MMM) [12, 14], Matrix-Vector Multiplication [13], tensor contractions [15], Fast Fourier Transform [10], stencil [23] and other algorithms, but the proposed approaches cannot be generalized. In particular, regarding stencil applications, there has been a long thread of research and development tackling data locality and parallelism, where many loop tiling strategies have been proposed such as overlapped tiling [5, 26], diamond tiling [2] and others.

The second line of techniques for addressing the tile size selection problem relies on empirical approaches. A successful example is the ATLAS library [25] which performs empirical tuning at installation time, to find the best tile sizes for different problem sizes on a target machine. The main drawback in empirical approaches is the enormous search space that must be explored.

Moreover, there are several frameworks able to generate tiled code with parameterized tiles such as PrimeTile [7] and PTile [1]. Parameterized tiling refers to the application of the tiling transformation without employing predefined tiles sizes, but inserting symbolic parameters that can be fixed at runtime [19]. In [1], a compile-time framework is proposed for tiling affine nested loops whose tile sizes are handled at runtime. In [19], authors present a formulation of the parameterized tiled loop generation problem using a polyhedral set. Pluto [3] is a popular polyhedral code generator including many additional optimizations such as vectorization and parallelization.

In [6], a thorough study on the major known tiling techniques is shown. In [21], authors use an autotuning method to find the tile sizes, when the outermost loop is parallelised. In [11], loop tiling is combined with cache partitioning to improve performance in shared caches. Finally, in [22], a hybrid model is proposed by combining an analytical with an empirical model. However, this model ignores the impact of set associativity in caches.

3 Proposed Methodology

3.1 Inefficiencies of Current Analytical Models

A. Current analytical models do not accurately calculate the tiles sizes Current methods, such as [16, 20], calculate the number of cache lines occupied by a tile, by using the following formula:

$$\begin{aligned} number.lines = \lceil \frac{tile.size.in.bytes}{line.size.in.bytes} \rceil \end{aligned}$$
(1)

However, Eq. 1 is not accurate as different tiles (of the same size) occupy a varied number of cache lines. Let us give an example (Fig. 1). Consider an one-dimensional (1-d) array of 200 elements and non-overapping tiles consisting of 25 elements each. Also consider that each array element is of 4 bytes and the cache line size is 64 bytes. The array elements are stored into consecutive main memory locations and thus into consecutive cache locations. Current methods assume that each tile occupies two cache lines (\(\lceil \frac{25 \times 4}{64} \rceil = 2\)) (Eq. 1), therefore just two cache misses are assumed when loading the tile into the cache. However, as it can be shown in Fig. 1, half of the tiles occupy two cache lines and the other half occupy three cache lines.

Fig. 1.
figure 1

An 1-d array is partitioned into tiles. 25 element tiles occupy a varied number of cache lines

The number of cache lines occupied by a tile is given by Eq. 2, where \(a=0\) or \(a=1\), depending on the tile size and cache line size values.

$$\begin{aligned} number.cache.lines = \lceil \frac{tile.size.in.bytes}{line.size.in.bytes} \rceil + a \end{aligned}$$
(2)

There are cases where the tiles occupy a varied number of cache lines (e.g., in Fig. 1, \(a=0\) holds for some tile sizes and \(a=1\) holds for others) and cases where the tiles occupy a constant number of cache lines.

To ascertain that the tiles remain in the cache, in Subsect. 3.2, we show that the cache size allocated must equal to the largest tile size value.

B. The tiles proposed by current analytical models cannot remain in the cache. Related works such as [20] assume that if the aggregated size of the tiles is smaller than the cache size, then the reused tiles will remain in the cache; however, this holds true only in specific cases because even the elements of a single tile might conflict with each other due to the cache module effect. An improved model is proposed in [16], where the cache associativity value is taken into account, but still the tiles cannot remain in the cache in many cases, leading to a significant number of unforeseen cache misses.

Let us showcase the above problem with another example, the well-known Matrix-Matrix Multiplication (MMM) algorithm (Fig. 2). Although different tiles of A and B are multiplied by each other, the tile of C is reused N/Tile times (data reuse), where Tile is the tile size and N is the arrays size in each dimension. The current analytical models, such as [16], will consider data reuse in this case and therefore they will include this information to their cache misses calculation model; therefore, current models do assume that the tile of C is loaded just once in the cache, not N/Tile times, which is accurate. However, the tile of C cannot remain in the cache unless all the following three bullets hold (in current analytical models only the first condition is satisfied):

Fig. 2.
figure 2

An example. Loop tiling for MMM algorithm

Fig. 3.
figure 3

An illustration of how tiles might be allocated to the cache, for the example shown in Fig. 2. On the top, \((Tilei, Tilej, Tilek)=(112,32,32)\) is shown, while in the bottom \((Tilei, Tilej, Tilek)=(64,64,32)\). Each tile is shown in a different colour. (Color figure online)

  • Each tile must contain consecutive memory locations

    The sub-rows of tile of C are not stored into consecutive main memory locations and therefore cache conflicts occur due to the cache module effect.

    A solution to this problem is array copying transformation; an extra loop kernel is added prior to the studied loop kernel where it copies the input array to a new one, in a tile-wise format; therefore, the tile elements are stored in consecutive main memory locations.

  • A cache way must not contain more than one tiles, unless they are stored into consecutive memory locations.

    Assume an L1 data cache of 32 KB 8-way associative and \((Tilei, Tilej, Tilek)=(112,32,32)\); the size of tile of C, A and B is 14336 (\(Tilei \times Tilej \times 4\) bytes), 14336 and 4096 bytes, respectively (32768 bytes in total) and they occupy (3.5, 3.5, 1) cache ways, respectively (each way is 4096 bytes), assuming that each element is 4 bytes. Therefore, one cache way will be used to store part of the tiles of C and A (Way-0 in Fig. 3). In this case, Way-0 will store part of Tile of C and part of A; when the next tiles of A are loaded into the cache, they will be stored into different cache lines and therefore part of the C tile will be removed from the cache due to the cache module effect. This problem does not occur when \((Tilei, Tilej, Tilek)=(64,64,32)\), as the tiles occupy (4, 2, 2) cache ways, respectively (Fig. 3).

    For the reminder of this paper, we will be writing that a tile is written in a separate cache way if an empty cache line is always granted for each different modulo (with respect to the size of the cache) of the tile memory addresses, e.g., in Fig. 3, the tile in red is written in two ’separate’ cache ways as an empty cache line is always granted for each different cache modulo value.

  • Extra cache space must be granted for the non-reused tiles

    Even if the two aforementioned bullets hold, it is false to assume that the C tile will remain in the cache just because the aggregated size of the three tiles is smaller than the cache size. This is because there is no cache space allocated for the next tiles of A and B; therefore, when the next tiles of A and B are loaded into the cache they will evict cache lines from the tile of C (LRU cache replacement policy is assumed). However, if \((Tilei, Tilej, Tilek)=(64,64,16)\) is selected instead of \((Tilei, Tilej, Tilek)=(64,64,32)\), then cache space for 2 tiles of A and B is allocated and therefore the Tile of C will remain in the cache.

We evaluated the above assumptions on a PC (see Sect. 4) using Cachegrind tool [17] (simulation) and the following tile sets (TileiTilejTilek) = (112, 32, 32), (64, 64, 32), (64, 64, 16) give (10.2, 9.8, 5.2) million dL1 misses and (3.1, 3, 3, 7.4) Gflops, respectively (square matrices of size \(N=1344\)).

figure a

3.2 The Proposed Analytical Model

Our approach is given in Algorithm 1. The proposed method generates the iterators to be tiled, their order as well as their tile sizes, for a given cache memory.

STEP.1: The iterators that loop tiling is applicable to are manually provided; not all the loops are eligible to loop tiling mainly because of dependencies.

Step.2: The next step is to specify the iterators that loop tiling will be applied to as well as their nesting level values. For example, in a loop kernel with three iterators (ijk) eligible to loop tiling, such as the original (non-tiled) version of MMM in Fig. 2, the following 15 loop tiling implementations will be generated: (i), (j), (k), (ij), (ik), (ji), (jk), (ki), (kj), (ijk), (ikj), (jik), (jki), (kij), (kji). All different orderings are processed so as not to exclude any efficient implementations.

Step.3: In Steps.3–6, the main part of the proposed loop tiling algorithm takes place. First, a mathematical inequality is constructed holding the tile sizes for which the tiles fit and remain in the cache:

$$\begin{aligned} m \le \lceil \frac{Tile_1}{L_i/assoc} \rceil + \lceil \frac{Tile_{1\_next}}{L_i/assoc} \rceil + ... + \lceil \frac{Tile_n}{L_i/assoc} \rceil + \lceil \frac{Tile_{n\_next}}{L_i/assoc} \rceil \le assoc \end{aligned}$$
(3)

where \(Tile_{i}\) is the tile size in bytes, \(L_i\) is the cache size in bytes, n is the number of tiles, assoc is the \(L_i\) associativity and m defines the lower bound of the tile sizes and it equals to the number of arrays in the loop kernel. The tile sizes not included in Eq. 3 are discarded as they cannot remain in the cache.

In Eq. 3, a separate tile exists for each array reference (in the loop kernel) and thus an array might have multiple tiles. Furthermore, for each tile, we grant cache space for its next tile too (to address the third bullet in Sect. 3.1.2). Note that the overlapping tiles are merged into Step.4. All the tiles contain consecutive memory locations (1st bullet in Subsect. 3.1.2). The value of (\(\lceil \frac{Tile_1}{L_i/assoc} \rceil \)) is an integer representing the number of \(L_i\) cache ways used by Tile1, or equivalently, is an integer representing the number of \(L_i\) cache lines with identical cache addresses used for Tile1. Equation 3 satisfies that the array tiles directed to the same cache subregions do not conflict with each other as the number of cache lines with identical addresses needed for the tiles is not larger than the assoc value (second bullet in Sect. 3.1.2).

\(Tile_{i}\) which contains consecutive memory locations is given by Eq. 4:

$$\begin{aligned} Tile_{i} = max.number.cache.lines \times cache.line.size \times element.size \end{aligned}$$
(4)

where cache.line.size is the size of the cache line in elements, element.size is the size of the array’s elements in bytes and the max.number.cache.lines gives the maximum number of cache lines occupied by the tile (Eq. 2).

Step.4: In this step, the overlapping tiles in Eq. 3 are merged to one, normally bigger tile, which consists of their union; if the tiles match, then the new tile’s size remain unchanged. Step.4 is needed so as there are no tile duplicates in the cache. For the rest of this paper we will write that two tiles overlap, if their memory locations overlap.

Consider the example where the following two array references exist in the loop body \(A[i][j-2]\), \(A[i][j+2]\) and j loop spans from 2 to N-2. By applying loop tiling to j loop with tile size T, the 1st tile of the 1st array reference spans within (0,T) and the 1st tile of the 2nd array reference spans within (4, T + 4). These tiles are merged and a single bigger tile is created of size (\(T+4\)).

Step.5 in Algorithm 1: In Step.5, all the remaining tile sizes with no consecutive memory locations are either discarded as they cannot remain in the cache or array copying transformation is applied.

It is common practice to apply array copying transformation before loop tiling in order all the tiles to contain consecutive memory locations. An extra loop kernel is added prior to the studied loop kernel where it copies the input array to a new one, in a tile-wise format. This adds an extra overhead and this is why it is performance efficient only in limited number of loop kernels.

Step.6 in Algorithm 1: In Step.6, the number of cache misses is approximated theoretically, considering the cache hardware parameters, the array memory access patterns of each loop kernel and the problem’s input size. To do so, we calculate how many times the selected tiles (whose dimensions and sizes are known) are loaded/stored from/to the cache.

We are capable of approximating the number of cache misses because the number of unforeseen misses has been minimised (the reused tiles remain in the cache). This is because only the proposed tiles reside in the cache, the tiles are written in consecutive memory locations, an empty cache line is always granted for each different modulo and we use cache space for two consecutive tiles and not one (when needed). Additionally, we refer to CPUs with an instruction cache; in this case, the program code typically fits in L1 instruction cache; thus, it is assumed that the shared cache or unified cache (if any) is dominated by data.

The number of cache misses is estimated by Eq. 5.

$$\begin{aligned} Num\_Cache\_Misses = \sum \nolimits _{i=1}^{i=sizeof(Tiles.List)} (repetition\_i \times cache.lines\_i) \end{aligned}$$
(5)

where \(repetition\_i\) gives how many times the array of this tile is loaded/stored from/to this cache memory (given by Eq. 7), \(cache.lines\_i\) is the number of cache lines accessed when this tile traverses the array (given by Eq. 6) and Tiles.List contains all the tiles that contribute to Eq. 5.

The Tiles.List is initialised with all the tiles specified in Eq. 3, after the merging process (Step.3b) in Algorithm 1 (the ‘next’ tiles are not included; the only reason they exist in Eq. 3 is to grant extra cache space). There are cases where not all the tiles contribute to Eq. 5 and this is why some tiles might be deleted from the Tiles.List. This happens when an array has multiple array references (in the loop body) and therefore multiple tiles. Thus, different tiles of the same array might access memory locations that have already been accessed just before and thus the tile resides in the cache; in this case, accessing the tile will lead to a cache hit, not a miss.

The cache.lines value in Eq. 5 is given by

(6)

where (TxTy) are the tile sizes of the iterators in the (x,y) dimension of the array’s subscript, respectively, (NM) are the corresponding iterators’ upper bounds (for 1D arrays \(Ty=1\)), line is the cache line size in elements, tiles is the total number of the array’s tiles and \((tiles=N/Ty \times M/Tx)\) or \((tiles=M/Tx)\) whether for 2D/1D arrays, respectively.

Let us give an example for the first branch of Eq. 6, consider a 2D floating point array and a tile of size \((10 \times 10)\) traversing the array in the x-axis. Also consider that \((line=16)\) array elements. The first tile occupies \(10 \times (\lceil \frac{10}{16} \rceil - \lfloor \frac{0}{16} \rfloor )=10\) cache lines while the second tile occupies \(10 \times (\lceil \frac{20}{16} \rceil - \lfloor \frac{10}{16} \rfloor )=20\) cache lines. Although the array’s tiles are of equal size, they occupy a different number of cache lines. Equation 6, gives the number of cache lines occupied in the case where array copying has been applied and therefore the array is written tile-wise in memory; in this case, the first tile lies between (0, 100), the second between (100, 200) etc.

The repetition value in Eq. 5 is given by

$$\begin{aligned} repetition = \prod \nolimits _{j=1}^{j=U} \frac{(up_j-low_j)}{T_j} \times \prod _{k=1}^{k=Q} \frac{(up_k-low_k)}{T_k} \end{aligned}$$
(7)

where U is the number of new/extra iterators (generated by loop tiling) that a) do not exist in the corresponding array’s subscript and b) exist above of the iterators of the corresponding array, e.g., regarding the B tile in Fig. 2, this is the ii iterator. Q is the number of new/extra iterators that a) do not exist in the array and b) exist between of the iterators of the array, e.g., regarding the A tile in Fig. 2, this is the jj iterator; the ii iterator forces the whole array of B to be loaded N/Tile times, while the jj iterator forces the whole array of A to be loaded N/Tile times.

4 Experimental Results

The experimental results are extracted in a host PC (Intel i7-4790 CPU at 3.60 GHz, Ubuntu 18.04) and the codes are compiled using gcc 7.5.0 compiler.

The benchmarks used in this study consists of six well-known memory-bound loop kernels taken from 4.1 PolyBench/C suite [18]. These are: gemm, mvm, gemver, Doitgen, Bicg and gesumv. The input size of the loop kernels is specified with letter ‘N’ (square matrices are taken of size \(N \times N\)).

Table 1. The error in cache misses is measured for five different tile sizes using Eq. 8 and the maximum value is shown.

4.1 Validation of the Proposed Methodology

In this sub-section we showcase that i) the tiles generated by the proposed methodology fit and remain in the cache and ii) the proposed equations (Step.6) can accurately estimate the number of cache misses. To validate the proposed method, we have applied the proposed methodology to L1 data cache (dL1) (32 KB, 8-way) and L3 cache (8 MB, 16-way). The tile sizes and the iterators to be tiled are given by Algorithm 1.

The number of cache misses is measured for five tile sizes and the maximum error value is calculated (Eq. 8) using i) Cachegrind tool [17] (simulation) and ii) Perf tool using the ‘l1d.replacement’, ‘LLC-load-misses’ and ‘LLC-store-misses’ hardware counters.

$$\begin{aligned} error \% = \frac{\mid cache.misses.measured - Eq.\,5.misses \mid }{Eq.\,5.misses} \times 100 \end{aligned}$$
(8)

Cachegrind and Perf give different cache misses values, because the perf measures the number of cache misses of all the running processes, not just the process we are interested in.

In Table 1, we compare the dL1 and L3 misses as extracted from Eq. 5 against the measurements from Cachegrind and Perf. As Table 1 indicates the proposed equations provide roughly the same number of cache misses as Cachegrind. This means that first, the proposed tiles fit and remain in the cache and second, the proposed equations give a very good approximation of the number of misses.

Regarding dL1, the error values are higher (about \(3\%\)) when using the dL1 hardware counter (Table 1), as other processes are loading/storing data from/to this memory too. Note that Table 1 shows only the tile sizes that need roughly the size of seven out of eight cache ways, or less; the tiles that use more cache space give a much higher error value, which is up to \(20\%\). Given that this inconsistency holds only for the Perf measurements and not for Cachegrind, it is valid to assume that this is due to the fact that other processes using the dL1. In this case, each dL1 access of another process leads to an unforeseen miss.

For the same reason, on the right of Table 1, we show the tile sizes that need roughly the size of 9 out of 16 L3 cache ways, or less. mvm, doitgen, bicg, gesumv and gemver give a small L3 error value as their arrays fit and remain in L3 even without using loop tiling. This is not the case for gemm and this is why the error value in gemm is higher.

Table 2. Comparison over gcc on Intel i7-4790.

4.2 Evaluation over Gcc Compiler and Pluto

In all cases, the six studied loop kernels are compiled using ‘gcc -O2 -floop-block -floop-strip-mine’ command and the generated binaries are those that the proposed methodology is compared to. The ‘-floop-block -floop-strip-mine’ option enables gcc to apply loop tiling transformation. The C codes of the proposed method are compiled using ‘gcc -O2’ command.

On the left of Table 2, the proposed methodology has been applied to dL1 only. The proposed method provides significant dL1 miss gains at all cases but performance gains just for gemm, doitgen and gemver. Reducing the number of dL1 misses does not always align with performance; in this case, the selected tile sizes for mvm, bicg and gesumv (which minimize dL1 misses) slighly increase the number of L3 misses and this is why performance is degraded. Note that the dL1 miss gain is higher in gemm and doitgen comparing to the other loop kernels, as all their tiles achieve data reuse; the tiles remain in L1 and also being loaded many times from L1, highly reducing the number of L1 misses.

It is important to note that the baseline binary code that we compare our method to for mvm, bicg and gesumv in Table 2, does not include loop tiling (although the loop tiling option has been enabled, gcc disables its application in gesumv, bicg and mvm, by considering it not performance efficient).

On the right of Table 2, the proposed methodology has been applied first to dL1 and then to L3. Applying loop tiling for mvm, gemver, bicg and gesumv just for L3 cache is pointless as their arrays fit and remain in the cache even for very large input sizes and as a consequence the number of L3 misses cannot be reduced. However, applying loop tiling for L3 to the implementations shown on the left of Table 2 is beneficial, as these implementations give a higher number of L3 misses than the no tiled implementations. Regarding doitgen, applying loop tiling to L3 cannot give any gain as the arrays fit in the cache. The ‘*’ in Table 2 indicates that these iterators are interchanged.

The proposed methodology has been also evaluated using Pluto [3] (version 0.11.4). For a fair comparison, only the loop tiling phase of Pluto is activated. Pluto applies square tile sizes of size 32 at all cases and this is why gcc perfomrs better. Pluto is a powerful tool which is not limited to loop tiling and if we enable all its phases, then it provides higher speedup values than gcc.

5 Conclusions and Future Work

In this article, we first demostrate two important inefficiencies of current analytical loop tiling models and provide insight on how current models can overcome these inefficiencies. Second, we propose a new model where the number of cache misses is accurately estimated for each generated tile size. This is achieved by leveraging the target memory hardware architecture and data access patterns.

As far as our future work is concerned, the first step includes the validation and evaluation of the proposed method to other CPUs. Second, we plan to work towards correlating the number of cache misses with execution time.