Keywords

1 Introduction

Tiling is a very important iteration reordering transformation for both improving data locality and extracting loop parallelism. Loop tiling for improving locality groups loop statement instances in a loop iteration space into smaller blocks (tiles) allowing reuse when the block fits in local memory. On the basis of a valid schedule of tiles, parallel coarse-grained code can be generated.

To our best knowledge, well-known tiling techniques are based on linear or affine transformations of program loop nests [6, 9, 10, 13, 20]. In paper [5], we describe the limitations of affine transformations and present how the free-scheduling of loop nest statement instances can be formed by means of the transitive closure of program dependence graphs. In this paper, we demonstrate how the approach, presented in our paper [5], can be adapted to form the free-scheduling of valid tiles. To generate both valid tiles and free-scheduling, we apply the transitive closure of dependence graphs. The proposed approach allows generation of parallel tiled code even when there does not exist an affine transformation allowing for producing a fully permutable loop nest. This approach is a result of a combination of the polyhedral model and the iteration space slicing framework.

2 Background

A considered approach uses the dependence analysis proposed by Pugh and Wonnacott [16] where dependences are represented by dependence relations. Dependences of a loop nest are described by dependence relations with constraints presented by means of the Presburger arithmetic.

A dependence relation is a tuple relation of the form [input list]\(\rightarrow \)[output list]: formula, where input list and output list are the lists of variables and/or expressions used to describe input and output tuples and formula describes the constraints imposed upon input list and output list and it is a Presburger formula built of constraints represented with algebraic expressions and using logical and existential operators. A dependence relation is a mathematical representation of a data dependence graph whose vertices correspond to loop statement instances while edges connect dependent instances. The input and output tuples of a relation represent dependence sources and destinations, respectively; the relation constraints point out instances which are dependent.

Standard operations on relations and sets are used, such as intersection (\(\cap \)), union (\(\cup \)), difference (\(-\)), domain (dom R), range (ran R), relation application (\(\textit{S}^\prime = \textit{R}(\textit{S}): \textit{e}^\prime \in \textit{S}^\prime \) iff exists e s.t. \(\textit{e} \rightarrow \textit{e}^\prime \in \textit{R},\textit{e} \in \textit{S}\)). In detail, the description of these operations is presented in papers [11, 16].

The positive transitive closure for a given relation R, \(R^+\), is defined as follows [11]: \(R^+=\{e\rightarrow e':\ e\rightarrow e'\in R \vee \exists e''s.t.\ e\rightarrow e'' \in R \wedge e''\rightarrow e'\in R^+\}\).

It describes which vertices \(e'\) in a dependence graph (represented by relation R) are connected directly or transitively with vertex e.

Transitive closure, \(R^*\), is defined as follows [12]: \(R^*=R^+\cup I\), where I the identity relation. It describes the same connections in a dependence graph (represented by R) that \(R^+\) does plus connections of each vertex with itself.

The composition of given relations \(R_1 = \{x_1 \rightarrow y_1 | f_1(x_1, y_1)\}\) and \(R_2 = \{x_2 \rightarrow y_2 | f_2(x_2, y_2)\}\), is defined as follows [11]: \(R_1 \circ R_2 = \{ x \rightarrow y | \exists z\ s.t.\ f_1(z,y) \wedge f_2(x,z)\}\).

3 Finding Free Scheduling

The algorithm, presented in our paper [5], allows us to generate fine-grained parallel code based on the free schedule representing time partitions; all statement instances of a time partition can be executed in parallel, while partitions are enumerated sequentially. The free schedule function is defined as follows.

Definition 1

[7, 8]. The free schedule is the function that assigns discrete time of execution to each loop nest statement instance as soon as its operands are available, that is, it is mapping \(\sigma :\textit{LD}\rightarrow \mathbb {Z}\) such that

$$\begin{aligned} \sigma (p) = \left\{ \begin{array}{l} 0\ if\ there\ is\ no\ p_{1}\in LD\ s.t.\ p_{1}\rightarrow p \\ 1+max(\sigma (p_{1}),\sigma (p_{2}),...,\sigma (p_{n})); p,p_{1},p_{2},...,p_{n}\in LD; \\ p_{1}\rightarrow p, p_{2}\rightarrow p, ..., p_{n}\rightarrow p, \end{array}\right. \end{aligned}$$

where \(p, p_{1},p_{2},...,p_{n}\) are loop nest statement instances, LD is the loop nest domain, \(p_{1}\rightarrow p, p_{2}\rightarrow p, ..., p_{n}\rightarrow p\) mean that the pairs \(p_{1}\) and p, \(p_{2}\) and p, ...,\(p_{n}\) and p are dependent, p represents the destination while \(p_{1},p_{2},...,p_{n}\) represent the sources of dependences, n is the number of operands of statement instance p (the number of dependences whose destination is statement instance p).

The free schedule is the fastest legal schedule [8]. In paper [5] we presented fine-grained parallelism extraction based on the power k of relation R.

The idea of the algorithm is the following [5]. Given relations \(R_{1}, R_{2}, ..., R_{m}\), representing all dependences in a loop nest, we first calculate \(R = {\bigcup \limits _{i=1}^m} R_{i}\) and then \(R^k\), where \(R^k = \underbrace{R \circ R \circ ... R}_{k}\), “\(\circ \)” is the composition operation. Techniques of calculating the power k of relation R are presented in the following publications [12, 17] and they are out of the scope of this paper. Let us only note that given transitive closure \(R^+\), we can easily convert it to the power k of R, \(R^k\), and vice versa, for details see [17].

Given set UDS comprising all loop nest statement instances that are ready to execution at time \(k=0\) (Ultimate Dependence Sources), each vertex, represented with the set \(S_k = R^k (UDS) - R^+ \circ R^k(UDS)\), is connected in the dependence graph, defined by relation R, with some vertex(ices) represented by set UDS with a path of length k. Hence at time k, all the statement instances belonging to the set \(S_k\) can be scheduled for execution and it is guaranteed that k is as few as possible.

4 Loop Nest Tiling Based on the Transitive Closure of Dependence Graphs

In this paper, to generate valid tiled code, we apply the approach presented in paper [4], which is based on the transitive closure of dependence graphs. Next, we briefly present the steps of that approach.

First, we form set \(\textit{TILE}({\varvec{II}}, {\varvec{B}})\), including iterations belonging to a parametric tile, as follows \(\textit{TILE}({\varvec{II}}, {\varvec{B}}) = \{[{\varvec{I}}] | {\varvec{B}}\text {*}{\varvec{II}} + {\varvec{LB}} \le {\varvec{I}} \le \min ( {\varvec{B}}\text {*}({\varvec{II}} +\mathbf 1 ) + {\varvec{LB}} - \mathbf 1 , {\varvec{UB}}) \;\text {AND}\; {\varvec{II}} \ge 0\}\), where vectors LB and UB include the lower and upper loop index bounds of the original loop nest, respectively; diagonal matrix B defines the size of a rectangular original tile; elements of vector I represent the original loop nest iterations contained in the tile whose identifier is II; 1 is the vector whose all elements have value 1; here and further on, the notation \(x \ge (\le ) y\) where xy are two vectors in \(\mathbb {Z}^n\) corresponds to the component-wise inequality, that is, \(x \ge (\le ) y\Longleftrightarrow x_i \ge (\le ) y_{i}\), i=1,2,...,n.

Next, we build sets TILE_LT and TILE_GT that are the unions of all the tiles whose identifiers are lexicographically less and greater than that of TILE(II, B), respectively:

TILE_LT ={[I] | exists \({\varvec{II}}^\prime \) s. t. \({\varvec{II}}^\prime \prec {\varvec{II}}\) AND II \(\ge \) 0 AND B*II+LB \(\le \) UB AND \({\varvec{II}}^\prime \ge 0\) and \({\varvec{B}}\text {*}{\varvec{II}}^\prime +{\varvec{LB}} \le {\varvec{UB}}\) AND I in \(\textit{TILE}({\varvec{II}}^\prime , {\varvec{B}})\)},

TILE_GT ={[I] | exists \({\varvec{II}}^\prime \) s. t. \({\varvec{II}}^\prime \succ {\varvec{II}}\) AND \({\varvec{II}} \ge 0\) AND B*II+LB \(\le \) UB AND \({\varvec{II}}^\prime \ge 0\) and \({\varvec{B}}\text {*}{\varvec{II}}^\prime +{\varvec{LB}} \le {\varvec{UB}}\) AND I in \(\textit{TILE}({\varvec{II}}^\prime , {\varvec{B}})\)},

where “\(\prec \)” and “\(\succ \)” (here and further on) denote the lexicographical relation operators for two vectors. Then, we calculate set

$$\begin{aligned} \textit{TILE}\_\textit{ITR} = \textit{TILE} - \textit{R}{^+} (\textit{TILE}\_\textit{GT}), \end{aligned}$$

which does not include any invalid dependence target, i.e., it does not include any dependence target whose source is within set TILE_GT. The following set

$$\begin{aligned} \textit{TVLD}\_\textit{LT} = (\textit{R}{^+}(\textit{TILE}\_\textit{ITR}) \cap \textit{TILE}\_\textit{LT}) - {R^+}(\textit{TILE}\_\textit{GT}) \end{aligned}$$

includes all the iterations that (i) belong to the tiles whose identifiers are lexicographically less than that of set TILE_ITR, (ii) are the targets of the dependences whose sources are contained in set TILE_ITR, and (iii) are not any target of a dependence whose source belong to set TILE_GT. Target tiles are defined by the following set TILE_VLD = TILE_ITR \(\cup \) TVLD_LT.

Lastly, we form set TILE_VLD_EXT by means of inserting (i) into the first positions of the tuple of set TILE_VLD elements of vector II: \(ii_1, ii_2, ..., ii_d\); (ii) into the constraints of set TILE_VLD the constraints defining tile identifiers \({\varvec{II}} \ge 0\) and \({\varvec{B}}\text {*}{\varvec{II}}+{\varvec{LB}} \le {\varvec{UB}}\). Target code is generated by means of applying any code generator allowing for scanning elements of set TILE_VLD_EXT in the lexicographic order, for example, CLooG [1].

5 Free Scheduling for Tiles

The algorithm presented in this paper is a combination of the approaches presented in the two previous sections. First, we generate tiled code as it is described in Sect. 4, then we find free scheduling for tiles of the tiled code. For this purpose, first, we form relation, \(R\_TILE\), which describes dependences among tiles as follows

R_TILE:={[II]\(-\)>[JJ]: exist I, J s.t. (II,I) in \({}_{}\) TILE_VLD_EXT(II) AND (JJ,J) in \(^{}_{}\) \(\textit{TILE}\_\textit{VLD}\_\textit{EXT}_i ({\varvec{JJ}})\) AND J in R(I)},

where II, JJ are the vectors representing tile identifiers; vectors I, J comprise iterations belonging to tiles whose identifiers are II, JJ, respectively.

The following step is to calculate set, UDS, including the tile identifiers which state for tile ultimate dependence sources and/or independent ones as follows: UDS=II_SET \(-\) range (\(R\_TILE\)), where set \(\textit{II}\_\textit{SET} =\{[{\varvec{II}}] | {\varvec{II}} \ge 0\;\text {and}\;{\varvec{B}}\text {*}{\varvec{II}}+{\varvec{LB}} \le {\varvec{UB}}\}\) represents all tile identifiers.

Now, we apply the algorithm presented in paper [5] to form free-scheduling for tiles of tiled code. With this purpose, we calculate the transitive closure and power k of relation \(R\_TILE\) and next calculate set \(S_k\), representing the free schedule, as follows \(S_k = R\_TILE^k(UDS) - (R\_TILE^+\circ \textit{R}\_\textit{TILE}^k (\textit{UDS}))\). Finally, we extend the tuple of set \(S_k\) with variable k and variables representing statement instances of a parametric target tile(together with corresponding constraints) and generate code applying any code generator, for example, CLooG to scan iterations within set \(S_k\) in the lexicographical order. Algorithm 1 presents the discussed above idea in a formal way.

figure a

6 Illustrative Example

In this section, we illustrate steps of Algorithm 1 by means of the following loop:

figure b

We use the ISL library to carry out operations on relations and sets required by the presented algorithm. A dependence relation, returned by Petit, the Omega project dependence analyzer, is the following

figure c

where here and further on “6” states for the statement identifier represented via the corresponding line number in the original loop nest.

The algorithm presented in paper [4] returns the following set \( TILE\_VLD\_EXT\) representing both tile identifiers and statement instances within each target tile.

figure d

Using relation R and set \( TILE\_VLD\_EXT\), we form realtion \(R\_TILE\) that is of the form below.

figure e

Set UDS is the following \(\{[0,\textit{ jj}, 6]: \textit{jj} \le 2\) and \(\textit{jj} \ge 0\}\).

Using the appropriate functions of the ISL library to calculate relations \(R\_TILE^k\) and \(R\_TILE^+\), we calculate set \(S_k\) according to the formula in step 4 of Algorithm 1, and extend set \(S_k\) as presented in step 5 of Algorithm 1, to get:

figure f

Finally, we apply to set \(S_k\) the GLooG code generator and postprocess the code returned by CLooG to yield the following OpenMP C code.

figure g

where line 1 presents the serial for loop enumerating time partitions; line 2 represents the two OpenMP directives (parallel for) pointing out that the iterations of the for loop in line 3 can be executed in parallel; the for loops in line 1 and line 3 enumerate tile identifiers, whereas the for loops in line 4 and line 5 scan iterations within a tile. Figure 1 presents original tiles, while Fig. 2 shows target tiles returned by the algorithm, presented in paper [4] (depicted by dashed lines), and the three time partitions (k=0, 1, 2) for the illustrative example.

Fig. 1.
figure 1

Original tiles

Fig. 2.
figure 2

Target tiles and time partitions

7 Experimental Study

The presented algorithm has been implemented in the optimizing compiler TRA CO, publicly available at the website http://traco.sourceforge.net. For calculating \(R^+\) and \(R^k\), TRACO uses the corresponding functions of the ISL library [17]. To evaluate the effectiveness of proposed approach, we have experimented with NAS Parallel Benchmarks 3.3 (NPB) [14].

From 431 loops of the NAS benchmark suite, Petit is able to analyse 257 loops, and dependences are available in 134 loops (the rest 123 loops do not expose any dependence). For these 134 loop nests, ISL is able to calculate \(R\_TILE^k\) for 58 ones and accordingly TRACO is able to generate parallel tiled code for those programs. Such a limitation is not the limitation of the algorithm, it is the limitation of the corresponding ISL function.

To check the performance of parallel tiled code, produced with TRACO, the following criteria were taken into account for choosing NAS programs: (i) a loop nest must be computationally intensive (there are many NAS benchmarks with constant upper bounds of loop indices, hence their parallelization is not justified), (ii) structures of chosen loops must be different (there are many loops of a similar structure).

Applying these criteria, we have selected the following five NAS loops: BT_rhs_1 (Block Tridiagonal Benchmark), FT_auxfnct.f2p_2 (Fast Fourier Transform Benchmark), UA_diffuse_5, UA_setup_16 and UA_transfer_4 (Unstructured Adaptive Benchmark).

To carry out experiments, we have used a computer with Intel i5-4670 3.40 GHz processors (Haswell, 2013), 6 MB cache and 8 GB RAM. Source and target codes of the examined programs are available in http://sourceforge.net/p/issf/code-0/HEAD/tree/trunk/examples/fstile/.

Table 1. Speed-up of parallel tiled loop nests for 4 CPU cores.

Table 1 presents execution time and speed-up for the studied loop nests. Speed-up is the ratio of sequential and parallel program execution times, i.e., S=T(1)/T(P), where T(P) is the parallel program execution time on P processors. Speedups were computed against the serial original code execution time. Experiments were carried out for 4 CPUs. Analysing the data in Table 1, we may conclude that for all parallel tiled loops, positive speed-up is achieved. It depends on the problem size defined by loop index upper bounds and a tile size. It is worth to note that for the FT_auxfnct.f2p_2 and UA_transfer_4 programs, super-linear speed-up is achieved, i.e., the speed-up is greater than 4 – the number of CPUs used. This phenomenon could be explained by the fact that the data size required by the original program is greater than the cache size when executed sequentially, but could fit nicely in each available cache when executed in parallel, i.e., due to increasing program locality.

8 Related Work

There has been a considerable amount of research into tiling demonstrating how to aggregate a set of loop iterations into tiles with each tile as an atomic macro statement, starting with pioneer paper [10] and those presenting advanced techniques [6, 9, 19].

One of the most advanced reordering transformation frameworks is based on the polyhedral model. Let us remind that “Restructuring programs using the polyhedral model is a three steps framework. First, the Program Analysis phase aims at translating high level codes to their polyhedral representation and to provide data dependence analysis based on this representation. Second, some optimizing or parallelizing algorithm uses the analysis to restructure the programs in the polyhedral model. This is the Program Transformation step. Lastly, the Code Generation step returns back from the polyhedral representation to a high level program” [3].

All above three steps are available in the approach presented in this paper. But there exists the following difference in step 2: in the polyhedral model “a (sequence of) program transformation(s) is represented by a set of affine functions, one for each statement” [3] while the presented approach does not find and use any affine function. It applies the transitive closure of a program dependence graph to specific subspaces of the source loop iteration space. At this point of view the program transformation step is rather within the Iteration Space Slicing Framework introduced by Pugh and Rosser [15], where the key step is calculating the transitive closure of a program dependence graph.

Papers [10, 18] are a seminal work presenting the theory of tiling techniques based on affine transformations. These papers present techniques consisting of two steps: they first transform the original loop into a fully permutable loop nest, then transform the fully permutable loop nest into tiled code. Loop nests are fully permutable if they can be permuted arbitrarily without altering the semantics of the source program. If a loop nest is fully permutable, it is sufficient to apply a tiling transformation to this loop nest [18].

Papers [2, 5] demonstrate how we can extract coarse- and fine-grained parallelism applying different Iteration Space Slicing algorithms, however they do not consider any tiling transformation.

Wonnacott and Strout review implemented and proposed techniques for tiling dense array codes in an attempt to determine whether or not the techniques permit on scalability. They write [19]: “No implementation was ever released for iteration space slicing”. This permits us to state that TRACO, which implements the algorithm, presented in this paper, is the first compiler where Iteration Space Slicing is applied to produce parallel tiled code based on the free-schedule of tiles.

9 Conclusion

In this paper, we presented a novel approach based on a combination of the Polyhedral Model and the Iteration Space Slicing framework. It allows generation of parallel tiled codes which demonstrate significant speed-up on shared memory machines with multi-core processors. The usage of the free schedule of tiles instead of that of loop nest statement instances allows us to adjust the parallelism grain-size to match the inter-processor communication capabilities of the target architecture. In the future, we plan to present an extended approach allowing for tiling with parallelepiped original tiles.