1 Introduction

In this paper, we deal with automatic parallelization of sequential programs by means of an optimizing compiler. The parallel program is executed on a computer including two or more processing units. Process synchronization is required to guarantee that a parallel program produces correct results. Synchronization is the coordination of parallel tasks in real time, often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same point.

Synchronization has considerable impact on parallel program overhead, granularity, and load balancing. It usually involves waiting by at least one task, and can therefore cause a parallel application’s wall-clock execution time to increase, i.e., it introduces parallel program overhead. Any time one task spends waiting for another is considered synchronization overhead. Minimizing its cost is a very important part of making a program efficient. Since synchronization overhead tends to grow rapidly as the number of tasks in a parallel job increases, it is the most important factor in obtaining good scaling behavior for the parallel program.

In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. A parallel application can be coarse- or fine-grained: coarse means that relatively large amounts of computational work are done between communication or synchronization events; fine means that relatively small amounts of computational work are done between communication events. For current multicore CPUs with support for SMT (Simultaneous Multi Threading), coarse-grained parallelism is strictly required, independently of the parallel technology used (pthreads, OpenMP, MPI). Usually, decreasing the number of synchronization events allows for increasing parallel code granularity.

Load balancing is important to parallel programs for performance reasons: to get good parallel program performance, all threads should have the same volume of work to be executed. If all tasks are subject to a barrier synchronization point, the largest task will determine the overall performance.

Summing up, we may conclude that to decrease synchronization overhead, increase parallel program granularity, and improve load balancing we need to minimize the number of synchronization events in the parallel program. The ideal situation is when there is no synchronization in the parallel program, i.e., when an optimizing compiler discovers synchronization-free parallelism that requires no synchronization.

Synchronization-free parallelism can be considered on the statement instance level or on the tile level when multiple threads run independent fragments of the program comprising statement instances or tiles, respectively.

Well-known techniques discovering synchronization-free parallelism are based on the affine transformation framework [10, 21]. Limitations of affine transformations to extract synchronization-free parallelism are discussed in paper [3], the authors demonstrate that despite the fact that there exists synchronization-free parallelism on the statement instance level for some classes of loop nests, there does not exist any affine transformation allowing for extracting such a parallelism. The authors show how the transitive closure of dependence graphs can be used to extract synchronization-free parallelism on the statement instance level for such problematic loop nests but they do not consider extracting synchronization-free parallelism on the tile level.

In this paper, we present a way to extract synchronization-free parallelism on the tile level applying the transitive closure of dependence graphs.

Tiling [10, 13, 17, 21, 29, 35] is a very important iteration reordering transformation for both improving data locality and extracting loop nest parallelism. Tiling for improving locality groups loop statement instances into smaller blocks (tiles) allowing reuse when the block fits in local memory. In parallel tiled code, tiles are considered as indivisible macro statements. This coarsens the granularity of parallel applications that often leads to improving the performance of an application running in parallel computers with shared memory.

One well-known class of tiling techniques is based on affine transformations of program loops. To generate tiled code, first affine transformations allowing for producing a band of fully permutable loops are formed, then this band is transformed into tiled code.

Papers [5, 6] introduce a novel approach for automatic generation of tiled code for nested loops which is based on the transitive closure of a loop nest dependence graph. This technique produces tiled code even when there does not exist any affine transformation allowing for producing a band of fully permutable loops. According to that approach, we first form fixed rectangular original tiles and next examine whether all loop nest dependences are respected under the lexicographic order of tile enumeration. If so, we conclude that all original tiles are valid, hence code generation is straightforward. Otherwise, we correct original tiles so that all target tiles are valid, i.e., the lexicographic enumeration order of target tiles respects all dependences available in the original loop nest. The final step is code generation representing target (corrected) tiles.

In this paper, we present a way to extract synchronization-free parallelism on the tile level applying the transitive closure of dependence graphs. Paper [6] deals with extracting slices on the tile level so that each slice contains valid tiles generated by means of applying the transitive closure of dependence graphs. In Sect. 5, we demonstrate that such a way of generation of synchronization-free tiled code can lose synchronization-free parallelism on the tile level in spite of there exists synchronization-free parallelism on the statement instance level. In this paper, we show how this problem can be resolved, i.e., how to generate synchronization-free tiled code provided we are able to calculate the exact transitive closure of a loop nest dependence graph and there exists synchronization-free parallelism on the statement instance level.

The contributions of this paper over previous work are as follows:

  • a concept and an algorithm demonstrating how the Iteration Space Slicing framework can be combined with the Polyhedral Model to generate parallel synchronization-free tiled code on the tile level when there exists synchronization-free parallelism on the statement instance level and the exact transitive closure of dependence graph can be calculated;

  • development and presentation of the public available source-to-source TC compiler implementing the introduced algorithms using the ISL library;

  • evaluation of the effectiveness of the introduced algorithms and the speed-up of tiled code produced by means of the presented approach.

The rest of the paper is organized as follows. Section 2 contains background. Section 3 describes how synchronization-free slices can be extracted on the statement instance level. Section 4 summarizes how tiled code can be generated by means of the transitive closure of dependence graphs. Section 5 introduces a concept and an algorithm to generate parallel synchronization-free code on the tile level. Section 6 discusses related work. Section 7 highlights the results of experiments. Section 8 concludes our work and outlines plans for future.

2 Background

In this paper, we deal with affine loop nests [11]. A statement instance \(\mathrm {S}[{I}]\) is a particular execution of a loop statement \(\mathrm {S}\) for a given iteration vector I.

Given a loop nest with q statements, we transform it into its polyhedral representation, including: an iteration space \(IS_i\) for each statement \(\mathrm {S}i,i=1,...,q\), read/write access relations (\( RA \)/\( WA \), respectively), and global schedule S corresponding to the original execution order of statement instances in the loop nest.

The loop nest iteration space \(IS_i\) is the set of statement instances executed by a loop nest for statement \(\mathrm {S}i\). An access relation maps an iteration vector \(I_i\) to one or more memory locations of array elements. Schedule S is represented with a relation which maps an iteration vector of a statement to a corresponding multidimensional timestamp, i.e., a discrete time when the statement instance has to be executed.

In this paper, we use two types of iteration spaces (i) original one where instances are identified by a statement identifier and a sequence of integers and (ii) global one where instances are identified by their execution order. We obtain the global iteration space by means of applying the global schedule to the original iteration space. A global iteration vector I represents statement instances in the global iteration space.

Further on, under I and IS, we mean the global iteration vector and global iteration space, respectively.

Two statement instances \(\mathrm {S1}[{I}]\) and \(\mathrm {S2}[{J}]\) are dependent if both access the same memory location and if at least one access is a write. \(\mathrm {S1}[{I}]\) and \(\mathrm {S2}[{J}]\) are called the source and target of a dependence, respectively, provided that \(\mathrm {S1}[{I}]\) is executed before \(\mathrm {S2}[{J}]\). The sequential ordering of statement instances, denoted \(\mathrm {S1}[{I}] \prec \mathrm {S2}[{J}]\), is induced by the global schedule.

The algorithms, presented in this paper, use a dependence relation which is a tuple relation of the form \(\{\, \mathrm {}[{\textit{input list}}] \rightarrow \mathrm {}[{\textit{output list}}] \mid {\textit{formula}} \,\}\), where input list and output list are the lists of variables and/or expressions used to describe input and output tuples, and formula describes the constraints imposed upon input and output lists and it is a Presburger formula built of constraints represented by algebraic expressions and using logical and existential operators.

A dependence relation, describing all the dependences in a loop nest, can be computed as a union of flow, anti, and output dependences according to the following formula [32]:

$$\begin{aligned} R = (( RA ^{-1} \circ WA ) \cup ( WA ^{-1} \circ RA ) \cup ( WA ^{-1} \circ WA )) \cap (S \prec S), \end{aligned}$$
(1)

where \( RA \)/\( WA \) are read/write access relations, respectively, mapping an iteration vector to one or more referenced memory locations, and S is the original schedule represented with a relation which maps an iteration vector of a statement to a corresponding multidimensional timestamp (i.e., a discrete time when the statement instance has to be executed). \(S \prec S\) denotes a strict partial order of statement instances: \(S^{-1} \circ (\{\, \mathrm {}[{e}] \rightarrow \mathrm {}[{e'}] \mid e \prec e' \,\} \circ S)\).

A dependence relation is a mathematical representation of a data dependence graph whose vertices correspond to loop statement instances while edges connect dependent instances. The input and output tuples of a relation represent dependence sources and destinations, respectively; the relation constraints point out instances which are dependent.

In the presented algorithm, standard operations on relations and sets are used, such as intersection (\(\cap \)), union (\(\cup \)), difference (−), composition (\(\circ \)), domain (\(\text {domain}(R)\)), range (\(\text {range}(R)\)), relation application (\(R(S) = \{\, \mathrm {}[{e'}] \mid \exists e \in S : \mathrm {}[{e}] \rightarrow \mathrm {}[{e'}] \in R \,\}\)).

The positive transitive closure of a given relation R, \(R^+\), is defined as follows:

$$\begin{aligned} R^ += \bigcup _{i=1}^{i=\infty }{R^i}, \end{aligned}$$
(2)

where \(R^i\) is the i-th power of R, defined inductively by \(R^1=R\), and for \(i>1\), \(R^i=R \circ R^{i-1}\), where \(\circ \) denotes the composition of relations.

A relation R is reflexively closed on a set D if the identity relation \({Id}_D\) is a subset of R. The reflexive closure of R on D is \(R \cup {Id_D}\). The reflexive and transitive closure of R on D is:

$$\begin{aligned} R^* = R^{+} \cup {Id_D}. \end{aligned}$$
(3)

It describes the same connections in a dependence graph (represented by R) that \(R^+\) does plus connections of each vertex with itself.

Techniques aimed at calculating the transitive closure of a dependence graph, which in general is parametric, are presented in papers [4, 18, 33] and they are out of the scope of this paper. In general, it is impossible to represent the exact transitive closure using affine constraints [18]. Existing algorithms return either exact transitive closure or an approximation of it. Exact transitive closure or an over-approximation of it can be used in the presented algorithms, but if we use an over-approximation of transitive closure, tiled code will be not optimal: it provides less code locality and/or parallelization.

Paper [4] presents the time of transitive closure calculation for NPB benchmarks [23]. It depends on the number of dependence relations extracted for a loop nest and can vary from milliseconds to several minutes (when the number of dependence relations is equal to hundreds or thousands).

3 Extraction of synchronization-free slices

To extract synchronization-free parallelism on the statement instance level applying the transitive closure of dependence graphs, we use the approach presented in paper [3]. Let us recall basic definitions related to iteration space slicing on the statement instance level.

Definition 1

(Slice) Given a dependence graph, a slice is a weakly connected component of this graph, i.e., a maximal subgraph such that each pair of its vertices is connected by some path, ignoring the edge directions.

Definition 2

(Ultimate dependence source) An ultimate dependence source is a source that is not the destination of another dependence.

Definition 3

(Representative source) The representative (source) of a slice is its lexicographically minimal ultimate dependence source in the global iteration space of a loop nest.

In order to extract parallelism represented with synchronization-free slices, we need to carry out the following steps:

  1. 1.

    Find a set, \({ REPR}\), including representatives of slices;

  2. 2.

    Reconstruct slices from their representatives as independent execution flows.

Given relation R found as the union of all dependence relations extracted for a loop nest, we start with forming a set of statement instances, \({ UDS}\), describing all ultimate dependence sources, as the difference between the domain of R and the range of R:

$$\begin{aligned} UDS = \text {domain}(R) - \text {range}(R). \end{aligned}$$
(4)

Subsequently, to find which elements of set \({ UDS}\) are representatives of slicesFootnote 1, we construct a relation, \(R\_USC\), that describes all pairs \((e, e')\) of the ultimate dependence sources contained in set \({ UDS}\) that are connected by some path, ignoring the edge directions. Formally, set \(R\_USC\) is defined as shown below:

$$\begin{aligned} R\_{USC} = \{\, \mathrm {}[{e}] \rightarrow \mathrm {}[{e'}] \mid e,e' \in {UDS} \ \wedge \ e \prec e' \ \wedge e' \in (R \cup R^{-1})^+(\{\mathrm {}[{e}]\})\}. \end{aligned}$$
(5)

The inequality (\(e \prec e'\)) in the constraints of relation \(R\_USC\) means that e is lexicographically less than \(e'\). Such a condition guarantees that the lexicographically smallest element will be represented only with the input tuple, and as a result the set \(\text {range}(R\_USC)\) will contain all but the lexicographically smallest sources of synchronization-free slices. \(R^{-1}\) denotes the inverse relation of R, i.e., \(R^{-1} = \{\, \mathrm {}[{e}] \rightarrow \mathrm {}[{e'}] \mid \mathrm {}[{e'}] \rightarrow \mathrm {}[{e}] \in R \,\} \,\). The condition \(e' \in (R \cup R^{-1})^+(\{\mathrm {}[{e}]\})\) implies that there exists some path between e and \(e'\) when the edge directions are ignored.

In order to illustrate the presented idea, let us consider the following relation: \(R := \{\, \mathrm {}[{1}] \rightarrow \mathrm {}[{5}]; \mathrm {}[{2}] \rightarrow \mathrm {}[{4}]; \mathrm {}[{3}] \rightarrow \mathrm {}[{4}]; \mathrm {}[{3}] \rightarrow \mathrm {}[{5}] \,\}.\)

The graph, described with relation R, is the single weakly connected component, presented in Fig. 1 (solid lines).

Set \({ UDS}\) computed over relation R includes vertices belonging to the set \(\{ \, \mathrm {}[{1}]; \mathrm {}[{2}]; \mathrm {}[{3}] \, \}\), i.e., all the vertices that have no incoming edges and thus satisfy the definition of an ultimate dependence source.

According to Definition 1, each weakly connected component constitutes a synchronization-free slice. We would therefore like to find its lexicographically smallest element, serving as the representative of a slice. First, we compute the inverse relation of R, \(R^{-1} := \{\, \mathrm {}[{4}] \rightarrow \mathrm {}[{2}]; \mathrm {}[{4}] \rightarrow \mathrm {}[{3}]; \mathrm {}[{5}] \rightarrow \mathrm {}[{1}]; \mathrm {}[{5}] \rightarrow \mathrm {}[{3}] \,\}.\)

Let us highlight that the forward (solid) and backward (dashed) edges in Fig. 1 form paths connecting ultimate dependence sources contained in the same slice, i.e., if there exists a pair of ultimate dependence sources described with the relation \((R \cup R^{-1})^+\), then they both belong to a single synchronization-free slice. For the working example, relation \(R\_USC\) is as follows: \(R\_USC := \{\, \mathrm {}[{1}] \rightarrow \mathrm {}[{2}]; \mathrm {}[{1}] \rightarrow \mathrm {}[{3}]; \mathrm {}[{2}] \rightarrow \mathrm {}[{3}] \,\}.\) The paths connecting the ultimate dependence sources for the working example, being described with relation \(R\_USC\), are presented in Fig. 1 (dotted lines).

Fig. 1
figure 1

Connections described with relations R, \(R^{-1}\) and \(R\_USC\)

As already mentioned, the range of relation \(R\_USC\) contains all but the lexicographically smallest sources of all slices. Following this observation, in order to find set \({ REPR}\) comprising representatives of slices, we carry out the below computation:

$$\begin{aligned} REPR = UDS - \text {range}(R\_USC). \end{aligned}$$
(6)

Set \({ REPR}\) is very important because its cardinality is equal to the number of synchronization-free slices, each its element is a slice representative, which is enough to reconstruct a corresponding slice by means of the way discussed below. As far as the working example is considered, the set of representatives contains a single element, \(REPR := \{\, \mathrm {}[{1}] \,\}\), which means there exists a single slice.

After finding the representatives of slices and a relation, describing the connection of each representative with other ultimate dependence sources contained in the same slice, given a representative, rpr, we can reconstruct the corresponding synchronization-free slice of the original graph using the following formula:

$$\begin{aligned} SFS(rpr) = R^*(R\_USC^*(rpr)). \end{aligned}$$
(7)

It is worth to note that if \(R\_USC = \varnothing \) then \(R\_USC^*(rpr) = rpr\).

To generate valid code representing synchronization-free parallelism and executing all statement instances that reside in the loop nest iteration space, we also need to calculate set, IND, including independent statement instances as follows:

$$\begin{aligned} IND = IS - (\text {domain}(R) \cup \text {range}(R)), \end{aligned}$$
(8)

where set IS represents the global iteration space of a loop nest.

4 Loop tiling

In papers [5, 6], algorithms based on the transitive closure of a dependence graph allowing for loop nest tiling are introduced. They correct original rectangular tiles so that target tiles are valid under lexicographical order.

Let us consider the following example:

Example 1

figure a

A data dependence analysis over the read and write accesses, based on formula (1), results in the following dependence relation:

$$\begin{aligned} R:= \{\, \mathrm {S1}[{i, j}] \rightarrow \mathrm {S1}[{i + 1, j - 1}] \mid 0< i \le 3 \wedge 2 \le j \le 4 \,\} \, , \end{aligned}$$

which describes data dependences between the instances of statement \(\mathrm {S1}\). Figure 2a shows dependences and synchronization-free slices for Example 1.

Fig. 2
figure 2

Illustrations for Example 1: a dependences, UDS, independent iterations, slice representatives, and slices, b original tiles, c target tiles, d slices generated without splitting tiles including slice representatives, e slices generated with relation \(T\_RPR\), presented in Sect. 5.1, f slices generated with the affine transformations \(i'=i\), \(j'=i+j\)

In general, for each statement \(\mathrm {S}i,i=1,...,q\), surrounded by \(d_i\) loops, we form set \(TILE_i({II_i})\) including iterations belonging to a parametric original rectangular tile as follows:

$$\begin{aligned} \begin{aligned} TILE_i({II_i}) = \mathrm {}[{II_i}] \rightarrow \{\, \mathrm {}[{I_i}] \mid&\ {B_i} * {II_i} + {LB_i} \le I_i \le \mathrm {min}({B_i} * ({II_i} + {1_i}) \ + \\&{LB_i} - {1_i}, {UB_i}) \wedge {II_i} \ge {0_i} \,\}, \end{aligned} \end{aligned}$$

where vectors \({LB_i}\) and \({UB_i}\) include the lower and upper bounds, respectively, of the indices of the loops surrounding statement \(\mathrm {S}i\); diagonal matrix \({B_i}\) defines the size of original rectangular tiles; elements of vectors \({I_i}\) and \({II_i}\) represent the original indices of the loops enclosing statement \(\mathrm {S}i\) and the identifiers of tiles, respectively; \({1_i}\) and \({0_i}\) are the vectors whose all \(d_i\) elements have value 1 and 0, respectively.

Additionally, with each set \(TILE_i,i=1,...,q\), we associate another set, \(II\_SET_i,i=1,...,q\), that includes the tile identifiers of all tiles represented with set \(TILE_i\):

$$\begin{aligned} II\_SET_i = \{\, \mathrm {}[{II_i}] \mid {II_i} \ge {0_i} \wedge {B_i} * {II_i} + {LB_i} \le {UB_i} \,\}. \end{aligned}$$

As far as Example 1 is considered, sets \(TILE_1\) and \(II\_SET_1\) represent tiles of the size \(2 \times 2\) in space \(IS_1\):

$$\begin{aligned} TILE_1:= & {} \mathrm {}[{ii, jj}] \rightarrow \{\, \mathrm {S1}[{i, j}] \mid 0 \le ii \le 1 \wedge 0 \le jj \le 1 \wedge i> 2ii \wedge 0< i \le 4 \ \wedge \\&i \le 2 + 2ii \wedge j > 2jj \wedge 0 < j \le 4 \wedge j \le 2 + 2jj \,\},\\ II\_SET_1:= & {} \{\, \mathrm {}[{ii, jj}] \mid 0 \le ii \le 1 \wedge 0 \le jj \le 1 \,\}. \end{aligned}$$

Figure 2b illustrates original tiles of the size \(2 \times 2\) (\(T00, T01, T10, T11\)) defined by the above sets.

The approach, discussed in this paper, is applicable to both perfectly and imperfectly nested loops. We form a global iteration space for instances of all loop nest statements by means of applying global schedule S, computed by the Polyhedral Extraction Tool (PET) [34], to sets \(TILE_i, IS_i\) used in a tiling algorithm. We call this procedure normalization. To normalize dependence relation R, we apply global schedule S to both the domain and range of this relation.

To compare lexicographically tile identifiers in the global iteration space and generate valid code, we normalize in the same way also sets \(II\_SET_i\).

For the reader convenience, we present the tiling algorithm in “Appendix A”, which is a bit modification of that presented in paper [6]. The modification concerns the way of set and relation normalization, which is carried out using the global schedule of loop nest statement instances returned with PET [34]. The first step of the algorithm transforms a loop nest into its polyhedral representation. The second one prepares data to be used for tile correction. The third step envisages carrying out a dependence analysis for the loop nest. Step 4 carries out the normalization of sets and relations formed in steps 1 to 3. Steps 5 to 7 are to generate set \(TILE\_VLD\). It is the result of the correction of original rectangular ones and it represents target tiles valid under lexicographic order. The inter-tile dependence graph whose vertices are represented with set \(TILE\_VLD\) is acyclic, so there exists a schedule for those vertices [6]. In the next section, we show how we use set \(TILE\_VLD\) to generate synchronization-free code on the tile level.

Figure 2c shows valid target tiles for Example 1, generated according to Algorithm A.

5 Generation of synchronization-free tiled code

In this section, we demonstrate how the techniques presented in Sects. 3 and 4 can be combined to generate parallel synchronization-free tiled code on the tile level when there exist synchronization-free slices on the statement instance level.

Extracting synchronization-free parallelism on the tile level is a more complex task than that on the statement instance level. The techniques to extract slices on the tile level, discussed in papers [6, 25], are based on the following steps: (i) valid (corrected) tiles are generated; (ii) a relation describing all the dependences among valid tiles (inter-tile dependences) is derived; (iii) techniques, presented in paper [3], are applied to the relation, obtained in step (ii), to generate synchronization-free code on the tile level.

Applying the way described in the previous paragraph to Example 1, we get target tiles shown in Fig. 2c. The inter-tile dependence graph for Example 1 is shown in Fig. 3a. As we can see, tiles \(T\_VLD01,\ T\_VLD10,\ T\_VLD11\) are in the same slice, so we lose synchronization-free parallelism on the tile level despite there exists synchronization-free parallelism on the statement instance level. In the following sub-section, we discuss how this problem can be resolved.

Fig. 3
figure 3

Inter-tile dependence graphs for a Example 1, and b Example 2

5.1 Basic concept

There can be the following possible cases concerning the number of slice representatives within a valid tile: (i) no representative, (ii) a single representative, (iii) two or more representatives. When two or more representatives are contained within a valid tile, it may be reasonable to split this tile into several sub-tiles so that each sub-tile includes at least one representative. The reason is the following. The number of slice representatives within a tile impacts the parallelism degree and granularity of tiled code. Increasing the number of representatives in a tile leads to decreasing parallelism degree but increases parallel program granularity. We illustrate this trade-off in the following sub-section by means of two examples.

In this paper, we consider the following three cases of splitting a tile including slice representatives (i) a target tile, including multiple representatives is not sliced, i.e., all slice representatives, extracted on the statement instance level and contained in some target tile, are included in a single slice on the tile level; (ii) a valid tile is sliced into several sub-tiles so that each one includes only a single representative, i.e., each slice representative, extracted on the statement instance level and contained in some target tile, is included in a separate slice on the tile level; (iii) a set, including slice representatives, is tiled so that each tile includes the same number of representatives, N, \(N \ge 2\) except from the last one.

To generate target code, corresponding to each of those cases, we need corresponding mappings from statement instances or slice representatives to abstract identifiers. For this purpose, we form the following two relations.

Relation, SLC, mapping each statement instance, i, to the corresponding slice representative, rpr, is formed as follows:

$$\begin{aligned} SLC = \{\, \mathrm {}[{i}] \rightarrow \mathrm {}[{rpr}] \mid i \in {SFS}({rpr}) \wedge {rpr} \in REPR \,\} \, , \end{aligned}$$

where SFS(rpr) is the set of instances within the slice whose representative is instance rpr, \({ REPR}\) is the set including slice representatives; sets SFS(rpr) and \({ REPR}\) are defined in Sect. 3.

Relation, T, mapping each statement instance, i, to the corresponding tile identifier, II, is built as follows:

$$\begin{aligned} T = \{\, \mathrm {}[{i}] \rightarrow \mathrm {}[{II}] \mid i \in TILE\_VLD(II) \wedge II \in II\_SET \,\} \, , \end{aligned}$$

where \(TILE\_VLD(II)\) represents instances within the valid tile with identifier II, this set is formed by means of Algorithm A; \(II\_SET\) is the set including the identifiers of valid tiles, this set is defined in step 4.4 of Algorithm A.

The third relation, \(T\_RPR\), is user-provided, it maps each slice representative, rpr, to the corresponding identifier, ID, of the tile generated as the result of tiling a set including all slice representatives. It is of the form:

$$\begin{aligned} T\_RPR = \{\, \mathrm {}[{rpr}] \rightarrow \mathrm {}[{ID}] \mid rpr \in REPR \wedge constraints\ on\ ID \,\}. \end{aligned}$$

It is worth noting that relation \(T\_RPR\) represents tiles different from valid target tiles generated with Algorithm 1, they include only slice representatives.

In this paper, we suppose that relation \(T\_RPR\) has to be provided by an expert and it is the input of Algorithm 1 discussed below. If it is not presented on input, the algorithm skips its usage.

For Example 1, such a relation, provided that each tile should include 2 representatives (except from the last one), can be the following:

$$\begin{aligned} \begin{aligned} T\_RPR&:= \{\, \mathrm {}[{rpr1, rpr2}] \rightarrow \mathrm {}[{id1, id2}] \mid rpr1=1 \wedge id1=0 \\&\quad \wedge 2*id2+1 \le rpr2 \le 2*id2+2 \vee rpr2=4 \wedge id1=1 \\&\quad \wedge 2*id2+2 \le rpr1 \le \mathrm {min}(2*id2+3, 4) \wedge 0 \le id1,id2 \le 1 \,\}. \end{aligned} \end{aligned}$$

Tiles ID00, ID01, ID10, and ID11, formed according to the relation \(T\_RPR\) above, are shown in Fig. 2e in green.

To generate code without splitting tiles including slice representatives, taking into account that representative rpr can be found as \(rpr=SLC(i)\), identifier II of the tile, including instance i, is formed as \(II=T(i)\), and identifier \(II'\) of the tile, including representative rpr, is calculated as \(II'= T(rpr)=T(SLC(i))\), we form the following set:

$$\begin{aligned} CODE\_UNSLICED = \{\, \mathrm {}[{II', II, i}] \mid II' = T(SLC(i)) \wedge II=T(i) \,\}. \end{aligned}$$

To generate code with splitting tiles including slice representatives, taking into account that representative rpr can be found as \(rpr=SLC(i)\), identifier II of the tile, including instance i, is calculated as \(II=T(i)\), we form the following set:

$$\begin{aligned} CODE\_SLICED = \{\, \mathrm {}[{rpr, II, i}] \mid rpr = SLC(i) \wedge II=T(i) \,\}. \end{aligned}$$

When relation \(T\_RPR\) is given on input, taking into account that \(rpr=SLC(i)\), we form the following set:

$$\begin{aligned} CODE\_SLICED = \{\, \mathrm {}[{ID, II,i}] \mid ID = T\_RPR(SLC(i)) \wedge II=T(i) \,\}. \end{aligned}$$

Next, we apply the code generator of the Integer Set Library [14] to generate pseudo-code scanning elements of set \(CODE\_UNSLICED\) or \(CODE\_-SLICED\) and finally postprocess this pseudo-code to get the parallel pseudo-code of the following structure:

parfor scanning elements \(II'\) or rpr or ID

   for scanning elements II

      for scanning elements \(i\)

Algorithm 1 lists the steps of the procedure for generation of synchronization-free parallel tiled pseudo-code. It includes the following three steps. The first step is responsible for the calculation of slice representatives and independent statement instances according to the technique presented in Sect. 3. The second step is to extract synchronization-free slices on the tile level by means of the concept presented in this section and Algorithm A. The last step generates synchronization-free pseudo-code on the tile level.

It is worth noting that to generate compilable code, we need a postprocessor which has to transform pseudo-code into compilable one. A postprocessor organization depends on a target platform, programming API or library to represent and then compile parallel programs. In Sect. 7, we clarify how the postprocessor of the TC compiler generates target compilable code.

figure b

5.2 Illustrative examples

In this sub-section, we illustrate extracting synchronization-free slices on the tile level by means of two examples. All calculations were carried out by means of the iscc calculator [31].

Let us start with Example 1. For this loop nest, set UDS, calculated according to formula (4), is as follows:

$$\begin{aligned} UDS := \{\, \mathrm {}[{i, 4}] \mid 2 \le i \le 3 \,\} \cup \{\, \mathrm {}[{1, j}] \mid 2 \le j \le 4 \,\}. \end{aligned}$$

Relation \({R\_USC}\), calculated with formula (5), is empty, hence we conclude that each element of set UDS is the representative of a slice, that is \({REPR}={UDS}\). For Example 1, Fig. 2a shows dependences, ultimate dependence sources (blue points), independent iterations (green points), and slices on the statement instance level (points within red parallelograms).

Consequently, set \(S\_REPR\_IND\) is as follows:

$$\begin{aligned} S\_REPR\_IND := \{\, \mathrm {}[{i, 4}] \mid 2 \le i \le 4 \,\} \cup \{\, \mathrm {}[{1, j}] \mid 1 \le j \le 4 \,\}. \end{aligned}$$

Slice representatives and independent statement instances are depicted in Fig. 2a. After applying Algorithm 1, we receive the following set representing valid target tiles:

$$\begin{aligned} \begin{aligned} TILE\_VLD := \mathrm {}[{ii, jj}] \rightarrow&\ \{\, \mathrm {}[{i, j}] \mid i> 2ii \wedge i > 0 \\&\wedge \ ((jj = 0 \wedge i \le 4 \wedge 0 < j \le 3 + 2ii - i) \\&\vee \ (jj = 1 \wedge ii \le 1 \wedge i \le 2 + 2ii \wedge 4 + 2ii - i \le j \le 4)) \,\}. \end{aligned} \end{aligned}$$

Figure 2c shows four target tiles defined by the above set. Assuming that tiles, including multiple slice representatives, should not be sliced and a slice representative is presented with variables ij, set SFS is as follows:

$$\begin{aligned} \begin{aligned} SFS := \mathrm {}[{i, j}] \rightarrow&\{\, \mathrm {}[{i0, 4 + i - i0}] \mid j = 4 \wedge 2 \le i \le 4 \wedge i< i0 \le 4 \wedge i0<= 3 + i \,\} \ \\&\cup \{\, \mathrm {}[{i0, 1 + j - i0}] \mid i = 1 \wedge 0< j \le 4 \wedge 2 \le i0<= 4 \wedge i0 \le j \,\} \ \\&\cup \{\, \mathrm {}[{i, 4}] \mid j = 4\wedge 2 \le i \le 4; \mathrm {}[{ 1, j}] \mid i = 1 \wedge 0 < j \le 4 \,\}. \end{aligned} \end{aligned}$$

Let us note that tile \(T\_VLD11\) is divided into two sub-tiles \(T\_VLD11'\) and \(T\_VLD11''\). The former belongs to slice SFS1 while the latter is within slice SFS2 (see Fig. 2d.

Set CODE includes the following elements:

{ [1, 1, 1, 1, 4, 4]; [1, 1, 1, 1, 3, 4]; [0, 1, 0, 1, 2, 4]; [0, 1, 0, 1, 1, 4]; [1, 1, 1, 1, 4, 3]; [0, 1, 1, 1, 3, 3]; [0, 1, 0, 1, 2, 3]; [0, 1, 0, 1, 1, 3]; [0, 1, 1, 1, 4, 2]; [0, 1, 1, 0, 3, 2]; [0, 1, 0, 1, 2, 2]; [0, 0, 0, 0, 1, 2]; [0, 1, 1, 0, 4, 1]; [0, 1, 1, 0, 3, 1]; [0, 0, 0, 0, 2, 1]; [0, 0, 0, 0, 1, 1] }.

The functions of elements in set CODE are the following. The first pair of elements represents the identifier of the tile including slice representatives (0,0 or 0,1 or 1,1), the second pair stands for the identifiers of the tiles including elements of the slice defined with the slice representatives determined with the first pair (from 0,0 to 1,1), the third pair defines the iterations of the tile with the identifier represented with the second pair (from 1,1 to 4,4).

Figure 2d illustrates tree slices SFS0, SFS1, and SFS2 represented with set CODE.

Applying relation \(T\_RPR\) presented in the previous sub-section, we get tiles shown in blue in Fig. 2e. They are the same as those obtained by means of applying the affine transformations: \(i'=i\), \(j'=i+j\) to the original iteration space and then applying Algorithm A to the transformed iteration space, see Fig. 2f.

Let us now consider another example.

Example 2

figure c

Dependences in this loop are described by the following relation:

$$\begin{aligned} \begin{aligned} R :=&\{\, \mathrm {}[{i, j}] \rightarrow \mathrm {}[{i + 1, j + 1}] \mid 0< i \le 3 \wedge 0< j \le 3 \,\} \ \cup \\&\{\, \mathrm {}[{i, j}] \rightarrow \mathrm {}[{i + 1, j - 1}] \mid 0< i \le 3 \wedge 2 \le j \le 4 \,\}. \end{aligned} \end{aligned}$$

Figure 4a illustrates dependences for this example. As we can see, there are two synchronization-free slices, red arrows depict the first one, while the black ones form the second.

Set \({ UDS}\) computed over relation R includes the iterations contained in the following set: \(\{\, \mathrm {}[{1, 1}];\) \(\mathrm {}[{1, 2}];\) \(\mathrm {}[{1, 3}];\) \(\mathrm {}[{1, 4}] \,\}\). Figure 4a presents those four ultimate dependence sources, however, only two of them are slice representatives. In order to exclude non-representative sources, we apply formula (5) to get the following relation: \(R\_USC := \{\, \mathrm {}[{1,1}] \rightarrow \mathrm {}[{1,3}]; \mathrm {}[{1,2}] \rightarrow \mathrm {}[{1,4}] \,\}\). Subsequently, we form set \(S\_REPR\) as \(S\_{REPR} := UDS - \text {range}(R\_{USC}) = \{\, \mathrm {}[{1, 1}]; \mathrm {}[{1, 2}] \,\}\).

Let us note that there are no independent statement instances for this example, i.e., set \(S\_REPR\_IND\) is the same as set \(S\_REPR\).

Applying Algorithm 1, we obtain the following set representing target tiles:

$$\begin{aligned} \begin{aligned} TILE\_VLD := \mathrm {}[{ii, jj}] \rightarrow \{\, \mathrm {}[{i, j}] \mid&\ i> 2ii \wedge i > 0 \wedge ((jj = 0 \wedge i \le 4 \ \\&\ \wedge 0 < j \le 3 + 2ii - i) \vee (jj = 1 \wedge ii \le 1 \ \\&\ \wedge i \le 2 + 2ii \wedge 4 + 2ii - i \le j \le 4)) \,\}. \end{aligned} \end{aligned}$$
Fig. 4
figure 4

Illustrations for Example 2: a dependences, UDS, slice representatives, and slices, b original and target tiles, c slice with representative (1,1), d slice with representative (1,2)

Figure 4b shows original tiles of the size \(2 \times 2\) and target tiles. The dependence graph on the target tile level is shown in Fig. 3b. As we can see, for this example, all target tiles are combined in a single slice, i.e., the way described in paper [6] fails to extract any synchronization-free parallelism on the tile level.

Applying Algorithm 1 with slicing tiles including multiple slice representatives, we first construct set SFS, we skip its mathematical representation because it is too long. Figure 4c, d shows the two synchronization-free slices with the representatives (1,1) and (1,2), respectively. As we can see, the elements of target tiles are divided between the two slices. Each slice includes different elements of all target tiles.

Eventually, for the purpose of the code generation phase, we form set CODE which contains the following elements:

{ [1, 1, 1, 1, 4, 4]; [1, 2, 1, 1, 3, 4]; [1, 1, 0, 1, 2, 4]; [1, 2, 0, 1, 1, 4]; [1, 2, 1, 1, 4, 3]; [1, 1, 1, 1, 3, 3]; [1, 2, 0, 1, 2, 3]; [1, 1, 0, 1, 1, 3]; [1, 1, 1, 1, 4, 2]; [1, 2, 1, 0, 3, 2]; [1, 1, 0, 1, 2, 2]; [1, 2, 0, 0, 1, 2]; [1, 2, 1, 0, 4, 1]; [1, 1, 1, 0, 3, 1]; [1, 2, 0, 0, 2, 1]; [1, 1, 0, 0, 1, 1] }.

The roles of elements in this set are the following. The values of the first two elements correspond to a slice representative (1,1 or 1,2), the second pair stands for the identifiers of the tiles including elements of the slice defined with a slice representative (0,0 or 0,1 or 1,0 or 1,1), the third pair defines the iterations of the tile with the identifier represented with the second pair (from 1,1 to 4,4).

5.3 Discussion

The high level concept, presented in Sect. 5.1, is not dependent on how synchronization-free slices on the statement instance level are extracted and how target tiles valid under lexicographic order are derived but the ways of its implementation define the quality of tiled code generated.

The result of extracting slices should be presented with a relation which maps each statement instance to the corresponding slice representative while the result of tiling has to be presented with a relation which maps each statement instance to the corresponding tile identifier. In addition to those relations, an expert can provide a way of splitting slice representatives and the corresponding relation which maps each representative to the identifier of the tile generated due to splitting slice representatives. Such a relation may allow for generation of better tiled code.

These three relations are then used to form a set, as it is shown in Sect. 5.1, responsible for generation of tiled code. These relations define the effectiveness of the concept and quality of code generated by it: parallelism degree, program granularity, performance, scalability depend on the implementation of this concept. In this paper, to implement the introduced concept, we present and apply Algorithms A and 1 based on the transitive closure of dependence graphs. Applying other algorithms, for example those based on affine transformations, will lead to other tiled code.

6 Related work

There has been a considerable amount of research into tiling demonstrating how to aggregate a set of loop nest iterations into tiles with each tile as an atomic macro statement, from pioneer papers [17, 29, 35, 36] to those presenting advanced techniques [13, 16, 19,20,21].

Several popular frameworks are used to produce tiled code: the affine transformation framework based on the Polyhedral Model [9, 11, 12, 22], non-Polyhedral Model [19], and Iteration Space Slicing [27, 28].

The affine transformation framework is one of the most advanced reordering transformations. Let us recall that this approach includes the following three steps: (i) program analysis aimed at translating high level codes to their polyhedral representation and to provide data dependence analysis based on this representation, (ii) program transformation with the aim of improving program locality and/or parallelization, (iii) code generation. All the three steps are available in the algorithms presented in this paper. But there exists the following difference in step (ii): in affine transformations, a (sequence of) program transformation(s) is represented by a set of affine functions, one for each statement, while the presented approach does not find and use any affine function for program transformation(s). It applies the transitive closure of a program dependence graph to transform invalid original tiles into valid target ones. At this point of view the program transformation step is rather within the Iteration Space Slicing Framework introduced by Pugh and Rosser [27]: “Iteration Space Slicing takes dependence information as input to find all statement instances from a given loop nest which must be executed to produce correct values for the specified array elements”.

That is, we may conclude that the introduced algorithms are based on a combination of the Polyhedral and Iteration Spacing Slicing frameworks. Such a combination allows for improving the loop nest tiling transformation effectiveness. In the next section, we show that for examined benchmarks, the presented approach extracts more synchronization-free parallelism than that provided with well-known affine transformations.

Papers [3, 7] demonstrate how to extract coarse- and fine-grained parallelism applying different Iteration Space Slicing algorithms, however, they do not consider any tiling transformation.

Paper [5] deals with applying transitive closure to only perfectly nested loops and does not present any algorithm to extract synchronization-free parallelism.

In paper [25], the authors present a way to extract synchronization-free parallelism using a relation representing inter-tile dependences. As we demonstrated in Sect. 5, such an approach can lose discovering synchronization-free parallelism on the tile level.

Paper [8] demonstrates how tiled code can be generated applying free scheduling of tiles, but such a way results in generating code with synchronization.

Paper [6] deals with tiling arbitrarily nested loops, however extracting synchronization-free parallelism is based on forming slices including valid tiles without splitting tiles including multiple slice representatives. That prevents extracting synchronization-free parallelism on the tile level for some classes of loop nests.

Diamond and hexagonal tiling [2, 15] allows for scalable parallelism, but it can be applied only to stencil algorithms while the approach presented in our paper is of general usage.

Summing up, we may conclude that provided that the exact transitive closure of a dependence graph can be calculated, the approach, presented in this paper, is able to generate synchronization-free parallelism on the tile level when there exists synchronization-free parallelism on the statement instance level. As far as disadvantages of the presented technique are concerned, we observed, that for loop nests, exposing non-regular dependences, generated tiles are also non-regular that results in more complicated code than that generated with affine transformations. Some target tiles can be parametric, i.e., their sizes depend on parametric upper loop index bounds. It does not guarantee that the data size per a parametric tile is smaller than the capacity of cache. In such a case, a parametric tile represents an iteration sub-space where tiling is excluded. We will address and illustrate these issues in our future work.

7 Implementation and experimental study

The algorithms, presented in this paper, have been implemented into the TC optimizing compiler, v.0.2.24,Footnote 2 which utilizes the Polyhedral Extraction Tool [34] for extracting polyhedral representations of original loop nests, and the Integer Set Library [30] for performing dependence analysis, manipulating integer sets and relations, and generating output code.

To evaluate the effectiveness of the presented approach and the performance of parallel tiled code generated by means of this approach, we have experimented with the PolyBench/C 4.1 [26] benchmark suite. Out of the 30 benchmarks contained in PolyBench, TC finds a total of 8 benchmarks for which there exist more than one synchronization-free slice on the both statement instance and tile levels. The list of these loop nests is as follows: 2mm, bicg, gemm, gesummv, mvt, syr2k, syrk, trmm. The code generated by TC for the studied kernels can be found in the results directory of the compiler’s repository.

The evaluation of parallel code performance was carried out on a multicore architecture (2x Intel Xeon E5-2699 v3 clocked at 2.3 GHz, 18 cores/socket, 36 threads/socket, 32 KB L1 data cache/core, 256 KB L2 cache/core, 45 MB L3 cache/socket, 256 GB RAM clocked at 2133 MHz). The code of both original and transformed loop nests was compiled under the Linux kernel 3.10.0 x86_64 by GCC 4.8.3 with the O3 optimization enabled.

Each examined loop nest was tiled with tiles of the size 32 (in each dimension). Such transformed code was then executed using 1, 2, 4, 8, 16, 32, and 64 threads in subsequent runs.

The problem sizes used for the studied benchmarks are shown in Table 1 which also presents the number of synchronization-free slices extracted from each loop nest, expressed as formulas involving loop index upper bounds and a tile size constant B representing the width of a tile side in each dimension (32 in our study). The vertical bar “ | ” indicates that in code generated with TC and PLUTO v.0.11.4 [10]—a state-of-the-art optimizing compiler based on the Affine Transformations Framework—there is a synchronization barrier between parallel regions each including synchronization-free code, e.g., M|N means that first a loop nest of M synchronization-free slices is computed, then after a barrier another loop nest of N synchronization-free slices is executed.

The row Theoretical presents the cardinality of set \(S\_REPR\_IND\), i.e., the total number of synchronization-free slices extracted with Algorithm 1. The rows TC sliced and TC unsliced present the number of slices extracted when the tiles, including multiple slice representatives, are sliced (so that each such tile includes a single representative) and uncliced, respectively. The row PLUTO shows the number of slices extracted with PLUTO.

For several kernels, despite Algorithm 1 extracts only synchronization-free parallelism, compilable TC code has a synchronization point (for example, for kernel mvt). This fact is explained by the way used by a TC post-processor to generate compilable code. Set CODE generated with Algorithm 1, may consist of several sub-sets. For each sub-set, a code generator may produce a separate loop nest (although all those loop nests are independent) and for each of them, the TC post-processor generates a separate parallel loop nest. So we have a sequence of parallel loop nests and there exists barrier synchronization between each pair of such loop nests. So, TC does not exploit the whole parallelism, extracted with Algorithm 1.

Table 1 Problem sizes used for experiments; the theoretical and real number of slices

TC generates code where the parallelism of tiled loop nests is represented with the OpenMP API [24] as follows: the outermost loop, scanning representatives of synchronization-free slices in each loop nest, is decorated with the omp parallel for worksharing construct; the remaining loops enumerating the other variables responsible for slice representatives are serial. This leads to the fact that in TC code, there are less slices than the cardinality of set \(S\_REPR\_IND\), representing the theoretic (maximal) number of slices. This can be justified by the fact that the number of slices in code generated with TC is greater than the number of CPUs available in a computer used, i.e., the parallelism degree of code generated with TC is enough to utilize all available processing units. As a result, the real number of independent execution flows in TC code—summarized in the two subsequent rows—is less than the theoretical one.

Analysing the content of Table 1, we may conclude that TC code represents less barriers of synchronization than those in PLUTO code except for the mvt benchmark. The reason is that TC and PLUTO generate different codes. In addition to tiling, PLUTO can apply simultaneously other optimizing transformations 1, for example, strongly connected component (SCC) graph splitting into particular SCCs, fusion, permutation. For the mvt code, PLUTO applies the fusion of two loop nests that leads to producing a single loop nest, while TC does not utilize the fusion transformation and generates two parallel regions (each including one loop nest) divided with a barrier of synchronization.

Fig. 5
figure 5

Speed-up for parallel tiled code generated by TC and PLUTO

Figure 5 illustrates the speed-up of parallel tiled code produced by means of the presented algorithms implemented in TC and that of code generated by PLUTO, v.0.11.4, called with the options “–tile –parallel” and “–lbtile”. The labels TC sliced and TC unsliced indicate execution times when tiles including slice representatives are sliced and unsliced, respectively. The speed-up has been computed as the ratio of the execution time of parallel tiled code relative to the serial execution time of the corresponding original loop nest.

For most of the studied programs, we can observe that transformed code demonstrates positive speed-up (greater than 1) even for a single-threaded execution due to program locality enhancement. The number of synchronization-free slices extracted from each kernel exceeds the total number of available cores and as an expected consequence, speed-up grows with increasing the number of threads. If slices can be distributed evenly between threads, the speed-up progression is close to linear. For all benchmarks, code generated by TC demonstrates equal or higher performance than that of code generated with PLUTO. For the kernels bicg, gemm, gesummv, unslicing tiles with slice representatives leads to code exposing speed-up similar to that of code generated by PLUTO.

For the kernels 2mm, gesummv, syr2k, code based on slicing tiles with slice representatives exposes a higher degree of parallelism and scales better.

Based on the results obtained, we may conclude that parallel synchronization-free code on the tile level, generated by means of the presented approach, is a scalable solution which allows for achieving significant speed-up of numerical programs characterized by the presence of multiple independent execution flows.

8 Conclusion

In this paper, we considered automatic extraction of synchronization-free parallelism available in arbitrarily nested loops on the tile level when elements of independent program fragments are tiles or sub-tiles. We demonstrated that well-known techniques do not guarantee generation of synchronization-free code on the tile level when there exists synchronization-free parallelism on the statement instance level when elements of independent program fragments are statement instances.

We presented a solution to the following problem: how to generate synchronization-free code on the tile level when there exists synchronization-free parallelism on the statement instance level. This solution includes the following steps: (i) extraction of synchronization-free parallelism on the statement instance level; (ii) generation of target tiles whose lexicographical execution order is valid; (iii) production of synchronization-free slices whose elements are valid tiles or sub-tiles; (iv) code generation. To realize steps (i) and (ii), we applied the transitive closure of dependence graphs. This results in an approach based on a combination of the Polyhedral and Iteration Space Slicing frameworks.

The presented approach was formalized with two algorithms which were implemented in the publicly available TC optimizing compiler. Experiments allow us to derive the following conclusions. For all examined loop nests, the presented approach is able to generate synchronization-free code on the tile level when there exists synchronization-free parallelism on the statement instance level. In future, we intend to implement the interchange and fusion transformations in the TC compiler to increase generated code locality.