1 Introduction

As embedded software grows rapidly complex and large to satisfy diverse demands from the market, an efficient use of limited storage resources is ever becoming more important for compilers targeting these processors. To attain a desired performance goal with such limited storage, the processors are designed assuming software that runs on them would make heavy use of their various special hardware features. One potentially profitable route to efficient use of memory bandwidth is through DMA engine with a burst transfer.

A large number of techniques to reduce memory access delay and increase memory bandwidth in the execution of code, have been proposed since the performance of memory subsystem usually dominates system’s performance [6]. It should be noted that most of the modern DRAMs used in embedded system design support efficient memory access modes. In particular, the burst access mode is extensively supported in DRAMs [5, 10]. In general, access latency of random access is much higher than that of burst access. Further, memory operates in lower power per bit in the burst mode than in a random access mode. For example, IBM’s Cu-11 Embedded DRAM [10], ARM macro [1] supports a random access mode of 10 ns and a burst mode access of 5 ns. It has an active current of 60 mA/Mb roughly in random cycle and 13 mA/Mb. Consequently, a full exploitation of the burst memory-access mode is very necessary to alleviate the performance bottleneck caused by the delay of memory accesses.

Panda et al. [14] proposed an algorithm for arranging local variables to memory and organizing array variables via loop-transformation techniques. To do this, they modeled a number of a realistic burst-access mode in DRAMs. The formulation of the problem of arranging variables in a memory is analogous to the one in [15], which is originally used to reduce the number of cache misses. Though the formulation looks reasonable, it is not an exact formulation in a strict sense. Khare et al. [12] extended the work in [14] to support the burst mode in a dual-memory architecture, in which they focused mainly on an efficient interleaving of memory accesses.

Johnson et al. [11] designed a new algorithm that identifies profitable memory access sequences for combining into multiple memory access instructions as a means of compacting code. Grun et al. [7] proposed an approach that allows the compiler to exploit detailed timing information of the memories. They showed a further optimization (using a burst mode) is possible when an accurate timing of memory accesses to multiple memories is extracted. They also presented an approach [8] of extracting, analyzing, and clustering the most active memory-access patterns in an application and customizing memory architecture. Ayukawa et al. [2] proposed an access-sequence control scheme for relieving the page-miss penalty in random-access mode. They introduced an embedded DRAM macro attached with a special hardware logic for access-sequence control. Mckee et al. [13] described a stream memory controller (SMC) whose access ordering circuitry attempts to maximize memory-system performance based on the device characteristics. Their compiler is used to detect streams, while access ordering and issue are determined by hardware. Chame et al. [4] manually optimized an application for a PIM-based (embedded-DRAM) system by applying loop unrolling and memory-access reordering to increase the number of burst-mode accesses. Shin et al. [16] designed and implemented a compilation technique which exploits burst mode automatically. Their technique unrolls loops and reorders memory accesses by hoisting some codes in the loop body. Hettiaratchi et al. [9] proposed an address-assignment technique for array variables to reduce energy consumption on memory accesses by maximizing burst-mode accesses. They formulated the problem into a multiway graph partitioning, and solved it using a Kernighan–Lin-based partitioning algorithm.

All of the above approaches are in some sense related to our work in that they aim at finding an “optimal memory (offset) assignment” for certain hardware feature like the burst transfer. However, their works differ from ours in that none of those studies addressed the size of variable. They all center around to solve some specific problems with word sized variables. Our work presented here solves a generalized problem to maximally utilize burst transfers including subword sized variables.

In this paper, we propose a new general approach to the problem of relocation of nonarray variables in code to DRAM so that the efficient burst mode is maximally utilized. The first contribution of this work is that we show the problem is NP-hard. Consequently, finding optimal solutions within a reasonable amount of time to relatively large-sized problem instances look infeasible. This validates our strategy to design an efficient heuristic for solving the problem. The second contribution is that we solve the original problem based on formal formulation of the problem unlike the previous approaches.

The rest of this paper is structured as follows. In Section 2 describes our generalized problem formulation. The entire structure of the proposed approach is described in Section 3. Section 4 shows the experimental results. The conclusion is derived in Section 5.

2 Problem formulation

We first define a set of constraints that should be satisfied in the process of generating burst accesses. Then, we formally define a variable placement problem (VPP) for maximizing burst accesses. The optimization problem is, given a set of read/write operations, to find a stack variable location with maximum number of burst accesses (e.g., ARM chips [1]).

Definition 1

(Burst order constraints)

In a burst access, the sequence of m variables’ location must be contiguous starting from the address specified by the content of offset, i.e., Var[offset], Var[offset + Type], …, Var[offset + Type*m-Type] where Type represents corresponding size of the variables. Var[offset], Var[offset + Type*1], Var[offset + Type*2], Var[offset + Type*3] - i.e., sequential beginning at the first access. Note that when a burst access is to be formed from reads or writes, the task of satisfying burst order constraint is closely related to the task of assigning variables to memory (i.e., variable placement).

Problem 1

(VPP) For a set RWtot of read/write operations in assembly code C, the VPP is to generate a set RWburst of burst accesses from RWtot while satisfying the burst order constraints as well as the data dependency constraint of C. It maximizes the total sum of the number of burst accesses (i.e., |RWburst|) and minimizes the number of normal accesses (denoted as |RWnorm|). The objective function is minimizing the number of accesses of |RWburst| + |RWnorm|.

The objective function stands memory access latency. The latency usually depends on hardware features. For the sake of generality of our algorithm, we assume that |RWburst| and |RWnorm| represent the number of normal R/W operations served for burst and normal accesses respectively. For example, 4 words are transferred by the burst mode, where 4 read operations are involved. Then, |RWburst| must be one. The reason is that the first access spends most of burst transfer time since it initiates the burst transfer. The other 3 accesses consisting of the burst transfer just hit caches. It is relatively short time compared with normal accesses, thus we does not count them as a factor of |RWburst|.

We describe the VPP and our proposed solution to the problem using an example code fragment from a multimedia application where the original ARM code is translated into 3-address form for a better readability. R-regions (Read regions) and W-regions (Write regions), marked with vertical bars on the right side of Fig. 1, represent the maximum ranges of cycle steps in code, in which the corresponding read and write operations can be executed without violating the data dependency specified in the code, respectively. For example, the read from the variable d in the basic block B1 of the code has an R-region stretching from cycle 0 to cycle 6 in B1 because there is no write before the read in B1 and the value read to r2 is first used at cycle 6. Similarly, W-region for the write into the variable a in the block B1 stretches from cycle 7 to cycle 8 in B1 since the written value is defined at cycle 6 and there is no read from a in B1 after the write.

Then, the problem is to generate burst accesses by grouping the read/write operations whose R/W-regions are overlapped, so that the objective function is as minimum as possible while satisfying the burst order constraints. By a simple check we can see that the results of variable placement would critically affect the satisfaction of the burst order constraint of each R/W operation. Before we describe our approach to the problem VPP, we show that problem is intractable.

Problem 2

(decision-VPP) For an VPP instance with (RWtot, C) and k, is there a variable placement that can generate RWburst such that |RWburst| < = k ?

Problem 3

(decision-MWPC (maximum-weighted path cover)) For an edge weighted complete graph G(V,W), and k’, is there a path cover P such that the total sum of the weights of the edges on P is greater than or equal to k’?

Theorem 1

decision-VPP is NP-complete.

Proof

Consider a decision-VPP the restrictions that (a) RWtot = {O1; O2; O3; O4; …; O2m-1; O2m}, (b) all operations in RWtot are reads, and (c) a burst access is possible only from each pair of O2j-1 and O2j, j = 1, 2, …,m, and nothing else. Then, we show that the decision-VLP is NP-complete. (Note that the problem is in NP since we can generate burst accesses and check the value of its RWburst in polynomial time.) Let var(Oj) denote the variable read by Oj and V be the set of all variables read by the operations in RWtot. Then, we construct a complete graph G(V’,W) where V’ = V and the weight to each edge (v1, v2) represents the number of pairs of O2j-1 and O2j, j = 1, 2, …,m in RWtot such that {var(O2j-1), var(O2j)} = {v1, v2}. We show that decision-MWPC is polynomial time reducible to decision-VPP.

Let G(V,W) be a graph in decision-MWPC and P be a path cover on G. We construct a problem instance (RWtot) and solution RWburst of decision-VPP from the instance G(V,W) and P of decision-MWPC. Initially we set (RWtot) = {}. We assign a unique variable to each node of V in G. For an edge ei of positive weight (say wi), we create two unique operations and include them to (RWtot). The corresponding edge weight is then decremented by one. The process repeats until the edge weight becomes zero. The process of a burst access generation iterates for the other edges of positive weight in G until there remains no edge of positive weight. Note that if the total edge weight of G is L, |RWtot| = 2 L. We set variables’ location to that of the corresponding node sequence of P. Then, we create a RWburst solution from (RWtot) as follows:

For each edge ei in P of G, we extract the corresponding wi pairs of operations from (RWtot) and group each pair to form wi burst accesses. Then, it is true that if the total edge weight of P of G is k, there will be k burst accesses (i.e., |RWburst| = k) and (2 L - 2 k) normal accesses (i.e., |RWnorm| = 2 L - 2 k). Consequently, if the constructed decision-VPP instance contains a solution of |RWburst| + |RWnorm| = 2 L - k, (which we set to k’) there is a solution of P with k total edge weight in the decision-MWPC instance. Conversely, if there is a solution of P with k total edge weight in the decision-MWPC instance, there is a solution of |RWburst| + |RWnorm| = 2 L-k (=k’) in the constructed decision-VPP instance.

3 The proposed approach

According to Theorem 1, in the following, we propose an efficient approach, called VPP Solver. The solver consists of two steps:

(Step 1) From RWtot, we construct a graph, called variable placement graph, Gvpp, which describes relations between variables in forming burst accesses; (Step 2) we propose a burst access forming technique and apply it to the Gvpp obtained in step 1, to generate burst accesses that minimizes the factor of |RWburst| + |RWnorm|.

Step 1.1 (Construction of Gvpp): Nodes of graph Gvpp represent the variables used in the read/write operations in RWtot. Gvpp is a multi-graph, and there is a unique edge between two nodes ni1 and ni2 if there is a pair of reads or writes operations Oj1 and Oj2 in RWtot that satisfies two conditions: (condition 1) {var(ni1), var(ni2)} = {var(Oj1), var(Oj2)}, where var(x) indicates the variable transferred by x if x is a read or write operation, or indicates the corresponding variable if x is a node in Gvpp; (condition 2) the lifetimes of Oj1 and Oj2 overlap. Figure 2 shows Gvpp of the code of Fig. 1. The graph consists of nodes, each of which represents a distinct variable used in the reads/writes. Note that each edge of Gvpp is labeled with a pair of operations. These operations are a pair of reads or writes for the variables corresponding to the end nodes of the edge. For example, “O3/O2” is labeled on the edge (c, b) because O3 and O2 respectively perform reads for c and b, satisfying condition 1, and the lifetimes of O3 and O2 overlap, as shown at the R-regions of O3 and O2 in Fig. 1, satisfying condition 2 (Fig. 3).

Fig. 1
figure 1

3 basic blocks (B1,B2,B3) of the 3-address code and the R/W-regions for reads/writes of each block

Fig. 2
figure 2

Gvpp derived from the code in Figure 1

Fig. 3
figure 3

Algorithm build_Gvpp

Step 1.2 (Adjustment of Gvpp): The previous step does not consider variable size. A node can have several variable sizes such as one byte character, two bytes short integer, eight bytes double.

We observe that many media and network processing applications make extensive use of subword data. Therefore, for such applications, by packing multiple subword variables into a single word, we can generate storage layouts that further reduce the cost of memory accessing in an efficient way.

Thus, nodes of Gvpp should be adjusted with fitting to four bytes a node since a basic unit of a burst transfer is a word (four bytes).

To fit each node into four bytes, we define an adjustment operation, called node coalescing for subword variables. Normally, double types variables are manipulated as two word variables, thus, no more specific operation is necessary to split such two word variables.

Only two nodes could be coalesced together as each was two bytes, in general more than two variables may be coalesced into a single location assuming that their combined bytewidths do not exceed 4 bytes. While a single coalescing operation involves exactly two nodes, by iteratively performing coalescing, more than two variables can be made to share the same location.

Definition 2

(Node Coalescing)

Given an access graph G(V,E), node u, v ∈ V can be coalesced into uv if the following condition holds:

Bitwidth(u) + Bitwidth(v) ≤ 32bits

The transformed graph G’(V’, E’) obtained after coalescing of u and v is obtained as follows:

 V’ = V- {u, v} ∪ {uv}

 Bitwidth(u,v) = Bitwidth(u) + Bitwidth(v)

 E’:

     for each node x ∈ V – {u,v} do

      if u-x, v-x ∈ E then

         uv – x ∈ E’

weight(uv–x) = weight(u-x) + weight(v-x)

      elseif u-x∈E and v-x ∉ E then

         uv –x ∈ E’

         weight(uv-x) = weight(u-x)

      elseif v-x∈E and u-x ∉ E then

          uv - x ∈ E’

          weight(uv-x) = weight(v-x)

      elseif x-y∈E and {x,y} ∩{u,v}= φ then

          x-y∈E’

      endif

     endfor

The node coalescing operation is formally defined above. The definition above specifies the precondition under which nodes u and v can be coalesced and replaced by node uv. Furthermore, edge weight weight(x-y) is the number of times variables x and y are accessed one after another.

Step 2 (Generation of burst accesses): This step groups read/write operations to form burst accesses using the Gvpp graph obtained in step 1. Since the formation of burst operations should satisfy burst order constraints, and the constraint is tightly related to the result of variables’ location in memory space, our key algorithm in this step is to find an optimal location of the variables that leads to a formation of burst accesses. We formulate the problem of finding such a variable location by finding a path cover in Gvpp. The proposed algorithm is designed as an iterative method.

Initially, we have an empty path P, and Gvpp. At each iteration, we select an edge in Gvpp among the “candidate” edges and expand P by including the edge to P and update Gvpp by merging the two end nodes of the edge into a new node. More precisely, in the ith iteration of the algorithm, for every edge (vi, vj) in Gvpp such that the inclusion of (vi, vj) to P does not cause a cycle in the initial Gvpp, the operations corresponding to vi and vj are merged into a burst access, and compute the cost. Cost represents the number of burst R/W operations involving in P. The first RW operation is not counted in cost computation since it takes full of memory access time to initiate the burst transfer. After cost calculation, we select, among the candidate edges, the edge (vi, vj) with the largest cost value, and merge the operations of vi and vj into burst accesses. P is then updated by including (vi, vj), and Gvpp is also updated by merging the nodes vi and vj and updating the connected edges accordingly. The process repeats until P becomes a path that covers all the nodes in the initial Gvpp. Figure 4 shows a step-by-step procedure of the generation of burst accesses by our proposed algorithm from Gvpp in Fig. 2. The table in Figure 4a summarizes the value of cost of edges in the initial Gvpp. Edge (a,d) is selected because its cost is the maximum. Consequently, two new burst accesses, O(1,6) and O(11,13), which respectively come from the results of merging operations O1 and O6, O11 and O13, are produced. The left side of Fig. 4b shows the updated Gvpp where the thick edge indicates a (partial) path cover (i.e., P). Edge (a, b) is then selected, and thus two new burst accesses are generated accordingly. Figure 4c shows the updated Gvpp and computations of the cost in the third and fourth iterations in step 2 of VLP Solver, respectively. Finally, Figure 4c shows the final path cover d-a-b-c, which becomes exactly the memory layout of variables. From the results, we can predict that the total reduction of memory latency is about 12 if ARM ISA (instruction set architecture) [1] supports three cycles’ load operation and one cycle spent in burst mode. The reduction is 45 % of the total memory latency of the code in Fig. 1.

Fig. 4
figure 4

Example for illustrating the detailed procedure of step 2: generation of burst accesses from Gvpp

Finally, each variable should be assigned onto memory followed by the obtained solution. At this point, variables can be placed by sequential order, circular order or interleaved order. It depends on memory subsystem design. Normally, sequential order is good enough. However, some hardware features such as rotating register files or memory bank supported by DMA engine effectively support interleaved and circular orders.

  • Circular placement: Var[offset], Var[(offset + Type*1)%4], Var[(offset + Type*2)%4], Var[(offset + Type*3)%4] - wrapping at the end of the aligned block, e.g., 0123, 1230, 2301, 3012.

  • Interleaved placement: The interleaved burst mode computes the address using an exclusive or operation between the counter and the address. Using the same starting address of 5, a 4-word burst would return words in the order 5-4-7-6. An 8-word burst would be 5-4-7-6-1-0-3-2. This way is preferred by Intel chips.

4 Experimental results

We conducted a set of experiments to check the effectiveness of the proposed variable location optimization algorithm, called VPP Solver. Our algorithm was implemented in C++ and evaluated it on a set of benchmark programs in DSPStone [17]. The evaluation wad conducted on an ARM 7 processor. VPP Solver is applied to a set of ‘unoptimized’ (initial) ARM assembly codes of benchmark programs to see how much reductions of execution cycles VPP Solver can purely achieve.

  • Assessing the effectiveness of VPP Solver on initial codes:

    Two versions of ARM assembly code are generated for benchmarks listed in Table 1; the first version is the one which uses normal data transfers with an exception in procedure boundaries in which burst transfers are limitedly used to save few status registers, whereas the second version is the one generated by the application of VPP_Solver to the first version. The first version places variables in alphabetically sorted order in stack, since traditional commercial compilers (like ARM ADS, TI’s code composer) arrange non-array variables in such order [1]. The second version places variables with an optimal order generated by VPP_Solver.

    Table 1 Description of benchmark programs

    Table 2 summaries the two code versions. Init_code and Init + VPP_Solver represent the first and second versions of code, respectively. Further, R/W operations and #burst accesses indicate the numbers of memory access and the number of burst accesses among them in the corresponding code versions, respectively, and total cycles indicates the total number of execution cycles of the corresponding codes. The comparisons of the results in Table 2 reveal that VPP Solver is able to reduce the execution time by 8.4 % on average with increasing burst accesses when a normal ARM processor is used. Specifically, we found by our hand analysis that traditional register allocation algorithms [3] generate spills without considering its effect on burst transfers. This analysis provides us with an insight that the register allocator may reduce spill costs if it is able to identify a set of the registers to spill which will be more likely spilled and reloaded together with a single burst transfer. As our future study, we are currently working on this issue to improve our register allocator in our compiler framework for burst transfers.

    Table 2 Comparisons of performances before and after the application of VPP Solver to initial ARM assembly codes of DSPStone benchmarks

    Note that the exploitation of burst transfers by VPP_Solver does not lead to a dramatic decrease in the total execution time. However, considering that, although it is not shown explicitly in here, we can successfully further reduce the running time after when every effort was already made by the compiler for code optimizations, we believe these results are of some meaning. Besides, the proposed technique takes 1 ~ 9 times more burst accesses, that would bring about a tangible reduction in the amount of energy consumption shown in Fig. 5., which are also equally important performance metrics in embedded processors.

    Fig. 5
    figure 5

    Energy consumption of the benchmark programs

  • Checking the reductions on total energy consumption:

    We used the access figures in IBM Cu-11 Embedded DRAM [10] macro as reference in the experiments, in which a random (i.e., normal) access cycle is 12 ns at best access is 4 ns at best. In addition, under the supply voltage of 1.5 V an active current is 59.19 mA on average for random accesses, and 11.21 mA for a burst access. It is shown that careful layouts of variables for maximizing burst accesses considerably contributes to the reduction of total memory access energy consumption. Note that the access sequence of a program on a specific real processor will be extracted only after the execution of code-selection and register-allocation, which are machine-dependent parts of compilation.

    Figure 5 graphically shows how much the total energy consumption is actually saved by our VPP_Solver over the conventional method. Energy consumption of each program is considered as the base line (100 %), and each bar represents its corresponding energy consumption. The energy savings are due to the more optimal variable location decisions that are possible for making burst accesses compared to that made according to the alphabetically ordering policy, which result in less burst accesses to the main memory. As a result, the proposed technique consumes 9 to 47 % less energy than the comparison shown in Fig. 5.

5 Conclusions

In this paper, we proposed a comprehensive solution to the problem of assignment for variables to achieve a full utilization of an efficient DRAM access mode, called burst transfers. Since memory access is one of the major bottlenecks in embedded system’s performance, the proposed technique can be effectively used in DRAM-access-intensive embedded system applications. For utilizing burst transfers, the contributions are that we showed that the problem is NP-hard and we propose a formal definition of the problem and solution of efficient variables’ layout algorithms, called VPP_Solver for the burst mode generation. From a set of experiments with benchmark examples, it was shown that the proposed techniques used on average 7.2 times more burst accesses over generated codes from the ARM commercial compiler.