1 Introduction

One of the main pitfalls of Very Long Instruction Word (VLIW) and Transport Triggered Architectures (TTAs) [1] is the poor code density which is caused by the long instruction word and No Operation (NOP)s that have to be inserted into the program code. A VLIW or TTA processor has several execution units in order to achieve high performance on computationally intensive inner loops, often with aid of unrolling and software pipelining. Each of the execution units has an execution slot in the wide instruction word. Often there is a large amount of control-oriented code outside the inner loops which cannot exploit the parallel execution units, and is, therefore, represented with instructions where most of the execution slots are NOPs. In a straightforward encoding, this helper code consumes a large amount of instruction memory.

In order to save instruction memory and fetch power, control code should be encoded in a more compact form than full-size VLIW instructions. One common approach is to structure instructions in a series of operations, with, e.g., start and stop bits to delimit operations that can be executed in parallel. Dynamic hardware then detects bundles and routes operations to appropriate execution units. Conte et al. discuss I-fetch and cache design tradeoffs in this type of encoding [2]. However, the hardware to support any combination of NOPs in each instruction is complex. Other processors simplify the decode apparatus by specifying a limited number of instruction templates: each template specifies some execution slots as implicit NOPs that do not consume space, but other NOPs have to be explicitly encoded. In a reasonable implementation, there are only few different instruction templates. [3, 4]

An example of template selection and NOP removal for a 5-issue processor is shown in Fig. 1. In this example, a large amount of NOPs is seen in four instruction words. Two new instruction formats are assigned to the templates ’10’ and ’11’, which only use the execution slots A,B and D,E. The rest of the execution slots in these two formats are considered as NOP slots. If NOPs are seen in the NOP slots, they are removed from the instruction. Three instructions can be encoded in these templates to remove a majority of the NOP operations in the example program. Considerable instruction memory savings can be achieved by simply scheduling the instructions for the maximum performance without optimizing the code for the implicit NOP slots and simply using the shorter instructions when the NOP patterns match. This is, however, suboptimal, as the usage of the short instructions can be increased by compiler optimizations.

Figure 1
figure 1

A short program before (left) and after (right) assigning two new instruction formats, which define two execution slots to be used out of the five in the processor. Most of the NOP operations are removed by using the shorter instruction formats in the 2nd, 3rd and 4th instruction. IMM means a long immediate value which is transferred to the data path of the processor by encoding it into two execution slots of an instruction word in a instruction template where these two executions slots cannot be used for ordinary operations.

In this paper, two solutions to this are presented and compared: (a) prioritizing function units that have implicit NOP slots associated with them in fewer instruction templates and (b) executing a post-scheduler NOP-optimizer which utilizes slack of the schedule. In post-scheduler approach, operations, which have slack, are moved spatially into different instruction words so that the available instruction templates are better utilized to their maximum capacity. Prioritizing function units may decrease the performance of the code as this may conflict with other, more performance-critical methods of function unit priorization. The post-scheduler optimizer should have minimal effect on performance.

We have already reported some preliminary results in [5] and, in this paper, we propose two additional improvements: performing prioritization only for basic blocks outside inner loops so as to decrease the performance penalty, and speculatively scheduling basic blocks with both heuristics and selecting the one with the better code density. These improvements make FU priorization a viable optimization in practise, unlike the original algorithm which uses same heuristics for all basic blocks in the program. Also some of the technical content is presented in more detail. In addition, an advanecd version of the instruction scheduler is used which explains the differences in some results compared to [5].

This paper is organized as follows. Section 2 discusses previous research related to the proposed methods. Subsection 3.2 explains the function unit selection heuristics. Subsection 3.3 introduces the proposed post-scheduler optimizer algorithm. Section 4 contains evaluation of the two proposed methods. Section 5 discusses potential future research related to the topic. Section 6 concludes the paper.

2 Related Work

Lee et al. [6] introduce a post-scheduler optimization algorithm to minimize the instruction fetch and control logic transitions between successive instructions. In this method, horizontal and vertical rescheduling of operations is performed, moving operations both between instructions and between execution slots in the same instructions. Their method, however, does not consider the NOP usage and does not try to optimize the code size, but their work has been an inspiration to the post-optimizer pass presented in this paper.

Hahn et al. [7] propose a variable-length encoding for VLIW. This method has “protected” versions of many long-latency operations and control operations. These versions of the operations add pipeline stalls after the operations, so that there is no need to add subsequent instruction words containing only NOPs. Their instruction scheduler fills the delay slots and instruction words after a long-latency operation with usable instructions, and uses the ordinary version of the operation if possible, to maximize performance. In case the scheduler cannot place any useful operations in delay slots or instructions after some long-latency operation, it replaces the operation with the “protected” version of the operation. When optimizing for minimal code size, the compiler always uses the protected versions of instructions, resulting in lower performance but eliminating all NOPs due to delay slots and long-latency operations.

In [8], a method of collapsing the prolog and the epilog of software pipelined loop is introduced. This optimization can be combined with the proposed methods as they attack different parts of the code size problem.

The approach of Jee et al. [9] eliminates many NOP operations by encoding only data dependencies and enabling different execution slots to execute operations from different instruction words. This, however, requires considerable changes to the processor architecture and adds complexity to the processor’s control logic, and requires some support for dynamic scheduling in the processor.

Some studies approach the code density problem at a lower level by compressing code with, e.g., Huffman coding. Ros et al. [10] propose compiler optimization passes that improve the compression ratio by reassigning registers. Larin and Conte [11] have the compiler generate an optimal Huffman coding and decompression hardware to go with each program. These approaches might be applied on top of a template-based instruction format.

Haga et al. [12] introduce a method to minimize code size with global scheduling. Their approach, however, does not consider optimizing the code size for variable-length instructions.

Haga et al. [13] discuss compiler optimizations for explicitly parallel instruction computing (EPIC) architectures. EPIC is a variation of VLIW where instructions divided into regions that can be executed in parallel by means of conceptual stop bits. They are fetched in fixed-size bundles, each of which contains a template field indicating the placement of stop bits within each bundle. Parallel execution groups can span multiple bundles. Haga et al. introduce an algorithm to create schedules which are optimal in both performance and code size for infinitely wide EPIC processors which are never resource constrained. The optimal algorithm becomes slow when large basic blocks are scheduled. The authors propose non-optimal heuristics to overcome this problem. The instruction templates in EPIC architectures are, however, quite different than the variable-width templates discussed in this work and their method can be only used for EPIC-type instruction templates.

Most of the related work [7, 911, 13] concentrates on instruction encodings which are more complex than simple variable-length instruction templates, and the techniques cannot be applied on simple variable-length templates. Some of the related work [8, 12] proposes code size optimizations generic enough to be used together with the techniques proposed in this paper.

3 Proposed Optimizations

3.1 Baseline

In this work, the cycle-based list scheduler [14] is used as baseline and the proposed optimizations are compared against this method. The scheduler used in this paper schedules instructions one basic block at time, but also includes a postpass delay slot filler which performs some inter-basic block code motion. The baseline method schedules operations to the best possible cycle but if multiple execution units can execute the operation in the same cycle, the function unit with least operations that can execute the operation is selected. If there are multiple units with the same number of operations, the connectivity of the function unit is considered and the unit with the least connectivity is selected. The rationale for this is that if, e.g., addition can be executed on both function units A and B, but only unit A supports multiplication, add operations should be preferentially scheduled to unit B, so that unit A is be free to execute the multiplications. The optimizations could also be used with different instruction schedulers.

3.2 Function Unit Selection

A prioritize NOP-slots option was added to the function unit selection. This option works by calculating on how many instruction templates a function unit can be encoded as an implicit NOP, and deprioritizing those function units with the highest implicit NOP count. If two or more function units have the same NOP slot value, then the instruction scheduler reverts to the old performance-optimized mechanism when selecting between those function units.

We experiment with three modes of operation for the function unit selection optimization. The first mode schedules all code with the size-optimized function unit heuristic. The second mode uses the size-optimized function unit heuristic only for code outsize inner loops. The third mode is to compile every non-inner-loop basic block with both heuristics and select the code that has the smaller size, as sometimes the original heuristic may generate denser code. Inner loops are scheduled with the baseline heuristics to minimize performance degradion just like in the second mode. Even the third mode can result a small performance degradation as code outside inner loops can simultaneously grow larger in terms of instruction count while actually decreasing in code size.

The idea in the second mode is to minimize the execution time penalty of the optimization by not performing it in the parts of the code that are executed the most. There is sufficient code outside inner loops to still allow considerable size savings.

3.3 Post-scheduler Optimizer Algorithm

The main motivation of the post-scheduler optimizer is to make better use of NOP slots without decreasing performance; decisions taken during the actual instruction scheduling phase would affect the schedule and might decrease the performance of the program.

Figure 2 shows an example on how rescheduling can improve the usage of the shorter instruction templates. In this example, there are no data dependencies limiting the rescheduling while, in a real situation, the data dependencies would usually not allow all the operations to be rescheduled, and the benefit from the optimization would be smaller than in the example.

Figure 2
figure 2

A short program without (middle) and with (right) the post-scheduler rescheduling. Left side shows the instruction templates used. Upper row shows the operations in full-width instructions and bottom row shows the instructions after the NOP-slot compression is applied. Without the optimization (middle column) the usage of both B and D slots simultaneously prevents usage of any short template in instruction 2, usage of both A and E slots simultaneously prevents usage of any short template in instruction 3, and usage of slot C prevents usage of any short template in instruction 4. On the optimized version D2 is moved to instruction 3, A3 and B2 are moved to instruction 4 and C4 is moved to instruction 6. This allows instruction template 10 to be used for instructions 2 and 4 and instruction template 11 to be used for instruction 3.

The algorithm is run for every basic block after that basic block has been scheduled, but before the inter-basic block delay slot filler.

Figure 3 shows how the algorithm first pushes all moves or operations into a queue. After this the main algorithm is iterated as long as the queue contains elements. Only one instance of each operation can be in the queue.

Figure 3
figure 3

The NOPOptimizer outer loop routine.

In the main loop, the first element in the queue is popped and processed. If the instruction where the operation belongs is already full, nothing is done for that operation; these instructions are already optimally coded and contain no wasted bits. Rescheduling operations in these moves may sometimes ease up the dependencies of other operations and allow more optimal placement of those other operations, but these situations are rare and rescheduling operations in full instructions prevents the algorithm to finishi naturally. Jump and call operations are not moved since this would affect the length of the basic block and, therefore, the performance of the code. This is illustrated in Fig. 4, lines 1-5.

Figure 4
figure 4

The tryPushNode helper routine for the NOPOptimizer.

In Fig. 4, lines 6-18 show how the slack of the schedule is considered; the data dependencies of the operation are checked and the latest and earliest possible time for the operation are calculated based on the data dependencies. If an operation has no data producers limiting how early it can be scheduled, it is not moved earlier, and if it has no data consumers limiting how late it can be scheduled, it is not moved later. This is to guarantee that the basic block cannot get longer and that later inter-basic-block code motion optimizations do not lose optimization opportunities due to the NOP optimization.

The operation is then unscheduled and rescheduling is attempted to both earlier and later cycles, until either it is scheduled to an instruction which will not grow longer due the rescheduling, or the data dependence limits are reached. Moving operations to both directions allow the same algorithm to be used for both top-down and bottom-up scheduled code, and also allows inefficient reschedules to be reverted later without special backtracking logic. When an operation is scheduled into a new instruction, its predecessors are requeued if the operation was moved forward, and its successors are requeued if the operation was moved backward. This is shown in Fig. 4, lines 19-39, Figs. 5 and 5.

Figure 5
figure 5

The tryToMoveToCycle helper routine for the NOPOptimizer.

When new operations are popped from the queue, there is a counter limiting how many times one operation can be rescheduled; this is to prevent the algorithm from going into an infinite loop scheduling some two consecutive operations back and forth. This is shown in Fig. 3, lines 6-10.

4 Evaluation

4.1 Benchmarks

We evaluate the performance of our methods with a subset of the CHStone [15] benchmark. This benchmark is selected since it contains a range of real-world routines, not microbenchmarks, with varying amounts of control code and instruction-level parallelism. Tests adpcm, gsm, mips, jpeg, aes, blowfish and sha are used. The software floating-point tests dfadd, dfmul, dfdiv and dfsin are omitted since they are microbenchmarks with a very small code footprint so they are not good benchmarks for code size measurement.

4.2 Processor Architectures

In order to measure the efficiency of the optimizations in practice, two Transport Triggered Architecture (TTA) type VLIW processors were developed using the TTA Codesign Environment (TCE) [16], and the compiler for the TCE toolset was modified to include the proposed optimizations.

TTA-type VLIW gives the compiler extra freedom to transfer some operands to earlier instructions than the execution starts and to read results later than they are produced. The implemented version of the post-scheduler optimizer algorithm takes advantage of this by rescheduling individual moves instead of whole operations.

The processor interconnect was developed with a method resembling the method in [17] to have reasonable interconnect and register file structure to the processor.

The first processor architecture, threeway, is a 3-issue processor with a 96-bit instruction word. The overall processor architecture was designed to give relatively good performance on the CHStone test while keeping the instruction width at 96 bits, to have good balance between performance and instruction size even without variable-length instruction encoding. The processor has six buses, each of which has its own slot in the instruction encoding. The first two buses are connected into a combined Load-Store-Unit (LSU) and Arithmetic-Logical Unit (ALU). Third and fourth buses are connected to a combined ALU and multiplier while the fifth and sixth bus are connected to an ALU and the control unit. This processor has three register read ports and one register file write port. Figure 6 shows the organization of the processor.

Figure 6
figure 6

Organization of the threeway Processor used in the evaluations. Function units and register files on top, interconnect buses ans sockets on bottom. Little boxes on the function units and register files are ports, X indicates the trigger port on the function units.

The processor has four different instruction templates; one with 32-bit length, one with 48-bit length and two full-length ones: one with long immediate value and one without long immediate value. The 32-bit instruction template was selected by first finding all bus combinations that can be encoded in 32 bits and then selecting the one that is used the most often when the adpcm code from the CHStone benchmark was compiled without any compiler optimizations for NOP slot usage. The 48-bit instruction template was selected in a similar manner, finding all the combinations that can be encoded in 48 bits and selecting the one that is mostly used when the adpcm was compiled without any compiler optimizations for NOP slot usage.

Another processor architecture, fourway, is a 4-issue processor with a 128-bit instruction word. The processor has 8 buses, each of which have their own slot in the instruction encoding. The organization of the fourway processor is such that most code with low level of instruction-level-parallelism (ILP) can execute using only the two first buses and the first function unit, as these are connected to both combined ALU and LSU and also separate control unit. The 3rd and 4th bus are connected to a combined ALU and multiplier, and the 5th and 6th bus are connected to another combined ALU and multiplier. The 7th and 8th bus are connected to an ALU. This organization of the function units is reletively close to the default configuration of HP’s VLIW EXample (VEX) processor architecture [18], with the difference that in VEX the control unit is combined with the last ALU, though the fourway processor has considerably fewer number of register file ports(3 read, 2 write) due TTA-specific optimizations such as software bypassing and operand sharing decreasing the need for ports.

Instruction templates in the fourway processor architecture are such that in all instruction templates, the first and second bus can always contain moves. In the 40-bit template, all the other slots are implicit NOPs, and in 72-bit template, there is also a 32-bit immediate value in addition to the two first buses. The short templates in this processor are wider than the templates in the threeway processor because of the requirement to be able to have the moves in the first two buses in all of the instruction templates.

4.3 Evaluation Results

Tables 1 and 2 show the performance and instruction counts and code sizes of the CHStone benchmark with different optimization methods on the two processors. The baseline “No optimization” in these results means that the shorter instruction templates are used when the compiler happens to generate instructions which can be encoded with them, but no optimizations are performed to encourage their usage.

Table 1 Instruction template usage on CHStone benchmark with and without the proposed optimizations on 3-issue, 4-template processor.
Table 2 Instruction template usage on CHStone benchmark with and without the proposed optimizations on a 4-issue, 4-template processor.

On higher-ILP workloads such as adpcm, blowfish and aes prioritizing the function units on all basic blocks had a considerable negative effect on the performance, and also the number of instructions. In these cases, the increase in instruction count usually resulted in larger instruction memory than was saved by the better usage of the smaller instructions, and the total program memory size increased by 1.7 - 6.6 %. The worst slowdown occured with the blowfish benchmark where the program slowdown was 28.3 %. On more control-oriented low-ILP workloads such as gsm and mips prioritizing function units in all basic blocks caused smaller slowdown on performance on both processors, and with fourway processor decreased the program memory size by 6.9 - 7.1 %. The sha benchmark behaved in similar fashion than gsm and mips benchmarks, even though it has more ILP, benefiting 2.7 % from the function unit prioritizing. On the threeway processor the results of function unit prioritizing in all basic blocks were also negative also in the low-ILP cases. On average always applying the FU priorization resulted in 1.2 % increase in code size with an average slowdown of 7.2 %

Applying the function unit priorization only for code outside inner loops usually helped to decrease the slowdown from the function unit selection in many tests, but also in many cases resulted in smaller code size decrease. However, in threeway processor, this did not help to make function unit priorization beneficial, as the code size was still larger than the code size without the optimization, applying the optimization for code outside inner loops only helped to make the code size penalty less severe. On fourway processor the best result is achieved in gsm test where this optimization results in both smallest code size, 9.1 % smaller than the unoptimized version, with only 0.5 % slowdown. In case case always applying the FU priorization resulted only 7.1 % savings, with a big 7.7 % slowdown. Especially on sha, applying priorization only outside inner loops helps; in this case, the code has only a 1.5 % slowdown compared to the 16.1 % slowdown when the optimization is applied for all the basic blocks, and code size is actually smaller, 3.8 % decrease versus 2.7 % decrease. On average this optimization was also harmful, causing on average a 0.3 % code size increase and an average slowdown of 4.1 % between all the test cases.

Scheduling every non-inner-loop basic block with both FU selection heuristics and reverting to the baseline function unit selection heuristic for the basic blocks where FU prioritization produced larger code, code gave much better results. This mode practically eliminated the cases where longer code with more instructions caused code size increase. This optimization was even beneficial on the threeway processor, with on average code size saving of 0.4 % while causing in average only a 0.2 % slowdown. On fourway processor the results were much better, achieving the best case of 10.7 % and an average of 5.6 % saved code size at the cost of an average slowdown of 0.5 %.

The post-optimizer pass mode had a more stable effect on both performance and code size on both processors. The code size decrease was in the range between 0.0 % and 10.5 %. The performance in all cases was very close to the original performance, in average being 0.3 % better than the non-optimized version. The average code size decrease was 3.4 % for the threeway processor and 1.5 % for the fourway processor, the average of both being 2.4 %. The reason for the weak improvement with the fourway processor is that operations in other than the first function unit always required a full-length instruction to be used, while with threeway there was also a shorter instruction template that included the final three buses.

The effect of applying both optimizations together usually had a similar result as the sum of the benefits of the optimizations done separately, and the best result was usually obtained by applying together the post-optimizer pass and FU priorization in a mode where it reverts to the old heuristics in basic blocks where the FU priorization would cause larger code size due larger number of instructions. The best case improvement was 12.2 % in the gsm test in processor fourway while performance stayed exactly the same. The average code size decrease in this mode was 5.4 % while performance decreased by 0.1 %

5 Future Work

In the second mode of the function unit selection heuristic optimization the basic blocks where the optimization is used and when not are statically selected by the compiler by just analysing whether the code is in inner loop or not. Profiling could be used to identify what are actually the most critical inner loops and to leave only those unoptimized.

The post-scheduler optimizer should also be improved to aggressively perform horizontal rescheduling of operations into those execution slots which have fewer implicit NOP slots associated with them, even when the main scheduling is done with function unit selection which favors performance instead of code size. This could increase the code size savings from the post-scheduler optimizer without performance degradion.

6 Conclusions

Two compiler optimizations to better utilize short instruction words in a variable-length instruction coding scheme were presented and analyzed. The introduced post-optimizer pass resulted on average 2.4 % and the best case of 10.5 % code size reduction without performance loss. As the post-scheduler optimizer had a good impact on both code size and performance, it should be always used.

Always prioritizing function units based on the implicit NOP slots gave best case code size savings of 7.1 % while on average the code size increased by 1.2 % while the performance decreased by an average of 7.2 %. Also the processor architecture and the selection of instruction templates had a significant effect on whether the function unit priorization decreased or increased the code size; with the threeway processor the function unit selection increased code size, but with fourway it decreased the code size. Since it may reduce performance and sometimes even increase code size, the function unit prioritization should be used cautiously, only after testing that it provides benefit for the case at hand and only in cases where the performance penalty is not too severe.

Prioritizing function units based on the implicit NOP slots only outside inner loops gave best case code size savings of 9.1 % but the average code size increased by 0.3 % while performance decreased by average of 4.1 %. In most cases both the performance drawbacks and the code size savings were smaller than when the optimization was always performed, but compared to always performing the optimization, the performance saved by not doing the optimization was usually better than code size wasted. But compared to the unoptimized version, performance still dropped in most test cases, so even this mode should not be enabled without testing that it is really beneficial.

Because results from the function unit priorization were so often negative, third mode had to be implemented to it. In this mode, for every non-innerloop basic block it is first tested which heuristic gives better code size, and then this heuristic is selected for the final scheduling. This mode gave best case code size saving of 10.7 % and average of 3.0 % code with average slowdown of 0.4 %.

By combining the two optimization methods, a best case of 12.2 and average of 5.4 % code size savings could be achieved, with only a 0.1 % performance loss.

The fact that results from the function unit prioritization were sometimes negative also weight the importance of good low level instruction scheduler. The quality of the instruction scheduler can sometimes have greater impact on code size than special optimizations to decrease it.