Improving Code Density with Variable Length Encoding Aware Instruction Scheduling

Kultala, Heikki; Viitanen, Timo; Jääskeläinen, Pekka; Helkala, Janne; Takala, Jarmo

doi:10.1007/s11265-015-1081-6

Improving Code Density with Variable Length Encoding Aware Instruction Scheduling

Published: 07 December 2015

Volume 84, pages 435–446, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Signal Processing Systems Aims and scope Submit manuscript

Improving Code Density with Variable Length Encoding Aware Instruction Scheduling

Download PDF

Heikki Kultala¹,
Timo Viitanen¹,
Pekka Jääskeläinen¹,
Janne Helkala¹ &
…
Jarmo Takala¹

297 Accesses
Explore all metrics

Abstract

Variable length encoding can considerably decrease code size in VLIW processors by reducing the number of bits wasted on encoding No Operations(NOPs). A processor may have different instruction templates where different execution slots are implicitly NOPs, but all combinations of NOPs may not be supported by the instruction templates. The efficiency of the NOP encoding can be improved by the compiler trying to place NOPs in such way that the usage of implicit NOPs is maximized. Two different methods of optimizing the use of the implicit NOP slots are evaluated: (a) prioritizing function units that have fewer implicit NOPs associated with them and (b) a post-pass to the instruction scheduler which utilizes the slack of the schedule by rescheduling operations with slack into different instruction words so that the available instruction templates are better utilized. Three different methods for selecting basic blocks to apply FU priorization on are also analyzed: always, always outside inner loops, and only outside inner loops only in basic blocks after testing where it helped to decrease code size. The post-pass optimizer alone saved an average of 2.4 % and a maximum of 10.5 % instruction memory, without performance loss. Prioritizing function units in only those basic blocks where it helped gave the best case instruction memory savings of 10.7 % and average savings of 3.0 % in exchange for an average 0.3 % slowdown. Applying both of the optimizations together gave the best case code size decrease of 12.2 % and an average of 5.4 %, while performance decreased on average by 0.1 %.

Instruction set independent program encoding

Article 06 August 2015

Improvements of Instruction Scheduling

Aligned Scheduling: Cache-Efficient Instruction Scheduling for VLIW Processors

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

One of the main pitfalls of Very Long Instruction Word (VLIW) and Transport Triggered Architectures (TTAs) [1] is the poor code density which is caused by the long instruction word and No Operation (NOP)s that have to be inserted into the program code. A VLIW or TTA processor has several execution units in order to achieve high performance on computationally intensive inner loops, often with aid of unrolling and software pipelining. Each of the execution units has an execution slot in the wide instruction word. Often there is a large amount of control-oriented code outside the inner loops which cannot exploit the parallel execution units, and is, therefore, represented with instructions where most of the execution slots are NOPs. In a straightforward encoding, this helper code consumes a large amount of instruction memory.

In order to save instruction memory and fetch power, control code should be encoded in a more compact form than full-size VLIW instructions. One common approach is to structure instructions in a series of operations, with, e.g., start and stop bits to delimit operations that can be executed in parallel. Dynamic hardware then detects bundles and routes operations to appropriate execution units. Conte et al. discuss I-fetch and cache design tradeoffs in this type of encoding [2]. However, the hardware to support any combination of NOPs in each instruction is complex. Other processors simplify the decode apparatus by specifying a limited number of instruction templates: each template specifies some execution slots as implicit NOPs that do not consume space, but other NOPs have to be explicitly encoded. In a reasonable implementation, there are only few different instruction templates. [3, 4]

An example of template selection and NOP removal for a 5-issue processor is shown in Fig. 1. In this example, a large amount of NOPs is seen in four instruction words. Two new instruction formats are assigned to the templates ’10’ and ’11’, which only use the execution slots A,B and D,E. The rest of the execution slots in these two formats are considered as NOP slots. If NOPs are seen in the NOP slots, they are removed from the instruction. Three instructions can be encoded in these templates to remove a majority of the NOP operations in the example program. Considerable instruction memory savings can be achieved by simply scheduling the instructions for the maximum performance without optimizing the code for the implicit NOP slots and simply using the shorter instructions when the NOP patterns match. This is, however, suboptimal, as the usage of the short instructions can be increased by compiler optimizations.

In this paper, two solutions to this are presented and compared: (a) prioritizing function units that have implicit NOP slots associated with them in fewer instruction templates and (b) executing a post-scheduler NOP-optimizer which utilizes slack of the schedule. In post-scheduler approach, operations, which have slack, are moved spatially into different instruction words so that the available instruction templates are better utilized to their maximum capacity. Prioritizing function units may decrease the performance of the code as this may conflict with other, more performance-critical methods of function unit priorization. The post-scheduler optimizer should have minimal effect on performance.

We have already reported some preliminary results in [5] and, in this paper, we propose two additional improvements: performing prioritization only for basic blocks outside inner loops so as to decrease the performance penalty, and speculatively scheduling basic blocks with both heuristics and selecting the one with the better code density. These improvements make FU priorization a viable optimization in practise, unlike the original algorithm which uses same heuristics for all basic blocks in the program. Also some of the technical content is presented in more detail. In addition, an advanecd version of the instruction scheduler is used which explains the differences in some results compared to [5].

This paper is organized as follows. Section 2 discusses previous research related to the proposed methods. Subsection 3.2 explains the function unit selection heuristics. Subsection 3.3 introduces the proposed post-scheduler optimizer algorithm. Section 4 contains evaluation of the two proposed methods. Section 5 discusses potential future research related to the topic. Section 6 concludes the paper.

2 Related Work

Lee et al. [6] introduce a post-scheduler optimization algorithm to minimize the instruction fetch and control logic transitions between successive instructions. In this method, horizontal and vertical rescheduling of operations is performed, moving operations both between instructions and between execution slots in the same instructions. Their method, however, does not consider the NOP usage and does not try to optimize the code size, but their work has been an inspiration to the post-optimizer pass presented in this paper.

Hahn et al. [7] propose a variable-length encoding for VLIW. This method has “protected” versions of many long-latency operations and control operations. These versions of the operations add pipeline stalls after the operations, so that there is no need to add subsequent instruction words containing only NOPs. Their instruction scheduler fills the delay slots and instruction words after a long-latency operation with usable instructions, and uses the ordinary version of the operation if possible, to maximize performance. In case the scheduler cannot place any useful operations in delay slots or instructions after some long-latency operation, it replaces the operation with the “protected” version of the operation. When optimizing for minimal code size, the compiler always uses the protected versions of instructions, resulting in lower performance but eliminating all NOPs due to delay slots and long-latency operations.

In [8], a method of collapsing the prolog and the epilog of software pipelined loop is introduced. This optimization can be combined with the proposed methods as they attack different parts of the code size problem.

The approach of Jee et al. [9] eliminates many NOP operations by encoding only data dependencies and enabling different execution slots to execute operations from different instruction words. This, however, requires considerable changes to the processor architecture and adds complexity to the processor’s control logic, and requires some support for dynamic scheduling in the processor.

Some studies approach the code density problem at a lower level by compressing code with, e.g., Huffman coding. Ros et al. [10] propose compiler optimization passes that improve the compression ratio by reassigning registers. Larin and Conte [11] have the compiler generate an optimal Huffman coding and decompression hardware to go with each program. These approaches might be applied on top of a template-based instruction format.

Haga et al. [12] introduce a method to minimize code size with global scheduling. Their approach, however, does not consider optimizing the code size for variable-length instructions.

Haga et al. [13] discuss compiler optimizations for explicitly parallel instruction computing (EPIC) architectures. EPIC is a variation of VLIW where instructions divided into regions that can be executed in parallel by means of conceptual stop bits. They are fetched in fixed-size bundles, each of which contains a template field indicating the placement of stop bits within each bundle. Parallel execution groups can span multiple bundles. Haga et al. introduce an algorithm to create schedules which are optimal in both performance and code size for infinitely wide EPIC processors which are never resource constrained. The optimal algorithm becomes slow when large basic blocks are scheduled. The authors propose non-optimal heuristics to overcome this problem. The instruction templates in EPIC architectures are, however, quite different than the variable-width templates discussed in this work and their method can be only used for EPIC-type instruction templates.

Most of the related work [7, 9–11, 13] concentrates on instruction encodings which are more complex than simple variable-length instruction templates, and the techniques cannot be applied on simple variable-length templates. Some of the related work [8, 12] proposes code size optimizations generic enough to be used together with the techniques proposed in this paper.

3 Proposed Optimizations

3.1 Baseline

In this work, the cycle-based list scheduler [14] is used as baseline and the proposed optimizations are compared against this method. The scheduler used in this paper schedules instructions one basic block at time, but also includes a postpass delay slot filler which performs some inter-basic block code motion. The baseline method schedules operations to the best possible cycle but if multiple execution units can execute the operation in the same cycle, the function unit with least operations that can execute the operation is selected. If there are multiple units with the same number of operations, the connectivity of the function unit is considered and the unit with the least connectivity is selected. The rationale for this is that if, e.g., addition can be executed on both function units A and B, but only unit A supports multiplication, add operations should be preferentially scheduled to unit B, so that unit A is be free to execute the multiplications. The optimizations could also be used with different instruction schedulers.

3.2 Function Unit Selection

A prioritize NOP-slots option was added to the function unit selection. This option works by calculating on how many instruction templates a function unit can be encoded as an implicit NOP, and deprioritizing those function units with the highest implicit NOP count. If two or more function units have the same NOP slot value, then the instruction scheduler reverts to the old performance-optimized mechanism when selecting between those function units.

We experiment with three modes of operation for the function unit selection optimization. The first mode schedules all code with the size-optimized function unit heuristic. The second mode uses the size-optimized function unit heuristic only for code outsize inner loops. The third mode is to compile every non-inner-loop basic block with both heuristics and select the code that has the smaller size, as sometimes the original heuristic may generate denser code. Inner loops are scheduled with the baseline heuristics to minimize performance degradion just like in the second mode. Even the third mode can result a small performance degradation as code outside inner loops can simultaneously grow larger in terms of instruction count while actually decreasing in code size.

The idea in the second mode is to minimize the execution time penalty of the optimization by not performing it in the parts of the code that are executed the most. There is sufficient code outside inner loops to still allow considerable size savings.

3.3 Post-scheduler Optimizer Algorithm

The main motivation of the post-scheduler optimizer is to make better use of NOP slots without decreasing performance; decisions taken during the actual instruction scheduling phase would affect the schedule and might decrease the performance of the program.

Figure 2 shows an example on how rescheduling can improve the usage of the shorter instruction templates. In this example, there are no data dependencies limiting the rescheduling while, in a real situation, the data dependencies would usually not allow all the operations to be rescheduled, and the benefit from the optimization would be smaller than in the example.

The algorithm is run for every basic block after that basic block has been scheduled, but before the inter-basic block delay slot filler.

Figure 3 shows how the algorithm first pushes all moves or operations into a queue. After this the main algorithm is iterated as long as the queue contains elements. Only one instance of each operation can be in the queue.

In the main loop, the first element in the queue is popped and processed. If the instruction where the operation belongs is already full, nothing is done for that operation; these instructions are already optimally coded and contain no wasted bits. Rescheduling operations in these moves may sometimes ease up the dependencies of other operations and allow more optimal placement of those other operations, but these situations are rare and rescheduling operations in full instructions prevents the algorithm to finishi naturally. Jump and call operations are not moved since this would affect the length of the basic block and, therefore, the performance of the code. This is illustrated in Fig. 4, lines 1-5.

In Fig. 4, lines 6-18 show how the slack of the schedule is considered; the data dependencies of the operation are checked and the latest and earliest possible time for the operation are calculated based on the data dependencies. If an operation has no data producers limiting how early it can be scheduled, it is not moved earlier, and if it has no data consumers limiting how late it can be scheduled, it is not moved later. This is to guarantee that the basic block cannot get longer and that later inter-basic-block code motion optimizations do not lose optimization opportunities due to the NOP optimization.

The operation is then unscheduled and rescheduling is attempted to both earlier and later cycles, until either it is scheduled to an instruction which will not grow longer due the rescheduling, or the data dependence limits are reached. Moving operations to both directions allow the same algorithm to be used for both top-down and bottom-up scheduled code, and also allows inefficient reschedules to be reverted later without special backtracking logic. When an operation is scheduled into a new instruction, its predecessors are requeued if the operation was moved forward, and its successors are requeued if the operation was moved backward. This is shown in Fig. 4, lines 19-39, Figs. 5 and 5.

When new operations are popped from the queue, there is a counter limiting how many times one operation can be rescheduled; this is to prevent the algorithm from going into an infinite loop scheduling some two consecutive operations back and forth. This is shown in Fig. 3, lines 6-10.

4 Evaluation

4.1 Benchmarks

We evaluate the performance of our methods with a subset of the CHStone [15] benchmark. This benchmark is selected since it contains a range of real-world routines, not microbenchmarks, with varying amounts of control code and instruction-level parallelism. Tests adpcm, gsm, mips, jpeg, aes, blowfish and sha are used. The software floating-point tests dfadd, dfmul, dfdiv and dfsin are omitted since they are microbenchmarks with a very small code footprint so they are not good benchmarks for code size measurement.

4.2 Processor Architectures

In order to measure the efficiency of the optimizations in practice, two Transport Triggered Architecture (TTA) type VLIW processors were developed using the TTA Codesign Environment (TCE) [16], and the compiler for the TCE toolset was modified to include the proposed optimizations.

TTA-type VLIW gives the compiler extra freedom to transfer some operands to earlier instructions than the execution starts and to read results later than they are produced. The implemented version of the post-scheduler optimizer algorithm takes advantage of this by rescheduling individual moves instead of whole operations.

The processor interconnect was developed with a method resembling the method in [17] to have reasonable interconnect and register file structure to the processor.

The first processor architecture, threeway, is a 3-issue processor with a 96-bit instruction word. The overall processor architecture was designed to give relatively good performance on the CHStone test while keeping the instruction width at 96 bits, to have good balance between performance and instruction size even without variable-length instruction encoding. The processor has six buses, each of which has its own slot in the instruction encoding. The first two buses are connected into a combined Load-Store-Unit (LSU) and Arithmetic-Logical Unit (ALU). Third and fourth buses are connected to a combined ALU and multiplier while the fifth and sixth bus are connected to an ALU and the control unit. This processor has three register read ports and one register file write port. Figure 6 shows the organization of the processor.

The processor has four different instruction templates; one with 32-bit length, one with 48-bit length and two full-length ones: one with long immediate value and one without long immediate value. The 32-bit instruction template was selected by first finding all bus combinations that can be encoded in 32 bits and then selecting the one that is used the most often when the adpcm code from the CHStone benchmark was compiled without any compiler optimizations for NOP slot usage. The 48-bit instruction template was selected in a similar manner, finding all the combinations that can be encoded in 48 bits and selecting the one that is mostly used when the adpcm was compiled without any compiler optimizations for NOP slot usage.

Another processor architecture, fourway, is a 4-issue processor with a 128-bit instruction word. The processor has 8 buses, each of which have their own slot in the instruction encoding. The organization of the fourway processor is such that most code with low level of instruction-level-parallelism (ILP) can execute using only the two first buses and the first function unit, as these are connected to both combined ALU and LSU and also separate control unit. The 3rd and 4th bus are connected to a combined ALU and multiplier, and the 5th and 6th bus are connected to another combined ALU and multiplier. The 7th and 8th bus are connected to an ALU. This organization of the function units is reletively close to the default configuration of HP’s VLIW EXample (VEX) processor architecture [18], with the difference that in VEX the control unit is combined with the last ALU, though the fourway processor has considerably fewer number of register file ports(3 read, 2 write) due TTA-specific optimizations such as software bypassing and operand sharing decreasing the need for ports.

Instruction templates in the fourway processor architecture are such that in all instruction templates, the first and second bus can always contain moves. In the 40-bit template, all the other slots are implicit NOPs, and in 72-bit template, there is also a 32-bit immediate value in addition to the two first buses. The short templates in this processor are wider than the templates in the threeway processor because of the requirement to be able to have the moves in the first two buses in all of the instruction templates.

4.3 Evaluation Results

Tables 1 and 2 show the performance and instruction counts and code sizes of the CHStone benchmark with different optimization methods on the two processors. The baseline “No optimization” in these results means that the shorter instruction templates are used when the compiler happens to generate instructions which can be encoded with them, but no optimizations are performed to encourage their usage.

Table 1 Instruction template usage on CHStone benchmark with and without the proposed optimizations on 3-issue, 4-template processor.

Full size table

Table 2 Instruction template usage on CHStone benchmark with and without the proposed optimizations on a 4-issue, 4-template processor.

Full size table

On higher-ILP workloads such as adpcm, blowfish and aes prioritizing the function units on all basic blocks had a considerable negative effect on the performance, and also the number of instructions. In these cases, the increase in instruction count usually resulted in larger instruction memory than was saved by the better usage of the smaller instructions, and the total program memory size increased by 1.7 - 6.6 %. The worst slowdown occured with the blowfish benchmark where the program slowdown was 28.3 %. On more control-oriented low-ILP workloads such as gsm and mips prioritizing function units in all basic blocks caused smaller slowdown on performance on both processors, and with fourway processor decreased the program memory size by 6.9 - 7.1 %. The sha benchmark behaved in similar fashion than gsm and mips benchmarks, even though it has more ILP, benefiting 2.7 % from the function unit prioritizing. On the threeway processor the results of function unit prioritizing in all basic blocks were also negative also in the low-ILP cases. On average always applying the FU priorization resulted in 1.2 % increase in code size with an average slowdown of 7.2 %

Applying the function unit priorization only for code outside inner loops usually helped to decrease the slowdown from the function unit selection in many tests, but also in many cases resulted in smaller code size decrease. However, in threeway processor, this did not help to make function unit priorization beneficial, as the code size was still larger than the code size without the optimization, applying the optimization for code outside inner loops only helped to make the code size penalty less severe. On fourway processor the best result is achieved in gsm test where this optimization results in both smallest code size, 9.1 % smaller than the unoptimized version, with only 0.5 % slowdown. In case case always applying the FU priorization resulted only 7.1 % savings, with a big 7.7 % slowdown. Especially on sha, applying priorization only outside inner loops helps; in this case, the code has only a 1.5 % slowdown compared to the 16.1 % slowdown when the optimization is applied for all the basic blocks, and code size is actually smaller, 3.8 % decrease versus 2.7 % decrease. On average this optimization was also harmful, causing on average a 0.3 % code size increase and an average slowdown of 4.1 % between all the test cases.

Scheduling every non-inner-loop basic block with both FU selection heuristics and reverting to the baseline function unit selection heuristic for the basic blocks where FU prioritization produced larger code, code gave much better results. This mode practically eliminated the cases where longer code with more instructions caused code size increase. This optimization was even beneficial on the threeway processor, with on average code size saving of 0.4 % while causing in average only a 0.2 % slowdown. On fourway processor the results were much better, achieving the best case of 10.7 % and an average of 5.6 % saved code size at the cost of an average slowdown of 0.5 %.

The post-optimizer pass mode had a more stable effect on both performance and code size on both processors. The code size decrease was in the range between 0.0 % and 10.5 %. The performance in all cases was very close to the original performance, in average being 0.3 % better than the non-optimized version. The average code size decrease was 3.4 % for the threeway processor and 1.5 % for the fourway processor, the average of both being 2.4 %. The reason for the weak improvement with the fourway processor is that operations in other than the first function unit always required a full-length instruction to be used, while with threeway there was also a shorter instruction template that included the final three buses.

The effect of applying both optimizations together usually had a similar result as the sum of the benefits of the optimizations done separately, and the best result was usually obtained by applying together the post-optimizer pass and FU priorization in a mode where it reverts to the old heuristics in basic blocks where the FU priorization would cause larger code size due larger number of instructions. The best case improvement was 12.2 % in the gsm test in processor fourway while performance stayed exactly the same. The average code size decrease in this mode was 5.4 % while performance decreased by 0.1 %

5 Future Work

In the second mode of the function unit selection heuristic optimization the basic blocks where the optimization is used and when not are statically selected by the compiler by just analysing whether the code is in inner loop or not. Profiling could be used to identify what are actually the most critical inner loops and to leave only those unoptimized.

The post-scheduler optimizer should also be improved to aggressively perform horizontal rescheduling of operations into those execution slots which have fewer implicit NOP slots associated with them, even when the main scheduling is done with function unit selection which favors performance instead of code size. This could increase the code size savings from the post-scheduler optimizer without performance degradion.

6 Conclusions

Two compiler optimizations to better utilize short instruction words in a variable-length instruction coding scheme were presented and analyzed. The introduced post-optimizer pass resulted on average 2.4 % and the best case of 10.5 % code size reduction without performance loss. As the post-scheduler optimizer had a good impact on both code size and performance, it should be always used.

Always prioritizing function units based on the implicit NOP slots gave best case code size savings of 7.1 % while on average the code size increased by 1.2 % while the performance decreased by an average of 7.2 %. Also the processor architecture and the selection of instruction templates had a significant effect on whether the function unit priorization decreased or increased the code size; with the threeway processor the function unit selection increased code size, but with fourway it decreased the code size. Since it may reduce performance and sometimes even increase code size, the function unit prioritization should be used cautiously, only after testing that it provides benefit for the case at hand and only in cases where the performance penalty is not too severe.

Prioritizing function units based on the implicit NOP slots only outside inner loops gave best case code size savings of 9.1 % but the average code size increased by 0.3 % while performance decreased by average of 4.1 %. In most cases both the performance drawbacks and the code size savings were smaller than when the optimization was always performed, but compared to always performing the optimization, the performance saved by not doing the optimization was usually better than code size wasted. But compared to the unoptimized version, performance still dropped in most test cases, so even this mode should not be enabled without testing that it is really beneficial.

Because results from the function unit priorization were so often negative, third mode had to be implemented to it. In this mode, for every non-innerloop basic block it is first tested which heuristic gives better code size, and then this heuristic is selected for the final scheduling. This mode gave best case code size saving of 10.7 % and average of 3.0 % code with average slowdown of 0.4 %.

By combining the two optimization methods, a best case of 12.2 and average of 5.4 % code size savings could be achieved, with only a 0.1 % performance loss.

The fact that results from the function unit prioritization were sometimes negative also weight the importance of good low level instruction scheduler. The quality of the instruction scheduler can sometimes have greater impact on code size than special optimizations to decrease it.

References

Corporaal, H., & Arnold, M. (1998). Using Transport Triggered Architectures for embedded processor design. Integrated Computer-Aided Engineering, 5(1), 19–38.
Google Scholar
Conte, T.M., Banerjia, S., Larin, S.Y., Menezes, K.N., & Sathaye, S.W. (1996). Instruction fetch mechanisms for VLIW architectures with compressed encodings. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 201–211).
Aditya, S., Mahlke, S. A., & Rau, B. R. (2000). Code size minimization and retargetable assembly for custom EPIC and VLIW instruction formats. ACM Transactions on Design Automation of Electronic Systems, 5(4), 752–773.
Article Google Scholar
Helkala, J., Viitanen, T., Kultala, H., Jääskeläinen, P., Takala, J., Zetterman, T., & Berg, H. (2014). Variable length instruction compression on transport triggered architectures. In Proceedings of the International Conference on Embedded Computing Systems: Architectures Modeling and Simulation (pp. 149–155). Samos, Greece.
Kultala, H., Viitanen, T., Jääskelainen, P., Helkala, J., & Takala, J. (2014). Compiler optimizations for code density of variable length instructions. In Proceedings of the IEEE Workshop on Signal Processing Systems (pp. 1–6).
Lee, C., Lee, J.K., & Hwang, T. (2000). Compiler optimization on instruction scheduling for low power. In Proceedings of the 13th International Symposium on System Synthesis (pp. 55–60).
Hahn, T.T., Stotzer, E., Sule, D., & Asal, M. (2008). Compilation strategies for reducing code size on a VLIW processor with variable length instructions. In Proceedings of the 3rd International Conference on High Performance Embedded Architectures and Compilers (pp. 147–160). Berlin Heidelberg: Springer-Verlag.
Chapter Google Scholar
Stotzer, E.J., & Leiss, E.L. (2012). Co-design of compiler and hardware techniques to reduce program code size on a vliw processor. CLEI Electronic Journal, 15(2), 2–2.
Google Scholar
Jee, S., & Palaniappan, K. (2002). Performance evaluation for a compressed-VLIW processor. In Proceedings of the ACM Symposium on Applied Computing (pp. 913–917).
Ros, M., & Sutton, P. (2005). A post-compilation register reassignment technique for improving hamming distance code compression. In Proceedings of the 2005 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (pp. 97–104).
Larin, S.Y., & Conte, M.T. (1999). Compiler-driven cached code compression schemes for embedded ilp processors. In Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture (pp. 82–92): IEEE.
Haga, S., Webber, A., Zhang, Y., Nguyen, N., & Barua, R. (2005). Reducing code size in VLIW instruction scheduling. Journal of Embedded Computing, 1(3), 415–433.
Google Scholar
Haga, S., & Barua, R. (2001). EPIC instruction scheduling based on optimal approaches. In Proceedings of the First Annual Workshop on Explicitly Parallel Instruction Computing Architectures and Compiler Technology (pp. 22–31).
Muchnick, S.S. (1997). Advanced Compiler Design and Implementation: Morgan Kaufmann.
Hara, Y., Tomiyama, H., Honda, S., & Takada, H. (2009). Proposal and quantitative analysis of the CHStone benchmark program suite for practical C-based high-level synthesis. Journal of Information Processing, 17, 242–254.
Article Google Scholar
Jääskeläinen, P., Guzma, V., Cilio, A., & Takala, J. (2007). Codesign toolset for application-specific instruction-set processors. In Proceedings of SPIE Multimedia on Mobile Devices (pp. 65070X–1 – 65070X–11).
Viitanen, T., Kultala, H., Jääskeläinen, P., & Takala, J. (2014). Heuristics for greedy transport triggered architecture interconnect exploration. In Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (pp. 2:1–2:7).
Fisher, J.A., Faraboschi, P., & Young, C. (2005). Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools: Elsevier.

Download references

Acknowledgments

This work was funded by Academy of Finland (funding decision 253087), Finnish Funding Agency for Technology and Innovation (project ”Parallel Acceleration”, funding decision 40115/13), and ARTEMIS Joint Undertaking under grant agreement no 621439 (ALMARVI).

Author information

Authors and Affiliations

Tampere University of Technology, Tampere, Finland
Heikki Kultala, Timo Viitanen, Pekka Jääskeläinen, Janne Helkala & Jarmo Takala

Authors

Heikki Kultala
View author publications
You can also search for this author in PubMed Google Scholar
Timo Viitanen
View author publications
You can also search for this author in PubMed Google Scholar
Pekka Jääskeläinen
View author publications
You can also search for this author in PubMed Google Scholar
Janne Helkala
View author publications
You can also search for this author in PubMed Google Scholar
Jarmo Takala
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Heikki Kultala.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kultala, H., Viitanen, T., Jääskeläinen, P. et al. Improving Code Density with Variable Length Encoding Aware Instruction Scheduling. J Sign Process Syst 84, 435–446 (2016). https://doi.org/10.1007/s11265-015-1081-6

Download citation

Received: 30 January 2015
Revised: 21 July 2015
Accepted: 30 October 2015
Published: 07 December 2015
Issue Date: September 2016
DOI: https://doi.org/10.1007/s11265-015-1081-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Improving Code Density with Variable Length Encoding Aware Instruction Scheduling

Abstract

Similar content being viewed by others