Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Safety critical applications use redundant designs and design state comparison techniques to detect potential design misbehavior. An example is a motor control circuit, where a malfunction could generate life threatening conditions. A full stop and restart of an application is sometimes costly and should be avoided with an on-the-fly recovery mechanism.

Another application for using redundant designs are the control systems of a satellite. Single event upsets (SEUs) must be detected before they could endanger costly missions in the orbit. It is beneficial when the power consumption of the redundant systems can also be reduced.

C-Slow Retiming (CSR) provides C copies of a given design by inserting registers and reusing the combinatorial logic in a time sliced fashion. The paper shows how CSR can be used when redundant designs are needed.

2 Background

The ever increasing demands for higher performance and higher throughput of cores have led to various techniques. [1] presents an efficient retiming algorithm and in [2] a retiming algorithm for FPGAs is shown. Retiming for wire pipelined SoC buses is discussed in [3]. Automatic pipelining of designs is outlined in [4]. The pipelining of sequential circuits with wave steering is shown in [5]. Leiserson et al. introduces the concept of C-Slow Retiming (CSR) in [6]. [7] presents a formulation as a general model for retiming and recycling, which also accepts an interpretation of the CSR problem. The effects of CSR is presented on three different benchmarks in [8] and the post-placements CRS for a microprocessor on an FPGA is shown in [9], whereas [10] presents an abstraction algorithm for the verification of generalized C-slow designs. In recent publications, CSR is used to maximize the throughout-area efficiency in [11] and on simultaneous multithreading processors in [12].

3 Contribution and Paper Organization

To the best of the author’s knowledge, power consumption (P) has not been considered in publications about C-Slow Retiming (CSR). Results are given for the P of CSR-ed based designs on an FPGA.

The paper demonstrates how to use CSR for single event upset (SEU) detection and design state on-the-fly recovery. The method is then further developed and optimized to reduce area (FPGA utilization) and the P of the application. Results of two 32-bit processors on a low-cost FPGA are given.

Section 12.3 outlines the CSR technology. In Sect. 12.5 the power consumption of CSR-ed designs is discussed. A method to detect single event upsets is shown in Sect. 12.6 and how the CSR algorithm can be adapted for that. The paper finishes with results and a summary in the Sects. 12.7 and 12.8.

4 C-Slow Retiming

4.1 Theory of CSR

Figure 12.1a shows the basic structure of a sequential circuit with its combinatorial logic (CL), in- and outputs and original registers (OR). In Fig. 12.1b, the CL is sliced into three (C = 3) parts, and each original path has now two (C-1) additional registers. This is the basic idea behind CSR. It results in C functionally independent design copies which use the logic in a time sliced fashion. It shows how different parts of the logic are used during different cycles. It now takes three micro-cycles to achieve the same result as in one original cycle. In Fig. 12.1, inputs and outputs are valid at the same time slice. The implemented register sets are called “C-Slow Retiming Registers”, CRs. They are placed at different C-levels. Figure 12.1b shows one basic rule of CSR. There are only paths between consecutive CRs and also from the last CRs to the original register set and from the original register set to the first CRs.

Fig. 12.1
figure 1

(a) Simplified design. (b) Applying CSR technique

We define the maximum frequency of the given design (Fig. 12.1a) as Fd and the maximum frequency of a CSR-ed design (Fig. 12.1b) as Fcsr, whereas:

$$ \mathrm{F}\mathrm{csr}\sim \mathrm{F}\mathrm{d}*\mathrm{C} $$
(12.1)

The individual cycle of a CSR-ed design is called a micro cycle. By generating C independent copies of the design, all running—theoretically—at Fd, we can also say that the system frequency Fsys is equal to Fcsr:

$$ \mathrm{F}\mathrm{sys}=\mathrm{Fcsr}\sim \mathrm{F}\mathrm{d}*\mathrm{C} $$
(12.2)

In theory, this could lead to an unlimited performance increase. Evidently this cannot be done endlessly and register insertion becomes inefficient for higher C again. The results section at the end of this paper shows examples for that.

4.2 CSR on RTL

CSR clearly changes the behavior of the design and can only be fully utilized when the CSR-ed core is embedded in a new logic environment. With the right wrapper logic, the CSR-ed design then behaves exactly as the original core, but multiple and functional independent versions are available. These modifications have a dramatic impact on the design flow. It is of great advantage to have a solution on higher level such as RTL. The CSR-ed version must be used as a new core in the design and verification process. A technique has been demonstrated, which automatically modifies the design to enable CSR on RTL by the author in [13]. The results given in this paper are generated using this technology.

4.3 Verification of CSR Design Modifications

It is non-trivial to verify the correctness of CSR based designs. Static timing analysis (STA) can be used for that. When each C-level gets its own clock tree, only paths from one C-level to the next one exist. Additional paths exist from the last C-level to the original registers and from the original registers to the first C-level. It can be checked during a standalone design level synthesis and STA run, if additional paths exist, which should not exist. The static analysis verifies the correctness of the register insertion process. The individual clocks can then be connected to a single clock again.

5 Power Consumption of CSR-ed Designs

5.1 Overview

Two (out of many) sources for power consumption (P) in digital designs are clock tree activity and switching activity generated by combinatorial logic. When a design is instantiated N-times, the number of resulting registers is N-times higher, but the clock speed remains the same. When using CSR, the number of resulting registers is less or roughly C-times higher. The difference is, that the clock speed must also be C-times higher to achieve the same performance. This results in a higher P of the clock tree in CSR-ed designs than the one of the alternative approach to instantiate individual designs. This is also shown in the result section on two different processors.

It has been demonstrated that register insertion can reduce the P of a design. For example Lim et al. use flip-flops with shifted-phase clocks in [14]. This looks similar to the CSR approach, although the register placement in the used CSR algorithm is timing driven. The used CSR algorithm places registers at the end of each path and then moves individual registers throughout the combinatorial logic until the best timing is achieved (timing optimization process).

In Fig. 12.2 the “CSR 4 P” line shows the relative P of one thread compared to the P when running the thread on the original core (“1-line”). It starts with 71 % P overhead at the beginning of the timing optimization process. This is due to the facts, that the signals generate toggling activity when passing through the additional registers and that the higher register load (and clock frequency) generate a higher clock tree P. The P overhead drops from 71 to 45 % during the timing optimization process when a better register distribution throughout the logic—mainly on the datapath—is reached. It can be argued, that this P reduction comes from the fact that the number of longer logic paths is reduced and therefore the probability to generate power consuming signal glitches is reduced. It was not successful to combine this timing driven approach with power aware optimization techniques (as in [14]).

Fig. 12.2
figure 2

Relative P of a CSR-ed design (C = 4) during timing optimization process

In Fig. 12.2 the relative P of the clock tree compared to the P of the original thread during the timing optimization process (“CT P”) is shown. The relative P of the clock tree increases due to the rising number of registers when improving the timing of an CSR-ed design. The line “CSR 1 P” shows the P of a single thread when only identical threads are executed. This will be discussed in the next section.

When CSR is used on an ASIC, it can be argued, that the smaller CSR-ed design consumes less Iddq compared to the larger design of the alternative approach to instantiate individual designs.

5.2 Using Both Clock Edges in CSR

For completeness we will show the results of a special CSR approach with inverted clock edges for every other C-level. This approach makes only sense when an even number of design slices exists (C = 2, 4, …). The number of resulting design copies will be half of the design slices C/2. The P numbers are presented in the result section.

5.3 P When Running Identical Threads

In Fig. 12.2 we see how the P changes when applying the CSR algorithm (C = 4) on a given example design and identical threads are executed (“CSR 1 P” line). In this case the P of a single thread is in the range of 87–77 % of the P of the same thread executed on the original design. It can be assumed that the clock tree P increase (due to the higher clock speed) is less than the P reduction that comes with the register insertion into the datapath.

In the sequel of this paper we elaborate on CSR-ed designs with an identical input stimuli for each design copy. We use processors to demonstrate the effectiveness of the proposed method, but the method is not limited to processors only. Nevertheless, we use the word “thread” synonym for the execution of a processor program or execution of an algorithm on a digital design.

6 Detecting a Single Event Upset (SEU) Using CSR

6.1 Detecting an SEU with Standard CSR

One way to detect a single event upset (SEU) is the duplication of a design (redundancy) and to compare key registers and/or outputs. When an SEU occurs, at least one design runs different and further actions can be taken. CSR supports this feature when executing (a group of) identical threads. In Fig. 12.3, all threads execute the same algorithm (or program) and use the logic in a time shared fashion. Therefore only a limited number of threads are affected when an SEU occurs. Multiple identical threads are most likely affected differently because each one of them is in a different design state. When this difference affects the state of key registers, it can be detected by a certain support logic.

Fig. 12.3
figure 3

Comparing signal values at key registers to detect an SEU

We applied this technique on two different processors. We added SEU detection logic to the design and run identical threads on each processor. In both cases we used the program counter and the data-bus access registers to detect different thread behaviors. We were not able to detect an SEU when running the application on an FPGA, but we used error injection techniques in simulation (as discussed by Braza et al. in [15]) to verify the behavior.

Based on empirical data we can assume that design duplication techniques using CSR generate less registers and certainly need less combinatorial logic than the alternative techniques using individual design instantiations. It can be argued, that this reduced register and logic count (compared to multiple individual instantiations of the design) also reduces the possibility to generate an SEU.

6.2 Recovery

When an SEU is detected, safety critical designs can restart or execute predefined software recovery routines. When using CSR, an on-the-fly recovery is possible. In Fig. 12.4 we see the CSR-ed design enhanced by the SEU detection circuit. When C ≥ 3, the SEU detection circuit uses a majority decoder to detect the failing thread by comparing the key register values of C identical threads. This is done every C micro-cycles.

Fig. 12.4
figure 4

On-the-fly recovery

A modified write enable sequence then overwrites the specific Rn register associated with the failing thread. For the on-the-fly recovery mechanism all other OR which are not used for SEU detection (Fig. 12.4) must be enhanced by signal hold (not enable) mechanism to overwrite the failing thread.

The technique has been successfully simulated on RTL using a simple 1-out-of-3 majority decoder and an error injection mechanism. The result was a full design on-the-fly recovery. The area overhead of this approach is reported in the result section. This on-the-fly recovery mechanism is almost impossible to achieve when using the standard SEU detection concept of individual redundant design implementations.

6.3 Reducing Shift Register Count

Figure 12.5a shows a design after applying the CSR method. It can be seen that CSR generates a high number of shift registers by adding registers to the feedback loop of the original registers. Additional shift registers are generated on the paths through the combinatorial logic. These shift registers contribute to the majority of area and P increase.

Fig. 12.5
figure 5

(a) Shift Registers generated by Register Feedback Loops and adjacent C-Slow Retiming Registers (CRs). (b) CSR-ed design with SEU detection circuit and reduced set of C-Slow-Retiming Registers (CRs)

When identical threads are executed, the number of shift registers can be reduced by using a modified CSR algorithm. In this case the original registers are replaced by a slightly enhanced logic, which is shown in Fig. 12.5b. Each original register is now instantiated C times, they are called “Rn” (R1, R2, R3 in the example). A FSM (same for all registers in the design) generates individual capture enable signals, so that the Rns take over the incoming bit stream at different consecutive cycles. Also the outputs of the Rns drive the combinatorial logic at different C-levels, so that shift registers generated by consecutive CRs can be removed by connecting the combinatorial logic with the relevant Rn. This has a positive impact on the overall register count (area) and P of CSR-ed designs running identical threads. Empirical data on two different processors is shown in the result section.

An additional comparator logic (see Fig. 12.5b) continuously compares 2 register values of the Rn registers. In case of a mismatch, the logic indicates that 2 threads run differently. This logic can be used for all original registers or only for certain key registers (like the program counter for instance).

At a certain timeslot (every C microcycles), all threads can be compared at the same time. When C ≥ 3, an on-the-fly recovery feature can be implemented by using a majority decoder and a slightly enhanced FSM logic. The FSM then uses the write enable signal to overwrite the Rns associated with a failing thread with the correct value.

This proposed method generates a system of redundant designs with SEU detection feature, a reduced area and a reduced P compared to a system using standard CSR or compared to a system using the alternative approach of instantiating individual designs. For more details see the result section. This method is particular useful when implemented on area sensitive ASICs used in safety critical and low power applications.

7 Results

The numbers in this results section are based on two CPUs. The RTL code for the ARM3 core (“Amber”, 32-bit RISC processor, 3-stage pipeline, ARM v2a) and the OR1200 (“OR1000”, 32-bit scalar RISC, 5-stage pipeline, MMU, DSP capabilities, TLB, instruction and data cache) can be found at [16]. The designs are implemented on a Xilinx Spartan-6 LX16 (-2ftg256). The clock is generated externally. The algorithm for CSR used in this paper is described in [13].

The P of a design during the CSR timing optimization is shown in Fig. 12.2. Two scenarios are tested. In a standard scenario, different threads are executed. In an alternative scenario, all threads execute the same program synchronously so that no combinatorial logic switches between the individual threads.

The P can be considered as relatively constant (Fig. 12.2) when moving the registers throughout the combinatorial logic. This was not expected. It was assumed, that the pipelined logic reduces the P by reducing the number of net glitches as shown in [14]. It can therefore be assumed, that the placement of additional registers (CRs) to reduce P needs to be carefully chosen. The author was not successful to combine this work with the technique demonstrated in [14].

Table 12.1 shows the results of a CSR-ed ARM3 core. When multiplying the functionality by C = 2 … 5, the number of registers increases up to 330 %. At the same time, the number of occupied slices remains relatively stable. This indicates, that the additional registers nicely fit into the already used slices. In other words, you have five times the functionality with just an area overhead of 43 % when using CSR.

Table 12.1 Results for CSR-ed ARM3 core, part I

The performance increases with each C step. Although it does not reach the performance (200, 300, …, 500 %) of the alternative concept by implementing individual processors (called “A” in the sequel of this section), it has a reasonably timing of 6.234 ns. This is a performance increase of up to 293 % compared to a single core implementation (“rel to 1”), but it only reaches 59 % (“rel to A”) of the performance of A. Better results can be achieved with more advanced technologies like the Virtex family, as can be seen in [13], and most likely in ASIC technologies.

When a single core with 825 occupied slices can run at 18.250 ns, we can calculate the performance per area (PpA) factor to 66.42 kHz/slice (Table 12.1 ). We can see in the PpA column, that this factor increases by up to 205 % for C = 5. In other words, when CSR can be used, more performance can be realized on a given size. Nevertheless, increasing C becomes less efficient for higher C.

The P of the original ARM3 core is 22.1 mW, running at maximum speed (18.250 ns). When instantiating individual ARM3 processors, the P multiplies accordingly (see Table 12.2 , P column). We distinguish between running the same program on all available designs or running different programs.

Table 12.2 Results for the CSR-ed ARM3 core, part II

When running the same program at the maximum possible speed, the P decreases to 40 % compared to A. This is certainly due to the fact, that the maximum possible speed is less than the one of A.

Even when the CSR-ed core could be run at the theoretical possible speed (cycle time = 18.250 ns/C), the P would only be in the range of 68–77 % of the A. The P seams to be relatively constant and independent of C when running the same program. We have already seen in Fig. 12.2, that the P is relatively independent of the CSR timing optimization process when moving registers throughout the combinatorial logic.

These findings show that CSR is great for safety critical applications (see Sect. 12.6). By running the same program on C copies of a CSR-ed design/CPU, you can decrease the area and power consumption at the same time, compared to A. By removing the obsolete registers on the register feedback paths, the increase of the occupied slices is only 13 % for C = 5 (See SEU column).

The P changes relatively to the P of the A from 113 to 85 % when increasing C and running different programs. When running the design at the theoretical possible speed (18.250 ns/C), the P is around 147 % of the P of A. It turned out that this number is relatively constant for different Cs. A CSR-ed design uses less registers than A, but can run (theoretically) C times faster, which results in a higher P of the clock tree than the one of A.

Similar numbers can be found for the CSR-ed implementation of the OR1200 core. The relative number of registers increases to up to 329 % (Table 12.3 ) whereas the number of occupied slices only reaches 137 % for C = 5. The performance increase is less optimal over an increasing number of copies. This is due to the already fast cycle time of the original core and the relatively slow technology (Spartan 6). Better results can be reached on a more advanced technology (Virtex 5), as reported in [13]. The P of the original core is 42.4 mW (Table 12.4). The P when running the same or different programs and with increasing numbers of copies is listed as well. When running the same thread and removing obsolete shift registers, the area increase is only 11 %.

Table 12.3 Results for the CSR-ed OR1200 core, part I
Table 12.4 Results when using SEU detection

Table 12.5 shows the P of the CSR-ed demo processors (C = 4) using inverted clock edges on consecutive C-levels. With C = 4 it is only possible to run two identical threads or two different threads. When the relevant numbers of Table 12.5 are compared with the one of Tables 12.2 and 12.6 , we see that this method generates more P on the ARM3, but less on the OR1200.

Table 12.5 Results when using different clock edges
Table 12.6 Results for the CSR-ed OR1200 core, Part II

Table 12.4 shows the area overhead of the CSR-ed demo processors (C = 3) when an SEU detection or an SEU detection and recovery mechanism is used. CSR offers in these cases the possibility to get an ARM3 (OR1200) implementation with SEU detection logic with just 30 % (37 %) overhead compared to the original implementation. A comparison logic for all registers is used. If only key registers are compared, the area overhead is reasonable lower. The additional area overhead when an on-the-fly recovery mechanism is added is minor (1–2 %). This is due to the fact that the write enable signal nicely fits on the used FPGA technology.

Table 12.7 shows the area overhead of the CSR-ed demo processors (C = 3) when an SEU and recovery mechanism with reduced register count is used. The are overhead could be reduced to only 27 % (ARM3) or 29 % (OR1200) of the single core implementation. Also the P when running three identical threads is reduced to just 46 % when running the possible speed on the ARM3, and just 71 % of the P when running the core at the theoretical maximal speed.

Table 12.7 Results when using Register Reduced CSR Version

8 Summary

C-Slow Retiming is known for running C copies of a design to increase the system performance per area for a given design. This paper elaborates on running identical threads and on using the resulting redundancy for SEU detection and design state recovery. In order to further reduce the area and power consumption various methods are discussed.

In general it can be said, that an individual thread runs always slower on an CSR-ed design compared to its execution on the original design. The multithreaded CSR solution needs less area but consumes roughly 40 % more power than the alternative approach to instantiate individual designs. Whereas when identical threads are executed, the power consumption is in favor of CSR because a thread consumes 20 % less power on a CSR-ed design than on the original core implementation. This fact as well as the possibility to use a design state recovery mechanism makes CSR ideal for safety critical and low power designs.

Although this paper concentrates on running CSR-ed designs on FPGAs, it looks promising to use this method also on ASICs and design implementations, where SEU detection, design state recovery, power consumption and design area play an important role.