Soft Error Rate and Fault Tolerance Techniques for FPGAs

Kastensmidt, Fernanda; Reis, Ricardo

doi:10.1007/978-1-4614-4078-9_10

Fernanda Kastensmidt⁴ &
Ricardo Reis⁴

2377 Accesses
2 Citations

Abstract

Different fault tolerance techniques can be applied to FPGAs according to their type of configuration technology, architecture and target operating environment. This chapter will present a set of fault mitigation techniques for SRAM, FLASH and ANTIFUSE-based FPGAs and a test methodology to characterize those FPGA under radiation. Results from neutron-induced faults will be presented and compared.

Access provided by Autonomous University of Puebla. Download chapter PDF

Method to Analyze the Susceptibility of HLS Designs in SRAM-Based FPGAs Under Soft Errors

Early Analysis of Soft Error Effects for Aerospace Applications Using Probabilistic Model Checking

Leveraging the Partial Reconfiguration Capability of FPGAs for Processor-Based Fail-Operational Systems

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Integrated circuits operating in radiation environment are sensitive to transient faults caused by the interaction of charge particles with the silicon [1]. At ground level, the main source of radiation are neutron particles that interact with the material provoking secondary particles such as alpha particles and muons that can ionize the silicon provoking transient upset in circuits fabricated in nanometer technology [2]. The interaction of the charged particles with the transistor may provoke transient and permanent effects. The effects that are caused by a single event interaction are called Single Event Effects (SEE) and they can be transient as the Single Event Upset (SEU) and Single Event Transient (SET) or permanent as single event latchup (SEL), single event gate rupture (SEGR), or single event burnout (SEB) [3]. The effect that is caused by accumulation of particle interaction is named Total Ionizing Dose (TID) and it represents degradation in the performance of the transistors as it modifies the threshold voltage and leakage current [4]. In this chapter, we focus on transient effects caused by a single event as SEU and SET.

Field Programmable Gate Array (FPGA) components are very attractive for aerospace applications [5], as well for many applications at ground level that require a high level of reliability, as automotive, bank servers, processing farms, and others. The high amount of resources available in programmable logic devices can be applied to add flexibility to the on-board computer in satellites and to the automotive industry, for example. As FPGAs can be configured in the field, design updates can be performed until very late in the development process. In addition, new applications and features can be configured after a satellite is launched, or updated in hash environments. However, programmable devices have been found to be very sensitive to radiation. It is fundamental to experimentally measure the soft error rate of the available resources, as well as the output error rate of specific applications, to evaluate their applicability in harsh environments.

The characterization of those programmable components is mandatory to sustain its applicability under transient faults. The test methodology and characterization of FPGAs under radiation will be presented. The fault mitigation methodology to protect a design targeting those FPGAs will also be shown. Practical radiation ground testing results under neutron for SRAM-based FPGAs and Flash-based FPGAs will be presented and discussed.

2 FPGAs Under Soft Errors

Field-Programmable Gate Arrays (FPGAs) are configurable integrated circuit based on a high logic density regular structure, which can be customizable by the end user to realize different designs [6]. The FPGA architecture is based on an array of logic blocks and interconnections customizable by programmable switches. Several different programming technologies are used to implement the programmable switches. There are three types of such programmable switch technologies currently in use: SRAM, where the programmable switch is usually a pass transistor or multiplexer controlled by the state of a SRAM bit (SRAM based FPGAs); Antifuse, when an electrically programmable switch forms a low resistance path between two metal layers (Antifuse based FPGAs); and EPROM, EEPROM or FLASH cell, where the switch is a floating gate transistor that can be turned off by injecting charge onto the floating gate.

Customizations based on SRAM are volatile. This means that SRAM-based FPGAs can be reprogrammed as many times as necessary at the work site and that they loose their contents information when the memories are not connected to the power supply. The antifuse customizations are non-volatile, so they hold the customizable content even when not connected to the power supply and they can be programmed just once. Each FPGA has a particular architecture. Programmable logic companies such as Xilinx, MicroSemi, Aeroflex (licensed for Quicklogic FPGAs), Atmel and Honeywell (licensed for Atmel FPGAs) offer radiation tolerant FPGA families. Each company uses different mitigation techniques to better take into account the architecture characteristics.

2.1 Single Event Effects on SRAM-Based FPGAs

The SRAM-based FPGA is composed of an array of configurable logic blocks (CLB), a complex routing architecture, an array of embedded memories (Block RAM), an array of digital signal processing components (DSP) and a set of control and management logic. The CLBs are composed of Look-up Table (LUT) that implements the combinational logic, and flip-flops (DFF) that implements the sequential elements. The routing architecture can be very complex and composed of millions of pre-defined wires that can be configured by multiplexers and switches to build the desirable routing.

The configuration of all CLBs, routing, Block RAMs, DSP blocks and IO blocks is done by a set of configuration memory bits called bitstream. According to the size of the FPGA device, the bitstream can contain millions of bits. The memory bits that store the bitstream inside the FPGA is composed of SRAM memory cells, so they are reprogrammable and volatile. When an SEE occurs in the configuration memory bit of an SRAM-based FPGA, it can provoke a bit-flip. This bit-flip can change the configuration of a routing connection or the configuration of a LUT or flip-flop in the CLB. This can lead to a dramatic impact in the designed circuit, since an SEE may change its functionality.

An SEE in the configuration memory bits of an SRAM-based FPGA has a persistent effect and it can only be corrected when a new bitstream is loaded to the FPGA. In the combinational logic, the effect of an SEE is related to a persistent fault (zero or one) in one or more configuration bits of a LUT. Figure 10.1 exemplifies an SEU occurrence in a LUT configuration bit and in a bit controlling a routing connection. SEE in the routing architecture can connect or disconnect a wire in the matrix. This is also a persistent effect and its effect can be a modification in the mapped circuit, as a logic change or a short circuit in the combinational logic implemented by the FPGA. It can take a great number of clock cycles before the persistent error is detected and recovery actions are initiated, as the load of a faulty-free bitstream. During this time, the error can propagate to the rest of the system.

Bit-flips can also occur in the flip-flop of the CLB used to implement the user’s sequential logic. In this case, the bit-flip has a transient effect and the next load of the flip-flop will correct it.

2.2 Single Event Effects on Flash-Based FPGAs

Well-known Flash-based FPGAs are from Microsemi, which presents the families PROASIC3 and SmartFusion [7]. The reconfigurable array is composed of VersaTiles and routing resources that are programmable by turning ON or OFF switches implemented by floating gate (FG) transistors (NMOS transistor with a stacked gate). The FG switch circuit is a set of two NMOS transistors: (1) a sense transistor to program the floating gate and sense the current during the threshold voltage measurement and (2) a switch transistor to turn ON or OFF a data-path in the FPGA (Fig. 10.2). The two transistors share the same control gate and floating gate. The threshold voltage is determined by the stored charge in the FG. Figure 10.3 illustrates VersaTiles used to implement some common logic gates. The VersaTiles are connected through a four-level hierarchy of routing resources: ultra-fast local resources; efficient long-line resources; high-speed, very-long-line resources; and the high-performance VersaNet networks.

Each VersaTile can implement any 3-input logic functions, which is functionally equivalent to a 3-inputs Lookup Table (3-LUT). But it is important to highlight that the electrical implementation of the VersaTile is totally different than the electrical implementation of a Lookup Table (LUT). Hence, the VersaTile may have a different electrical behavior to variability effects with respect to a 3-inputs LUT. The VersaTile can also implement a latch with clear and reset, or D flip-flop with clear or reset, or enable D flip-flop with clear and reset by using the logic gate transistors and feedback paths inside the VersaTile block. For each configuration in the VersaTile block, the number of FG switches and transistors in the critical path changes. Single Event Transient (SET) pulses can hit the drain of the transistor at OFF state as presented in Fig. 10.2 provoking a transient pulse in the configuration switches. Or it can hit the sensitive nodes of the transistors in the VersaTile provoking SET or bit-flip according to the customization of the tile (Fig. 10.3).

2.3 Single Event Effects on Antifuse-Based FPGAs

Antifuse-based FPGAs consists of a regular matrix composed of combinational (C-cells) and sequential (R-cells) surrounding by regular routing channels. One of well-known antifuse-based FPGAs is from Microsemi. All the customizations of the routing and the C-cells and R-cells are done by an antifuse element (programmable switch). Results from radiation ground testing have shown that programmable switches either based on ONO (oxide–nitride–oxide) or MIM (metal–insulator–metal) technology are tolerant to ionization and total dose effect [8]. Therefore, the customizable routing is not sensitive to SEU, only combinational logic and the flip-flops used to implement the design user sequential logic are sensitive to SEE.

Another well known antifuse-based FPGA is from Aeroflex and QuickLogic. Its architecture is composed of a regular matrix of configurable logic cells used to implement the combinational logic and flip-flops, surrounding by a regular routing matrix. Programmable switches called ViaLink connector are used to do all the customizations.

In order to summarize the SEU and SET effects in FPGAs, Table 10.1 shows the susceptible parts of the architectures and classifies the effects as transient or persistent, when it is needed reconfiguration to correct the fault.

Table 10.1 Summary of SEU and SET effects in FPGAs

Full size table

3 Fault Tolerance Techniques for FPGAs

Different fault tolerance techniques can be applied to FPGAs according to their type of configuration technology, architecture and target operating environment [6]. Techniques can be implemented by the user at hardware description language (HDL) of the design before the design is synthesized into the FPGA. Or techniques can be developed by the vendor, which provides a FPGA that is SEE robustness by layout. Figure 10.4 illustrates these two options. Here we will focus on techniques that can be applied by the user at the HDL design.

The main techniques are either based on spatial redundancy or temporal redundancy [9]. Spatial redundancy is based on the replication of n times the original module building n identical redundant modules, where outputs are merged into a voter. Usually n is an odd number. The voter decides de correct output by choosing the majority of the equal output values. The most common case of n-modular redundancy (nMR) is when n is equal to 3, where it is called Triple Modular Redundancy (TMR). In this case, a majority voter is used that is able to vote out 2 out of 3 values that are fault free. There is local TMR when only the flip-flops are triplicated or global TMR where all the combinational and sequential logic is triplicated. The TMR can be implemented in different ways by using large grain TMR, or breaking into small blocks and adding extra voters. Each one can protect SEU or SET, or both, as shown in Table 10.2.

Table 10.2 List of mitigation techniques that can be applied by the User in Designs targeting FPGAs

Full size table

When dealing with the routing, different techniques can be chosen to increase or decrease fan-out, delay and set of connections, which may have a different impact in the SEE sensitivity [10, 11]. Also embedded processors can use different mitigations based on software redundancy, or processor redundancy like lock-step and recomputation.

Time redundancy is based on capturing a value twice or three times in time to vote out a transient fault. The values are shifted by a delay [9]. The idea is to be able to capture 2 out of 3 upset free values to be able to mask the fault.

3.1 SRAM-Based FPGAs

In SRAM-based FPGAs, radiation-induced faults have a persistent effect so spatial redundancy is needed to mask the upset combined with reconfiguration to correct the fault in the configuration memory bits (bitstream). Figure 10.5 show the flow, the design can be either protected by Global TMR (called XTMR by Xilinx) using the XTMR tool or implemented by hand, and full or partial reconfiguration, called scrubbing, must be applied to correct the upsets from time to time. Fault injection can be used to evaluate the efficiency of the fault tolerance technique and ensure its masking capability under single upsets.

The global TMR or XTMR consist on triplicating all the combinational and sequential logic, input, outputs and clock trees as illustrated in Fig. 10.6. Note that the majority voter can be applied after flip-flops in throughput logic, or after flip-flop in a feed back path. In this second case, it is mandatory to have majority voters able to correct the bit-flip of the flip-flops to avoid accumulation of errors. The voter used in the output is based on a minority voter, where it blocks the output if this one is different to the other two.

Scrubbing or reconfiguration that can be full or partial can be performed by the internal block called ICAP or by the SelectMap interface. Scrubbing is mandatory to correct the upsets (bit-flips) in the configuration memory bits. Figure 10.7 shows the time for scrubbing in different FPGA families. The time for performing full scrubbing can be significant in large FPGAs, more than 50 ms. This means that some applications will run many clock cycles between two corrections of scrubbing. Consequently, spatial redundancy techniques are very important to ensure the fault masking at the output between scrubbings.

Scrubbing can be costly also in terms of power. In average, the scrubbing rate should be at least three times faster than the upset rate of the FPGA. In order to reduce the scrubbing rate, it is necessary to ensure that the spatial redundancy technique, such as XTMR for example, can tolerate more than a single fault.

Some improvements can be done as playing with different TMR granularities by voting insertion, as shown in Fig. 10.8 [12].

Another option is to use Diversity Triple Modular Redundancy (DTMR) to increase the tolerance to multiple and accumulated upsets in the configuration memory bits. A DTMR system is designed through the association of three diversified copies to a majority voter [13]. Figure 10.9 illustrates an example of a DTMR system. Each copy can be implemented by using a different architecture, or different processors, different logic granularities and others.

Another option to increase the masking capability to multiple upsets is to use nMR-based technique. The nMR is composed of n functionally identical modules, which receive the same m-bits input and deliver p-bits output to the Self-Adapted voter (SAv), Fig. 10.10 [14]. The SAv receives n × p bits from all modules and generates the fault-free p-output, n error status flags (ESF), and a non-masked fault signal (NMF). In this scheme, the system allows for the accumulation of defective modules, while remaining at least two modules without fault. SAv is a majority voter, considering as population fault-free modules.

3.2 FLASH-Based and Antifuse-Based FPGAs

The configurations of flash-based and antifuse-based FPGAs are not sensitive to radiation. So, faults as SET and SEU can only occur in the combinational and sequential logic and they have a transient effect. So, well-known techniques used in Application Specific Integrated Circuits (ASICs) can be applied in those FPGAs.

In case SEU can be observed in the user’s flip-flops of the FPGA, a technique based on local TMR, where only the flip-flops are triplicated and vote, can work well, Fig. 10.11.

In case SET is an issue and also SEU in the configurable flash memory cells, a technique based on global TMR, where all the combinational logic and flip-flops are triplicated and vote, can work well, as is illustrated in Fig. 10.12. Or the designer can choose on using a temporal filtering technique, where the SET will be filtered by the added chosen delay in the clock trees or in the logic path, Fig. 10.13. In this case, the idea is that each flip-flop will capture the data at a different moment, and at least two flip-flops out or three will have the fault-free value.

4 Radiation Test Methodologies to Predict and Measure SER in FPGAs

The test of FPGAs under radiation depends on a test plan developed for each type of FPGA and design architecture. Here we will detail the radiation test for SRAM-based FPGAs. There are two types of tests: the static test and the dynamic test. The static test experiment consists of configuring the FPGA with a golden bitstream containing the test-design and then constantly read back the FPGA configuration memory with the Xilinx iMPACT tool through the JTAG interface. In the experiment control computer, the golden bitstream is compared against the readback bitstream. If differences are found, the FPGA is reconfigured with the golden bitstream and the differences are stored in the computer. Faults are defined as any bit-flip in the configuration memory detected by the readback procedure. In this case, it is possible to calculate the upset rate in the configuration memory bits for that specific particle flux, expressed by particles/s/cm².

The cross-sections (σ) is the sensitive area of a circuit, where one particle may cause an upset event, and it is calculated using Eq. (10.1), where fluence = flux × time, expressed by particles/cm². Cross-section is expressed in cm².

$$ \upsigma =\frac{\# \emph{events}}{\emph{fluence}} $$

(10.1)

There are two types of cross-section: static and dynamic cross-section. Static cross-section, also known as device cross-section (σ device) is defined as the ratio between the number of upsets in the configuration memory bits of the SRAM-based FPGA (events) and the fluence of hitting particles. Usually, it is used a normalized cross-section or bit cross-section (σ bit), where the cross-section is divided by the total amount of configuration memory bits. Static cross-section quantifies the sensitivity of the FPGA technology to a specific radiation source.

On the other hand, dynamic cross-section is defined as the ratio between the number errors observed at the output at design configured into the SRAM-based FPGA (events), divided by the fluence of hitting particles. Dynamic cross-section quantifies the sensitivity of the implemented design application to any specific radiation source. The rate at errors is defined as the soft error rate (SER). In this case, the expected error rate is much lower than the static test. Based on the Xilinx Reliability Report [15], in average it is necessary 20 upsets in the configuration memory bits to provoke one error in the design output. This relation may of course vary according to the logic density, mapping, routing and the chosen architecture for the design. This number can triplicate when XTMR is used. And the relation can increase even higher, 6 or more times, if DTMR or nMR is used.

Notice that SER is proportional to both device size sensitivity (cross-section) and flux as shown in Eq. (10.2).

$$ \emph{SER}= \emph{flux}\times {\upsigma}_{\emph{SEU}} $$

(10.2)

When a charged particle (as neutron, protons or heavy ions) hits a device, part of its charge is deposited in the device. This is known as linear energy transfer (LET) and expresses the energy loss per unit length (dE/dx) of a particle and is a function of the mass and energy of the particle as well as the target material density. The units of LET are commonly expressed as MeV–cm²/g. Radiation experiments with charged particles commonly relate the relation between cross-section and LET, which also depends on the incidence angle.

Fig. 10.14 illustrates a setup experiment under neutron at ISIS Facility in United Kingdom, where we could measure the static and dynamic cross-section of SRAM-based FPGAs and later on by knowing the sea level neutron flux, we can infer the SER of the device and circuit application running at ground level.

For example, the device was exposed for 22 min in a neutron flux of 4.11 × 10⁴ and 70 events were counted. The fluence can be calculated and consequently by using Eq. (10.1), the cross-section obtained is 1.29 × 10^-6 cm². And by using Eq. (10.2), the SER is 5.3 × 10^-2 errors per second. By knowing particles flux at see level is 13 neutrons/cm²/h, the expected error rate at sea level can be infer to 1.29 × 10^-6 cm² × 13 neutrons/cm²/h = 1.67 × 10^-5 errors per hour.

Conclusions

Each type of FPGAs presents a different susceptibility to soft errors and according to its upset rate and architecture, distinct fault tolerance techniques may be chosen. For experiments under neutron-induced faults, it is possible to calculate the static and dynamic cross-section of SRAM-based FPGAs and to analyze its tolerance to multiple and accumulated upsets. The static cross-section can be measured for each SRAM-based FPGA device and the dynamic cross-section must be measured for each fault tolerant design implemented in the FPGA. For a certain device with a specific static cross-section, there will be many dynamic cross-sections according to the fault tolerance technique chosen.

References

J. Barth, “Applying Computer Simulation Tools to Radiation Effects Problems”, in: IEEE Nuclear Space Radiation Effects Conference Short Course, NSREC, 1997.
Google Scholar
O. Flament, J. Baggio, C. D''hose, G. Gasiot, J.L. Leray, “14 MeV neutron-induced SEU in SRAM devices,” In Nuclear Science, IEEE Transactions on, vol. 51, no. 5, pp. 2908–2911, Oct. 2004.
Google Scholar
M. Berg, “Fault tolerance implementation within SRAM based FPGA designs based upon the increased level of single event upset susceptibility,” On-Line Testing Symposium, IOLTS 2006.
Google Scholar
P. E. Dodd and L. W. Massengill, “Basic mechanisms and modeling of single-event upset in digital microelectronics,” IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 583–602, Jun. 2003.
Google Scholar
T. R. Oldham, F. B. McLean, “Total Ionizing Dose Effects in MOS Oxides and Devices,” IEEE Transactions on Nuclear Science, vol. 50, no. 3, 2003. pp. 483-498.
Article Google Scholar
F. L. Kastensmidt, R. Reis, L. Carro, Fault-Tolerance Techniques for SRAM-Based FPGAs (Frontiers in Electronic Testing), Springer, 2006.
Google Scholar
Actel. ProASIC3, IGLOO and SmartFusion Flash Family FPGAs Datasheet. [Online]. Available: http://www.actel.com/documents/PA3_HB.pdf and http://www.actel.com/documents/IGLOO_HB.pdf
Wang, J.J., RTAXS Single Event Effects Test Rep., Aug. 2004 [available on-line at http://www.actel.com/documents/RTAXS_SEE_Report.pdf]
Anghel, L., Alexandrescu, D., Nicolaidis, M., Evaluation of a soft error tolerance technique based on time and/or space redundancy, in the Proceedings of Symposium on Integrated Circuits and Systems Design, SBCCI, 13, 2000. p. 237-242.
Google Scholar
L. Sterpone, M. Sonza Reorda, M. Violante, RoRA: Reliability-oriented Place and Route for SRAM-based FPGAs, PRIME05: IEEE Ph.D. Research In Micro-Electronics & Electronics, 2005, pp. 147-150
Google Scholar
L. Sterpone, D. Boyang, D. Merodio Codinachs, V. Ferlet-Cavrois, Accurate Mitigation of Single Event Effects on Flash-based FPGAs: A new Design Flow. RADECS 2013.
Google Scholar
F. Kastensmidt, L. Sterpone, M. Sonza Reorda, L. Carro. On the Optimal Design of Triple Modular Redundancy Logic for SRAM-Based FPGAs, DATE2005: IEEE Design, Automation and Test in Europe, 2005, pp. 1290-1295
Google Scholar
L. Tambara, F. Kastensmidt, J. Azambuja, E. Chielle, F. Almeida, G. Nazar, L. Carro, P. Rech, C. Frost. Evaluating the Effectiveness of a Diversity TMR Scheme under Neutrons, RADECS 2013.
Google Scholar
J. Tarrillo, P. Rech, F. Kastensmidt, C. Valderrama, C. Frost, Neutron Cross-section of N-Modular Redundancy Technique in SRAM-based FPGAs. RADECS 2013.
Google Scholar
Xilinx, Inc. (2013) “Device Reliability Report Third Quarter 2013” [Online]. Available: http://www.xilinx.com/support/documentation/user_guides/ug116.pdf

Download references

Author information

Authors and Affiliations

Instituto de Informática, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, Brazil
Fernanda Kastensmidt & Ricardo Reis

Authors

Fernanda Kastensmidt
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Reis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ricardo Reis .

Editor information

Editors and Affiliations

Instituto de Informática, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, Rio Grande do Sul, Brazil
Ricardo Reis
School of ECEE, Arizona State University, Tempe, USA
Yu Cao
Instituto de Informática, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, Rio Grande do Sul, Brazil
Gilson Wirth

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kastensmidt, F., Reis, R. (2015). Soft Error Rate and Fault Tolerance Techniques for FPGAs. In: Reis, R., Cao, Y., Wirth, G. (eds) Circuit Design for Reliability. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4078-9_10

Download citation

DOI: https://doi.org/10.1007/978-1-4614-4078-9_10
Published: 16 October 2014
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-4077-2
Online ISBN: 978-1-4614-4078-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics