Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In the previous chapters, many of the challenges in the design of reliable nanoscale devices have been described. Many of these challenges such as manufacturing faults and transient faults have existed for many generations of CMOS, and a large body of knowledge around design for test, redundancy, and hardening techniques has developed. Today, advances in CMOS are less the result of scaling and increasingly the result of innovation in terms of process, materials, and new types of transistors. Combined with the fact that the dimensions of transistors are approaching the atomic scale, variability is increasing and the total number of transistors per die is often in the billions, new reliability challenges are emerging.

In this chapter, we address new approaches for addressing reliability threats, with a focus on the process level. In the first section, we explore new approaches for managing increased process variation. The following section discusses how transistor degradation due to gate oxide breakdown, BCI, HCI, and self-heating effects can be mitigated at the process level. Next is a section that discusses the trends in radiation sensitivity in the most recent CMOS nodes including how this impacts the design of radiation-hardened cells. The total power drawn by large integrated circuits can be very significant; thus IR drop is a real problem. The final section in the chapter discusses this challenge and how voltage droop can be managed.

The focus of this chapter is how the reliability challenges in advanced CMOS devices can be managed, primarily at the materials, process, and technology level. Subsequent chapters will investigate higher level approaches, at the micro-architectural and architectural level.

2 Mitigation of Process Variation

Process variation affects the speed and power consumption of circuits. Some circuits may fail to meet the intended speed, if process variation is not taken into account during design. This results in a low yield. Conventionally, guard bands are built into the design to account for process variation. With the ongoing shrinking of CMOS technology, the effects of process variation become more and more pronounced. Because of this, a guard-banded design leads to an increased penalty in terms of area and power consumption.

An often performed technique to remedy the effects of process variation is speed binning. With speed binning, chips are tested extensively after production in order to find their maximum clock speed and classify them. The faster chips are then sold at higher prices, while the slower ones at lower prices. Therefore, the extensive testing pays off. These days, however, processors have shifted from single core to multi-core. Therefore, an increase in clock speed has become less interesting and instead more cores are preferred. Because of this, speed binning is becoming less profitable.

The drawbacks with guard-banded design in terms of power and area, and the decrease in effectiveness of speed binning have led to the development of techniques to mitigate process variation. Thanks to these techniques, high yield is guaranteed without adding excessive guard bands.

2.1 Classification

Figure 1 shows a high-level classification of process variation mitigation schemes. As it can be seen, the schemes can be divided into static and dynamic ones. Static schemes are used during the design or the manufacturing of the chip; they can even be tuned once before deploying the chip in the application. However, these schemes cannot be tuned at runtime and during the lifetime of the chip, which is the case for dynamic schemes. Such schemes are based on monitoring the circuit’s behavior at runtime and taking action when necessary, in order to prevent errors in the circuit.

Fig. 1
figure 1

Process variation mitigation classification

2.2 Static Schemes

Figure 2 shows the classification of the most common static process variation mitigation schemes. As it can be seen, schemes can be applied either during the process phase or during the design phase; they are discussed next.

Fig. 2
figure 2

Static process variation mitigation schemes classification

2.3 Process Schemes

The process schemes include all techniques applied during fabrication to minimize process variation; these may be related to the used materials or to techniques to increase the resolution of structures printed on silicon, called Resolution Enhancement Techniques (RETs).

Material: The used materials for the production of semiconductor devices are constantly evaluated and improved, especially for emerging devices and new materials. A good example of a newly introduced material that helped mitigating process variation is high-κ dielectrics in the 45 nm technology [1]. The high-κ dielectrics are used to replace the conventional silicon dioxide (SiO2) that was used for the gate oxide material. The thickness of the gate oxide used to steadily decrease as transistors decreased in size, until leakage currents became a concern, and the scaling slowed down. With the introduction of high-κ dielectrics for the gate oxide, the gate leakage is reduced, making further gate oxide scaling possible. This scaling has a positive effect on random variations due to Random Dopant Fluctuation (RDF), because matching of transistors improves when gate oxide thickness decreases [2].

Resolution Enhancement Techniques: The current lithographic process for making chips utilizes ultraviolet light with a wavelength of 193 nm. As the dimensions of current nanometre-scale technology are only a fraction of this wavelength, it becomes difficult to produce the requested patterns. This happens due to diffraction effects, which defocus the patterns printed on silicon. The diffraction effects have resulted in the introduction of several Resolution Enhancement Techniques (RETs), which counteract the diffraction effects and, thus, increase the resolution of the lithography step. Thanks to the increased resolution, the process variation is reduced as well.

A common RET is to use phase-shift masks [3]. Phase-shift masks alter the phase of the light passing through certain areas of the mask, which in turn changes the way the light is diffracted and, therefore, the defocusing effect is reduced.

Another RET is Optical-Proximity Correction (OPC) [4,5,6]. OPC pre-distorts the mask data in order to compensate for image errors due to diffraction effects. The pre-distortion is done by moving edges or adding extra polygons to the patterns on the mask. This results in a better printability.

Finally, double patterning is also a technique to increase the resolution of printed patterns [7]. With double patterning, dense patterns with a high pitch are split over two masks. The two masks then contain lower pitch structures. The dense patterns are then printed with two exposure steps. In each of the steps one of the two masks is used. The combination of the two masks then results in a higher pitch printed on silicon. This pitch is hard, if not impossible, to achieve with a single patterning process.

2.4 Design Schemes

During the design of a chip, various methods, such as regular layout styles, gate sizing, and Statistical Static Timing Analysis (SSTA), can be applied to mitigate process variation. These techniques are discussed below.

Regular Layout: Regular layout styles aim at simplifying the patterns that need to be printed on silicon. This is achieved by adding more regularity and symmetry to the design. Regular layout techniques reduce variability that occurs due to lithography distortion.

Regularity is added to the layout, for instance, by only allowing a fixed device orientation and routing direction per layer. Ultra-regular and semi-regular layouts, of which example circuits are shown in Fig. 3, were proposed in [8]. The ultra-regular layouts use a single-device orientation, constant poly pitch, and the direction of the routing is fixed. With the semi-regular layout, the width and spacing for the geometries are held as constant as possible with minor deviations allowed.

Fig. 3
figure 3

4 Ultra-regular layouts and semi-regular

A higher regularity than the ultra-regular layouts can be achieved by using only a single or limited set of highly optimized basic building blocks and repeating these blocks, as it is the case for Via-Configurable Transistor Array (VCTA) cell proposed in [9]. The VCTA cell maximizes regularity for both devices and interconnects. The VCTA cell consists of n NMOS gates and n PMOS gates. To maximize regularity, all transistors have the same width and channel length. On top of the VCTA cell, a fixed and regular interconnect grid of parallel metal lines is placed. The functionality of the cell can be configured by connecting the transistors in the cell in a certain way. This is done by making connections between the metal lines and transistors using vias. With this, the via placement and inter-cell interconnections are the only source of irregularity in the layout of the design.

One of the advantages of regular layouts is the yield improvement due to the reduction of process variability. Another advantage is the acceleration of the time-to-market due to the lower number of basic cells and layout patterns that need to be optimized. A disadvantage of regular layouts is an increase in the area with the associated delay and power consumption. Furthermore, some regular layouts have a fixed transistor width, which may make it difficult to meet delay specifications on all paths, and will also increase power consumption.

Statistical Static Timing Analysis ( SSTA ): Traditionally, designers use corner analysis to ensure that the design will meet its timing specification under all cases of process variation. In corner analysis, all electrical parameters are considered to be within certain bounds. The design is valid if the circuit meets the performance constraints for all corners. For corner analysis Static Timing Analysis (STA) is used often. With STA, a circuit’s timing is analyzed by adding the worst-case propagation times for the gates of a path. This analysis is only necessary if the process variations are of a systematic nature. However, in nanometre-scale technologies random variations are more dominant. This means it is unlikely that all gates in a path will show worst-case propagation times. Therefore, STA leads to an overly pessimistic design.

As an alternative to STA, statistical static timing analysis (SSTA) has been proposed [10]. In SSTA, the worst-case propagation times of gates are replaced by probability density functions (PDFs). These PDFs are then propagated through the logic network, to determine the final PDF of the propagation delay of the circuit. With the final distribution, direct insight can be obtained on the yield of the design. Therefore, high yield can be achieved without adding excessive margins.

Obviously, SSTA leads to lower die area and reduced power consumption. However, it increases the design time, due to the higher complexity of SSTA as compared to STA.

Gate Sizing: With gate sizing, an attempt is done to find optimal drive strength of gates in the circuit in order to obtain a trade-off between delay, power consumption, and area. The drive strength is set by changing the size of the transistors in the gate, which enable the design optimisation under certain constraints. For instance, the design can be optimized for area and power consumption at a minimum target speed. Different optimization algorithms have been published in literature [8,9,10].

Recent gate sizing techniques have started to take into account process variation as well [11, 12]. These techniques are referred to as variation aware gate sizing. They model process, voltage, and temperature variations using statistical methods. With these techniques power and area can be optimized at higher yield.

Gate sizing offers advantages such as reduced die area and lower power consumption. However, it complicates the design process.

2.5 Dynamic Schemes

Dynamic mitigation schemes monitor the circuit’s behavior online (in field) and when necessary actions are taken to prevent timing errors. Figure 4 shows the classification of the most common dynamic mitigation schemes. Note that dynamic schemes can be applied at the hardware or at the software level; both approaches are discussed below.

Fig. 4
figure 4

Dynamic process variation mitigation schemes classification

2.6 Hardware-Based Schemes

As can be seen in Fig. 4, hardware techniques to mitigate process variation include voltage scaling, body biasing, adaptive clocking, and error detection and correction.

Dynamic Voltage Scaling ( DVS ): this is a technique where the supply voltage is scaled down in order to limit the dynamic power consumption. Most modern processors dynamically change the clock frequency based on throughput requirements in order to save energy. This tuning of the clock frequency can happen in conjunction with voltage scaling, as a lower frequency requires a lower voltage in order to still meet the timing. Thanks to this scaling of the supply voltage, even more power is saved compared to only lowering the clock speed.

Conventional DVS techniques require enough supply voltage margin to cover process variations, which results in wasted energy. Therefore, variation aware DVS is proposed. With this technique on-chip monitors are added to the circuit to provide feedback on the process variation in the circuit. Based on this feedback, the supply voltage can be adjusted to the near minimum level needed to run without errors. Early papers are based on critical path replication and monitoring this replica. For instance, in [13] the critical path of the system is replicated with a ring oscillator. Based on the measured frequency of the ring oscillator, it is determined if the supply voltage can be lowered or should be increased. Due to the growing on-die process variation in nanometre-scale technologies, using a single reference structure is no longer feasible, because in this case extra margin is necessary. Furthermore, it is becoming more and more difficult to select a unique critical path across all conditions. A technique to emulate the actual critical path under different process and parasitic conditions was described in [14]. Thanks to the close tracking of the actual critical path, the supply voltage can be scaled down further. An in situ delay monitoring technique was proposed in [15]. For this, pre-error flip-flops are used; they are capable of detecting late data transitions. The power supply is then scaled based on the rate of errors.

An advantage of DVS is a reduction in power consumption as the supply voltage is closer to its minimum value. A disadvantage of DVS is that it is mainly suitable for global variations, because it is difficult to find a unique critical path. To better account for local variations, more reference circuits for performance monitoring are needed on the die. To account even better for local variations, the circuit should be split up into multiple sub-circuits with separate supply voltages, so each sub-circuit is supplied with its minimum operation voltage. This results in even higher area overhead.

Body Biasing: it is a technique that allows to change the threshold voltage of a transistor. With body biasing the transistor’s body effect is utilized; it refers to the dependence of the threshold voltage on the voltage difference between the source and body of the transistor. Normally, the NMOS transistor’s body is connected to ground and for a PMOS transistor’s body to V DD. By applying different voltages at the body terminals, it is possible to control the threshold voltage of the transistors. In order to do this, the body terminals of the transistors need to be connected to separate power networks instead of V DD and ground. Through these power networks the body biasing voltages are then controlled.

When the threshold voltage is lowered with body biasing, it is called forward biasing. In this case the transistors will switch faster, making the circuit faster. This happens at the penalty of increased power consumption due to higher leakage. It is also possible to increase the threshold voltage; this is referred to as backward biasing. This makes the circuit less leaky, which leads to a lower power consumption at the cost of a slower circuit.

Body biasing can be used to mitigate process variation. On slow circuits forward biasing is performed in order to make them faster. On fast circuits, which suffer from higher leakage, backward biasing is performed. The required body biasing voltages can be applied with the use of on-chip sources, such as power regulators. Just like with DVS, on-chip monitors are added to the die that measure a test structure to determine the process variation. In [16, 17] a ring oscillator is used to measure the process variation in a circuit. Based on the ring-oscillator measurements the power regulators generate appropriate biasing voltages to mitigate the effects of process variation. Note that if only one test structure is measured, only global, systematic variations can be mitigated and still a margin is necessary to account for within-die variations. Accounting for within-die variations requires special attention; e.g., in [17], the authors proposed to divide the circuit into multiple sub-circuits with separate body bias networks. By monitoring ring oscillators close to the sub-circuits, each sub-circuit can then be supplied with unique biasing voltage. This way the within-die variation is compensated to a certain extent, which improves the frequency and the leakage of the circuit even more.

An advantage of body biasing is a reduction in power consumption, as the leakage of chips from the fast corner is reduced. Just like with DVS, a disadvantage of body biasing is that it is mainly suitable to compensate global variations. To better account for local variations, more test structures on the die are needed and the circuit needs to be split up into sub-circuits that each has their own body bias network. This results in a higher area overhead.

Clock Stretching: under process variation, some circuits may fail to meet timing. Often, critical paths that exceed the maximum delay are responsible for this; critical paths have the least amount of timing margin and, therefore, are the first to fail. As a solution for this, clock stretching has been proposed. The idea is to stretch the clock when a critical path is activated. This gives the path more time to finish propagation and, therefore, timing errors are avoided. The concept of clock stretching is illustrated in Fig. 5. As can be seen, in cycle 2 the computation time, which indicates the highest propagation time of activated paths in the circuits, exceeds the normal clock period. Therefore, the clock is stretched to two cycles in order to avoid timing errors.

Fig. 5
figure 5

Illustration of clock stretching

One of the challenges with clock stretching is predicting when a critical path is activated. One way to realize this is to use a pre-decoder as proposed in [18]; the pre-decoder has as input, the input vector to the logic. Based on this input vector, the pre-decoder predicts critical path activation in the circuit. When a critical path is activated, a signal is asserted to stretch the clock. An example of an adder with a pre-decoder to enable clock stretching is shown in Fig. 6. As can be seen, the pre-decoder relies on some adder’s inputs to insert the clock stretching when needed.

Fig. 6
figure 6

Adder circuit with pre-decoder for clock stretching

One of the challenges with using a pre-decoder to predict critical path activation is the area overhead and the additional wiring, especially for big circuits. This will most likely make the pre-decoder relatively big. An alternative to the pre-decoder is the CRISTA design [19]; CRISTA isolates the critical paths and makes their activation predictable. This is achieved by partitioning and synthesizing the circuit into several separate logic blocks. These blocks contain each a primary input, indicating if the block is active or idle. The design is synthesized in such a way that only some of these blocks contain critical paths. With the use of the active/idle signal, it is then easy to predict if critical paths are sensitized. Figure 7 shows an example path delay distribution for a design with CRISTA. The targeted delay of one cycle is indicated. It can be seen that a set of paths exceeds this delay. CRISTA makes the activation of these paths is predictable. When one of these paths is activated, clock stretching is performed, so there is enough time for the circuit to finish propagation. It can also be seen that the other set of paths has a lot of slack, which provides resilience against process variation.

Fig. 7
figure 7

Path delay distribution required for CRISTA

A disadvantage of clock stretching is the speed degradation of the circuit. This happens due to the fact that sometimes the clock period is longer. Another disadvantage is that the area overhead to enable prediction of the activation of critical path can become high and, therefore, not all circuits are suitable for such as scheme.

In situ error Detection and Correction: with in situ error detection, timing errors are detected. This is done by checking for late transitions at data inputs of flip-flops. Typically, flip-flops are augmented with a latch or a second delayed clock input in order to check for late transitions. Usually, these techniques are applied in pipeline circuits. One of the earliest works on in situ error detection is Razor [20], for which an example of a pipeline stage is shown in Fig. 8a. As can be seen, each flip-flop is augmented with a shadow latch, which is controlled by a delayed clock. A timing diagram to illustrate how Razor works is shown in Fig. 8b. In the first clock cycle, logic stage L1 meets the normal timing. In the second cycle, however, logic stage L1 exceeds the intended delay. Therefore, the data (instr 2) is not captured by the main flip-flop at clock cycle 3. The shadow latch does capture this data, since it operates with a delayed clock. Because the data stored in the main flip-flop and the shadow latch differ, the error signal is raised and the preceding pipeline stages are stalled. After this, the valid data is restored in the fourth cycle. Therefore, the error is corrected with a penalty of one clock cycle delay.

Fig. 8
figure 8

Pipeline augmented with Razor

Razor corrects timing errors in the circuit at the penalty of one clock cycle delay. There are also techniques that mask the timing error, e.g., by delaying the arrival of the correct data to the next pipeline stage. Authors in [21] proposed the TIMBER flip-flop; a flip-flop that has a delayed clock input to resample data input for any timing errors. In case of a timing error, the output of the flip-flop is updated with the correct value, which is then propagated to the next stage of the pipeline. In this case, time is borrowed from the succeeding pipeline stage.

In situ error detection can be used to mitigate process variation. Timing errors that occur due to critical paths affected by process variation can be detected and corrected. Hence, fault-free operation of the circuit can be achieved without adding a lot of margins to the design.

An advantage of in situ error detection is its capability to compensate local variation, besides global variation. This is due to the fact that all flip-flops or only flip-flops ending at critical paths are augmented with in situ error detection. Because of this, timing errors on critical paths that occur due to local variation are detected. A disadvantage of in situ error detection is a possible decrease in throughput, due to the correction. Another disadvantage is a high area overhead due to the fact that most flip-flops need to be augmented with error detection and also the control logic that is needed to handle the errors.

2.7 Software-Based Schemes

In addition to mitigating process variation at the hardware level, it is also possible to mitigate process variation at the software level. As technology scales further, reliability becomes a more challenging design factor. This is due to, for example, increased aging effects and increased vulnerability to soft errors. Software methods are being developed to detect errors in order to be able to guarantee dependable computing. A technique that can be employed is redundant execution [22], where critical portions of the software are run redundantly on multiple cores. The outputs are then compared to see if any errors are introduced. Another method is Re-execution and Recovery [23], which provides resilience by re-executing portions of the application that have been detected as being corrupted. These software techniques can also be applied to mitigate process variation, besides mitigating aging and soft errors.

3 Mitigation of Transistor Aging

As device dimensions are downscaled in the relentless effort to keep with Moore’s law, maintaining gate control and suppression of short-channel effects requires the introduction of new FET architectures. The semiconductor industry has already moved to FinFET or Fully Depleted Silicon-On-Insulator devices. These are expected to be superseded by nanowire devices with the gate fully wrapped around the channel. At the same time, high-mobility substrate materials, such as Ge, SiGe, and IIIV compounds, are being investigated to accelerate device operation.

As smaller devices, more complex device architectures, and new materials are being introduced, the reliability margins continue to shrink. In many cases, the reliability margin assuming continuous operation at elevated temperature (Fig. 9a) may be no longer sufficient [24]. Below, the main degradation processes affecting FET devices are first discussed (Sects. 3.13.4), along with their overall trends with technology scaling. The root cause of gate oxide degradation—generation and charging of interface and bulk gate oxide traps—is common to all the main degradation mechanisms. The technological means of reducing both interface and bulk traps are therefore discussed in Sects. 3.5 and 3.6.

Fig. 9
figure 9

Approaches to device degradation projections. a “Conventional” projection: mean degradation is estimated from the “worst-case” constant (i.e., DC) stress at maximum V DD (workload w 1) applied for the entire duration of application lifetime. b More realistic workloads w n (with different voltages, frequencies, duty cycles, temperatures, etc.) result in better end-of-lifetime mean degradation prediction. c In reality, a distribution of device-to-device degradation needs to be considered

Since devices in realistic digital circuits typically operate with a series of high and low signals, while the supply voltage V DD changes as, e.g., the “turbo” and the “sleep” modes are enabled, assuming more realistic workloads will result in a more realistic prediction of the mean degradation (Fig. 9b), thus regaining some of the projected reliability margin. In addition to that, correct understanding of the effect of a degraded device on the surrounding circuit will allow to better mitigate aging-related issues already during the design phase. Examples of this are given throughout the text.

On the other hand, only a handful of defects will be present in the gate oxide of each deeply scaled device. This will cause an increase of the so-called time-dependent, or dynamic, device-to-device variability. The same workload will result in a device-to-device distribution of degradations (Fig. 9c). The time-dependent variability is discussed in Sect. 4.

3.1 Stress-Induced Leakage Current and Gate Oxide Breakdown

Generation of conducting defects in the bulk of the gate dielectric during device operation leads to an increase in gate current (leakage). This phenomenon is therefore termed Stress-Induced Leakage Current (SILC). SILC can potentially partially eliminate gate current leakage reduction gained by the introduction of high-κ gate dielectrics [25]. At sufficiently high density the newly generated defects will form a percolation path between the gate and the body of the FET device, resulting in so-called Soft Breakdown (SBD). The current through a formed SBD path is typically a strongly superlinear function of gate bias and of the order of ~μA at 1 V.

The breakdown path can further progressively wear out and when a sufficient local current is reached, a runaway defect generation at the breakdown spot will lead to a so-called Hard Breakdown (HBD). HBD current–voltage characteristic is near-ohmic, with typical values of 1–10 kΩ.

All of the above processes, often called Time-Dependent Dielectric Breakdown (TDDB), are accelerated by gate voltage, current, and temperature [26]. The continuing voltage and power reduction is therefore generally beneficial for increasing and thus postponing time to soft breakdown t SBD. Oxide downscaling also affects PFET t SBD more than NFET, because the gate current in PFETs is due to direct tunneling, while in NFETs it is due to Fowler–Nordheim tunneling—a leakage mechanism less sensitive to thickness variations [27]. The employment of gate metals with more midgap work functions (see also Sect. 3.2) is also beneficial in this sense [27].

In gate stacks with high-κ dielectrics, “Alternating Current” (AC) TDDB is frequency dependent, with low frequencies apparently decreasing t SBD [28, 29]. This appears to be related to bulk high-κ traps, in particular their charging and discharging during the AC stress [30].

The post-SBD progressive wear-out is controlled by the voltage across the breakdown path and the current running through it. The SBD wear-out progress will be therefore slowed down if the stress bias is supplied from a non-ideal “soft” voltage source capable of providing only limited current, such as the preceding transistor stage [31].

If SBD does occur in a FET, the FET drain current characteristic will be typically little affected (Fig. 10). This is because of the limited current of the SBD spot [32]. FET width or the number of fins can be also upsized during design to compensate for the breakdown current. Sufficiently wide devices can then compensate even for HBD [33].

Fig. 10
figure 10

a An SRAM cell with wide FETs can compensate even a hard breakdown. SRAM transfer characteristics after hard FET N R drain-side BD are well reproduced by a simulation assuming R GD = 3.2 kΩ. Two stable points can still be discerned in the butterfly plot. b Narrow-FET SRAM characteristics after soft source-side BD in FET N R are well reproduced by simulation assuming a nonlinear, weakly conducting path. The cell’s characteristics are not strongly affected after SBD [34]

Gate oxide defect generation proceeds in parallel at different locations of all stressed FET gates and multiple SBD formation at different parts of the circuit [35] or even a combination of SBDs and HBDs is possible [36]. With proper device sizing, multiple SBD breakdowns will only affect power consumption. The statistics of time-to-nth breakdown has been developed [35, 37] allowing to reclaim some reliability margin.

3.2 Bias Temperature Instability

Bias Temperature Instability (BTI) is caused by charging of pre-existing and generated defects in the bulk and at the interfaces of the gate dielectric [38]. It is accelerated by gate oxide electric field and temperature. The issue in pFET devices, so-called Negative BTI (NBTI), was exacerbated by the introduction of nitrogen into SiO2 gate dielectric [39, 40]. The complementary mechanism in nFET devices, Positive BTI (PBTI), became a significant issue with the introduction of high-κ gate dielectrics. As the semiconductor industry moves to FinFET and FDSOI devices, channel doping can be reduced due to better channel control, resulting in the reduction of depletion charge. As a consequence, the gate work function can be adjusted toward Si midgap and the gate oxide field can be reduced at given V DD and threshold voltage V th with respect to planar devices [27]. Further reduction in depletion charge will come from reducing the fin width below the depletion width [41]. For future technology nodes, the electric field in the oxide is expected to increase, as the oxide thickness is reduced faster than V DD to help maintain channel control. One exception is the so-called junctionless FETs [42], which operate in partial depletion or in accumulation [43]. Such devices have high flat-band voltage, resulting in low field and hence low degradation during operation [44].

AC BTI results in significantly lower degradation than the equivalent fully on “DC” BTI stress. This is because of so-called relaxation of BTI, due to discharging of defects. At very low frequencies (<100 Hz), the relaxation also naturally explains frequency dependence of AC BTI [45,46,47]. At intermediate and high frequencies (~GHz), there is presently disagreement in the literature [48,49,50]. Part of the confusion seems to arise from experimental issues at high frequencies. When high-frequency signal integrity issues are correctly considered, NBTI is decreasing at high frequencies due to multistate nature of the involved traps [51], while PBTI is frequency independent [52].

3.3 Hot Carrier Degradation

When a FET is biased in inversion and a bias is also applied at the drain, the channel carriers arriving at the drain will not be in equilibrium with the semiconductor lattice. The “hot” carriers at the high energy tail of the energy distribution will be then responsible for (localized) generation of interface states (through hydrogen depassivation) and charging of the bulk states in the dielectric, either directly, or through the carriers of opposite polarity generated simultaneously through impact ionization. This set of processes is termed Hot Carrier Degradation (HCD). Note that BTI degradation due to “cold” carriers can still take place at the source side. Finally, heat will be generated as energy from the hot carriers is transferred into the semiconductor lattice, resulting in the so-called Self-Heating Effect (SHE) and accelerating some of the degradation above degradation mechanisms. The symptoms include drain current, transconductivity and subthreshold degradation, and threshold voltage shift.

Generally, the lateral electric field in the channel, particularly at the drain, will have a strong impact on the energy distribution function and hence on the above degradation processes. Therefore, even though the supply voltages V DD and hence the maximum drain voltages are gradually decreasing, this degradation mechanism becomes more pronounced as the gate length is reduced [53]. The gate oxide electric field also increases as the gate oxide is scaled down. Hot carrier degradation is presently flagged as most critical reliability concern in the upcoming technology nodes.

Junction optimization to lower the electric field at the drain is therefore generally mandatory to alleviate the impact of hot carrier degradation [54, 55]. The decreased oxide electric field in junctionless FETs can decrease HDC effects [56].

The fin width in FinFET devices is a critical parameter. Both reduction and acceleration of HCD with the fin width have been reported [27, 57,58,59,60]. The disparate results are likely due to the complex dependence of the involved mechanisms. As the fin width changes, so does the threshold voltage and the electric field profile in the fin [59], junction profiles, and the amount of heat retained in the fin due to SHE [27]. This will result in the energy distribution function varying strongly with the fin width [59]. Furthermore, the fraction of hot carriers impinging on the gate oxide will change with changing fin width as well [27].

HCD is a cumulative process and AC HCD does not seem not frequency dependent [52].

3.4 Self-heating Effect

When the FET device is operating at V D = V DD, considerable power I D * V DD is dissipated in the device. In planar devices, the excess heat is primarily dispersed into the silicon substrate (bulk Si thermal conductivity ~148 W K−1 m−1). The remnant heat raises the device body temperature above that of the chip. This is called the Self-Heating Effect (SHE). Although not strictly a degradation mechanism of its own, SHE can accelerate other degradation processes in the FET.

As device geometry changes from planar to multi-gate, the relative thermal contact of the device with the silicon substrate decreases. Heat has to escape into the gate through the gate oxide (bulk SiO2 thermal conductivity ~1.40 W K−1 m−1) and the source and drain contacts. This phenomenon is further amplified if Silicon-On-Insulator (SOI) technology is used (Fig. 11) [57].

Fig. 11
figure 11

Across-technology plot based on measurements (dots) and simulations (dotted lines) for bulk planar, SOI planar and bulk FinFETs showing the local temperature rise in the FET as a function of the power density [57]

New high-mobility materials presently under consideration may have lower thermal conductivity than Si (Ge bulk thermal conductivity 58 W K−1 m−1, GaAs bulk thermal conductivity 58 W K−1 m−1). Thermal conductivity also decreases at elevated temperatures and with dopant concentration (the latter is fortunately reduced in modern devices). In deeply scaled devices, the impact of material interfaces is amplified as they will scatter the heat-carrying phonons, resulting in severely reduced thermal conductance values (fractions of the bulk values) [61].

Temperature generally accelerates single-carrier interface state depassivation and bulk charging; it, however, also reduces the mean-free path of the hot carriers, thus lowering their average energy. However, the tail of the energy distribution can expand with temperature, accelerating one type of interface bond depassivation mechanisms [62]. Also the BTI degradation taking place at the source will be accelerated. Separation of the concomitant degradation mechanisms for proper lifetime projection is therefore a considerable challenge.

SHE can be generally alleviated by improving the heat escape paths. In FinFET devices, SHE can be reduced by sufficient spacing of the fins [63, 64]. Reduction of buried oxide in SOI devices is also highly beneficial [65]. Finally, assuming the actual workload already during will result in better estimate of the dissipated heat, actual temperature, defect generation and charging rates, and hence better lifetime estimation.

3.5 Root Cause 1: Interface Traps

Dangling silicon bonds at the Si/SiO2 interface act as energy states within the Si band gap. In standard CMOS process flow, these bonds are passivated with hydrogen during chip fabrication. When the chip and the devices are biased during use in the field, especially during negative gate bias, electrically active defect states are again generated at the Si/SiO2 interface by stripping (depassivating) the bonds of hydrogen by interaction with channel holes [66]. Interface state generation is also a crucial component of HCD, especially in conventional devices with SiO2 gate oxide [55]. The bond dissociation mechanism during HCD is relatively complex, and can be triggered by a single, sufficiently energetic carrier, or through multiple vibrational excitations (MVE) of the bond by multiple, lower energy carriers [67].

In standard planar devices the Si surface has (100) orientation. In FinFET devices the Si fin sidewalls have (110) orientation with a higher density of Si dangling bonds. Their depassivation can therefore contribute more to NBTI [68]. The contribution of side-wall interface states is also reduced when fins are rotated 45° around the vertical axis [58, 69], as the side-wall orientation of these devices changes from (110) back to lower density (100).

Since the Scanning Tunneling Microscopy experiments on passivated Si surfaces [70, 71], it has been established that passivation of the Si/SiO2 interface by deuterium will result in stronger bonds less susceptible to desorption by hot electrons [72]. In general, passivating these bonds with other elements with higher atomic mass, such as fluorine, has been reported in the literature to reduce the interface defect state generation [73, 74]. Higher atomic mass is presumed to change the vibrational frequencies associated with the dangling bond and better coupling with phonon modes in the Si substrate and thus faster “cooling” of vibrations [75].

Deuterium passivation has been shown to be beneficial for reducing interface state generation due to HCD [76] and NBTI [77], although in the latter case interface state generation may not be the main component in high-κ based dielectrics (see next section). Fluorine passivation has been reported beneficial for HCD [78] and NBTI [79, 80], although the effect seems to be strongly dependent on the F amount and processing conditions. (Low-voltage) SILC is also suppressed by F implantation resulting in lower gate current, although it does not influence defect generation efficiency [26].

3.6 Root Cause 2: Oxide Bulk Traps

Charge trapping into pre-existing defects appears to be the main contributor to both NBTI [81] and PBTI. Ubiquitous hydrogen has been reported as the main source of hole traps in SiO2 [82, 83]. It is thought to be for both multistate switching traps and as a precursor for permanent hole trapping [81].

The contribution of bulk defect increases as advanced materials, such as high-κ gate dielectrics (responsible for rise of PBTI due to electron trapping) and Ge and IIIV substrates are introduced [84]. A significant progress in understanding the reduction of both PBTI and NBTI has been achieved with the “energy-alignment” model (Fig. 12) [85, 86]. In HfO2 high-κ gate dielectrics, PBTI can be reduced by incorporating rare-earth elements or even nitrogen, which redistribute charge around oxygen vacancies and shift the electron trap energies toward the HfO2 conduction band, thus misaligning them with the channel electrons (Fig. 12a) [85, 87,88,89]. In contrast, equivalent “defect level shifting mechanism” has not been known for NBTI (Recent work claims adjustment of hole traps by dopants [40]). However, the introduction of Ge, a high-carrier-mobility semiconductor, results in shifting the inversion channel hole energy level upward (Fig. 12b). This again results in misaligning the channel holes with the defects in the dielectric, resulting in sizable reduction in SiGe pFET NBTI degradation. Recently, shifting the trap levels in the high-κ layer has been also achieved by engineering a dipole at the interface with the SiO2 interfacial layer [90]. Figure 13 illustrates that misaligning defect levels (Scenario 2) is significantly more efficient at low (operating) gate overdrives (V ov = V ddV th) than reducing defect density “en bloc” (Scenario 1), which also takes place as the gate oxide thickness is reduced.

Fig. 12
figure 12

A schematic illustrating the reduction of charge trapping by decoupling defect and channel energy levels a in nFETs (PBTI), by introducing “doping” elements into the high-κ dielectric layer, and b in pFETs (NBTI), by introducing low-bandgap Ge into the substrate

Fig. 13
figure 13

a Charge trapping is suppressed by reducing the dielectric defect density (i), or by carrier/defect energy decoupling (ii), with respect to the reference case (ref). b Calculated ΔV th assuming a 10× defect density reduction by process improvement (i), or the same defect density of states with mean shifted by 0.5eV (ii). The latter case clearly reduces BTI significantly more at low operating V ov [84]

“Passivating” the oxygen vacancies by optimizing nitrogen incorporation is also shown to reduce SILC and TDDB [91] and HCD [92]. Reduction of bulk high-κ defects by higher PDA temperature as well as Zr incorporation and high-κ/metal gate interface roughness also reduce SILC [26]. Furthermore, discharging high-κ traps during stressing, e.g., with bipolar AC stress, appears to lead to SILC reduction [28, 30].

3.7 Mitigation of RTN and Time-Dependent Variability

In Sect. 3 we have discussed the origins of several aging mechanisms and possible remedies to lower their average or mean impact on the device. As device dimensions are aggressively reduced, all aging mechanisms become distributed. This time-dependent variability is discussed in this section.

The gate oxide thickness was the first dimension of deeply scaled FETs to reach nm length scale. The formation of the percolation path during TDDB is a stochastic process and the time-to-first SBD is described by the Weibull distribution with the mean \( \left\langle {t_{SBD}} \right\rangle \) and scale parameter β, also known as the Weibull “slope”. The variance of the distribution is reciprocal with β (smaller β results in a large distribution variance) [93].

One of the signatures of the conducting path formation process is that the variance of time-to-SBD distribution strongly increases as the physical oxide thickness is scaled down [93]. This is because fewer defects are needed to bridge physically thinner oxide. The introduction of high-κ dielectrics, with its increased gate oxide physical thickness, does not automatically yield reduced variance—the Weibull shape factor β is low for laminate dielectrics with, e.g., HfO2 and ZrO2. This could mean that either the SBD formation is controlled by the very thin SiO2 interfacial layer or by extrinsic defects in the high-κ layer [94]. The latter case underlines the requirement of mastering fabrication of high-κ layers with low defect density and free from other imperfections, such as sharp fin edges (Fig. 14).

Fig. 14
figure 14

Electric field distribution at the fin top corner a without and b with corner rounding. c Corner rounding improves (increases) the Weibull slope β [95]

In deeply scaled devices with typical gate areas around 1–2 × 103 nm2, only 1–10 defects will be present in the gate oxide of each fin. Even at constant bias on the FET terminals, charging and discharging of individual defects will take place and result in discreet intermittent changes of the FET drain current. Such “steady-state” stochastic variations are called Random Telegraph Signal (RTS) or Random Telegraph Noise (RTN) [96]. Under certain conditions, RTN can be observed in the gate current as well [97].

If the gate is biased toward V dd, the defects will become preferentially charged, contributing to BTI. The collective contribution of the charged defects to the total threshold voltage shift ΔV th can be acceptably described by the so-called “Defect-Centric” or “Exponential-Poisson” (EP) distribution [98, 99] (Fig. 15a). The variance of the distribution is

Fig. 15
figure 15

a Tails due to RTN and due to RTN and BTI can be discerned in device-to-device distributions of ΔV th of pFET devices [100]. b The standard deviation of device-to-device ΔV th increases for smaller gate areas, as per Eqs. 1 and 2 [101]

$$ \sigma_{{\Delta V_{\text{th}} }}^{2} = 2\eta \left\langle {\Delta V_{\text{th}} } \right\rangle , $$
(1)

where η is the average threshold voltage shift per single trapped electron or hole and \( \left\langle {\Delta V_{\text{th}} } \right\rangle \) is the mean threshold voltage shift. The means of reducing the latter have been discussed in the previous section.

The technologically important parameter η scales with oxide thickness t ox, doping N A, and gate area A G as

$$ \eta \sim \frac{{t_{\text{ox}} \sqrt {N_{\text{A}} } }}{{L^{{}} W}}. $$
(2)

As can be seen from Eqs. 1 and 2, reducing η results in reduced RTN and BTI variability. From the form of Eq. 2 it is also apparent that flash memory type devices, with their minimum device sizes and large t ox suffer the largest impact from individual charged defects. As in the case of as-fabricated variability, the “time-dependent” variability in logic circuit-critical devices can be reduced by increasing their gate area or fin count (Fig. 15b). Fortuitously, in logic devices η is also reduced as t ox is reduced with device size to maintain control over channel, and reduced doping in the low-doped channels of FinFET and FDSOI. However, other sources of variability, such as interface states, may take over as the main sources of channel variability, resulting in η increase [102]. Since η represents the electrostatic impact of the charged traps, traps spatially deeper in the gate oxide will contribute less [103]. Since only spatially deeper gate traps are accessible in FETs with SiGe substrate, this material shows superior NBTI robustness [104].

In deeply scaled devices, HCD will also induce device-to-device variability [105], described by the EP distribution (cf. RTN and BTI var) [106]. Additional variability may arise after HCD due to enhanced generation of interface states. Due to the contribution of different defect types the total distribution will be multimodal (Fig. 16) [100, 106]. The high-σ tail of the full distribution is controlled by defects at the substrate (high η, cf. Figure 16 inset).

Fig. 16
figure 16

Bimodal defect-centric distribution ΔV th corresponding to HCD stress [106]

4 Mitigation of Radiation Effects

4.1 Introduction and Trends in Radiation Effects

Radiation effects continue to be a concern both in terrestrial and aerospace applications. The term “radiation effects” refers to a broad set of effects that occur when ionizing particles interact with silicon devices. The effects are highly dependent on the types and energies of the particles and thus on the radiative environment (e.g., terrestrial vs. space). In most cases, the device is not permanently damaged and thus these effects are often referred to as “soft errors”. Some radiation effects such as gate rupture (SEGR) in power MOSFETs are destructive and thus not soft. However, in this section, the terms “radiation effects” and “soft errors” will be used interchangeably.

In terrestrial applications, the main sources of radiation that are relevant are fast neutrons, alpha particles produced from the decay of traces of unstable isotopes in the packaging materials and, for processes that contain B10 isotope of Boron, then thermal neutrons may also be a concern. In many applications, the latest process technologies (10 and 14 nm FinFETs) are being quickly adopted for cost, power, and density reasons. This shift is driven by the FinFET’s reduced leakage current, fewer short-channel effects, and increased drain saturation current, but, as will be seen in the following sections, this new technology is also significantly less sensitive to soft errors. Indeed, many recent process technologies are immune to alpha particles and have a neutron sensitivity that is an order of magnitude lower than their planar counterparts. In this way, advances in process technology represent perhaps the most significant process level mitigation of radiation effects in terrestrial applications.

In space applications, due to longer qualification cycles, older bulk technologies are still used extensively. Here, the foremost requirement is to avoid single-event latchup (SEL). Also, in addition to the issue of single-event upsets (SEUs), space applications are also concerned about the effect that the total ionizing dose (TID), which occurs over the course of the mission. TID results in a permanent shift in transistor parameters. Although the latest FinFET and FDSOI technologies are generally not yet qualified for space applications, their benefits in terms of reduced SEE sensitivity make them attractive; however, more studies are required to assess whether they are sufficiently robust against TID effects.

Currently, the move away from bulk planar transistors is perhaps the most effective mitigation against soft errors. In [107] the authors presented a concise overview of the SER benefits of different technologies including FDSOI and provide test data for nodes up to 65 nm. In Fig. 17, reproduced from [107], the relative SER benefit of multiple technologies is compared. In the following two sections, our goal is to present the latest results in SER analysis and measurement of soft errors in the FinFET and FDSOI technologies, respectively.

Fig. 17
figure 17

SER overview of multiple technologies up to 65 nm (reproduced from [107])

Another trend in terrestrial applications is that circuits in advanced technologies are being increasingly used in safety critical applications such as automotive and industrial automation. It is currently estimated that over half of the end points in the Internet of Things (IoT) will be safety critical, thus a careful understanding of the impact of radiation effects is required in order to assess their impact on reliability and safety goals. Despite the SER benefits achieved at the process technology level, there is still a need for circuit-level techniques. It is a common practice to protect memories using error correcting codes (ECC) so the real challenge remains the protection of flip-flops and to a lesser extent combinatorial logic. In a subsequent section we present recent results in the design and test of hardened flip-flops for both traditional bulk technologies as well as FinFET and FDSOI technologies.

4.2 Impact of FinFETs on SER

The key characteristic of FinFET devices is that there is a “fin” which wraps around the conducting channel between the source and drain as shown in Fig. 18. The fact that the gate structure wraps around the channel reduces the leakage current and reduces short-channel effects.

Fig. 18
figure 18

Overview of FinFET device

Several studies have shown that the critical charge for FinFET devices is either similar [108] or slightly lower [109] than similar bulk devices. It has also been shown that the doping profiles of bulk and FinFET devices are relatively similar [109]. The differences in SER sensitivity are explained by the differences in charge collection because of the thin drain region and narrow connection to the substrate. The initial charge collection, which is dominated by drift, is not so different between planar and FinFET devices. However, in the FinFET, there is very little charge collection due to diffusion from the substrate [108].

One of the first studies of FinFET devices was by Intel [110]. Note that Intel refers to their FinFET devices as tri-gate devices. In this study, they report that the neutron SER of 22-nm tri-gate 6T SRAM cells is 3.5× lower than a planar 32 nm cell. The improvement in neutron SER of 22-nm tri-gate flip-flops was less, in the range of 1.5× to 4×. However, the tri-gate devices are shown to be 10× to 300× less sensitive to alpha SER. This study showed that MCU rates and the extent of MCUs are not significantly lower than in bulk devices.

In [111], Intel reports new test results for their 14-nm tri-gate devices which have taller and narrower fins and thus reduced charge collection. In this study, the neutron SER of the 14 nm devices is shown to be about one-eighth that of the 22 nm devices while the alpha SER was reduced by about 4×. In the accelerated testing, the extent of the MCUs in the 14 nm technology was similar to the 22 nm technology. Interestingly, during real-time testing, the 14 nm devices showed several MCU events with very large extent (5 and even 14 bits), which was above the expectations from the accelerated testing and modeling.

In [112] Samsung reports the SER sensitivity of SRAM cells implemented in their 14 nm FinFET process and they report a 5–10× reduction in sensitivity for fast neutrons and alpha particles, as compared to 28 nm planar devices. Interestingly, they report a much smaller change in the sensitivity to thermal neutrons. In this study, single-fin and two-fin devices are studied and the latter are slightly more sensitive which was also confirmed by TFIT simulation [113].

In [114] the authors present a heavy-ion study of flip-flops implemented in 28 nm bulk planar, 20 nm bulk planar, and 16 nm bulk FinFET processes operating at 900 mV. In general, the 20 nm devices have a cross section about 50% lower than the 28 nm bulk devices. For lower LETs, the FinFET devices showed a cross section that is well over an order of magnitude lower than the planar devices. Above a LET of 20 meV cm2/mg, there was very little difference in the sensitivity between the different devices. The drain region of the FinFET is much smaller and lower LET particles must strike directly in this region to cause an effect. At higher LETs, however, there is still significant charge collection in the substrate, thus the smaller difference in sensitivity. In space applications, low LET particles are dominant; however, the fact that at high LET there is less difference in sensitivity may reduce the SEE benefit of FinFETs in space applications.

The authors of [114] also performed TCAD simulations, building 3-D models using data from the PDK as well as predictive technology libraries. In these simulations, the ion track was simulated as a cylinder with the charge carriers following a Gaussian distribution. One of the key findings of this work was that the radius of the ion track plays a very important role in determining the sensitivity of FinFET devices. As the radius was swept from 5 to 50 nm, the impact on the SER of bulk devices was small; however, the diameter of the ion track radius played an important role for the FinFET devices. The simulation results highlight the difficulty in accurately simulating the effect of low LET ion strikes in FinFET devices.

In [115] the authors perform an in-depth study of SBU and MCUs for planar and 16 nm FinFET SRAMs. The test results show that between 20 nm planar and 16 nm FinFETs, there is an order of magnitude reduction in SBUs caused by alpha, thermal and fast neutrons. Furthermore, there is also an order of magnitude drop in the absolute rate of MCUs. In this work, it is also shown by TCAD simulations that MCUs in FinFETs are primarily due to charge sharing and that the increased doping levels that are used in FinFETs tend to reduce charge collection and lower the rate of MCUs.

The above works have primarily studied single-event effects on FinFET devices. In [116], the authors present a detailed study of the TID effect on FinFETs, particularly, the dependence on the number of fins, although the study is done for an older 90 nm technology. TID generally causes positive traps in the oxides and at the silicon to oxide boundaries. In this study, the authors find that the impact of TID on leakage current is greatest for single-fin devices. The single-fin devices show the largest increase in leakage current and the largest shift of V t, compared to two- and 40-fin devices.

To summarize, it is clear that FinFET devices show a very significant reduction in SER compared to planar devices. The contribution of alpha SER is much lower than for planar devices. It also interesting to note that Intel also reports significant improvements in other reliability metrics such as TDDB, BTI, HCI, and SILC [117].

4.3 Impact of FDSOI on SER

An excellent overview of the SER benefits of FDSOI technology is presented in [107]. SOI has long been known to provide strong protection against radiation effects; however, it has generally been significantly more expensive than bulk technologies and used only in specialized applications. Recently, ST Microelectronic’s 28 nm FDSOI technology, which is described in detail in [118], has brought this technology more into the mainstream. In this technology, because of the thin box, these devices have an ultra-thin body. The field between the source and drain is confined between the gate oxide and the box, making the transistor behavior closer to ideal. In terms of radiation effects, the sensitive volume is isolated from the substrate, making the sensitive area extremely small.

In [119] it is reported that the alpha particle SER sensitivity of ST’s 28 nm FDSOI technology is approximately 1 FIT/Mbit, which is about two orders of magnitude lower than similar 28 nm bulk technologies, although at lower voltages (0.8 V), the alpha SER does increase (4×…8×) [120]. It is reported [120] that this technology has a raw neutron SER of approximately 10 FIT/Mbit, which is about 20× lower than comparable bulk technologies. The technology also has a low sensitivity to thermal neutrons (2 FIT/Mbit) [121]. A further benefit is that the technology is immune to SEL [120], even at high temperature, which is to be expected, as the parasitic thyristor structure does not exist. Taken together, these characteristics make this technology attractive for applications which require a low sensitivity to radiation effects. Investigations are underway to potentially qualify the technology for space applications; however, this requires a better understanding of the TID effects and also an investigation to better understand the SEE benefits in harsh radiative environments.

4.4 SOI FinFETs

IBM is developing processes to build advanced FinFET transistors in an SOI process. In [122], detailed simulation results of the sensitivity of these devices are presented, and as might be expected, they show extremely low radiation sensitivity. In this paper, it is predicted that the PDSOI FinFET SRAM cells will be two orders of magnitude-less sensitivity than planar PDSOI cells, which already have a very low sensitivity. The critical charge of these cells is expected to be approximately 4 fC, nearly an order of magnitude higher than the 22 nm planar devices.

Although SOI FinFET devices have extremely low SEE sensitivity, preliminary studies [123] show that they are sensitive to TID. The study in [123] analyzed the effect of TID on 14 nm SOI FinFETs, 14 nm bulk FinFETs, and 22 nm UTBB FETs with two different box thicknesses. Interestingly, the impact of TID was quite different across these devices. It was found that for the SOI FinFETs, a V t shift of 14 mV was observed after 100 krad and these transistors were most sensitive to TID in the off state. For the bulk FinFETs, there was very little shift in V t; however, the off-state current increased dramatically. The UTBB FETs showed a significant V t shift with dose, with a sensitivity greater than the bulk FinFETs.

At this point, it is clear that both FDSOI and FinFET devices bring huge benefits in terms of SEE sensitivity. The TID analysis of these technologies in small geometries is still underway, but it does appear that they are quite sensitive which may be an obstacle for their adoption for space applications. For terrestrial applications, however, they provide a massive benefit due to their extremely low rate of soft errors.

4.5 Hardened Cells

For many terrestrial applications, such as networking or general-purpose computing, the large soft-error benefit provided by advanced process technologies is such that it may not be necessary to use hardened flip-flops in order to obtain reliability targets. On RAMs, the use of ECC remains a good practice as ECC has a relatively low cost and can correct errors from any source, whether it be radiation effects, RTN, aging, or other faults. Furthermore, in today’s SoCs, RAMs represent the majority of the die area, and thus this simple technique can provide a high overall level of protection.

For high-reliability applications, such as automotive, even when advanced process technologies with low soft-error sensitivity are adopted, there is still a need for hardened flip-flops to protect the most functionally critical state in the logic. This is partly due to the fact that the number of flip-flops per chip increases with scaling and it is typical to have SoCs with tens of millions of flip-flops. This is also the result of new safety standards, such as ISO26262, which require a systematic analysis of the effects of faults.

The most widely used techniques for hardening flip-flops include DICE [124], LEAP [125], increased nodal capacitance, Quatro [126, 127], reinforcing charge collection (RCC) [128], device stacking [129, 130], guard gates [131], variants on DICE [132], or TMR designs.

The classic DICE flip-flop is illustrated in Fig. 19 and, as is well known, provides immunity against upsets to a single node. In older technologies, the DICE design could provide a reduction up to 1000× in SER sensitivity. However, recent studies have shown [133] that even at 28 nm the benefit of the DICE is limited. In advanced technologies, a single particle can deposit charge on multiple nodes and, due to this charge sharing, in order to achieve the benefit of DICE, the layout must be carefully optimized using techniques such as LEAP [125]. With careful layout, it is still possible to design hardened flip-flops that can achieve two orders of magnitude in soft-error sensitivity; however, the benefits are less for high LET particles.

Fig. 19
figure 19

Schematic of DICE flip-flop

Particles that strike the device at normal incidence are much less likely to deposit charge on multiple nodes, whereas particles that strike at an angle often upset multiple nodes. When evaluating the sensitivity of hardened flip-flops, especially for space applications, it is important to analyze the effect of angular strikes on the design. In Fig. 20, the simulated effect of a heavy-ion strike is shown at normal incidence and at a tilt of 60°. The colors represent the sensitive cross section, and as can be clearly seen, the design is significantly more vulnerable to angular strikes.

Fig. 20
figure 20

Simulated heavy-ion strike on DICE FF using TFIT [113]

A recent test chip in a 32 nm FDSOI technology was implemented by the authors of [129]. The chip consisted of six different flip-flop designs. Two of the designs (NAND and transmission gate—TG) were unhardened. Two other designs were based on the DICE technique, one of which implemented guard gates [131]. Finally, an alternate implementation of the unhardened flip-flops was implemented using stacked transistors The layouts of the six designs are reproduced from [129] in Fig. 21. The large area overhead for hardened flip-flops is clearly visible (Fig. 22).

Fig. 21
figure 21

Layout of six flip-flops in 32 nm FDSOI (reproduced from [129])

Fig. 22
figure 22

Heavy ion test results of six 32 nm FDSOI flip-flops (reproduced from [129])

The test results showed that the DICE designs were not completely immune to alpha particles, although their sensitivity was reduced by over two orders of magnitude. When tested under heavy ions, the DICE designs showed increasing sensitivity with angular strikes, as was observed in simulation results shown earlier (for a different technology). Overall, in this study, the stacked transistor design performed better than the DICE designs, especially for particles arriving at a high tilt.

Of course, all hardened flip-flops induce penalties in area, power, and timing. In [134], the authors present a broad study of 30 different industrial flip-flops, including 11 hardened designs, implemented in a 28 nm bulk process. In this study, it is reported that the average area overhead is 3.8×, the average power overhead is 2.5× and the average timing (CLK → Q) is 1.2×. Given the high cost of hardened flip-flops, it is important to carefully select the most functionally critical flip-flops in the design, which requires analysis techniques [135].

At this point, the reader will appreciate that there are a large number of techniques available for designing flip-flops with reduced SER. It is beyond of the scope of this book to provide a comprehensive review of all techniques; however, the reader can find more information in the referenced works. The “best” cell design for a given application depends on many factors including the acceptable area and power penalties, the radiative environment, and the required level of protection. Simulation and testing are essential when designing and validating radiation-hardened flip-flops.

In an unprotected logic design (excluding RAMs), the largest overall contribution to soft errors comes from flip-flops. Using hardened cells, the contribution of flip-flops to the overall error rate can be managed. Careful selection of which flip-flops to harden can keep the area penalties reasonable. Although the focus is often on the actual storage cell, as shown in [136], the design of the clock tree plays an important role in reducing the rate of upsets in flip-flops. Also, after flip-flops have been protected, the relative contribution of combinatorial logic gates increases and designers must pay attention to SETs.

5 Mitigation of Voltage Droop

Traditionally, the voltage droop phenomenon has been an important reliability factor in the power delivery subsystem of chips and has been mitigated by off-chip schemes and on the board itself. However, with the technology scaling to nanoscale dimensions and the increase in transistor density per die along with the increase of chip frequency, the off-chip techniques have become not enough alone; and advanced mitigation techniques have also emerged inside the chip. This section covers the main techniques to either avoid or mitigate voltage droop in modern electronic chips.

5.1 Classification

Mitigation schemes for voltage droop in modern integrated circuits can be classified into two categories:

  • Off-Chip techniques: These methods aim at improving the supply voltage network impedance and reducing the voltage variation on the board power delivery subsystem. They are generally utilized to avoid low and medium frequency voltage droops.

  • On-Chip techniques: These are techniques applied inside the chip to reduce the supply voltage droops within the die and mitigate their effects. They have obtained significant importance due to the increase in different variation sources and complexity in modern chips. Note that on-chip methods are generally applied to avoid high-frequency voltage droops.

Figure 23 outlines the main off-chip (on the board) and on-chip (inside the die itself) voltage droop mitigation techniques. Next, these schemes are discussed and the focus will be on the on-chip voltage droop compensation approaches, as they are more efficient in terms of power and performance inside the modern chips.

Fig. 23
figure 23

Overview of existing voltage droop mitigation techniques

5.2 Off-Chip Techniques

The most important factor in terms of voltage droop for a chip on the board is the voltage on its pads. If there were no current flow in the power delivery network interconnects, the voltage would appear constant on the chip pads. Figure 24 exhibits an example of the microprocessor power delivery subsystem and the specific components in relation with the processor power delivery [137].

Fig. 24
figure 24

Microprocessor power delivery subsystem

Any improvement in the components of power delivery on the board can lead to reduction of voltage droops. For instance, enhanced voltage regulator modules can better mitigate the low-frequency voltage droops [137]. In this sense, the off-chip techniques which can be utilized to reduce the voltage droops are discussed next.

Decoupling Capacitors: Adding decoupling capacitors (d-caps in Fig. 24) can reduce the power supply impedance and make the load less sensitive to existing inductance in the power pads. The off-chip decoupling capacitors are very efficient in avoiding the on-board voltage droops at mid-range frequencies. Moreover, the effectiveness of decoupling capacitors at high frequency is greatly increased when the inductance in the power delivery path is minimized [138].

Voltage Guardbanding: This approach can be classified into two methods. In the first one, also called static voltage margining, a voltage higher than the nominal one is set on the board by the voltage regulator module. However, in the second technique, the voltage on the board is increased by the regulator during the periods in which the processor has a low activity, so that the potential sudden voltage droops can decrease. Note that the voltage regulators typically have slow response frequencies and cannot compensate high-frequency V dd droops [139]. The on-board voltage guardbanding methods impose additional power loss specifically during the chip low loads.

Better IC Packaging: As the quality of the chip packaging improves, the parasitic effects in the chip interconnects become less, which reduces the voltage droops. Moreover, the vias are placed close together to minimize the inductance effect [139]. These improvements significantly reduce the potential occurrence of the voltage droops.

5.3 On-Chip Techniques

The concern of the on-chip power supply droops has increased with the shrink of the CMOS feature size and increase of the frequency. This has led to novel on-chip techniques to mitigate its effect, which are classified into layout-, circuit-, and architecture-based solutions described in the following sections.

Layout-Based Solutions: Without a careful layout planning, the design may suffer from power supply noise and potential supply voltage droops [140]. By considering the power supply planning during the early design stage inside the chip the circuit block locations and shapes can be flexibly changed to minimize the droop phenomenon. For instance, noticing the distance of the power pads from the potential high switching activity nodes in the chip and keeping these pads close to each other can help in avoiding the voltage droops. Therefore, having the power lines as close as possible to the chip blocks by utilizing multiple supply voltages and ground pins in the floor plan of the chip can help in reducing the dynamic variations inside the chip. Figure 25 shows an example of an advanced power supply distribution inside the chip utilizing the IBM floor-planning standard (C4). It depicts that how the power and signal pads can be distributed to reduce the potential voltage droops.

Fig. 25
figure 25

Chip floor-planning showing power distribution patterns from IBM

5.4 Circuit-Based Solutions

With the technology scale reaching the nanoscale design era, circuit-based techniques have obtained significant importance; they are the ones which can significantly reduce the impacts of the voltage droop at high frequencies. Note that the circuit-based techniques can be categorized into two groups:

  • Static (pre-silicon) techniques: Static circuit-based solutions are generally designed for the chip’s worst-case operational conditions; therefore, they might be pessimistic and not efficient in terms of performance or energy consumption. Moreover, they require proper modeling of the power delivery network, which might be quite complex in modern chips.

  • Dynamic (post-silicon) techniques: Dynamic circuit-based voltage droop mitigation techniques consider the chip’s runtime operational conditions to apply the appropriate mitigation margin (reducing the frequency or increasing the supply voltage) respectively. They can adapt themselves to the on-chip supply voltage variations and compensate its effect for a robust operation.

    In the following, first the static techniques including on-chip decoupling capacitors and frequency or voltage guardbanding are described. Thereafter, the dynamic approach of adaptive clocking is discussed.

On-Chip Decoupling Capacitors: According to modern scaling trends, on-chip decoupling capacitors must be added inside the die to suppress the droops and reduce the noise in it [141]. They function based on providing charge to circuits upon a sudden current demand [142]. Figure 26 shows an on-die distributed grid model of the parasitics inside the chip, including the decoupling capacitors [137, 138]; C spc represents the decoupling capacitance located between the functional units and C blk represents the intrinsic parasitic capacitance of the functional units.

Fig. 26
figure 26

On-die distributed grid model containing additional in-die decoupling capacitors

Although the on-chip decoupling capacitors can balance the abrupt changes in power delivery of chip blocks, their implementation results in more cost in terms of area and leakage when the chip size reduces. Moreover, these on-chip decoupling capacitors have some imperfections, which can lead to additional voltage resonances [143].

Frequency ( F CLK )/Supply Voltage ( V dd ) Guardbanding: To ensure reliable operation of the microprocessors in the existence of voltage droops, conventionally the design is built with guardbands in the operating clock frequency (F CLK) or supply voltage (V dd) [144]. This inflexible approach can limit the exploitation of the high-performance mode of the microprocessor by setting its operational frequency to the worst-case of supply voltage variation [145]. Furthermore, the inability to reduce the V dd during favorable operating conditions decreases the energy efficiency of the chip. Note that these marginal F CLK or V dd guardbanding in modern microprocessors can lead to even higher guardbands than previous designs, therefore, making it necessary to design dynamic approaches, which can significantly reduce the guardbands.

Adaptive Clocking: This dynamic technique is the most important circuit-based approach to mitigate the voltage droops in modern microprocessors and has been utilized in various industrial products such as in AMD, Intel, and ARM microprocessors [146,147,148]. It is based on adjusting the clock period in relation with voltage variations, so that the clock runs at a lower frequency until the supply voltage returns to the nominal value [149]. The adaptive clocking technique can be categorized into two major classes:

  • Traditional on-die monitor-based schemes: This technique relies on sensors to detect the droop and then adapts the frequency accordingly to mitigate the droop.

  • Modern adaptive clock distribution-based schemes: This approach utilizes an in situ monitoring approach to reduce the delay between the droop detection and the frequency adaptation.

Both techniques are discussed in more detail in the following.

Traditional on-die monitors: The conventional dynamic approach is based on utilizing on-die sensors inside the chip, to measure specific parameters such as voltage or current or temperature [140, 141]. Then, these monitors are interfaced with adaptive control circuits to react to existing variations, by adjusting the operating parameters such as the F CLK or V dd. For instance, the frequency of the chip will be adapted with the droop in such a way that no processing error occurs. Figure 27 shows an example framework for this approach, where the droops inside the circuit blocks are detected by a detector, which will then stimulate the adaptation circuits to mitigate the impact of the voltage droop.

Fig. 27
figure 27

Feedback loop in sensor-based V dd droop mitigation technique

The conventional on-die sensors and adaptive circuits need sufficient time in order to respond to parameter variations. However, in the presence of high-frequency V dd droops the on-chip sensors and feedback-based adaptive circuits are not able to respond to fast variations. Therefore, still some F CLK or V dd guardbanding is necessary to guarantee a reliable chip operation, imposing performance and energy overheads.

Modern adaptive clock distribution: The second adaptive approach to mitigate the impact of supply voltage droops (high-frequency V dd droops) is based on an all-digital dynamically adaptive clock distribution [137]. This technique prolongs the clock-data delay compensation in critical paths during a V dd droop, by exploiting a tunable-length delay prior to the global clock distribution. The adaptive clock distribution design contains three major circuit blocks: 1-On-die Dynamic Variation Monitor (DVM), 2-Tunable-Length Delay (TLD), and 3-Clock gating circuit. Figure 28 shows a block diagram example of Intel test chip, fabricated utilizing this technique and including the corresponding monitoring and adapting blocks [137].

Fig. 28
figure 28

Block diagram of the dynamically adaptive clock distribution technique

The impact of dynamic parameter variations on critical path timing margin is measured by the DVMs. Once a voltage droop is detected by DVM, the TLD, which is located between the clock generator and the global clock distribution, proactively gates the clock for the duration of the V dd droop. TLD extends the delay and changes the delay sensitivity to the V dd in the clock distribution; therefore, it mitigates the impact of the V dd droop. An alternative to the clock gating is to reduce the F CLK in half with a clock divider circuit.

In comparison with other existing techniques, utilizing an adaptive clock distribution has significant advantages in terms of performance and energy efficiency by reducing the guardbands for potential V dd droops. However, the main disadvantage of this approach is its need for a post-silicon calibration [145].

5.5 Architectural-Based Solutions

The architectural methods to mitigate the voltage droop in processors are generally known as resilient error detection and recovery approaches. They function based on two main concepts. The first is based on reducing the activity of the processor to avoid the droops, by throttling the instruction issues. Furthermore, the second approach allows the droops to occur inside the chip and then the processor has a built-in mechanism to recover its state and to correct the error [142].

As an example, [143] utilizes a resilient microarchitecture which can detect the induced timing violation by the dynamic variations. Then, it isolates the error from the corrupting architecture state and corrects the error through instruction replay. The error correction can occur during multiple cycles to prevent timing errors corrupting the architectural state of the processor.

The key advantage of the resilient error detection and recovery approaches is their ability to mitigate the guardbands for both fast and slow changing variations. Nevertheless, the main disadvantages are the design complexity overhead and the need for post-silicon calibration.

5.6 Summary

This section has covered the main techniques to either avoid or mitigate the supply voltage droops in modern microprocessors. The off-chip techniques have been traditionally used to reduce the supply voltage noise and deliver a clean voltage to the chip pads. However, with an increase of chip design complexity, frequency and number of transistors per die and the use of on-chip mitigating techniques have become inevitable. Among the on-chip approaches, adaptive clocking is the most significant and efficient method to mitigate the effects of voltage droops inside the chip and has been utilized in many modern microprocessors.

6 Conclusion

In this chapter, we have provided an overview of how some of the major challenges to IC reliability can be mitigated. In advanced processes, variability is becoming a key challenge and the chapter opened with a discussion of techniques to manage the impact of static and dynamic variability.

The problem of variability is compounded by aging effects and the evolution of transistor parameters over the lifetime of the device. The second section of the chapter discussed the challenges of transistor aging in-depth, including how effects such as BTI, HCI, RTN, and self-heating can be managed at the process level.

Advanced technologies such as FinFETs and FDSOI have a reduced sensitivity to radiation effects; however, they remain a real concern. These were discussed including how technology scaling is impacting the design of radiation-hardened cells. Finally, the chapter wraps up with a discussion of how the high power required for large SoCs can induce significant static and dynamic voltage drops, causing errors when the voltage at the transistors falls too low. Advanced techniques to manage both on- and off-chip voltage drop were discussed.

Taken together, it is clear that new process technologies are posing significant reliability challenges. This chapter has focussed on mitigation techniques at the process level and subsequent chapters will discuss mitigation techniques at higher levels in the design flow.