Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Embedded systems are making their way into more and more devices, from hand-held gadgets to household appliances, and from mobile devices to cars. The current trend is that this growth will continue and the market is expected to experience a three-fold rise in the demand from 2013 to 2018 [20]. This growth has been possible due to continued technological advancements in terms of device miniaturization, feature richness, design cost control and performance improvement, originally described by the Moore’s Law [32].

Since a vast majority of today’s embedded systems are battery powered, a prime design objective is to minimize power consumption. Minimized power consumption can extend the battery operating lifetime of a system with a given energy budget. Although technology scaling has enabled the fabrication of faster and more power-efficient devices than their predecessors due to smaller geometry and device capacitance [3], increased computational demand and packing density has caused a diminishing effect on the overall power consumption at system- and application-level. In fact, the overall power consumption of a system-on-chip (SoC) is now increasing beyond available maximum power density budget at chip-level [12]. This has necessitated efficient and low power design techniques that have been studied extensively by researchers in the indsutry and the academia [4, 15, 36].

A major challenge for modern SoC design is the increasing number of hardware faults, such as those caused by imperfect lithographic pattering during manufacturing and induced electromagnetic induction during operational lifetime [5, 12]. These faults manifest themselves as logic upsets at circuit-level and can affect the signal transfers and stored values leading to incorrect execution in embedded systems. According to ITRS, 1 out of every 100 chips will experience a fault per day during operational lifetime, while manufacturing defect rate will reach the level of approximately 1,000 defects/m2 in the next few years [20]. Indeed, under the circumstances, reliable design and testing of current and future generations of SoCs is of critical importance, in particular for high availability, safety-critical systems etc. [24]. However, designing energy-efficient reliable systems is highly challenging due to conflicting design trade-offs between power consumption and reliability objectives [36]. This is because, reliable design and testing techniques generally introduce redundant hardware or software resources that can increase the overall power consumption of the system [15].

The rest of this chapter is outlined as follows. Section 1.1 gives brief introduction to energy-efficient design. Section 1.2 outlines necessary background on faults and reliability, highlighting the challenges of energy-efficient reliable design.

1.1 Energy-Efficient Design

The total power dissipated in a CMOS circuit is formed of two major components: dynamic (P dyn ) and static power (P stat ) dissipation, i.e.

$$\displaystyle{ P_{\mathit{total}} = P_{\mathit{dyn}} + P_{\mathit{stat}}. }$$
(1.1)

Dynamic power is mainly caused by circuit activity. The main contribution of dynamic power dissipation in (1.1) is incurred by capacitive switching current that charges and discharges the circuit capacitive loads during various logic changes, given by

$$\displaystyle{ P_{\mathit{dyn}} =\alpha \ C_{L}\ V _{\mathit{dd}}^{2}\ f, }$$
(1.2)

where C L is the average load capacitance per cycle (generally constant for a given circuit), V dd is the supply voltage, f is the circuit operating frequency and α is the switching activity factor (i.e. the ratio of switching activity over a given time). Another contributor of dynamic power is short-circuit current that flows in the circuit due to direct current (DC) path between the supply and ground when the output transition is taking place. Compared to switching capacitive switching power, short-circuit power is small and often it is ignored in total dynamic power dissipation. Static power, on the other hand, is incurred even without any circuit activity. The main contribution of static power is from sub-threshold gate leakage currents. Sub-threshold current arises from the inversion charges that exist at the gate voltages below the normal circuit threshold voltage. A simpler static power model of leakage power (P leak ) can be given by

$$\displaystyle{ P_{\mathit{leak}} = V _{\mathit{dd}}NkI_{\mathit{leak}}, }$$
(1.3)

where N is the number of transistors and and I leak is the leakage current. The leakage current (I leak ) depends on technology parameters like threshold voltage, while k is circuit constant that depends on the number of transistors operating at a given time. Similar to dynamic power, as can be seen from (1.3) leakage power depends on the supply voltage (V dd ) and the number of transistors (N), which is increasing with technology scaling. Figure 1.1 shows the normalized power dissipation trends with continued technology scaling over a span of 30 years (from 1990 to 2020) [34]. As can be seen, with previous technology nodes dynamic power dissipation dominated the total power consumption in CMOS circuits. From (1.2) it is evident that the most effective means of lowering total (dynamic and also static) power dissipation (P dyn ) for these technology nodes is to reduce the operating voltage (V dd ). However, lowering V dd increases circuit propagation delay. This delay eventually restricts the circuit operating frequency, which requires the operating clock frequency to be lowereed to tolerate the propagation delay [39]. Dynamic voltage and frequency scaling (DVFS) is an effective power minimization technique that reduces dynamic power through lowering V dd and f during runtime [11, 13]. The main working principle of DVFS is to lower V dd and f during slack times (i.e. time between early completion of the previous computational task or late starting of the next computational task) [42]. Over the last decade, power minimization using DVFS-enabled SoCs has been extensively investigated considering its effects on system performance [4, 15].

Fig. 1.1
figure 1

Trends of power dissipations with technology scaling

However, with continued technology scaling, static power dissipation is emerging as a major concern for systems designers (as shown in Fig. 1.1). To reduce static power dissipation dynamic power management (DPM) is an effective technique. The main strategy employed in DPM is to control the operational times of supply voltages in system components. For example, power supply can be shut down for components within an MPSoC when they are idle and can be switched on when they are operational (otherwise known as power gating) or supply clock can be gated off when certain components in a circuit are not active. However, shutting down supply power or clock for these components can result in delay in retaining fully operational mode of the circuit. Hence, DPM technique needs to carefully take into consideration this delay effect to achieve power minimization without compromising the system performance [7]. Often, today’s MPSoCs include both DVFS and DPM techniques to minimize dynamic and leakage power consumptions.

1.2 Faults and Reliable Design

An emerging challenge in today’s electronic system design is reliability of the system when it is subjected to different errors or faults [23, 30]. In fact due to technolgy scaling and aggressive voltage scaling, the number of these faults occuring in a circuit is increasing exponentially. These faults manifest themselves as temporary logic upsets, such as single-event upsets (SEUs), and can affect the signal transfers and stored values leading to incorrect or undesired outcomes in circuit and systems. Several academic and industrial studies, such as [18, 31, 35, 43, 48], have investigated the presence and increase of these faults highlighting the impact of operating environments.

Faults in electronic systems can be classified in two types: permanent and transient. Permanent faults are related to irreversible physical defects in the circuit. Major causes of permanent faults include improper lithographic process or systematic aberrations during manufacturing or post-manufacturing burn-in, or even burn-in caused by electro-migration and negative bias temperature instability (NBTI) during operational lifetime etc. Since permanent faults can cause persistent failures in the circuit, faulty chips are discarded or repaired after post-manufacturing tests. Transient faults, also known as soft errors, can appear during the operation of a circuit. Unlike permanent faults, transient faults do not represent a physical defect in the circuit. Major causes of transient faults include cross-talk, power supply noise and neutron or alpha radiations during operational lifetime. Radiation induced faults are generally considered major source of transient faults as they take place during operational lifetime when a single ionizing radiation event produces a burst of hole-electron pairs in a transistor that is large enough to cause the circuit to change state.

To evaluate the rate and effect of fault occurrence, different parameters have been reported to date. Major parameters are briefly defined below:

Fault Density :

is the measure of number of faults found in a device per unit of data. For memories, this is generally expressed as the number of faults per megabyte or gigabyte data. This parameter is used for permanent faults or defects only [9].

Fault injection time (FIT) :

is the rate at which the faults take place per unit of time in an electronic component. It is generally denoted as λ and expressed in unit of number of fault per million of operating hours (fault/109 operating hours).

Mean time-to-failure (MTTF) :

is an estimate of the mean time expected until the first fault occurs in a component of an electronic system. MTTF is a statistical value and is meant to be the mean over a long period of time and large number of units. It is usually expressed in unit of millions of hours. For constant failure rate systems, MTTF is the inverse of FIT (i.e. \(\mathit{MTTF} = \frac{1} {\lambda }\)) [44].

Mean time-to-repair (MTTR) :

is described as the time elapsed between a fault occurrence and its repair is completed, i.e. the system return back to its operational mode.

Mean time-between-failures (MTBF) :

is described as the time elapsed before a component in an electronic system experiences another fault. Unlike MTTF, the time elapsed in MTBF includes the time required to recover from a fault, i.e. \(\mathit{MTBF} = \mathit{MTTF} + \mathit{MTTR}\).

Soft error rate (SER) :

is the rate at which the soft errors take place per unit time and per unit data. It is generally used to describe the severity of an operating environment and is expressed as number of soft errors per bit per cycle or number of soft errors per cycle per chip.

Reliability :

is the ability of a system to perform a required function under given conditions for a given time interval. Hence, with a given FIT (λ) and time interval, reliability can be expressed as \(R = {e}^{-\lambda t}\).

Availability :

is the ability of a system to be in a state to perform a required function at a given instant of time or at any instant of time within a given time interval, assuming that the external resources, if required, are provided. In other words, it can also be expressed as the ratio of up time of a system to the total operating time (up and down time). Since faults can cause failures and down time of a system, availability can get affected.

Appropriate fault modeling is crucial for systems designers as it describes how a physical fault in the underlying device affects the circuit-level behavior. Depending on the nature and impact of occurrence of faults, fault models can be classified in the following major ones:

Stuck-at Fault Model :

The most common fault model is the single stuck-at model. In this model the defects behave like the given circuit line is permanently connected to ground (stuck-at-0) or to power supply (stuck-at-1); and only a single fault is present in a circuit at anytime. The stuck-at model has several advantages that include simplicity, logical behaviour, tractability and measurability [2]. The single stuck-at fault model remains the most commonly used, even though multiple technologies and numerous process shrinks.

Bridging fault :

An extension of the stuck-at model is the bridging fault model and this models the short between two lines. This model is appealing because shorts are generally considered the most common fault in CMOS circuits. Inductive analysis has shown that the most commonly occurring type of fault resulting from fabrication defects, modeled as dust particles of various sizes on photo-masks, is the bridging fault [17]. Much of the earlier work in bridging faults claimed that either wired-and or wired-or resulted when two nodes were bridged [1, 29]. Randomly placed bridging faults are complicated and can not be modeled by a single fault model. As technology advance to smaller geometries, more metal layers, reduced design margins and higher frequencies, the effect of these defects will grow in complexity, increasing the variety of bridging behaviour we need to detect.

Stuck Open Fault :

While shorts remain the most common type of defects in most CMOS processes, open fault also cause concern. As the number of wiring levels in circuits increases, the number of vias proliferates. We do not well understood the effects of missing vias, partial vias and resistive vias on circuit operation [2]. In some cases, the circuit still functions correctly but at a lower speed. The best known open model is the stuck-open fault model [2]. In this model, the gate to a given transistor is fixed at the open or ‘off’ value. As a result transistor cannot pull the cell output to its voltage. The length of time that the output remains at the previous value depends on the length of time required to discharge the load capacitance. A stuck-open fault requires a two-pattern test [22]. The first pattern sets up the fault pulling the circuit output to a known state and the second pattern activates the fault.

Delay Fault :

Failures causing logic circuits to malfunction at desired clock rates, or not meeting timing specifications are modeled by what are called delay faults. A change in the timing behaviour of the circuit, causes incorrect operation only at certain frequencies. The two broad classes of delay faults are gate delay fault and path delay faults. Gate delay fault models defects local to individual circuit elements; whereas path delay fault models distributed failures caused by statistical process variations and imprecise modeling [37, 41]. Because delays refers to differences in behaviour over time, delay fault models focus on transitions in logic values rather than logic values. A test for a delay fault consists of an initialization pattern, which sets the stage for a transition and a propagation pattern, which creates a desired transition and propagates it to observable points.

Figure 1.2 shows the trends of device failure rates over their lifetimes, highlighting the impact of technology scaling [34]. As can be seen, due to manufacturing silicon or metal defects many devices experience early burn-in and get discarded from the shipment line. The good circuits that are delivered to customers also experience failures due to mainly transient faults. Note that failure rates during useful operating lifetime are more or less constant as failures during this period is dominated by the environmentally induced faults. However, as the devices experience wearouts due to electromigration effects etc., the number of failures start to increase. As can be seen, with technology scaling the normalized device failure rates are also increasing. Moreover, useful operating lifetime is also shortening as technology scaling increases the rate or device wearouts substantially. These issues pose serious challenges to systems designers to ensure reliability in the presence of different faults.

Fig. 1.2
figure 2

Trends of power dissipations with technology scaling

To mitigate the effect of faults, different techniques have been employed over the years [33, 42]. These techniques are briefly outlined in the following.

Hardware Redundancy :

such as [28], is an effective technique, which employs extra hardware and incorporates voting from multiple outputs to a single output to mitigate the effect of transient or permanent faults. Due to usage of extra hardware resources this technique incurs area and power overheads, while achieving desired reliability. The trade-off between reliability and system overheads in hardware redundancy techniques has been extensively studied [28]. An example implementation is Maxwell’s SCS750 supercomputer used in space applications, which employs triple modular redundancy (TMR) technique for its IBM’s PowerPC processors [27]. Due to such redundancy, SCS750 achieved fault-tolerance with more than 180% overall hardware overheads.

Time Redundancy :

techniques require minimal hardware resources to detect transient faults. Upon detection of faults, task replication or re-execution is carried out during idle/slack times. Due to software-dependent nature of time redundancy techniques, the hardware overhead is generally much lower than hardware redundancy techniques. But this greatly depends on the availability of idle or slack times during computation. For example, in [6] it is demonstrated that the fault-tolerance can be improved without any impact on the execution time by utilizing idle processors for duplicating some of the computations of the active processors. However, several studies reported that these techniques cause overheads in terms of computation and communication performance [25]. Application check-pointing is another effective time redundancy technique. The fault-tolerance using such technique is achieved by selectively repeating (or rolling back) an execution interval of an application to realise fault-tolerant design. However, fault-tolerance using this technique is achieved at the cost of high complexity of application design. Examples of effective application check-pointing techniques highlighting such increased cost are, adaptive and non-uniform online application check-pointing proposed in [47], offline application check-pointing shown in [19].

Information Redundancy :

is another effective technique for fault detection and correction, particularly for memory devices. Using this technique fault detection and correction (EDAC) codes are integrated with the original logic circuit data [45]. These extra information codes are generated from the original logic data to effectively identify the presence of one or more transient or permanent faults and possibly correct them. Recently, Intel introduced dual-core Xeon-7100 system with several EDAC features to incorporate fault detection and correction [14]. Combination of time and information redundancy techniques, such as [16], can also be highly effective in achieving desired fault-tolerance at low cost.

System-level Techniques :

are highly effective as they can design hardware and software with combinations of various redundancy techniques to achieve effective fault detection and tolerance [8]. A number of system-level techniques have been proposed in past few years showing different fault-tolerance techniques to extract maximum benefit in terms of fault-tolerance and low power consumption. For example, pre-emptive online scheduling [46] of failed tasks has been demonstrated as an effective system-level fault-tolerance technique. However, the effectiveness of such technique depends upon predictability of slack times. Fault-tolerance-based optimization of cost-constrained distributed real-time systems has been proposed [21]. Fault-tolerance in [21] is achieved through system-level mapping and assignment of fault-tolerance policies to processing cores. Another approach to fault-tolerant design using process re-execution and scheduling in available hardware computing resources in an multiprocessor SoC (MPSoC) application has been proposed in [38]. Highlighting the impact of faults at application-level, various other researchers [10, 26] have shown low-cost fault-tolerance techniques. The basic principle of these works is that the faults manifested at architectural-level do not always lead to faults at application level. Exploiting the relaxed requirements of application-level correctness, low cost and energy-efficient fault-tolerance techniques have been proposed.

Indeed, energy-efficient fault-tolerant systems design is highly challenging [40]. The following chapters present state-of-the-art and also emerging issues in energy-efficient fault-tolerant systems design. Various case studies have also been illustrated, where necessary, to demonstrate effective means of addressing the design challenges.