Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This chapter addresses the challenges and the opportunities to perform computation with nearly-minimum energy consumption through the adoption of logic circuits operating at near-threshold voltages. Simple models are provided to gain an insight into the fundamental design tradeoffs. A wide set of design techniques is presented to preserve the nearly-minimum energy feature in spite of the fundamental challenges in terms of performance, leakage and variations. Emphasis is given on debunking the incorrect assumptions that stem from traditional low-power common wisdom at above-threshold voltages.

In this analysis, the main emphasis is given on the energy consumption, as performance requirements in IoT nodes are easily achievable with near-threshold circuits in most cases, as discussed in Chap. 1 and in the following. Sustained higher levels of performance can always be achieved through architectural techniques (see Chap. 3), whereas occasional performance boosts can be obtained through circuit techniques (see below).

4.1 Preliminary Considerations on Near-Threshold Operation

4.1.1 Transistor Current vs. Supply Voltage and Transregional Model

Voltage scaling is well known to be a very effective knob to reduce the energy per computation at the cost of degraded performance (Burd et al. 2015). The performance degradation at supply voltages V DD lower than the nominal voltage is determined by the reduction in the transistor on-current I on , which in turn depends on the operating region (i.e., the voltage range). The transregional EKV model can be conveniently used to express such dependence in all regions (Enz and Vittoz 2006):

$$ {I}_{on}={I}_0\cdot I C={I}_0\cdot {\left[ \ln \left({e}^v+1\right)\right]}^2 $$
(4.1)

where IC is the inversion coefficient (i.e., normalized current), I 0 is the specific current \( 2\cdot n\cdot \mu \cdot {C}_{OX}\frac{W}{L}{\left( kT/ q\right)}^2 \), and v is the normalized gate overdrive \( v=\left({V}_{DD}-{V}_{TH}\right)/\left[2\cdot n\cdot \left( kT/ q\right)\right] \). In the above equations, n is the transistor subthreshold factor, μ is the carrier mobility, C OX is the MOS capacitance per unit area, W/L is the aspect ratio, V TH is the transistor threshold voltage, and kT/q is the thermal voltage.

In the EKV model in (4.1), a transistor operates in weak inversion when \( I C<0.1 \) (i.e., for \( v<-1 \)), which from (4.1) corresponds to voltages below \( {V}_{TH}-50\ \mathrm{mV} \) for typical n (~1.3–1.5) and operating temperatures (Sansen 2006). On the other hand, a transistor operates in strong inversion for \( I C>10 \) (i.e., for \( v>3.1 \)), and hence for voltages above \( {V}_{TH}+200\ \mathrm{mV} \) (Sansen 2006). Near-threshold operation occurs for intermediate voltages, as summarized in Fig. 4.1.

Fig. 4.1
figure 1

Qualitative trend of I on transistor current (log scale) versus the gate overdrive \( {V}_{DD}-{V}_{TH} \)

The above traditional EKV model is very useful for quick estimates, but it oversimplifies the I–V characteristics at voltages above V TH . Indeed, eq. (4.1) leads to \( {I}_{on}\approx {I}_0\cdot {v}^2 \) in strong inversion, and its quadratic trend is far from the linear trend that is observed in actual nanometer CMOS technologies.Footnote 1

Introducing voltage-dependent coefficients in (4.1) solves the issue, but leads to impractically complicated expressions for pencil-and-paper evaluations. To retain its simplicity while employing constant coefficients, (4.1) is here modified according to

$$ {I}_{on}={I}_0\cdot \ln \left({e}^{\frac{V_{DD}-{V}_{TH}}{n\cdot \left( kT/ q\right)}}+1\right) $$
(4.2)

which is plotted in Fig. 4.2 along with the actual I–V characteristics for 28-nm NMOS and PMOS transistors. The model is 10% (20%) within circuit simulations on average (in the worst case), hence it is well suited for quick estimates and design purposes.

Fig. 4.2
figure 2

Plot of I on transistor current (log scale) in (4.2) versus the magnitude of the gate-source voltage V GS (in CMOS logic gates, \( {V}_{GS}={V}_{DD} \))

4.1.2 Transistor Current and Gate Delay in Different Regions

By the definition summarized in Fig. 4.2, sub-threshold voltages correspond to transistor operation in weak inversion, above-threshold are associated with strong inversion, and near-threshold voltages correspond to intermediate voltages between \( {V}_{TH}-50\ \mathrm{mV} \) and \( {V}_{TH}+200\ \mathrm{mV} \). For typical standard threshold voltages,Footnote 2 near-threshold voltages are in the range of 400–600 mV, approximately.

At above-threshold voltages such that \( {e}^{\frac{V_{DD}-{V}_{TH}}{n\cdot \left( kT/ q\right)}}\gg 1 \), Eq. (4.2) is approximately a linear function of \( {V}_{DD}-{V}_{TH} \) as expected

$$ {I}_{above- threshold}\approx \left({I}_0/n\frac{kT}{q}\right)\cdot \left({V}_{DD}-{V}_{TH}\right) $$
(4.3)

whereas at sub-threshold voltages it can be approximated asFootnote 3

$$ {I}_{sub- threshold}\approx {I}_0\cdot {e}^{\frac{V_{DD}-{V}_{TH}}{n\cdot {v}_t}} $$
(4.4)

which exponentially decreases when lowering the voltage. At near-threshold voltages, Eq. (4.2) can be approximated as

$$ {I}_{n ear- threshold}\approx \frac{I_0}{2}\cdot \left[1.5+{\left(\frac{V_{DD}-{V}_{TH}}{n\cdot kT/ q}\right)}^{1.35}\right] $$
(4.5)

which is within 15% of the exact I–V characteristics in Fig. 4.2. From (4.5), the near-threshold I–V characteristics is a power law, and is steeper than in the above-threshold region.

Let us now consider a CMOS logic gate driving a capacitive load C, which includes the capacitive parasitics of the gate itself. As usual (Weste and Harris 2011), its propagation delay τ PD can be expressed as \( \left( C/{I}_{on}\right)\cdot \left({V}_{DD}/2\right) \):

$$ {\tau}_{PD}=\frac{C}{I_{on}}\cdot \frac{V_{DD}}{2}\approx \frac{C}{2\cdot {I}_0}\cdot \frac{V_{DD}}{ \ln \left({e}^{\frac{V_{DD}-{V}_{TH}}{n\cdot \left( kT/ q\right)}}+1\right)}. $$
(4.6)

As shown in Fig. 4.3, from (4.3) and (4.6), voltage downscaling leads to an approximately linear delay (i.e., performance) degradation, when operating above threshold. As discussed in Sect. 4.3, the energy is typically dominated by the dynamic contribution, hence a quadratic energy saving is observed above threshold. On the other hand, an exponential increase in the gate delay is observed in the sub-threshold region. Also, due to the heavier leakage contribution at low voltages (Sect. 4.3), the energy reaches a minimum energy point (MEP), and it tends to increase again when further lowering V DD . Hence, near-threshold voltages are an ideal compromise between energy and performance in energy-centric VLSI designs. Indeed, the near-threshold gate delay is still reasonably small, and energy is close to its minimum value across all voltages. This motivates this chapter, and the adoption of near-threshold circuits for VLSI processing in the IoT domain.

4.2 Near-Threshold Transistor and Circuit Properties

In this section, properties of transistors at near threshold are discussed to provide general circuit design guidelines. Preliminary considerations on voltage scaling and threshold voltage dependence on sizing are respectively provided in Sects. 4.2.1 and 4.2.2. The impact of transistor stacking and PMOS/NMOS imbalance are discussed in Sect. 4.2.3 to guide the topology selection during circuit design. As second and equally fundamental aspect of circuit design, transistor strength adjustment is discussed in Sect. 4.2.4.

4.2.1 Impact of Aggressive Voltage Scaling on Transistor Current and Delay

The considerations on the delay degradation under voltage scaling in the previous section were based on the assumption that the gate load C is independent of V DD . Observe that the load C comprises wire parasitics and transistor gate capacitances. The above assumption certainly holds in wire-dominated loads (as wire parasitics are voltage-independent), whereas it is somewhat pessimistic in gate-dominated loads. Indeed, as shown by Fig. 4.4, the transistor gate capacitance tends to moderately decrease at voltages close to or below V TH , and hence makes the delay degradation more graceful than discussed above, although to a minor extent.

According to the above observation, the above qualitative considerations on the delay at near- and sub-threshold voltages fully apply to any practical design. As an example, Fig. 4.5 shows the trend of the fan-out-of-4 delay FO4 (i.e., the delay of an inverter gate driving four equal inverters). This metric is widely used at process level to characterize the speed of the technology, at circuit level to abstract the circuit design from the process details, and at architectural level since the clock cycle normalized to FO4 is typically a constant that is defined by the architecture (Harris). In short, FO4 characterizes the system performance versus voltage for a given architecture. From Fig. 4.5, operation in the middle of the near-threshold region degrades the performance by approximately a factor of 10, compared to operation at nominal voltage. This is generally true regardless of the adopted technology (Dreslinski et al. 2010).

A very distinctive property of near-threshold operation is the stronger delay sensitivity to a given absolute change in the gate overdrive (i.e., both V DD and V TH ), compared to above-threshold designs. This is partially explained by the steeper I–V characteristics (4.5) compared to (4.3) (the exponent of v is respectively 1.35 and 1). But the main reason is due to the very large sensitivity of \( v=\left({V}_{DD}-{V}_{TH}\right)/\left[2\cdot n\cdot \left( kT/ q\right)\right] \) to a given change in V DD , as V DD is much closer to V TH compared to above-threshold voltages. The relative I on improvement due to a 100-mV supply voltage increase (i.e., boosting) for a 28-nm technology is shown in Table 4.1. As expected, the impact of voltage boosting at near-threshold voltages is substantially larger than above threshold, with improvements in I on in the range of 2-4X. This unique feature permits to have significant speed adjustment capability with very limited amount of boosting, which needs to be thoroughly exploited in near-threshold-designs.

Table 4.1 Ion Improvement due to supply voltage boosting by 100 mV

The above considerations equally apply to the threshold voltage, as I on is a direct function of the gate overdrive \( {V}_{DD}-{V}_{TH} \). For example, the I on and speed sensitivity to a 100-mV V DD shift in Table 4.1 hold for the same change in V TH (although with negative sign). In other words, increasing V TH by 100 mV (i.e., the typical difference between a low and regular V TH ) at near-threshold supply voltages leads to a 2-4X reduction in speed. This is shown in Fig. 4.6, which plots the inverter delay ratio under regular-V TH (RVT) and low-V TH (LVT). At nominal voltage, the different V TH has a moderate impact on the performance, whereas such difference is much more pronounced at near-threshold voltages.

In summary, the high sensitivity of performance to V DD and V TH makes them very powerful knobs at near-threshold voltages, although the (same) sensitivity to their variations poses a challenge at the same time, as will be discussed in the following sections.

4.2.2 Impact of DIBL and Sizing on Threshold Voltage

In the previous subsection, V TH was implicitly considered constant. In view of the large sensitivity of I on to V TH , the dependence of V TH on transistor voltages and sizing needs to be explicitly considered at near-threshold voltages.

Regarding the dependence of the transistor voltages, V TH tends to be quite sensitive to the drain-source voltage due to the Drain Induced Barrier Lowering (DIBL) effect (Tsividis 1999). Due to the DIBL effect, V TH increases in an approximately linear fashion when the magnitude of the drain-source voltage is reduced. Due to the body effect, V TH decreases (increases) under Forward FBB (Reverse, RBB) Body Biasing, i.e. for positiveFootnote 4 (negative) body bias voltages V BB (Tsividis 1999). The approximately linear dependence in both effects is captured by the following equation

$$ {V}_{TH}={V}_{TH0}-{\lambda}_{DIBL}{V}_{DS}-{\lambda}_{BB}{V}_{BB} $$
(4.7)

where V TH0 is the threshold voltage extrapolated for very low V DD and \( {V}_{BB}=0\ V \), λ DIBL is the DIBL coefficient and λ BB is the body effect coefficient. The DIBL coefficient is in the order of 0.1 V/V or larger for technologies suited for IoT, and hence denotes a pronounced dependence of the threshold voltage on the drain-source voltage. As an example, Fig. 4.7 shows the change in V TH versus the drain-source voltage (i.e., V DD ) in 28-nm transistors. When V DD is reduced down to near-threshold voltages, V TH typically increases by around 100 mV compared to operation at nominal voltage. This needs to be explicitly taken into account when choosing the type of threshold voltage at design time.

On the other hand, the threshold voltage dependence on the body voltage is well known to be rather weak in advanced technologies, although it is appreciable in 90-nm generations or older. Considering the strong sensitivity of performance and leakage on V TH , body biasing is still a viable option in near-threshold circuits in 90-nm technology bulk generations, or in more recent generations in FDSOI CMOS technology.

The transistor threshold voltage also depends on the size, especially when the latter is close to the minimum allowed by the process. The qualitative dependence on the channel width W (length L) is qualitatively depicted in Fig. 4.8a, b. From Fig. 4.8a, the reduction of W leads to a decrease (increase) in V TH due to the Reverse (traditional) Narrow-Channel Effect RNCE (NCE) (Tsividis 1999). The dominance of one of the two effects mainly depends on the transistor isolation technology (e.g., Shallow Trench Isolation vs LOCOS), device structure (bulk, FinFET, FDSOI) and several parameters. On the other hand, from Fig. 4.8b, the reduction of L leads to an increase (decrease) in V TH due to the Reverse (traditional) Short-Channel Effect RSCE (SCE) (Tsividis 1999). The dominance of one of the two former mainly depends on whether the transistor body is lightly doped or includes halos to counteract short-channel effects (Tsividis 1999).

Fig. 4.8
figure 8

Qualitative trend of threshold voltage vs. (a) transistor channel width, (b) transistor channel length (due to Short Channel Effect)

From the above considerations, transistor sizing can affect the performance in ways that are more complicated than the usual linear dependence of I on on W/L, due to the additional (strong) dependence of V TH on size at near-threshold voltages. As an example, Fig. 4.9 shows that the I on current trend versus W deviates from the traditional linear dependence of \( {I}_{on}\propto W \). For the specific considered technology, the current is increasing faster than \( {I}_{on}\propto W \), denoting that the NCE effect dominates (i.e., wider channels lead to large current than expected due to the simultaneous reduction in V TH ). This effect is clearly more pronounced at low voltages due to the stronger dependence of I on on V TH , whereas it is negligible at nominal voltage.

Fig. 4.9
figure 9

I on normalized to current in minimum-sized transistor vs. channel width (normalized to minimum width W min )

Other technologies might have opposite behavior due to dominant RNCE (i.e., I on increases slower than W, due to the progressive increase in V TH due to the increase in V TH ). On the other hand, Fig. 4.10 shows that I on decreases faster than 1/L at near-threshold voltages, due to the dominance of SCE. Again, this dependence is 1.5–3X stronger than at nominal voltage due to the stronger dependence of I on on V TH at low voltages. Other technologies might have different behavior, due to the dominance of RSCE.

Fig. 4.10
figure 10

I on normalized to current in minimum-sized transistor vs. channel length (normalized to minimum length L min )

4.2.3 PMOS/NMOS Strength Ratio, Stacking and Wire Delay

Another important effect observed at near-threshold voltages is the deviation of the ratio of the PMOS and NMOS strength (i.e., I on ) at iso-size, compared to nominal voltage. This is due to the different dependence of PMOS and NMOS I on across different voltages. Indeed, from (4.3)–(4.6) the transistor strength has a mild dependence on V TH and is mostly defined by the carrier mobility at nominal voltage. Hence, differences in V TH between PMOS and NMOS do not significantly impact the strength. On the other hand, the strength has a strong dependence on V TH at near-threshold (and lower) voltages, hence even moderate differences in V TH between PMOS and NMOS substantially alter their strength ratio. The latter can be smaller or larger than the value at nominal voltage, depending on the V TH differences between PMOS and NMOS (including DIBL), and hence the specific technology. Figure 4.11 shows the trend of the PMOS/NMOS strength ratio in a specific 28-nm technology, which at near-threshold voltages can be reduced by up to 2.5X compared to nominal voltages. At lower voltages, the impact is even larger, due to the exponential dependence of I on on V TH in (4.4). This deviation of the PMOS/NMOS strength ratio clearly threatens the noise margin of CMOS logic gates, thus degrading robustness and exposing logic gates to malfunctions due to variations. This also emphasizes the imbalance between the rise and fall delay, thus degrading performance.

Fig. 4.11
figure 11

Ratio between the strength of PMOS and NMOS versus supply voltage V DD (LVT transistors)

Analogously, the strength of stacked transistors (i.e., connected in series) can heavily deviate from the strength of a single transistor, compared to operation at nominal voltage. This can be shown by the I on stacking factor X on , defined as the factor by which the I on current is reduced due to the transistor stacking, compared to a single transistor (assuming all transistors have the same size as the single one). The trend in Fig. 4.12 shows that the stacking factor tends to peak around near-threshold voltages, and the phenomenon is more evident under a larger number of stacked transistors. At lower (sub-threshold) voltages, the stacking factor goes back to smaller and threshold-voltage independent value (Alioto 2012).

Fig. 4.12
figure 12

Ratio between the strength of PMOS and NMOS versus supply voltage V DD (LVT transistors)

The stacking factor peaking at near-threshold voltages can be observed in any CMOS technology, as the presence of stacked transistors reduces the drain-source voltage of each stacked transistor, and hence leads to a further increase in V TH (and decrease in the strength) due to DIBL, compared to a single transistor. This explains why the near-threshold current delivered by four stacked transistors is up to 7X lower than a single transistor at iso-size, although this factor is about half of it at nominal voltage. Due to the same reason, the degradation for two and three stacked transistors is much less pronounced, and is acceptable from a performance point of view. Hence, as general circuit design guideline, the maximum number of stacked transistors in near-threshold designs needs to be lower (e.g., 3) than at nominal voltage (typically up to four).

Finally, another fundamental difference encountered in near-threshold designs is the deviation of the ratio between the gate and wire delay, compared to nominal voltage. Indeed, at lower voltages, the gate delay increases as in (4.6), whereas the wire delay remains constant. As an example, Fig. 4.13 considers a wire whose delay matches the delay of a single gate designed for high performance (i.e., its delay is about one FO4 (Sutherland et al. 1999)) at nominal V DD . This corresponds to a global wire with a length in the order of a millimeter. From this figure, the wire delay at near-threshold voltages represents only a small fraction (in the order of 10X smaller) of the gate delay. This means that above-threshold designs and architectures that aims at mitigating the impact of wire delay are definitely overdesigned and performance/energy sub-optimal at near-threshold voltages. Hence, near-threshold circuits and architectures need to be very different from traditional above-threshold solutions, due to drastically smaller impact of wire delay. This brings back circuit and architectural solutions that were abandoned in the late 90s, due to the then incumbent impact of wires on the system performance.

Fig. 4.13
figure 13

Ratio of wire and gate delay normalized to value at nominal voltage versus supply voltage V DD (28 nm, LVT transistors)

In summary, the large performance sensitivity on V DD and V TH represents a very interesting opportunity in near-threshold circuits, but also poses various challenges. Among those, it is not possible to maintain a fixed delay ratio between cells with different amount of stacking and threshold voltage, when scaling the voltage (even without variations). This, in addition to the substantial performance degradation due to stacked transistors, suggests that the maximum fan-in of near-threshold CMOS standard cell should be three. Within the same cell, it is not possible to maintain a stable PMOS/NMOS strength ratio across different voltages. For the same reasons, ratioed and dynamic logic styles are unfeasible at near-threshold voltages (not to mention the larger impact of leakage and variations, as discussed in Sects. 4.3 and 4.6). Similarly, topologies that are inherently based on current contention and positive feedback need to be definitely avoided (e.g., cross-coupled non-clocked inverters in flip-flops). Unfortunately, this cannot be avoided in SRAM bitcells and register files for reasons due to density, and other sophisticated techniques need to be deployed (see Chap. 5).

4.2.4 Knobs to Adjust Transistor Strength

From the previous subsection, the transistor strength can be adjusted with the following knobs:

  • transistor size

  • body biasing

  • V TH selection

  • V DD tuning and fine-grain boosting.

From the previous subsection, transistor sizing is relatively effective, and can be more or less effective than at nominal voltage, depending on the dominance of RNCE over NCE, and SCE over RSCE. Body biasing can significantly alter the transistor strength only in old technologies (e.g., 90 nm), or in recent FDSOI technologies, with a typical 30% range of adjustment at near-threshold voltages.

In view of the strong dependence of I on on the gate overdrive discussed in Sect. 4.1, the transistor strength can be substantially modified through the proper selection of the threshold voltage, and the fine-grain boosting of V DD to selectively increase I on where required. Regarding the V TH selection, a 2–4X I on (and delay) change was previously shown to be feasible when changing V TH from one type (e.g., RVT) to the next available one (e.g., LVT). However, V TH selection at near-threshold voltages poses various additional challenges, compared to operation at nominal voltage. Indeed, the sensitivity of I on to V TH translates into a strong sensitivity to its process variations. Also, the delay ratio of an RVT and LVT logic gate (see Fig. 4.14. for an inverter gate) strongly depends on the supply voltage. In other words, mixing standard cells with different V TH poses the problem of having different delay scaling in different portions of the system. In turn, this makes timing closure certainly more difficult and might reduce the energy benefit of dynamic voltage scaling, as the critical path(s) depends on the voltage.

Fig. 4.14
figure 14

Ratio of I on of LVT and RVT transistors normalized to value at nominal voltage versus supply voltage V DD

Let us now consider fine-grain voltage boosting, which consists in selectively over-driving appropriate transistors with a voltage above V DD . As shown in the illustrative example in Fig. 4.15, this might be the case of a single large transistor M1 (e.g., sleep transistor, large buffer) that drives a sub-circuit containing several smaller transistors. Let us assume that the gate of M1 is overdriven at \( {V}_{DD}+\Delta {V}_{DD} \) as opposed to all other transistors and logic gates, which are powered at V DD . Due to the strong (super-linear) I on increase in M1 due to the gate voltage boosting by ΔV DD , the transistor can be significantly undersized while maintaining the same strength as the transistor that is driven by V DD . In view of the strong dependence of I on on the gate voltage in M1 at near-threshold voltages, a small amount of boosting ΔV DD permits to substantially reduce the area occupied by M1. This is shown in the example in Fig. 4.15 in 28 nm, where the area of M1 can be reduced by up to an order of magnitude while maintaining the same strength, through an amount of boosting in the order of a few hundreds of mVs. Similarly, such selective boosting permits to super-linearly reduce the leakage current of M1. At the same time, the gate capacitance C g,M1 of M1 is reduced super-linearly as well, whereas the supply voltage is increased by the very limited amount ΔV DD . This means that the dynamic energy \( {C}_{g, M1}\cdot {\left({V}_{DD}+\Delta {V}_{DD}\right)}^2 \) to switch M1 ON is reduced overall. In the example in Fig. 4.15, a 2X energy reduction can be achieved through selective boosting of M1, when adopting an adopting an optimal ΔV DD of 300 mV (this voltage depends on the specific technology). Very similar energy saving is observed for ΔV DD in the order of 100–200 mV. On the other hand, larger amount of boosting slightly increases the energy consumption to turn on M1, since the transistor starts operating above threshold (i.e., I on becomes less sensitive to ΔV DD ), and the energy cost \( {C}_{g, M1}\cdot {\left({V}_{DD}+\Delta {V}_{DD}\right)}^2 \) of boosting increases substantially due to the quadratic dependence.

Fig. 4.15
figure 15

(a) In-principle circuit with large transistor whose gate voltage is boosted by ΔV DD , (b) area and energy improvement vs. ΔV DD

From the above considerations, near-threshold circuits can be made more energy- area-efficient by selectively boosting portions of the circuit that contain large (and hence energy- and area-hungry) transistors. As opposed to traditional multi-V DD approaches that are applied at the module level, in this case the supply is boosted with fine granularity (i.e., down to the single transistor). Such fine-grain voltage boosting also offers the opportunity to equalize imbalanced logic across pipestages. As fundamental challenge, fine-grain boosting entails significant area overhead, due to the additional level shifters to drive boosted-voltage domains, and to the slight additional cost of distributing multiple voltages at the physical design level (Flynn et al. 2007). In other words, near-threshold circuits should certainly take advantage of fine-grain voltage boosting, but innovative techniques are needed to minimize the unavoidable overhead. Some recently proposed ideas to address this challenge will be presented in Sect. 4.1.

4.3 Energy Trends

In this section, the impact of voltage scaling on the energy is reviewed, providing models and design guidelines for minimum-energy operation.

4.3.1 Transistor Leakage Current at Near-Threshold Voltages

The MOS transistor leakage contributions are summarized in Fig. 4.16. The dominant contribution is due to the sub-threshold leakage in (4.4), which flows between drain and source and is due to the diffusion of minority carriers between the two terminals (Tsividis 1999). The gate leakage flows from gate to source/drain or vice versa, depending on the applied voltages, and tends to be exponentially smaller than the sub-threshold contribution when lowering V DD (Narendra and Chandrakasan 2006). Similar considerations hold for the substrate leakage, which is mostly due to the Band-to-Band Tunneling (BTBT), and the inverse saturation current of the source-bulk and drain-bulk pn junctions (Narendra and Chandrakasan 2006). Hence, the transistor leakage current at near-threshold voltages is well approximated by (4.4), where the gate-source voltage (assigned to V DD in (4.4)) has to be set to zero. By substituting the dependence of V TH in (4.7), the near-threshold leakage current of an NMOS transistor immediately results to

Fig. 4.16
figure 16

Transistor leakage contributions: (a) sub-threshold, (b) gate, (c) substrate

$$ {I}_{lkg}={I}_0\cdot {e}^{-\frac{V_{TH0}-{\lambda}_{BB}{V}_{BB}}{n\cdot kT/ q}}\cdot {e}^{\frac{\lambda_{DIBL}{V}_{DD}}{n\cdot kT/ q}} $$
(4.8)

where the first exponential term is set by the threshold voltage (including body biasing, when applied), and the second expresses the DIBL effect and hence the leakage dependence on V DD . The latter dependence is exponential, and typically operation at near-threshold voltages reduces the leakage current by about an order of magnitude for typical DIBL coefficients in the order of 0.1 V/V, compared to nominal voltage. The consistent exponential trend across voltages in (4.8) is shown in Fig. 4.17 for a 28-nm technology, along with a leakage current reduction at near-threshold voltages by 4–8.5X.

Fig. 4.17
figure 17

Leakage current I off of LVT transistor (normalized to value at nominal voltage) versus supply voltage V DD

4.3.2 Energy Consumption of Digital Systems at Near-Threshold Voltages

The total energy per operationFootnote 5 in a near-threshold VLSI digital system is essentially equal to the sum of the dynamic and the leakage energy. Indeed, the short-circuit energy contribution (Weste and Harris 2011) is negligible at near-threshold voltages, as opposed to operation at nominal voltage. This is because the transistors for input voltages around V DD /2 is a small sub-threshold current, which also rapidly vanishes when the input voltage deviates from V DD /2 to settle to its stable value (Alioto 2012) (due to the exponential I–V characteristics in the sub-threshold region).

The dynamic energy per operation is given by

where \( {\alpha}_{SW}\cdot C\cdot {V}_{DD}^2 \) is the energy per cycle, being C the total capacitance within the circuit, α SW is the activity factor (Weste and Harris 2011) (i.e., the fraction of C that is switched in a cycle, on average). In (4.9), it was considered that an operation in general takes an average number of cycles CPO (Cycles per Operation), which depends on the specific (micro)architecture, and the dataset to a minor extent (e.g., in microprocessors).

The leakage energy per operation can be expressed as the product of the average leakage power \( {V}_{DD}\cdot {I}_{off} \) (being I off the average leakage current), the clock cycle T CK and CPO (Alioto 2012). T CK can be expressed as \( F O4\cdot L{D}_{eff} \), where \( L{D}_{eff}={T}_{CK}/ F O4 \) is the number of the number of FO4 delays (i.e., cascaded inverters with fan-out of 4) that can fit the cycle time. Hence, LD eff represents the effective logic depth per pipestage, which is a constant defined by the (micro)architecture.Footnote 6 Hence the leakage energy per operation can be written as \( {E}_{lkg}={V}_{DD}\cdot {I}_{off}\cdot {T}_{CK}\cdot C P O \), or equivalently

In (4.10), the only parameters that depend on V DD are V DD itself, I off and FO4. When downscaling the voltage, the first term decreases linearly and I off decreases exponentially due to DIBL, although not very rapidly since V DD is multiplied by \( {\lambda}_{DIBL}\ll 1 \) in (4.8). On the other hand, FO4 rapidly increases as in (4.6) when V DD is reduced down to near-threshold voltages and below. The overall effect of the three factors leads to an increase in E lkg at near- and sub-threshold voltages when V DD is reduced, as opposed to the dynamic energy. The leakage energy tends to increase very rapidly when decreasing V DD down to the transistor threshold voltage, due to the resulting rapid increase in the gate delay in (4.6). This is shown in Fig. 4.18, where the leakage energy of the reference digital circuit in Sect. 4.4 is plotted versus V DD under RVT and LVT transistor flavor. As expected, the leakage energy under RVT flavor rapidly increases at higher voltages compared to LVT, due to the higher threshold voltage. This figure also shows that E lkg tends to shoot up at larger voltages, under microarchitectures with larger logic depth (e.g., LD eff  = 50 instead of 25). This is because such microarchitectures suffer from larger leakage energy from (4.10), and hence the rapid increase can be observed at larger voltages. On a side note, Fig. 4.18 also shows that E lkg has an opposite behavior at above-threshold voltages (i.e., it decreases when decreasing V DD ), due to the dominance of the exponential effect of DIBL over the linear FO4 increase.

Fig. 4.18
figure 18

Leakage energy vs. supply voltage V DD for different logic depths LD eff equal to 25FO4 and 50FO4 (28 nm, LVT and RVT transistors)

From (4.9)–(4.10), the total energy per operation E TOT of a given VLSI system or sub-system results to

$$ {E}_{TOT}={E}_{dyn}+{E}_{lkg}={E}_{cycle}\cdot C P O $$
(4.11)

where the energy per cycle \( {E}_{cycle}={E}_{TOT}/ C P O \) is defined as

$$ {E}_{cycle}={\alpha}_{SW}\cdot C\cdot {V}_{D D}^2+{V}_{D D}\cdot {I}_{off}\cdot F O4\cdot L{D}_{eff} $$
(4.12)

The qualitative trend of (4.11)–(4.12) versus V DD in Fig. 4.19 shows that the voltage down-scaling reduces the dynamic energy, but increases the leakage energy. Hence, a minimum-energy point (MEP) is observed at a voltage V DD,opt that optimally balances the dynamic and leakage energy, thus leading to the minimum energyFootnote 7 E min . The MEP voltage V DD,opt typically lies in the sub-threshold or near-threshold region (Hanson et al. 2006a; Hanson et al. 2006b), as discussed in the next section. Due to the flatness of the MEP, near-threshold operation permit true- or nearly-minimum energy operation, as fundamental design target of this chapter.

Fig. 4.19
figure 19

Qualitative trend of dynamic, leakage and total energy per cycle E cycle (or equivalently total energy per operation E TOT ) vs. supply voltage V DD

Figure 4.20a–d shows the energy trend and the presence of the MEP in various integrated prototypes, including an FFT core from MIT (Wang and Chandrakasan 2005), an 8-bit microprocessor from Umich (Hanson et al. 2008), an IA-32 processor from Intel (Jain et al. 2012), and an AES core from NUS (Zhao et al. 2015). From these figures, the energy curve is relatively flat around the MEP, hence the minimum- or nearly-minimum energy per operation does not require a stringent precision in the generation of the supply voltage. In practical designs, a change in V DD around the optimal voltage V DD,opt by various tens of mVs (e.g., 30-50 mV) keeps the energy very close to E min (e.g., within a few percentage points). The MEP voltage V DD,opt in the above examples covers the typical range encountered in real designs (300–450 mV). The detailed analysis on the dependence of the MEP position on process and design parameters is presented in the next subsection.

Fig. 4.20
figure 20

Energy vs. V DD and minimum-energy point in (a) FFT core ((Wang and Chandrakasan 2005) from MIT), (b) 8-bit microprocessor ((Hanson et al. 2008) from Umich), (c) IA-32 processor ((Jain et al. 2012) from Intel), (d) AES core ((Zhao et al. 2015) from NUS)

Let us observe that the leakage energy increase at low voltages limits the energy reductions enabled by aggressive voltage scaling, compared to the quadratic reduction that would be achievable if the total energy were dominated by E dyn . Indeed, in the latter case the minimum achievable energy would be given by (4.9) with V DD equal to the minimum operating voltage V min that ensure correct operation, as in Fig. 4.19. The related energy saving compared to nominal voltage is reported in Table 4.2, which represents an upper bound of the energy savings achievable for quick estimates. Observe that the potential energy savings in circuits with wire-dominated load are lower than the case of gate-dominated load. Indeed, in the former case the dynamic energy reduction is purely quadratic, whereas the latter also benefits from the simultaneous load reduction due to the reduction in the transistor gate capacitance at low V DD (see Sect. 4.2.1 and Fig. 4.4). From this table, operation at near-threshold voltages potentially reduces the energy by up to an order of magnitude, compared to the nominal voltage. At the same time, the presence of leakage narrows down the range of voltages at which energy reduction is truly allowed.

Table 4.2 Dynamic energy reduction vs. VDD

As even more crucial observation, the leakage energy is a substantially larger fraction of the overall energy budget at near-threshold voltages, compared to nominal voltage. Indeed, E lkg (E dyn ) at near-threshold voltages is larger (smaller) than at nominal V DD . Table 4.3 shows the detailed energy breakdown measured in the microprocessor in (Jain et al. 2012), which includes a level-1 cache. Above threshold, the leakage energy is 14% and is well in line with the expectations at nominal voltage. In near-threshold region, the leakage energy raises to a much larger 42% as expected, and in sub-threshold region it completely dominates the overall energy.

Table 4.3 Measured energy breakdown in Jain et al. (2012)

For all the above reasons, mitigating the leakage energy is a crucial goal of near-threshold designs, and is far more important than traditional low-power above-threshold designs. In addition, Sect. 4.4 will show that traditional low-power techniques to mitigate leakage are rather ineffective when V DD is pushed down to near threshold.

4.3.3 Trans-Regional Energy Model

From the previous subsection, the MEP is set by the optimal balance between dynamic and leakage energy. In other words, the MEP voltage V DD,opt and the resulting minimum energy E min both depend on the ratio between leakage and total energy. This means that the MEP position in the energy-voltage plane changes according to this ratio, as discussed below.

When the leakage energy significantly increases for some reason, whereas the dynamic energy remains constant, the total energy clearly increases and V DD,opt increases as well (i.e., the MEP moves to the right, and upwards, as summarized in Fig. 4.21a). Indeed, in this case the leakage energy tends to be a larger fraction of E TOT , hence V DD needs to be increased to reduce E lkg (as explained by Fig. 4.19). On the other hand, when the dynamic energy increases at iso-leakage energy, E dyn becomes a larger fraction of E TOT , hence it becomes more important to reduce E dyn and hence V DD,opt decreases (i.e., the MEP moves to the left, and upwards, as summarized in Fig. 4.21a). From the above considerations, the MEP tends to move to the right when the temperature is increased and/or the circuit activity is reduced, due to a different input data profile or power mode (i.e., different modules are activated). Since the input dataset and the temperature are time varying and are unpredictable at design time, a feedback scheme that tracks the actual MEP through energy sensing (or estimation) and adjusts the supply voltage accordingly (Ramadass and Chandrakasan 2008).

Fig. 4.21
figure 21

Qualitative description of how the minimum-energy point (MEP) changes position when (a) E lkg increases (E dyn kept constant), (b) E dyn increases (E lkg kept constant)

To have a more quantitative understanding of the dependence of the MEP position on process, design and environmental parameters, let us consider an analytical model of the energy. In detail, eq. (4.12) can be written in the following more useful formFootnote 8:

where (4.7) and (4.26) were used, and the intrinsic leakage-dynamic energy ratio ILDR (i.e., the contribution of E lkg /E dyn that is independent of V DD ) was defined as

$$ ILDR=2.5\cdot L{D}_{e ff}\cdot \frac{\overline{strength}}{{\left.{X}_{stack, off}\right|}_{V_{D D, nom}}\cdot {e}^{-{\alpha}_{X_{off}} \cdot \frac{V_{D D, nom}}{n\cdot kT/ q}}}\cdot \frac{1}{\alpha_{SW}\cdot \frac{\overline{C_{cell}}}{C_{in, min}}} $$
(4.14a)

and f ILDR (V DD ) was defined as

$$ {f}_{ILDR}\left({V}_{DD}\right)=\frac{e^{\frac{V_{DD}\left(1+{\lambda}_{DIBL}\right)-{V}_{TH0}}{n\cdot \frac{kT}{q}}}}{ \ln \left({e}^{\frac{V_{DD}\left(1+{\lambda}_{DIBL}\right)-{V}_{TH0}}{n\cdot \left( kT/ q\right)}}+1\right)} $$
(4.14b)

Also, in (4.13) it was observed that

  • the total leakage current I off,TOT of the design under consideration is equal to the gate count (gatecount) multiplied by the average leakage per standard cell \( \overline{I_{off}} \)

  • \( \overline{I_{off}} \) can be expressed as the leakage current of a minimum-sized inverter \( {I}_{0, min}\cdot {e}^{-{V}_{TH}/ n\cdot \frac{kT}{q}} \), multiplied by the average cell strength \( \overline{strength} \) (where 1X refers to the minimum-sized inverter) and divided by the average off-stacking factor X stack,off (i.e., the factor by which the leakage current is reduced due to transistor stacking)

  • the total capacitance C TOT is equal to the gate count multiplied by the average switched capacitance per standard cell \( \overline{C_{cell}} \)

  • FO4 can be thought of as the delay of a minimum-sized inverter driving four equal inverters, and hence its total load capacitance is 4C in,min (C in,min is the input capacitance of a minimum-sized inverter) plus its parasitic capacitance, which is approximately equal to C in,min (Sutherland et al. 1999); I on,min is the current delivered by such minimum-sized inverter (see (4.4)), and I 0,min is the I 0 parameter in (4.1) and (4.2) pertaining to the same inverter.

In (4.13), the average off-stacking factor X stack,off is evaluated at the nominal voltage V DD,nom of the adopted technology, and its downscaling at low voltages is accounted for by the technology-dependent parameter \( {\alpha}_{X_{off}}\ll 1 \) (see (4.26)). Eq. (4.13) is strongly affected by ILDR. From (4.13), the latter parameter represents the voltage-independent (i.e., intrinsic) contribution of the ratio between leakage and dynamic energy, and is defined by

  • the architecture through LD eff (i.e., faster designs with low LD eff exhibit lower ILDR)

  • the function implemented by the design under consideration, which in turn sets the standard cell usage statistics (i.e., the average off-stacking factor X stack,off ) and the average fan-out \( \overline{C_{cell}}/{C}_{in, min} \) (i.e., the average equivalent number of minimum-sized inverters that load the cells); such dependence tends to be fairly weak, due to the averaging effect across cells in large designs

  • the performance target in the automated cell sizing phase, as \( \overline{strength} \) and the average fan-out \( \overline{C_{cell}}/{C}_{in, min} \) both tend to be larger for tighter timing constraints (again, faster designs have lower ILDR); accordingly, such dependence tends to be fairly weak as well

  • the input dataset, which in turn sets the circuit activity (i.e., α SW ).

In summary, parameter ILDR is mainly set by the architecture and the input statistics, and low values of ILDR are associated with more active and faster designs. In practical cases, ILDR ranges from a few hundreds for rather fast and active designs with heavy dynamic energy, to a several tens of thousands in very slow and inactive circuits. Higher values are observed only when an additional constant power contribution comes from external blocks.

Observe that ILDR is defined at nominal voltage and all parameters not explicitly related to V DD,nom are essentially independent of the voltage,Footnote 9 and hence can be evaluated from the report of synthesis/place&route at such voltage without requiring the full characterization of the library at different voltages. A few numerical examples in 28 nm are reported in Table 4.4, assuming \( {\lambda}_{DIBL}=0.1 \) (i.e., \( {\alpha}_{X_{off}}=0.098 \) from (4.26)), \( {\left.{X}_{stack, off}\right|}_{V_{DD, nom}} \) equal to 20 (i.e., average of two stacked transistors in this technology) at nominal voltage. From the technology scaling viewpoint, k 0 tends to slightly decrease at finer technologies, due to stronger DIBL and hence larger \( {\left.{X}_{stack, off}\right|}_{V_{DD, nom}} \). As a simpler approach, ILDR can also be estimated as the value that makes the ratio E lkg /E dyn (i.e., \( ILDR\cdot {e}^{-{V}_{DD}\left(1+{\alpha}_{X_{off}}\right)/ n\frac{kT}{q}}\cdot {f}_{ILDR}\left({V}_{DD}\right) \) in (4.13)) equal to the value that is obtained from power analysis at RTL level.

Table 4.4 Numerical Examples for ILDR in 28 nm (\( {V}_{DD, nom}=1.2\ V \))

4.3.4 Considerations on the MEP Voltage

Typically, the MEP mostly lies in the deep sub-threshold region (Hanson et al. 2006a), and sometimes near-threshold (Hanson et al. 2006b). In the former case, \( f\left({V}_{DD}\right)\approx 1 \) in (4.13) since \( {V}_{DD}<{V}_{TH0} \) in (4.14b), hence the energy is independent of the transistor threshold voltage. This is because the latter affects both the leakage and the on-current in the same way (i.e., both are proportional to \( \exp \left(-{V}_{TH}/ n\cdot \frac{kT}{q}\right) \) in sub-threshold). In this case, V TH is chosen exclusively based on the performance requirement (i.e., targeted FO4), according to (4.4).

Let us now analyze the optimum voltage V DD,opt that minimizes the energy in (4.13) assuming the MEP to be in sub-threshold. Although a closed-form solution cannot be found, a good approximation for a single-V TH design is logarithmic (similar to (Bo et al. 2004; Hanson et al. 2006a, b))

$$ {V}_{DD, opt}\approx \frac{n}{1+{\alpha}_{X_{off}}}\frac{kT}{q}\left[1.25 \ln (ILDR)-0.5\right] $$
(4.15)

which has a maximum error of 4% for practical values of ILDR, as plotted in Fig. 4.22a under the above 28-nm parameters and \( {V}_{TH0}=0.35\ V \). From this figure, the MEP voltage logarithmically increases with the constant slope in (4.15), which is independent of the adopted threshold voltage, as shown in Fig. 4.22b.

Fig. 4.22
figure 22

MEP voltage vs ILDR for 28-nm technology with (a) \( {V}_{TH0}=0.35\ V \) and detailed analytical model, (b) \( {V}_{TH0}=0.35\ V,\ 0.45\ V,\ 0.55\ V,\ 0.65\ V \) (exact solution via numerical minimization of (4.13), approximate expression as in (4.15)–(4.17))

As expected from the above considerations and Fig. 4.21, the MEP moves to the right for slow and less active circuits, and to the left for circuits with dominating dynamic energy and low logic depth. Observe that operation at \( {V}_{DD}<{V}_{min} \) severely degrades the die yield, hence minimum-energy designs need to adopt a supply voltage equal to the minimum between \( {V}_{DD, opt} \) in (4.15) and V min . From Fig. 4.22a, b, this means that fast (i.e., with low LD eff ) designs with ILDR lower than a few thousands cannot really achieve true-minimum energy, due to voltage scaling limitations imposed by robustness issues.

The above analysis assumed that the MEP lies in the deep sub-threshold region, which is correct as long as V DD,opt in (4.13) is lower than \( {V}_{TH}-50\ mV \) (see Fig. 4.1), i.e., when \( ILDR<{e}^{\frac{V_{TH}-50\ mV}{1.5 n\cdot \frac{kT}{q}}+1.6} \). The latter boundary value for ILDR in 28 nm is typically in the order of 1,000–2,000 for \( {V}_{TH}=350\kern0.75em \mathrm{mV} \) (including DIBL), 4000–5000 for \( {V}_{TH}=400\kern0.75em \mathrm{mV} \), and 10,000 for \( {V}_{TH}=450\kern0.75em \mathrm{mV} \). Slightly larger values are typically found in older technologies, due to the lower subthreshold factor n.

Interestingly, Fig. 4.22a, b show that (4.15) can be extended to the near-threshold region (i.e., larger ILDR), as it still predicts the MEP voltage with good accuracy. Hence, the MEP voltage is again independent of the threshold voltage, even in the near-threshold region. For even larger values of ILDR such that \( {V}_{DD}>{V}_{TH0}+200\kern0.75em \mathrm{mV} \) (see Fig. 4.1), the MEP moves to the above-threshold region and eventually saturates to a value \( {\left.{V}_{DD, opt}\right|}_{ILDR\to \infty } \). Indeed, f(V DD ) in (4.14b) becomes approximately equal to \( {e}^{\frac{V_{DD}\left(1+{\lambda}_{DIBL}\right)-{V}_{TH0}}{n\cdot \frac{kT}{q}}}\cdot \left(\frac{V_{DD}\left(1+{\lambda}_{DIBL}\right)-{V}_{TH0}}{n\cdot \left( kT/ q\right)}\right) \), and the resulting V DD,opt is found by minimizing (4.13) for \( ILDR\to \infty \):

$$ {\left.{V}_{DD, opt}\right|}_{ILDR\to \infty }=\frac{2{V}_{TH0}}{1+{\lambda}_{DIBL}} $$
(4.16)

which was found to be always within 12% of the exact solution that minimizes (4.13) in 28 nm (and typically within 5%). V DD,opt saturates because E lkg increases when increasing V DD in the above-threshold region, as opposed to sub- and near-threshold (see Fig. 4.18 and related discussion). In other words, it does not make sense to increase V DD beyond (4.16) from an energy viewpoint, as this would surely increase both dynamic and leakage energy, and hence the total energy. Indeed, (4.16) represents the voltage at which E lkg is minimum (i.e., such that \( {V}_{DD}\cdot {I}_{off}\cdot F O4 \) is minimum, from (4.12)).

In summary, the above considerations suggest that V DD,opt can be simply modeled by extending (4.15) to the near-threshold and part of the above-threshold region, and limiting it to its asymptotic maximum value in (4.16):

$$ {V}_{DD, opt}\approx \min \left(\frac{n}{1+{\alpha}_{X_{off}}}\frac{kT}{q}\left[1.25 \ln (ILDR)-0.5\right], \kern0.5em \frac{2{V}_{TH0}}{1+{\lambda}_{DIBL}}\right) $$
(4.17)

which has a typical (maximum) error of 4% (7%) across the very wide range of ILDR in Fig. 4.22a, b. Eq. (4.17) is a useful tool to estimate the MEP position by knowing the type of design (i.e., ILDR), and a few other technology-dependent parameters. From (4.17), the transistor threshold choice affects only the value of ILDR and the voltage at which the MEP saturates at. In particular, larger V TH0 moves saturation towards exponentially larger k 0 and proportionally larger \( {\left.{V}_{DD, opt}\right|}_{ILDR\to \infty } \).

4.3.5 Considerations on the MEP Energy

From (4.13), the resulting energy at the MEP in deep sub-threshold region (i.e., under (4.15)) can be written as

$$ {E}_{min}={\alpha}_{SW}\cdot {C}_{TOT}\cdot {V}_{DD, opt}^2\left[1+{\left.\frac{E_{lkg}}{E_{dyn}}\right|}_{MEP}\right] $$
(4.18)

where, considering that \( f\left({V}_{DD}\right)\approx 1 \) and \( {\alpha}_{X_{off}}\ll 1 \) in (4.13), the energy-optimum ratio E lkg /E dyn at the MEP is given by

$$ {\left.\frac{E_{lk}}{E_{dyn}}\right|}_{MEP, sub- threshold}\approx ILDR\cdot {e}^{-\frac{V_{DD, opt}}{n\cdot \frac{kT}{q}}}\approx 0.2+\frac{17}{ILD{ R}^{0.75}} $$
(4.19)

In (4.19), an empirical approximate expression has been introduced to facilitate its estimate at design time, and its error is within 12% for low values of ILDR (down to 150), as plotted in Fig. 4.23. Observe that E lkg /E dyn at the MEP is independent from the chosen (single) threshold voltage, as expected from the considerations in Sect. 4.3.4. In other words, sub-threshold designs that differ only for the threshold voltage choice have the same leakage percentage contribution, other than the same V DD,opt (see Sect. 4.3.4). Accordingly, the MEP is hence chosen based on the performance target rather than energy. Also, this means that the MEP voltage can be estimated at design time even before choosing the transistor flavor (and hence before actual implementation).

Fig. 4.23
figure 23

Leakage-dynamic energy ratio at the MEP vs. ILDR, obtained through numerical minimization of (4.13a)

Compared to the overall energy budget, E lkg at the MEP needs to be 40–50% for very fast/active designs (\( ILDR\le 150 \)), around 15–30% for more typical designs (\( ILDR>150 \) but still in sub-threshold). Previous work on joint supply/threshold voltage and sizing optimization showed that energy optimality is achieved when E lkg is about one third of the overall energy (Markovic et al. 2004; Nose et al. 2000; Patil). Accordingly, these results hold in sub-threshold region only for relatively slow and inactive designs, from Fig. 4.23.

As discussed in Sect. 4.3.4, very fast and active designs have low V DD,opt , which often times falls below V min . In these cases, minimum-energy and reliable operation is achieved at \( {V}_{DD}={V}_{min}>{V}_{DD, opt} \). If the extra performance compared to the MEP is not utilized since the design is essentially energy constrained, operation at \( {V}_{min}>{V}_{DD, opt} \) leads to an increase in E dyn by a factor (V min /V DD,opt )2 from (4.9), compared to the MEP. At the same time, a smaller increase in E lkg by a factor V min /V DD,opt is observed (since same clock cycle is assumed in (4.10), and DIBL effect is neglected). In other words, designs with \( {V}_{min}>{V}_{DD, opt} \) typically have lower E lkg /E dyn , compared to operation at the MEP in Fig. 4.23.

For larger ILDR, again the MEP moves to the near-threshold region and the energy-optimum ratio E lkg /E dyn at the MEP increases again when increasing V DD . This is because the increase in V DD,opt at near-threshold voltages determines a much smaller reduction in E lkg compared to sub-threshold, as the gate delay decreases much slower than exponentially from (4.6). Analytically, this is accounted for by the increase in f(V DD ) in (4.13), which determines a proportional increase in E lkg /E dyn . Accordingly, E lkg becomes again a substantial fraction of the energy budget when the MEP is pushed at near-threshold voltages or higher (i.e., for large ILDR), as shown in Fig. 4.23. This explains why E lkg /E dyn heavily depends on V TH at near-threshold voltages, as in Fig. 4.23. For extremely slow and inactive circuits, the MEP moves to the above-threshold region, and is definitely dominated by E lkg .

From the above considerations, the energy E min at the MEP monotonically increases when increasing ILDR. At sub-threshold voltages, this is due to the increase in the energy-optimal voltage V DD,opt in (4.17), which is certainly more rapid than the reduction in \( \left(1+{\left.{E}_{lkg}/{E}_{dyn}\right|}_{MEP}\right) \) in (4.19). Since both dependencies were found to be unaffected by the threshold voltage in sub-threshold, E min for MEP in sub-threshold is independent of V TH as well. At near-threshold voltages, E min keeps increasing since V DD,opt in (4.17) continues to increase with the same trend as sub-threshold (see Fig. 4.22a), and \( \left(1+{\left.{E}_{lkg}/{E}_{dyn}\right|}_{MEP}\right) \) increases as well. For above-threshold MEP, V DD,opt in (4.17) saturates to an almost constant value, and \( \left(1+{\left.{E}_{lkg}/{E}_{dyn}\right|}_{MEP}\right) \) keeps increasing.

From the above considerations, the energy at the MEP is monotonically degraded when ILDR is increased, i.e., for leaky or little active designs. This is essentially due to the increase in V DD,opt (i.e., dynamic energy at the MEP), and the increase in the leakage-dynamic energy ratio at voltages above V TH . More quantitatively, Fig. 4.24 shows that E min increases in an approximately linear fashion when increasing ILDR. In particular, in the sub-threshold region the ratio E min /α SW C TOT is well approximated by \( 3.5\cdot {10}^{-4}\cdot ILDR \) regardless of the threshold voltage, hence

Fig. 4.24
figure 24

Energy at MEP E min normalized to α SW C TOT vs. ILDR for various threshold voltages

$$ {E}_{min}\approx 3.5\cdot {10}^{-4}\cdot {\alpha}_{SW}\cdot {C}_{TOT}\cdot ILDR $$
(4.20)

which is within 20% of exact E min for ILDR up to a few tens of thousands. For less typical design with larger ILDR, the trend becomes slightly steeper by a factor ranging from to 2.5X to 4X compared to (4.20), for a threshold voltage in the 350–650 mV range. In other words, when the MEP is at near-threshold voltages, E min actually increases when V TH increases, although moderately.

Summarizing these conclusions and those in Sect. 4.3.4, the MEP voltage is unaffected by the (single) threshold voltage when operating in sub- and near-threshold voltages. On the other hand, the energy portion associated with leakage drastically increases when operating at near-threshold voltages, compared to sub-threshold ones. At above-threshold voltages, the MEP voltage becomes a function of V TH , and energy is dominated by leakage. Regardless of the voltage range in which the MEP lies in, the minimum achievable energy monotonically and proportionally increases with ILDR.

4.3.6 Sensitivity of Nearly-Minimum Energy to V DD Inaccuracies

From a design standpoint, it is necessary to predict the required accuracy for V DD to achieve nearly-minimum energy per operation, which in turn constraints the design of the voltage regulation circuitry and the power management sub-system. As can be seen from Fig. 4.20a–d, the energy-voltage curve is steeper at the left of the MEP, due to the exponential increase in the leakage energy at low voltages. In other words, an uncertainty \( \pm \Delta {V}_{DD} \) in the supply voltage around the MEP degrades the energy more substantially when it pushes V DD below V DD,opt rather than above (even more so, if performance is considered). Due to the same reason, the energy degradation at the left of the MEP compared to the right becomes more evident for larger ΔV DD .

In nearly-minimum energy designs, the maximum tolerable percentage energy degradation % energydegradation compared to the MEP due to the uncertainty in V DD needs to be translated into the specification of the maximum tolerable uncertainty ΔV DD . In sub-threshold region, the resulting tolerable uncertainty ΔV DD is independent of V TH , since the energy is independent of V TH as well (see Sect. 4.3.4). Since \( f\left({V}_{DD}\right)\approx 1 \), Eq. (4.13) can be expressed in a technology-independent mannerFootnote 10 by defining the normalized voltage \( {V}_{DD, norm}={V}_{DD}\left(1+{\alpha}_{X_{off}}\right)/ n\frac{kT}{q} \). The maximum deviation ΔV DD,norm in V DD,norm compared to the value that minimizes the energy can be easily solved numerically. The numerical solution ΔV DD,norm turns out to be largely independent of ILDR, and is hence only a function of % energy degradation. ΔV DD,norm is well approximated (within 10%) by \( 0.62+0.034\cdot \left(\% energydegradation\right) \), as shown in Fig. 4.25. Accordingly, the maximum tolerable voltage deviation that meets a targeted percentage deviation in sub-threshold region is

Fig. 4.25
figure 25

Maximum supply voltage deviation from MEP that maintains the energy degradation within the target (on x-axis) in sub-threshold region (model in (4.21a), (4.21b))

$$ \Delta {V}_{DD, subthreshold}\approx \frac{n\frac{kT}{q}}{\left(1+{\alpha}_{X_{off}}\right)}\cdot g\left(\% energy\ degradation\right) $$
(4.21a)
$$ \mathrm{g}(x)=0.62+0.034\cdot x $$
(4.21b)

Interestingly, from (4.21a), (4.21b), ΔV DD in sub-threshold does not depend on the position of the MEP, and it only depends on technology through subthreshold slope and DIBL coefficient, and on the targeted maximum energy degradation. As an example, Fig. 4.25 plots the maximum tolerable ΔV DD versus the energy degradation in 28 nm, and shows that V DD needs to be set with a precision of about a thermal voltage (25–35 mV) to keep the energy degradation compared to the MEP modest (5–10%). Larger V DD uncertainty (e.g., 1.5–2X the thermal voltage) leads instead to an unacceptably large energy degradation, and should hence be avoided in practical cases.

When the MEP moves to the near-threshold region, a larger voltage deviation can be tolerated for a targeted maximum energy degradation compared to the MEP. This is because FO4 and hence E lkg (see Eq. (4.10)) become less sensitive to V DD compared to sub-threshold, as shown in Fig. 4.26. As expected, the tolerable voltage deviation at near-threshold depends on V TH0, as opposed to sub-threshold. This is because the energy in (4.13) depends on V TH0 at near threshold (see Sect. 4.3.4), since V TH0 defines the voltage range (and hence ILDR) in which transistors enter this region. From Fig. 4.26, 4 to 6 thermal voltages can be tolerated with minimal energy penalty at near-threshold voltages.

Fig. 4.26
figure 26

Maximum supply voltage deviation from MEP vs. ILDR for different V TH0 in 28 nm (target energy degradation w.r.t. MEP = 10%)

For larger MEP voltages in the above-threshold region, an even larger voltage deviation around the MEP is tolerable for a given allowed energy degradation. This is because FO4 has the minimum sensitivity to V DD across voltages from (4.6). As a consequence of the saturation of the MEP voltage discussed in Sect. 4.3.4, the maximum voltage deviation saturates as well at above-threshold voltages, as shown in Fig. 4.26. As expected from (4.16), larger V TH0 pushes the saturation to higher voltages (and hence ILDR). Analytically, the maximum tolerable allowed voltage deviation around the MEP is evaluated by equating (4.13) and the energy at MEP (i.e., voltage in (4.16)) increased by a factor \( \left(1+\% energy\ degradation/100\right) \), and solving for V DD . By approximating \( \ln \left({e}^{\frac{V_{DD}\left(1+{\lambda}_{DIBL}\right)-{V}_{TH0}}{n\cdot \left(kT/q\right)}}+1\right)\approx \frac{V_{DD}\left(1+{\lambda}_{DIBL}\right)-{V}_{TH0}}{n\cdot \left(kT/q\right)} \) and \( {\alpha}_{X_{off}}\approx {\lambda}_{DIBL} \) in (4.13), the maximum voltage deviation at above-threshold voltages results to

$$ \Delta {V}_{DD, above- threshold}={\left.{V}_{DD, opt}\right|}_{ILDR\to \infty}\cdot h\left(\frac{\% energy\ degradation}{100}\right) $$
(4.22a)
$$ h(x)=- x+\sqrt{x\cdot \left(1+ x\right)} $$
(4.22b)

the first of which is plotted in Fig. 4.27 for a 28-nm technology, along with the technology-independent curve in (4.22b). Equation (4.22a), (4.22b) was found to be within 10-20% of the exact solution, for typical threshold voltages. For large V TH0 (e.g., 0.65 V), the error increases to 30–40% since ΔV DD in (4.22a) becomes so large that it intrudes the near-threshold region, and the above calculations hence become inaccurate. From (4.22a), (4.22b) the maximum voltage deviation around MEP above threshold depends on the technology through a proportional dependence on \( {V}_{TH0}/\left(1+{\lambda}_{DIBL}\right) \). In other words, the maximum voltage deviation around the MEP above threshold is a fixed and technology-independent fraction of the MEP voltage, as set by h(% energy degradation/100). As opposed to sub-threshold region, the maximum deviation around a MEP lying above threshold depends on V TH0, and larger thresholds further relax the precision requirement on the voltage optimization and delivery. This is because a larger V TH0 enlarges the voltage range in which the MEP effectively increases for larger ILDR (i.e., from about V TH0 up to (4.16)).

Fig. 4.27
figure 27

Maximum supply voltage deviation from MEP that maintains the energy degradation within the target (on x-axis) in above-threshold region (model in (4.22a), (4.22b))

In summary, nearly-minimum energy operation requires the voltage to be controlled within approximately one thermal voltage when the MEP is in the sub-threshold region, independently of the position of the MEP. This requirement is substantially relaxed at near threshold, and increases at above threshold until saturation to a value that is proportional to the threshold voltage and sub-linearly related to the tolerable energy degradation. More in detail, deviation increases up to 4 thermal voltages for relatively low V TH0, whereas it increases to more than 6 thermal voltages under large V TH0.

Overall, this suggests that the performance can be increased with modest energy penalty by raising the voltage compared to the MEP, when the latter is at near- or above-threshold voltages. These considerations are summarized in the example in 28 nm in Fig. 4.28, which plots ΔV DD versus ILDR for various energy degradation targets and the related analytical models.

Fig. 4.28
figure 28

Summary of maximum V DD deviation from MEP that maintains the energy degradation within the target vs. ILDR and related models

4.3.7 Example: ARM Core Operating at Minimum Energy

As a further numerical example, let us apply the above energy and MEP models to the design of the ARM Cortex M0 core in Myers et al. (2015). Table 4.5 reports all technology-, design- and workload-dependent parameters for this design, as obtained from the process design kit and information provided in Myers et al. (2015). The two programs “checksum” and “AES” are considered to consider a wide range of activities, from low (checksum) to high (AES), and hence observe the MEP shift due to different activity factors (the latter has 60% higher activity than the former (Myers et al. 2015)).

Table 4.5 Technology-, design- and workload-dependent parameters for ARM Cortex M0 Core in Myers et al. (2015)

The resulting energy curve versus V DD from experimental results in Myers et al. (2015) and the model in (4.13) is plotted in Fig. 4.29. Very good agreement can be observed across the wide voltage range from 0.25 to 1.2 V, with an average error of 1.7%, and an error well below 10% down to 0.3 V. As expected from Sect. 4.3.3 and Fig. 4.21, the MEP for the AES program is pushed to the left of the MEP for the checksum program, due to the higher activity determined by the former one. More quantitatively, the ILDR factor in (4.13) and (4.14a), (4.14b) from the parameters in Table 4.5 results to 18,000 for the checksum program, and 11,200 for the AES program. The resulting voltage and energy at the MEP are summarized in Table 4.6, for both the experimental results in Myers et al. (2015) and the above models.

Fig. 4.29
figure 29

Experimental energy curve vs. V DD in ARM Cortex M0 core (Myers et al. 2015) and energy predicted by the model in (4.13)

Table 4.6 Minimum Energy Point, Leakage/Dynamic Energy Ratio and Maximum Tolerable VDD Uncertainty from (Myers et al. 2015) and Above Models

From Table 4.6, the estimated MEP voltage of the core in Myers et al. (2015) from Eq. (4.17) is 378 mV for the checksum program, which is close to the measured MEP voltage of 390 mV. The resulting minimum energy estimate of 11.6 pJ from (4.18) is also close to the measured energy of 11.7 pJ (Myers et al. 2015). At the MEP, the leakage energy is estimated to be smaller than the dynamic energy by a factor of 0.21 from eq. (4.19), which is close to the value of 0.22 in Myers et al. (2015). Good agreement of the models is also confirmed for the AES program, from the same table. Finally, the maximum tolerable V DD uncertainty for a 10% energy increase compared to the MEP results to 45 mV from (4.21a), (4.21b), which agrees well with the value of approximately 48 mV in Myers et al. (2015).

4.4 Exploration of MEP Dependence on Logic Depth, VTH, Activity and Ineffectiveness of Leakage Reduction Techniques

In this section, the impact of logic depth, threshold voltage and activity are quantitatively and widely explored by considering the reference circuit in Fig. 4.30, applying the insights gained in Sect. 4.3. The simplicity and regularity of the circuit in 4.30 permits to gain an intuitive understanding of the underlying tradeoffs. Such circuit contains 32 slices of inverter gates, each with a fan-out of 4, and with a total logic depth LD TOT (and hence delay by construction) of 200FO4, as representative of a relatively complex microprocessor. The slices are interrupted through the insertion of registers, whose number is adjusted to achieve a targeted logic depth LD eff . Registers are made up of transmission-gate flip-flops, which are customarily encountered in standard cell libraries (Alioto et al. 2015).

Fig. 4.30
figure 30

Reference circuit for evaluation of the impact of logic depth, threshold voltage and activity

4.4.1 Impact of Logic Depth

The heavy impact of E lkg at near- and sub-threshold voltages can be mitigated by adopting microarchitectures with lower logic depth (i.e., deeper pipelining), from (4.10). However, deeper pipelining should be applied judiciously to avoid a significant increase in the clocking overhead, which might offset some of the benefit brought by reduction in E lkg . In the following, we will assume that the additional clocking cost of meeting the timing constraints with lower logic depth is modest (which is typically true in microarchitectures with \( L{D}_{eff}\ge 25 F O4/ cycle \)).

The reference circuit in Fig. 4.30 has an overall energy per cycle equal to

$$ {E}_{cycle}={\alpha}_{SW}\cdot {C}_{TOT}\cdot {V}_{D D}^2+\frac{L{D}_{TOT}}{L{D}_{eff}}\cdot {E}_{REG}+{V}_{D D}\cdot {I}_{off}\cdot F O4\cdot L{D}_{eff} $$
(4.23)

where it was assumed that pipestages are perfectly balanced (i.e., the number of pipestages is \( \frac{L{ D}_{TOT}}{L{ D}_{eff}} \)). In (4.23), E REG is the energy consumed by a single register, and \( \frac{L{ D}_{TOT}}{L{ D}_{eff}} \) represents the number of registers in the above circuit. From (4.23), an energy-optimal logic depth exists at a given V DD , and its expression is readily found to be

$$ L{D}_{eff}=L{D}_{TOT}\sqrt{\frac{E_{REG}}{V_{DD}\cdot {I}_{off}\cdot L{D}_{TOT}\cdot FO4}}=L{D}_{TOT}\sqrt{\frac{E_{REG}}{E_{lkg}}} $$
(4.24)

where (4.10) was used to express E lkg . From (4.24), the optimal logic depth is determined by the balance between the clocking and the leakage energy, as a larger number of registers leads to an increase in the former and a decrease in the latter. Such tradeoff is not really observed in traditional low-power (above-threshold) designs, as the leakage energy is usually kept a small fraction of the overall budget through several techniques (Narendra and Chandrakasan 2006), and the amount of pipelining is mainly defined by the performance target, or the dynamic energy-performance tradeoff under dynamic voltage scaling. Instead, nearly-minimum energy designs require a careful management of the clocking-leakage energy tradeoff, due to their strong inter-dependence (Alioto 2012). In addition, this means that energy-centric (micro)architectures need to be tailored around the targeted operation voltage, and traditional architectures conceived for nominal voltage operation tend to be energy inefficient at low voltage. In other words, ultra-low power architectures need to be deeply rethought to truly enable nearly-minimum energy operation, as discussed in Chap. 3.

Quantitatively, eq. (4.24) suggests that the energy-optimal pipeline depth LD TOT /LD eff is given by the square root of the leakage-clocking energy ratio. Considering the large contribution of E lkg at near-threshold voltages, the theoretical energy-optimal pipedepth tends to be quite small. In (4.24), the energy cost of all registers E REG is assumed to be the same, since it refers to the simple reference circuit in Fig. 4.30. In more general architectures, the number of flip-flops per register, and hence the energy cost of a register, increases super-linearly under higher pipedepths (Chinnery and Keutzer 2007). Indeed, the overall number of flip-flops in a digital module increases according to a power law (LD TOT /LD eff )LGF, where \( L G F>1 \) is the Latch Growth Factor, which is mainly defined by the specific function implemented (Srinivasan et al. 2002). Hence, the energy-optimal logic depth in general architectures tends to be moderately larger than predicted by (4.24).

On the low side, the energy-optimal logic depth is practically limited by the rapidly increasing clocking energy cost at small logic depths (i.e., deep pipelines). Typically, low logic depths in the order of 20–25 FO4 or smaller have a disproportionately large energy cost in the clock network at ultra-low voltages, and require non-straightforward clock network design approaches. The necessity of “fast” circuit and architectural designsFootnote 11 with deep pipelining at ultra-low voltages was first shown in Jeon et al. (2013), where an aggressive 17FO4 logic depth was adopted in a 1024-point complex FFT processor. The adoption of such deep pipeline led to 17.7 nJ/transform at \( {V}_{DD, opt}=270\ mV \) in 65-nm CMOS, which was a 3.6X lower energy than previous state of the art. However, this required some non-trivial clocking technique to avoid timing violations under the unavoidably large variations (see Sect. 4.7), such as 2-phase latch clocking, custom latches with embedded logic, aggressive hold fix buffer insertion, and shallow clock network (3 levels, for the reasons clarified in Sect. 4.10).

The resulting energy curve versus V DD for different logic depths in the reference circuit in Fig. 4.30 is reported in Fig. 4.31 in 28-nm CMOS. As expected, increasing the logic depth to the practical lower bound of 25FO4 to larger logic depths of 50 and 100FO4 leads to a significant 20% and a considerable 60% energy increase at the MEP. This is respectively due to a 2X and 4X leakage energy increase, due to the larger logic depth from (4.10). From (4.14a), (4.14b), such increase in E lkg leads to a 2X (4X) increase in ILDR, which from (4.15) translates into an increase in the MEP voltage of approximately 35 mV and 65 mV (Fig. 4.31 discretizes voltages in 50-mV step, and hence results to 50 and 100 mV).

Fig. 4.31
figure 31

Energy normalized to value at nominal voltage vs. V DD for logic depth of 25FO4, 50FO4 and 100FO4

Finally, it should be observed that true minimum-energy operation actually requires a complex optimization that involves logic depth, voltage and transistor sizing. Unfortunately, no thorough methodology and no CAD support is currently available for this purpose, hence such joint optimization is still an open research question. A qualitative treatment of this problem will be presented in Sect. 4.11, to gain an insight into this fundamental design problem.

4.4.2 Impact of Threshold Voltage and Activity

The impact of the threshold voltage on the reference circuit in Fig. 4.30 is shown in Fig. 4.32. As expected from Sect. 4.3.4, the MEP voltage and energy are the same for different threshold voltages under a single-V TH design, being in the sub-threshold region. On the other hand, mixing the two threshold voltages leads to a substantially larger MEP energy (by 1.7X in this specific case). This suggests that multi-V TH design is not really advantageous at near- and sub-threshold voltages, and it should hence be avoided. Thorough analysis and justification of this observation will be provided in the next section.

Fig. 4.32
figure 32

Energy normalized to value at nominal voltage vs. V DD for single- and multi-VTH design (50% HVT, 50% LVT cells, logic depth of 25FO410% activity)

The effect of activity is depicted in Fig. 4.33, which once again confirms that the MEP moves to the left when the dynamic energy is increased, as was observed in Fig. 4.21. More quantitatively, the increase in the activity factor from 3% to 10% (20%) leads to a 3.3X (6.6X) decrease in ILDR, which from (4.15) translates into a decrease in the MEP voltage of 51 mV and 81 mV (the latter is not precisely visualized in Fig. 4.33, as voltages are discretized in 50-mV step).

Fig. 4.33
figure 33

Energy normalized to value at nominal voltage vs. V DD for different activity values (logic depth of 25FO4)

To provide a broader view on the impact of the above parameters onto the MEP position, Fig. 4.34a–c plot the statistical distribution of the MEP voltage for several different activities and logic depths, respectively for a very low, relatively low and relatively high V TH . From this figure, the MEP lies in the sub-threshold region for most of the designs, and it is pushed into in the near-threshold region only for very low threshold voltages (see Fig. 4.34a). Figure 4.34d–f show the contribution of the leakage energy as a fraction of the overall energy for the same threshold voltages. From this figure, E lkg accounts for 40% of the total energy or more in most of the designs, and tends to be larger under lower threshold voltages. According to Fig. 4.23, this is because the MEP is pushed to near-threshold voltages at low V TH , and the resulting E lkg can be as high as 70% of the total energy in some designs (see Fig. 4.34d).

Fig. 4.34
figure 34

Statistics on the MEP voltage V DD,opt for the reference circuit in Fig. 4.30 across different values of activity and logic depth for three threshold voltages (a)–(c). Ratio between leakage energy and total energy for same threshold voltages (d)–(f)

4.5 Ineffectiveness of Traditional Leakage Reduction Techniques

This section shows that traditional leakage reduction techniques (e.g., stacking, multi-V TH ) are far less effective at near-threshold voltages, thus posing a challenge on leakage management at such voltages.

Transistor stacking has been extensively exploited to reduce leakage in above-threshold circuits (Narendra and 2006), as the off-stacking factorFootnote 12 is typically much higher than the on-stacking factor. In other words, the series connection of multiple transistors reduces the leakage current much more heavily than the on-current (i.e., performance). As shown in Fig. 4.35, this is true in the above-threshold region, where the off-stacking factor for 2 to 4 transistors is larger than the on-stacking factor by an order of magnitude. At lower voltages, the on-stacking factor tends to moderately increase, for the reasons discussed in Sect. 4.2.3. At the same time, the off-stacking factor decreases exponentially when reducing V DD (Narendra and Chandrakasan 2006). Indeed, the off-stacking factor for two stacked transistors can be expressed as \( {e}^{\alpha_{X_{off}} \cdot {V}_{DD}/\left( nkT/ q\right)} \) (Narendra and Chandrakasan 2006), where

Fig. 4.35
figure 35

On- and off-stacking factor X on and X off vs. V DD for 2, 3 and 4 stacked transistors in 28 nm

$$ {\alpha}_{X_{off}} = {\lambda}_{DIBL}\cdot \frac{1+{\lambda}_{DIBL}}{1+2{\lambda}_{DIBL}}\ll 1 $$
(4.25)

and approximately the same dependence is observed for a larger number of stacked transistors, as shown in Fig. 4.35. In other words, the off-stacking factor X stack,off is proportional to \( {e}^{\alpha_{X_{off}} \cdot \frac{V_{DD}}{n\cdot kT/ q\ }} \) (Narendra and Chandrakasan 2006), hence it can be expressed as

$$ {X}_{stack, off}={\left.{X}_{stack, off}\right|}_{V_{DD, nom}}\cdot {e}^{\alpha_{X_{off}} \cdot \frac{V_{DD}-{V}_{DD, nom}}{n\cdot kT/ q}} $$
(4.26)

which tends to be very accurate across all voltages (within 2% in 28 nm, according to Fig. 4.35).

From Fig. 4.35, the off-stacking factor at near-threshold voltages is no longer much larger than the on-stacking factor, hence no significant leakage reduction is actually allowed for a given performance penalty. In other words, transistor stacking is rather ineffective in counteracting leakage at near-threshold voltages, as opposed to common low-power wisdom (i.e., above threshold).

As another traditional leakage reduction technique, let us consider the adoption of multiple threshold voltages, as depicted in Fig. 4.36 for the simple case of a design with two thresholds (i.e., low and high V TH ). In multi-V TH designs, cells in critical paths are LVT to meet the performance requirement, whereas cells in non-critical paths are replaced by the HVT counterparts. At above-threshold voltages, such replacement does not really degrade performance thanks to its weak dependence on V TH , while it certainly reduces the leakage current thanks to its strong dependence on V TH from (4.8). In other words, the multi-V TH approach offers a favorable tradeoff between performance and leakage in traditional low-power designs operating above threshold. On the contrary, performance becomes very sensitive to V TH at near-threshold voltages as discussed in Sect. 4.2.4, and the HVT cells are slowed down much more substantially than LVT when V DD is dynamically down-scaled (see Figs. 4.5 and 4.6). As a consequence, non-critical HVT paths at a given voltage (e.g., 0.6 V in Fig. 4.36) actually become criticalFootnote 13 when down-scaling V DD (e.g., 0.4 V in Fig. 4.36). In other words, the clock cycle of a multi-V TH design at lower voltages is significantly larger than a single-LVT design, thus leading to a leakage energy increased compared to the latter one, from (4.10). At the same time, the leakage current of a multi-V TH design is significantly larger than a single-HVT design, as the LVT cells in the design have a considerably larger leakage (typically more than an order of magnitude increase when moving from a threshold value to the immediately lower one (International Technology Roadmap for Semiconductors 2013)).

Fig. 4.36
figure 36

Multi-VTH approach and critical path shift from LVT to HVT paths when scaling down V DD

From the above considerations, multi-V TH designs suffers from substantially larger leakage energy compared to single-V TH designs for two concurrent reasons, under dynamic voltage scaling. Hence, multi-V TH approaches actually deteriorate the energy efficiency of VLSI circuits, and should be always avoided in favor of single-V TH designs. The choice of the single V TH has been discussed in Sect. 4.2.4. As an example, this is shown in Fig. 4.32, where the multi-V TH design of the reference circuit in Fig. 4.30 is found to be 1.7X less energy efficient than the single-V TH designs. In terms of energy-performance tradeoff at the MEP, Fig. 4.37 confirms that the multi-V TH design is essentially as slow as the single-HVT design, in spite of its significantly larger energy consumption.

Fig. 4.37
figure 37

Energy vs. clock frequency (both normalized to value at nominal voltage) for single- and multi-VTH design, and logic depth of 25FO4

Similar considerations hold for other traditional leakage reduction techniques, such as power gating (Flynn et al. 2007). At above-threshold voltages, power gating is well-known to provide substantial leakage reductions due to two different mechanisms. First, the sleep transistor size can be much lower than the overall effective transistor width of the power gated circuit, as only a fraction of the cells are active at a given time. Since the relative strength of the sleep and the power gated transistors is maintained at low voltages, this reduction mechanism is essentially maintained at near-threshold voltages. Second, the sleep transistor (see Fig. 4.38a) is able to provide its large on-current during active mode (\( \overline{sleep}=0 \)), whereas it delivers only its off-current during sleep mode (\( \overline{sleep}=1 \)). Such reduction is clearly more pronounced for larger I on /I off ratio, which is traditionally obtained by using HVT devices for sleep transistors at above-threshold voltages. At near-threshold voltages, the transistor I on /I off ratio is severely degraded (by 1–2 orders of magnitude) as shown in Fig. 4.38b. Hence, the leakage reduction enabled by power gating at near-threshold voltages is worsened by at least one order of magnitude, compared to above threshold. Such degradation in the effectiveness of power gating at near-threshold voltages can be partially recovered by boosting the gate voltage of the sleep transistor (Myers et al. 2015). Indeed, boosting its gate voltage only during active mode significantly increases I on , while maintaining the same I off . At near-threshold voltages, the sleep transistor I on /I off ratio (and hence the effectiveness of power gating) can be further improved by using thick-oxide (i.e., I/O) NMOS transistors whose gate is powered at the large I/O voltage (e.g., 1.8 V instead of 1 V). In this case, such I on /I off improvement is achieved at the expense of a larger energy and slower transient to turn on the sleep transistor, and hence to switch from sleep to active mode.

Fig. 4.38
figure 38

(a) Power gating scheme, (b) I on /I off ratio of sleep transistor in 28 nm

4.6 Challenges: Performance

As discussed in Sect. 4.2, operation at near-threshold voltages entails a ~10X penalty in terms of FO4 and hence performance, compared to the same architecture operating at nominal voltage. For sub-100 nm technologies, FO4 at near-threshold voltages is typically in the order of few hundreds of picoseconds. For reasonable architectures with a logic depth of up to several tens of FO4, this translates into a cycle time in the order of nanoseconds. Hence, throughputs in the order of hundreds of MOPS (Millions of Operations per Second) are easily achievable by near-threshold microprocessor cores. Such level of performance achievable near the threshold is actually acceptable for (or can exceed) the typical requirements of IoT systems, at least in the most frequent operation modes and in most of the practical applications. Higher performance might be needed occasionally in some applications, or customarily for compute-intensive ones, such as computer vision or real-time pattern recognition.

Sustained throughputs that are higher than hundreds of MOPS can always be obtained through appropriate architectures at near-threshold voltages (e.g., multi-core) and specialized hardware, as discussed in Chap. 3. Occasional performance boosts can be achieved through wide dynamic voltage scaling, i.e., by raising V DD from the MEP to the nominal voltage (Chandrakasan et al. 2010). Such temporary voltage up-scaling permits to increase the performance by one (two) order(s) of magnitude, when the MEP is in the near-threshold (sub-threshold) region. This performance increase is achieved at the expense of an increase in the energy per operation, as summarized in Fig. 4.39 for several integrated prototypes (Abouzeid et al. 2012; Gammie et al. 2011; Hsu et al. 2012; Jain et al. 2012; Kaul et al. 2012; Myers et al. 2015; Sheikh et al. 2012; Wilson et al. 2014; Jacquet et al. 2013). From this figure, this energy increase is more pronounced when the MEP is in the sub-threshold region, due to the larger voltage difference between the MEP and the nominal voltage.

Fig. 4.39
figure 39

Energy improvement at MEP vs. performance degradation at MEP, as compared to operation at nominal voltage

As another property of circuits operating near the MEP, the gate delay dominates over the wire delay, as shown in Fig. 4.40. Indeed, they might be comparable at nominal voltage in realistic VLSI architectures, due to the significant resistive, capacitive, and sometimes inductive parasitics of metal wires. However, operation at the MEP voltage determines a substantial increase in the gate delay, while keeping the wire delay constant. Hence, the wire delay is no longer a challenge in circuits with nearly-minimum energy, which certainly simplifies the design, the circuit modeling and the timing closure.

Fig. 4.40
figure 40

(a) RC wire delay, (b) relative scaling of gate and wire delay vs. V DD

The reduced wire-to-gate delay ratio around the MEP has also important consequences on the choice of the architecture, and the way the latter is mapped into the physical level. Indeed, VLSI architectures for nearly-minimum energy need to be different from traditional low-power architectures (i.e., for above-threshold operation). More specifically, signals can be propagated through a wider silicon area compared to nominal voltage operation. In detail, for unrepeated wires a ~3X longer distanceFootnote 14 can be covered by the same wire at near-threshold voltages, as compared to the same circuit operating at nominal voltage, when maintaining the wire delay a fixed fraction of the clock cycle. For similar reasons, repeated wires require 3X fewer repeaters per wire unit length since the optimal distance between repeaters is proportional to the square root of FO4 (Weste and Harris 2011), thus improving the route-ability. At the same time, the size of each repeater at MEP needs to be increased by 3X compared to nominal voltage, as its performance-optimal size is proportional to FO4 (Weste and Harris 2011). Since the number of repeaters is reduced by the same factor by which their area is increased, the energy and area cost of intra-chip global communication in designs around the MEP remains approximately the same. In summary, VLSI architectures for nearly-minimum energy can afford more global communications and larger modules (e.g., shared caches), compared to traditional low-power architectures. Such profound difference in the communication-computation energy/performance tradeoff requires the adoption of innovative architectures, as discussed in Chap. 3.

4.7 Challenges: Variations

In this section, the impact of variations is analyzed in the context of circuits operating around the MEP. In general, process, voltage and temperature variations as well as aging impose an additional timing margin that stretches the clock cycle as shown in Fig. 4.41. This conservative approach preserves correct functionality and performance specifications even in the worst-case die and environmental conditions.

Fig. 4.41
figure 41

Nominal cycle time and additional margin accounting for variations

The above cycle time margin resulting from variations translates into an increase in the energy per operation, as faster chips are forced to operate as slowly as the worst-case die and at the same voltage (which is higher than needed). Figure 4.42 shows the voltage increase required by the circuit in Fig. 4.30 to maintain a given clock cycle under a given clock cycle margin, as well as the resulting energy increase, assuming a logic depth of 25FO4, 10% activity, and LVT transistors. From Fig. 4.42, the voltage increase imposed by variations is fairly linear with the cycle time margin, and tends to be larger at higher nominal operating voltages. This is because FO4 (i.e., the cycle time from Sect. 4.3.2) is less sensitive to voltage increases at higher operating voltages, and hence requires larger increase to achieve a given percentage performance improvement. From Fig. 4.42, a typical 1.5X–2X cycle time increase requires a voltage increase by 100–200 mV to sustain the same performance as the nominal corner, which leads to an energy increase by a factor of 1.5X–2X. In other words, variations at near threshold entail a very large energy cost, which can negate the advantages of operating at the MEP. Accordingly, variations need to be accounted for in first place in the design of near-threshold circuits rather than an afterthought, as discussed more in detail in the following. Similar considerations hold for the sensitivity to soft errors, which is somewhat increased at near threshold voltages, compared to nominal voltage. On the other hand, operation at near-threshold voltages suppresses most of aging and reliability issues, such as Bias Temperature Instability, Hot Carrier Injection, Time Dependent Dielectric Breakdown. Indeed, such phenomena are all exponentially dependent on the supply voltage, and its reduction to near-threshold voltage substantially mitigates them.

Fig. 4.42
figure 42

Required V DD to sustain a given performance specification vs. cycle time margin (i.e., factor by which the cycle time needs to be increased due to variations), and resulting energy increase due to variations

4.7.1 Process Variations

Random (within-die) process variations are well known to be responsible for a major fraction of the cycle time margin, having a much heavier effect than fully correlated (die-to-die) variations (Orshansky et al. 2008). Indeed, threshold voltage variations at low voltages are dominated by random dopant fluctuations (Alioto et al. 2010; Orshansky et al. 2008), and their effect requires much more sophisticated feedback schemes that are immune to transistor mismatch (e.g., with timing error detection or prediction (Bowman et al. 2009; Bowman et al. 2011; Das et al. 2009; Khayatzadeh et al. 2016; Zhang et al. 2016)), rather than corner-based adaptive voltage scaling and body biasing techniques (Gregg and Chen 2007; Meijer and Pineda de Gyvez 2012; Martin et al. 2002; Olivieri et al. 2005; Tschanz et al. 2002). Accordingly, our analysis in the following will be focused on random variations.

At low voltages, process variations determine a much larger path delay variations than above-threshold voltages due to two phenomena:

  1. 1.

    the variability of the gate delay defined as the ratio between the standard deviation and mean value increases significantly

  2. 2.

    the probability distribution function (PDF) of such delay is no longer Gaussian for short paths, and has a longer tail on the right side.

Regarding the first phenomenon, the variability of the critical path delay is mainly due to the intrinsically larger variability of the transistor I on current (see Eq. (4.6)). This is mostly due to the larger impact of the threshold voltage, and hence of its variations, at lower voltages (see Sect. 4.2.1). More quantitatively, the delay variability is approximately equal to the variability in I on from (4.6). If the nominal threshold voltage V TH is subject to a variation ΔV TH that is Gaussian distributed with zero mean and standard deviation \( {\sigma}_{V_{TH}} \), from (4.3) to (4.4) the variability of I on for above- and sub-threshold voltages is readily found to be

$$ {\left.\frac{\sigma_{\tau_{PD}}}{\mu_{\tau_{PD}}}\right|}_{above- threshold}\approx {\left.\frac{\sigma_{I_{on}}}{\mu_{I_{on}}}\right|}_{above- threshold}=\frac{\sigma_{V_{TH}}}{V_{DD}-{V}_{TH}} $$
(4.27a)
$$ {\left.\frac{\sigma_{\tau_{PD}}}{\mu_{\tau_{PD}}}\right|}_{sub- threshold}\approx {\left.\frac{\sigma_{I_{on}}}{\mu_{I_{on}}}\right|}_{sub- threshold}=\sqrt{e^{{\left(\frac{\sigma_{V_{TH}}}{n\cdot {v}_t}\right)}^2}-1} $$
(4.27b)

For example, for the typical values \( {\sigma}_{V_{TH}}=35\kern0.5em \mathrm{mV} \) and \( {V}_{TH}=0.4\kern0.75em \mathrm{V} \) in 28 nm CMOS, the gate delay variability turns out to be 7% at 0.9 V, and 124% in the sub-threshold region (e.g., 0.3 V). In other words, the delay variability in sub-threshold is an order of magnitude larger than above threshold. At near-threshold voltages, the variability is somewhat intermediate.

Figure 4.43 plots the variability of the FO4 delay normalized to the value at nominal voltage in 28 nm CMOS. From this figure, the gate delay variability increases when V DD is reduced, and becomes 2X–6X larger than at nominal V DD at near threshold, and about an order of magnitude larger in the deep sub-threshold region. A typical delay variability of around 6–8% at nominal voltage in 28 nm translates into a sizable delay variability of various tens of percentage points at near threshold. To achieve a parametric yield of approximately 99%, three standard deviations are needed, hence the margin for a single gate can easily be 100%, which entails an unfeasibly large margin in Fig. 4.41 (i.e., an unacceptably high energy cost from Fig. 4.42, which easily offsets the energy benefit of operating at near-threshold voltages). When the MEP is in sub-threshold region, such margin becomes even higher.

Fig. 4.43
figure 43

Variability of FO4 normalized to value at 1.2 V vs. V DD (28 nm CMOS)

Regarding the second phenomenon that was observed above, the statistical delay distribution of a single gate is no longer Gaussian when operating at near- and sub-threshold voltages (Alioto 2012; Alioto et al. (in press); Gammie et al. 2011). In the sub-threshold region, I on and hence the gate delay are lognormally distributed due to the exponential dependence of I on on the threshold voltage in (4.4), being the latter Gaussian distributed. As shown in Fig. 4.44, the lognormal distribution has a much longer tail compared to the Gaussian distribution, at same standard deviation. This leads to a considerable increase in the number of standard deviations needed as design margin to meet a given yield target, as shown in Table 4.7. For example, from this table the worst-case gate delay margin across 99.9% of the cases is three standard deviations for Gaussian (i.e., above threshold), and twenty standard deviations for lognormal (i.e., sub-threshold). In the near-threshold region, the distribution is somewhat intermediate between above- and sub-threshold, and hence it is neither perfectly Gaussian nor lognormal. This is shown in Fig. 4.45a–c, which show the quantile-quantile (Q–Q) plot (Walpole et al. 2006) of the statistical FO4 delay sample in 28 nm CMOS versus the theoretical quantiles of a Gaussian distribution with same mean and standard deviation. The deviation from a straight line (i.e., perfect Gaussian behavior) of the Q–Q plot becomes noticeable at near threshold (see Fig. 4.45b), and is substantial at sub-threshold voltages (see Fig. 4.45c). Figure 4.45d confirms the FO4 lognormal distribution in sub-threshold.

Fig. 4.44
figure 44

Probability density function of Gaussian and lognormal distribution at same standard deviation (\( \sigma =1 \))

Table 4.7 Number of V TH Standard Deviations beyond the Mean to Achieve Given Yield Target in Gaussian and Lognormal
Fig. 4.45
figure 45

Q–Q plots of FO4 distribution (y axis) in 28 nm CMOS at (a) 1.2 V, (b) 0.6 V, (c) 0.3 V with normal distribution on the x axis. (d) at 0.3 V with lognormal distribution on the x axis (100,000 Monte Carlo runs)

The above considerations of non-Gaussian delay distribution at low voltages hold for single logic gates, and can be extended to short paths, i.e., paths that can be problematic in terms of hold time violations rather than setup time. Accordingly, short paths and hold fix at sub- and near-threshold voltages requires a much wider design margin, compared to above-threshold. In other words, the timing margin against hold time violations at low voltages tends to be very large compared to nominal voltage, and hence requires a much larger number of hold fix buffers.

On the other hand, long logic paths have a Gaussian delay distribution even in sub-threshold voltages. This is because of the Central Limit theorem, which guarantees that the sum of non-Gaussian random variables rapidly tends to a Gaussian distribution, when increasing the number of variables being summed (Walpole et al. 2006) (e.g., the number of logic gates whose delays are added to derive the critical path delay). This is quantitatively shown in Table 4.7 for 4, 8, 16 and 32 equal cascaded gates, which are individually assumed to have a lognormal delay distribution, as relevant to the sub-threshold region. Indeed, this table shows that margin in terms of standard deviations is essentially the same as an ideal Gaussian distribution even for a relatively short path of 4 cascaded gates, and is closer for a larger number of gates. This means that the clock cycle distribution is Gaussian at any voltage, and hence the margin in terms of standard deviations is the same as nominal voltage. In other words, the timing margin against setup time violations at low voltages scales like FO4, as opposed to hold violations.

4.7.2 Voltage and Temperature Variations

Voltage variations have a heavy impact on the clock cycle margin, due to the strong sensitivity of I on and hence the gate delay on V DD (see Sect. 4.2.1). Figure 4.46 plots the cycle time margin associated with a typical 5% and 10% voltage drop of the circuit in Fig. 4.30 in 28 nm CMOS. As all gate delays scale approximately like FO4 when V DD changes, this example is representative of any logic path. From this figure, a large cycle time margin of 20–50% is imposed by supply variations, if not kept under strict control. As discussed in Sect. 4.10, supply variations in systems designed for nearly-minimum energy operation are dominated by fluctuations in the output voltage of the regulator providing the supply. Accordingly, supplies for minimum-energy operation need to be designed with quite stringent specifications on voltage stability across temperatures, as well as line and load regulation.

Fig. 4.46
figure 46

Plot of the cycle time margin vs. V DD (assuming cycle time scaling to be proportional to FO4)

Temperature variations in circuits designed for minimum energy have an effect that is quite different from above-threshold circuits. Indeed, larger temperatures lead to a substantial increase in the energy per operation at low voltages, due to the large contribution of the leakage energy (see Sect. 4.3.2). Such effect is more pronounced in architectures with larger leakage energy, e.g., with larger logic depth. Figure 4.47 plots the energy versus V DD of the circuit in Fig. 4.30 at different temperatures (27 °C and 70 °C), and for logic depths widely ranging from 25FO4 to 100FO4. From this figure, the minimum energy is heavily influenced by the operating temperature, as it is increased by a factor of 1.3X for well-designed architectures with reasonable logic depth, and 1.7X for less energy-efficient and leakier architectures.

Fig. 4.47
figure 47

Impact of temperature on minimum energy vs. V DD for different logic depths (10% activity, very low V TH )

As further difference compared to traditional above-threshold low-power designs, the performance of circuits operating around the MEP actually benefits from increased temperature. Indeed, the I on transistor current is much more sensitive to the threshold voltage rather than the carrier mobility, hence it increases at larger temperatures. Figure 4.48 shows the FO4 delay versus V DD for various threshold voltages. From this figure, a temperature raise from 27 °C to 70 °C leads to a 1.4X–2X FO4 reduction at V DD equal to the threshold voltage V TH0 in (4.7). Such effect is less pronounced at higher threshold voltages, as the corresponding higher supply voltage emphasizes the carrier velocity saturation and the mobility degradation due to high-field operation, which in turn weaken the dependence of I on on V TH0. At above-threshold voltages around 700–800 mV, the effect of temperature is insignificant. At larger voltages, the temperature has the traditional inverse effect on the performance, and is much weaker (e.g., 2% change due to a temperature change from 27 °C to 70 °C) than near threshold. Hence, unless the operating temperature range set by the application is narrow (e.g., indoor applications), active compensation of temperature variations is essential in any integrated system aiming at minimum-energy operation.

Fig. 4.48
figure 48

Impact of temperature on performance vs. V DD for different threshold voltages (FO4 is normalized to the value at nominal voltage for lowest V TH )

4.8 The Leakage-Variability Tradeoff

Operation around the MEP introduces a tradeoff that is not encountered in traditional low-power above-threshold designs, namely the variability-leakage tradeoff. This is an unavoidable tradeoff that constrains the design at all levels of abstraction and is tightly linked to the averaging effect of additive variations, as discussed below.

At the gate level, a logic path with logic depth LD has a delay that is the sum of LD delays, as depicted in Table 4.8. The resulting path delay variability is inversely proportional to \( \sqrt{LD} \) thanks to the averaging effect of the random variations across cascaded gates (Alioto et al. 2010; Merrett et al. 2010) (i.e., more cascaded gates reduce the overall delay variability thanks to better averaging across a larger number of cells). Hence, the reduction of the delay variability would require the adoption of microarchitectures with larger logic depths. On the other hand, larger logic depths increase the clock cycle and hence the leakage energy per cycle from (4.10). In other words, the mitigation of delay variations comes at the cost of a higher leakage energy, and vice versa. Such tradeoff is very specific to operation at near- and sub-threshold voltages, due to the much more important contribution of the leakage energy, as opposed to above-threshold designs.

Table 4.8 Dependence of the delay variability in logic paths, cells and transistors

At the cell circuit level, a similar tradeoff is encountered when the number of stacked transistors is considered in a standard cell (Alioto et al. 2010; Merrett et al. 2010) (i.e., the cell fan-in). Indeed, the variability of the I on current delivered by the cell to the load, and hence the cell delay variability, is inversely proportional to \( \sqrt{N_{stacked}} \) as in Table 4.8. Thus, the mitigation of variations through more stacked transistors comes at the cost of larger delay (see considerations on stacking in Sect. 4.2.3) and hence larger leakage energy from (4.10).

At the transistor level, from Table 4.8 wider transistors exhibit smaller I on variability thanks to the Pelgrom’s law (Pelgrom et al. 1989). Hence cells with larger strength have smaller delay variability, as the latter is inversely proportional to \( \sqrt{strength} \). Again, the delay variability mitigation comes at the cost of larger leakage energy, as the leakage current drawn by a cell is proportional to its strength.

Summarizing, the variability-leakage tradeoff is an unescapable challenge in the design of circuits and systems operating around the MEP, as opposed to traditional low-power above-threshold designs. This tradeoff involves all levels of abstraction, and needs to be constantly taken care of during the design process. Such tradeoff can be broken by introducing innovative design techniques that do not purely rely on timing margining, as discussed later in this chapter.

4.9 Near-Threshold Cell Libraries

Designing cell libraries for operation around the MEP certainly helps manage the peculiar tradeoffs observed at near-threshold voltages in a more efficient manner, at the cost of additional design and characterization effort.

Since performance is not the main objective, near-threshold cell libraries can be designed with short standard cells (e.g., 7 metal tracks), as transistors typically do not need to be wide, as depicted in Fig. 4.49. In near-threshold designs, taller cells (e.g., 10–12 metal 1 tracks) can achieve higher performance but lead to a significant area efficiency degradation, and longer interconnects, thus degrading energy efficiency.

Fig. 4.49
figure 49

Near-threshold cells are shorter than typical above-threshold cells

The composition of near-threshold libraries does not need to be as wide as libraries for above-threshold (i.e., higher performance) operation. Indeed, cell versions with very large strength can be suppressed, as they are typically not used due to the more relaxed performance constraints. Similarly, cells with large fan-in need to be eliminated, since they suffer from disproportionately larger delay, as discussed in Sect. 4.2.3. Typically, libraries with around 100 cells are adequate for near-threshold designs. From observations of prior designs, the energy reduction obtained through a custom near-threshold library can be in the order of 20% (Gemmeke et al. 2013; Gammie et al. 2011), compared to a pruned out conventional library for above-threshold voltages (see below).

The circuit design of cells is affected by near-threshold operation in terms of sizing as well. Indeed, minimum transistor size needs to be skipped in technologies that are significantly affected by Narrow Channel Effects (see Sect. 4.2.2), to avoid the related increase in the transistor threshold voltage.

Near-threshold libraries might need to be enriched with cells that are normally not available in above-threshold libraries. For example, cells with thick-oxide transistors might be needed for always-on blocks (see Chap. 1) that need to be very low leakage, or connected directly to 3.6-V LiIon batteries (see Chap. 15). Being particularly critical in terms of the minimum voltage V min assuring correct operation, flip-flops usually need to be thoroughly redesigned to achieve adequate functional yield at low voltages. This is usually achieved through circuit techniques that eliminate the potential current contention between transistors (Jain et al. 2012; Kim et al. 2014a). V min is further reduced by replacing conventional dynamic circuits (e.g., periphery in register files) by their static CMOS counterparts. As summarized in Fig. 4.50, V min is determined by several contributions arising at the process and circuit level, and is certainly dominated by variations (Alioto 2012).

Fig. 4.50
figure 50

Breakdown of minimum supply voltage V min of logic gates ensuring correct operation (Alioto 2012)

As an alternative option, existing cell libraries designed for above-threshold regions can be reused at lower voltages, after proper pruning to eliminate the cells that suffer from robustness issues or particularly pronounced delay increase (Alioto 2010; Wang et al. 2006). In a given library designed for above-threshold voltages, the number of usable cells at lower voltages decreases when reducing V DD , as fewer cells can operate reliably at lower voltages. Typically, as summarized in Fig. 4.51, the suppression of cells with a high fan-in (e.g., 4) leads to approximately 100-mV V min reduction (Gemmeke et al. 2013).

Fig. 4.51
figure 51

Percentage of library cells operating correctly vs. V DD (Gemmeke et al. 2013)

4.10 Clock and Supply Networks for Near-Threshold Operation

The design of clock networks for near-threshold designs is very different from above-threshold networks, due to the very different balance between clock repeater and wire delay, and the clock skew is determined by different dominant mechanisms (Alioto 2014; Lin et al. 2017; Seok et al. 2011; Tolbert et al. 2011). At above-threshold voltages, several levels of clock repeaters are needed to frequently interrupt wires to limit the related RC time constant and hence clock slope through the wires (Xanthopoulos 2009) (see Fig. 4.52). Indeed, excessive clock slope induces large random delay variations in the clock repeaters at intermediate nodes of the clock network, and degrades flip-flop nominal timing parameters (as well as its variations) when considering the sinks of the same network. In other words, the significant wire RC delay and its impact on clock skew through the clock slope justifies the adoption of deep above-threshold clock networks.

Fig. 4.52
figure 52

General clock network structure and related timing parameters (Alioto 2014)

At sub- and near-threshold voltages, the gate delay becomes much larger than the wire delay (see Sects. 4.2.3 and 4.6), hence the clock slope through wires is no longer an issue, and the random skew is dominated by the intrinsic variations in the clock repeaters. According to the Central Limit theorem (Walpole et al. 2006), the random skew standard deviation is proportional to the square root of the number of cascaded repeaters, i.e., of the depth of the clock network. Accordingly, shallow networks need to be used at sub- and near-threshold voltages, so that the dominant skew contribution due to the number of clock repeaters is reduced.

From the above considerations, the design the clock network at a given voltage leads to a skew degradation at the other end of the voltage range. As an example, Fig. 4.53 plots the skew of a clock network in 28 nm that has been designed at 1.2 V and used at lower voltages. This figure shows that the skew at low voltages becomes several FO4 and even exceeds 10FO4 in sub-threshold. This means that the skew in a clock network used in a wide range of voltages becomes a large fraction of typical cycle time targets of energy-efficient designs (see Sect. 4.4.1). In other words, using a clock network in a wide voltage range leads to significant performance degradation (or energy efficiency, if V DD is increased to recover the lost performance). Similarly, the clock skew easily exceeds the available hold margin (Alioto et al. 2015), thus leading to timing failures at low voltages. In other words, the clock skew degradation at low voltages typically defines V min . Similar trends are observed when designing the clock network at low voltages and running at above-threshold voltages.

Fig. 4.53
figure 53

Clock skew of a sample clock network in 28 nm designed at 1.2 V (normalized to FO4) (Alioto 2014)

From the above considerations, the design of the clock network of integrated systems operating in a wide voltage range entails a fundamental tradeoff between the performance at above-threshold voltages, and the ability to scale down to low voltages. Various approaches have been proposed to make this tradeoff more favorable, and mitigate the skew-energy penalty imposed by the adoption of deep or shallow clock networks. For example, moderately deep networks with long-channel LVT buffers have been proposed in Myers et al. (2015). Design methodologies have been introduced in Seok et al. (2011), Tolbert et al. (2011), Zhao et al. (2012) to optimally design clock networks, although for a single low voltage. Techniques for adaptive point-to-point interconnects with regenerative drivers have been also proposed (Kim et al. 2014b; Wang et al. 2015), although they cannot be used for clock networks and are not supported by commercial EDA tools. Voltage-adaptive delay insertion across different clock domains was introduced in (Jain et al. 2012; Tokunaga et al. 2014) to mitigate the inter-domain skew (e.g., between processor and memory), although no adaption to voltage has been performed within each clock domain. Clock network adaptation to a wide range of voltages with each clock domain has been demonstrated in Lin et al. (2017), where the clock network topology is reconfigured to minimize the skew at each specific voltage.

Regarding the supply network, voltage drops are less of a concern at near-threshold voltages and below, as the I on transistor current is at least an order of magnitude lower than at nominal voltage. Accordingly, the current density drawn by the digital circuit is reduced by the same amount, and hence issues related to voltage drops across the supply network are largely mitigated. This partially alleviates the problem of the stronger impact of V DD fluctuations at near-threshold voltages, due to the larger sensitivity of performance (see Sect. 4.7.2). This translates in a relaxed requirement on the supply rail width in the cell library, which can help slightly reduce the cell height. For analogous reasons, the lower clock frequency of near-threshold circuits makes the effect of the wire parasitic inductance negligible. Finally, the peak current absorbed by near-threshold circuits is also reduced by an order of magnitude, compared to above-threshold operation. Hence, the size of decoupling capacitors to keep V DD fluctuations within a targeted band can be reduced by the same amount, thus saving area and improving the utilization factor of the module under design.

4.11 Perspectives and Trends

In summary, near-threshold circuits pose challenge and opportunities that are significantly differ from conventional above-threshold low-power circuits. Counteracting leakage in spite of the inefficacy of conventional leakage reduction techniques (see Sect. 4.5) requires a radically different approach that maximizes the opportunities to reduce leakage when transistors are not being used. This can be accomplished by introducing fine-grain power domains that can be power gated (e.g., with gate boosting to improve its effectiveness, as in Sect. 4.5) or voltage scaled to mitigate the leakage contribution of unused transistors. Power domains are typically coarse and of the size of at least an entire microprocessor, whereas such fine-grain power domains have the size of sub-blocks or execution units (e.g., ALU), or even finer (e.g., individual operators in the ALU). Although such approach certainly enhances the chances to turn off transistors, its direct application leads to significant area/energy/performance overhead. The latter is due to the need for additional power domain control circuitry, as well as isolation/clamping cells for power gating and level shifters (see Chap. 9) at the boundary of each domain.

Fine-grain voltage domains are also a highly promising approach in near-threshold circuits. Indeed, the ability to distribute different voltages with fine granularity maximizes the opportunities to correct variations in paths that turn out to be critical due to random variations, while reducing the energy in all other domains. The effectiveness of fine-grain voltage domains is further enhanced by the strong sensitivity of performance on V DD (see Sect. 4.2), which ensures that voltage boosting is kept small (e.g., 100–200 mV) in all practical cases. For example, selective boosting can be used to reduce the general V min of the circuit, while raising the voltage of the small portion of the circuit that needs to operate at higher voltages (Tokunaga et al. 2014). As another example, (Muramatsu et al. 2011) leverages such small voltage difference across voltage domains by suppressing level shifters altogether, so that the voltages can be freely assigned to very small domains to compensate variations where they arise, while avoiding the otherwise large overhead of level shifters. The Panoptic approach (Putic et al. 2009) introduces both spatial and temporal fine granularity by using multiple sleep transistors that also dynamically connect sub-blocks to three different supply voltages. The sleep transistors serve the purpose of reducing leakage of unused sub-blocks, and assign them the minimum possible voltage for the task at hand when used.

Variations can also be exploited rather than added as design margin, when an adequately large number of replicas of a given block are available on the same chip. For example, Raghunathan et al. (2013) introduces the concept of “cherry picking” among many on-chip cores, which consists in the post-silicon selection of the most energy-efficient cores while keeping the others off. This permits to maximize the energy efficiency by leveraging the inevitable random variations, rather than tolerating them, at the cost of area due to the partial utilization of cores. Observe that full utilization would not be allowed anyway in practical cases, due to the “dark silicon” issue (see Chap. 1) that is determined by the chip power constraint.

In general, variations can be mitigated at different times, from design time to testing, chip boot time and run-time, as summarized in Fig. 4.54. At design time, all variation contributions need to be incorporated into the design (e.g., cycle time) margin, as they are not known upfront. The margin is lowered at testing time, as process variations are known and can hence be suppressed, whereas voltage, temperature and aging-induced variations are need to be included (as they will be defined later at in-field operation). At boot time, aging can be compensated as well. The margin is made very small and virtually removed when variations are compensated at run-time, i.e., when all process, temperature, (slow) voltage variations are known. Obviously, the cost of such detection and compensation of variations increases when moving from design to run-time.

Fig. 4.54
figure 54

Summary of techniques to counteract variations at different time, and resulting cycle margin and overhead

Due to very large design margin required at near-threshold voltages (see Sect. 4.7), adequate yield and energy efficiency certainly require the adoption of run-time compensation of variations. This is typically performed through timing error detection and correction (EDAC) methods, which have been investigated since early 2000s (Ernst et al. 2003), EDAC methods sense the timing margin at run time by detecting timing failures, so that the system can be tuned to operate at nearly-zero margin (Ernst et al. 2003). This permits to run at the highest possible frequency at given voltage, or at the minimum possible voltage at given frequency. Hence, error detection and correction improves the energy efficiency of circuits operating at any voltage, typically by 1.3–1.45X (see references below).

Error detection can be performed through canary circuits and Tunable Replica Circuits (TRCs) mimicking critical path variations, and hence predicting the occurrence of timing violations with high (but not 100%) level of confidence, at rather low overhead (Bowman et al. 2011). However, tracking the critical path across a wide range of voltages is difficult, and hence such methods are more appropriate for operation on a narrow range. Also, since TRCs try to replicate the critical path, they cannot completely eliminate the design margin. In-situ error detection is performed by inserting timing sensors to detect true timing failures, which typically entails significant area overhead. Several in-situ error detection methods have been proposed, such as Razor (Ernst et al. 2003), Razor II (Das et al. 2009), EDS (Bowman et al. 2009; Bowman et al. 2011), ERSA (Leem et al. 2010), Bubble Razor (Fojtik et al. 2013). However, their overhead is in the order of various (if not several) tens of percentage points, and hence an order of magnitude larger than TRCs, which has prevented their adoption in commercial chips. Recently, very low-overhead (i.e., percentage points) in-situ approaches have been demonstrated, such Razor-Lite (Kwon et al. 2014) and iRazor (Zhang et al. 2016) for processors, and RazorSRAM for on-chip memories (Khayatzadeh et al. 2016). Being very lightweight, these approaches promise a much wider adoption of in-situ error detection in mass produced chips.

Another very promising direction to further reduce the energy per computation is offered by its tradeoff with quality. As discussed in Chap. 1, quality can be defined in different ways depending on the application and the sub-system under design. In a processing sub-system, quality is related to accuracy in terms of precision in case of arithmetic tasks, misclassification rate in classification tasks, or effective number of bits in an Analog-to-Digital Converter (ADC). The concept is far more general than approximations (e.g., approximate computing), in that it applies to a broad range of types of tasks and applications, and the tradeoff between quality and energy is dynamic and based on quality sensing (see example in Sect. 1.6.2).

Based on the concepts described in Sect. 1.6.2, energy-quality scalability has been introduced in many different sub-systems and levels of abstraction. For example, the first energy-quality SRAM memory has been introduced in Frustaci et al. (2015), where occasional faults (e.g., bitcells with inadequate write- or read-ability) occur in the array. The scalability comes from a bit-level management of the tradeoff between the bit error rate and the energy, by adjusting assist techniques (see Chap. 5) differently for different positions and in a dynamically scalable manner. This is beyond traditional memories where assist is uniformly applied to all bit positions and to fully suppress errors, which entails a substantial error cost. Similarly, selective Error Correction Codes have been introduced in the SRAM, to favor the robustness of the bits carrying the highest information content (e.g., MSBs in video processing applications), while saving on the other bit positions. Overall, this approach leads permits to improve the general quality by spending some energy in selected bit positions, thus enabling much more aggressive scaling on all positions and hence achieving quadratic benefit. Energy reductions of 2X have been demonstrated compared to traditional voltage scaling, at iso-quality (Frustaci et al. 2016). The same general concept has been applied to several other sub-systems, such as ADCs with dynamically scalable resolution (Freyman et al. 2014; Yip et al. 2011). In this case, when the application can tolerate a reduction in the ADC resolution, a more than 2X energy reduction is gained when the resolution is reduced by one bit, leading to an exponential energy saving.

Finally, the presence of a minimum-energy point (MEP) actually poses a fundamental challenge in terms of energy scalability when the system is operating at the right of the MEP. Indeed, the MEP tends to be a flat minimum as discussed in Sect. 4.3.2, which in turns translates into insignificant energy savings when the voltage is scaled down from values at the right of the MEP towards the MEP itself. As an example, from Fig. 4.55 there is almost no energy saving when scaling from 0.6 V (i.e., at the right of the MEP) down to 0.5 V (i.e., the MEP), due to the flatness of the MEP. In other words, the voltage scalability of the design (i.e., its ability to operate at very low voltages) does not translate in an actual energy scalability (i.e., the ability to reduce energy when scaling down the voltage). To preserve energy scalability, the energy curve in Fig. 4.55 needs to be steep rather than flat, which is achieved only if the operating voltage is far enough on the right side of the MEP. For example, quadratic benefit is observed in this figure, at voltages from 0.8 V to 1 V. Conversely, to achieve good energy scalability at a given low voltage (e.g., 0.5 V), the MEP needs to be pushed to the left of this targeted voltage (e.g., 0.3 V). In other words, innovation is needed to move the MEP where needed, depending on the operating voltage. At above-threshold voltages, the MEP can lie at a fairly high voltage, while not being a problem since the dominance of the dynamic energy still assured a quadratic benefit when downscaling V DD . When a near-threshold voltage is targeted and further voltage scaling needs to be applied, the MEP needs to be dynamically moved to the left to make the energy curve steeper, and again achieve a nearly-quadratic energy benefit. We believe that this is one of the fundamental challenges that needs to be addressed to further improve the energy efficiency of low-voltage integrated systems for IoT.

Fig. 4.55
figure 55

Energy curve vs. V DD in a 32-bit multiplier in 28-nm technology