Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

7.1 Introduction

The continuing development of IC technology during the last couple of decades has led to a considerable increase in the number of devices per unit chip area. The resulting feasible IC complexity currently allows the integration of a complete system on a chip (SOC) , which may comprise hundreds of millions to a few billion transistors.

Consequently, the design of such chips no longer simply consists of the assembly of a large number of logic gates. This poses a problem at a high level of design: how to manage the design complexity. Besides this, the growing influence of parasitic and scaling effects (see Chaps. 2, 9, and 11), which may reduce chip performance dramatically, requires a lot of additional design resources to take and implement adequate measures.

Such ICs combine signal processing capacity with microprocessor or microcontroller cores and memories. The dedicated signal processing parts take care of the computing power (workhorse), while the microprocessor or controller serves to control the process and possibly performs some low performance computation as well. The memories may store program code and data samples. Finally, since the world is analog, most ICs contain on-chip analog interface and pre- and post-processing circuits as well as an increasing number of wireless interfaces. The development of such heterogeneous systems on one or more ICs, for instance, may require tens to even hundreds of man-years, depending on their complexity. Microprocessors for standard PCs and servers, usually referred to as mainstream MPUs (Intel and AMD processors), may even require several thousand man-years of development time.

A significant amount of the total IC turnover is generated in the ‘low-end market’ . This market consists of low-complexity ICs and was originally controlled by the large IC vendors. During the 1980s and 1990s, however, a change took place and the low-end market is now dominated by Application-Specific Integrated Circuits (ASICs) . These are ICs which are realised for a single end-user and dedicated to a particular application. ASICs therefore implement customer-specified functions and there are various possibilities for the associated customisation . This can be an integral part of an IC’s design or production process or it can be accomplished by programming special devices.

ASICs do not include ICs whose functionality is solely determined by IC vendors. Examples of these ‘Application-Specific Standard Products’ (ASSPs) include digital-to-analogue (D/A) converters in DVD players. These ASSPs are so-called vendor-driven ICs, of which the vendor wants to sell as many as possible to every customer he can find. ASICs are customer-driven ICs, which are only tailored to the specific requirements of one single customer. Actually, User-Specific Integrated Circuits (USICs) would be a more appropriate name for ASICs. The use of USICs would clearly be preferable because it emphasises the fact that the IC function is determined by the customer’s specification and not simply by the application area.

The turn-around time of an ASIC is the period which elapses between the moment a customer supplies an IC’s logic netlist description and the moment the vendor supplies the first samples. The turn-around time associated with an ASIC depends on the chosen implementation type. A short turn-around time facilitates rapid prototyping and is important to company marketing strategies. In addition, ASICs are essential for the development of many real-time systems, where designs can only be verified when they are implemented in hardware. There exist many different market segments for which we can distinguish different ASIC products:

  • Automotive: networking, infotainment, GPS, tire pressure monitor, body electronics

  • Mobile communications: mobile/smart phones (GSM, UMTS), tablets, modems, wireless local loop (WLL), GPS

  • Medical: patient monitoring, diagnostics, ultrasound

  • Display: LCD TV, flat panel, projection TV

  • Digital consumer: CD/DVD, MP3, audio, TV, media box, set-top box, encoders/decoders

  • Connectivity: WLAN, Bluetooth, USB, NFC, FireWire

  • Identification: smart cards, electronic car keys, e-passports and RF-ID tags, such as animal tags and product tags

  • Industrial: robotics, motor/servo control

  • Military: image, radar and sonar processing, navigation

Suitable computer aided design (CAD) tools are therefore essential for the realisation of this rapidly expanding group of modern ICs. Growing design complexity combined with shorter product market windows requires the development of an efficient and effective design infrastructure, based on a (application-) domain-specific SoC design platform . In this respect, a platform is an integrated design environment, consisting of standard-cell libraries, IPs and application-mapping tools, which is aimed at providing a short and reliable route from high-level specification to correct silicon. The convergence of consumer, computing and communications domains accelerates the introduction of new features on a single chip, requiring a broader range of standards and functions for an increasing market diversity. This makes a design more heterogeneous, with a large variety of domain-specific, general-purpose IP and memory cores. Next to this, there is a tremendous growth in the complexity of embedded software, which may take more than 50% of the total SoC development costs, particularly in multi-processor design.

This puts very high demands on the flexibility and reusability of a platform across a wide range of application derivatives, requiring a large diversity of fast-compiling IPs in combination with efficient verification, debug and analysis tools. Such a platform needs to be scalable and must also enable the addition of new IP cores without the need for changing the rest of the system.

The design process is discussed on the basis of an ASIC design flow. The various implementation possibilities for digital VLSI and ASICs are discussed and factors that affect a customer’s implementation choice are examined. These implementations include: standard-cell, gate-array, field-programmable gate-array (FPGA) and programmable logic devices (PLD). Market trends and technological advances in the major ASIC sectors are also explained.

7.2 Digital ICs

Digital ICs can be subdivided into different categories, as shown in Fig. 7.1. ASICs can be classified according to the processing or programming techniques used for their realisation. A clear definition of the types and characteristics of available digital ICs and ASICs is a prerequisite for the subsequent discussion of the trends in the various ASIC products. Figure 7.1 presents quite a broad overview of digital ICs but excludes details such as the use of direct slice writing (DSW) or masks for IC production. Several terms used in this figure and throughout this chapter are explained on the next pages.

Fig. 7.1
figure 1figure 1

An overview of digital ICs

Definitions:

ASSP: :

Application-Specific Standard Products are ICs that are suitable for only one application but their availability is not restricted to a single customer. Examples include video ICs for teletext decoding and ICs for D/A conversion in DVD players.

Core: :

Pre-designed industry (or company) wide used standard building block: RAM, ROM, microprocessor (e.g., ARM, MIPS and Sparc), graphics processor unit (GPU), interfaces (Bluetooth, USB and NFC), etc.

Custom: :

A custom IC is an IC in which all masks are unique for a customer’s application. The term full-custom IC is often referring to an IC in which many sub-circuits are new handcrafted designs. In this book, full-custom ICs fall under the category of custom ICs. Cell-based custom-IC designs are based on standard cells, macro cells , mega cells and possibly compiled cells . Macro and mega cells, or cores are large library cells like multipliers, RAMs, ROMs and even complete microprocessor and signal processor cores. Compiled cells are automatically generated by modern software libraries. These cells are used for dedicated applications and are generated as a function of user-supplied parameters.

The customisation of PLD-based ASICs takes place after IC manufacture. Customisation of custom and semi-custom ASICs, however, is an integral part of IC manufacture. The turn-around time of ASICs from database ready to first silicon varies enormously and depends on circuit complexity and the customisation technique. This time can range from a few hours for a PLD to between 6 and 12 weeks for a custom design.

FPGA: :

A Field-Programmable Gate Array is an IC that has the ability to change its functionality after the manufacture. It contains programmable logic and programmable routing channels. It belongs to the group of ICs that is usually referred to as Programmable Logic Devices (PLDs).

HDL: :

Hardware description language. This language is used for formal description of the structure and behaviour of electronic circuits. It provides the circuit designer to describe (model) a circuit before it is physically implemented. Verilog and VHDL have become the two most popular HDLs for coding the design of integrated circuits. Synthesis tools are able to read these HDL codes, extract logic operation and transfer these into a netlist of logic gates.

IP: :

Intellectual Property. With the complexity of ICs reaching a billion or more transistors, the traditional way of designing can no longer be continued. Therefore, the concept of Virtual Component has been introduced in 1996 by the Virtual Socket Interface Alliance (VSI Alliance: www.vsi.org) , which was an international forum trying to standardise reusable cores, concepts, interfaces, test concepts and support, etc. Licensing and royalty issues of IP were also addressed. Due to the low efficiency of the alliance to create standards for the development of IP cores, VSIA was dissolved in 2008. However, this standardisation is a prerequisite to fully exploit the potentials of design reuse. The cores (or IP) can be represented in three forms.

A soft core is delivered in the form of synthesisable HDL, and has the advantage of being more flexible and the disadvantage of not being as predictable in terms of performance (timing, area, power). Soft cores typically have increased intellectual property protection risks because RTL source code is required by the integrator.

Firm cores have been optimised in structure and in topology for performance and area through floor planning and placement, possibly using a generic technology library. The level of detail ranges from region placement of RTL sub-blocks, to relatively placed data paths, to parameterised generators, to a fully placed netlist. Often, a combination of these approaches is used to meet the design goals. Protection risk is equivalent to that of soft cores if RTL is included, and is less if it is not included.

Finally, hard cores have been optimised for power, size or performance and mapped to a specific technology. Examples include netlists fully placed, routed and optimised for a specific technology library, a custom physical layout or the combination of the two. Hard cores are process- or vendor-specific and generally expressed in the GDSII format. They have the advantage of being much more predictable, but are consequently less flexible and portable because of process dependencies. The ability to legally protect hard cores is much better because of copyright protections and there is no requirement for RTL.

Figure 7.2 is a graphical representation of a design flow view and summarises the high level differences between soft, firm and hard cores.

Fig. 7.2
figure 2figure 2

Graphical representation of soft, firm and hard cores (Source: VSIA)

Due to the convergence of digital communications, consumer and computer, there is an increasing number of real-time signals to be processed: voice, professional audio, video, telephony, data streams, Internet of Things (IoT), etc. This processing is usually performed by high-performance analog and digital signal processors.

Today’s integrated circuits are complex heterogeneous systems: they consist of many different types of processing, storage, control and interface elements. Many of these elements are available as a kind of (standard) IP. Examples of IP are:

  • Microprocessors (CPU): use software to control the rest of the system

    • Intel Itanium, Oracle SPARC, IBM Power7, Sun UltraSPARC, ARM, MIPS, 80C51, …

  • Digital signal processors (DSP): manipulate audio, video and data streams

    • Omaps, TMS320 and DaVinci (TI), DSP56000 series

    • (Freescale), DSP16000 series (Agere), EPICS and Tri-

    • media and EPICS (NXP), Oak, Teaklite

    • Most DSPs are for wireless products

  • (F)PGA-based accelerators: decoders, encoders, error correction, encryption, graphics or other intensive tasks

  • Memories

    • Synopsys, Artisan, embedded memories and caches

    • Memory controllers (Denali): controlling off-chip memories

  • Interfaces: external connections

    • USB, FireWire, Ethernet, UART, Bluetooth, NFC, keyboard, display or monitor

  • Analog

    • A/D, D/A, PLL (e.g., for use in clock generation), oscillator, operational amplifier, differential amplifier, bandgap reference, SerDes, PHYs

PLD: :

The first Programmable Logic Devices were customised by fuses or anti-fuses. Modern PLDs are programmed by on-chip memory cells. Most PLDs can be customised by end-users themselves in the field of application, i.e., they are field-programmable devices (FPGA). The customisation techniques used are classified as reversible and irreversible. PLDs include erasable and electrically erasable types, which are known as EPLDs and EEPLD , respectively. The former are programmed using EPROM techniques while the EEPROM programming technique is used for the latter devices. These programming techniques are explained in Sects. 6.5.3.3 and 6.5.4, respectively. Complex PLDs (CPLDs) are often based on the combination of PALTMand PLA architectures.

Reuse: :

Future design efficiency will increasingly depend on the availability of a variety of pre-designed building blocks (IP cores; see IP definition). This reuse not only requires easy portability of these cores between different ICs, but also between different companies and between different process nodes. Standardisation is one important issue, here (see IP definition). Another important issue concerning reuse is the quality of the (IP) cores. Similar to the Known-Good Die (KGD) principle when using different ICs in an MCM, we face a Known-Good Core (KGC) principle when using different cores in one design. The design robustness of such cores must be so high that their correctness of operation will always be independent of the design in which it is embedded.

RTL: :

Register transfer level. See Sect. 7.3.4.

Semi-Custom: :

These are ICs in which one or more but not all masks are unique for a customer’s application. Many semi-custom ICs are based on ‘off-the-shelf’ ICs which have been processed up to the final contact and metal layers. Customisation of these ICs therefore only requires processing of these final contacts and metal layers. This results in short turn-around times. A gate array is an example in this semi-custom category.

Standard product: :

Standard products, also called standard commodities , include microprocessors, memories and standard-logic ICs , e.g., NAND, NOR, QUAD TWO-INPUT NAND. These ICs are produced in large volumes and available from different vendors. Their availability is unrestricted and they can be used in a wide variety of applications. They are often put into a product catalogue.

Usable gates: :

The number of gates in a PLD or (mask programmable) gate array that can actually be interconnected in an average design. This number is always less than the total number of available gates.

Utilisation factor: :

The ratio between that part of a logic block area which is actually occupied by functional logic cells and the total block area (gate array and cell-based designs).

7.3 Abstraction Levels for VLSI

7.3.1 Introduction

Most of today’s complex VLSI designs and ASICs are synchronous designs, in which one or more clock signals control the data flow to, on and from the chip. On the chip, the data is synchronised through flip-flops, which are controlled by a clock ϕ (Fig. 7.3). Flip-flops temporarily store the data and let it go on clock demand. At any time the positions and values of all data samples are known (by simulations).

Fig. 7.3
figure 3figure 3

Representation of a logic path in a synchronous design

The logic gates in between the flip-flops perform the functionality of the logic block from which they are part. So, in a synchronous chip, the signal propagates through the logic path from one flip-flop to the next. The logic path with the longest propagation delay (usually one with many complex gates and/or large wire delays) is called the worst-case delay path . This path determines the maximum allowed clock frequency. Next to many different functional logic blocks, most systems also contain memory, interface and peripheral blocks.

The implementation of a complete system on one or more ICs starts with an abstract system level specification. This specification is then analysed and transformed into a set of algorithms or operations. Next, an optimum architecture that efficiently performs these operations must be chosen. A model that represents the different abstraction levels is the Gajski-Kuhn Chart , which is named after the two developers in 1983 (Fig. 7.4).

Fig. 7.4
figure 4figure 4

Gajski-Kuhn VLSI design abstraction-level chart

It distinguishes three domains of VLSI design representation: a behavioural, a structural and a geometrical domain. At the design start, a behavioural description is provided. Due to its high abstraction level, it does not contain any information on the design structure, whether it is synchronous/asynchronous, and no timing constraints are considered. Let’s take the example of an elevator function: the elevator (Z) goes up when its door (c) is closed and when somebody (a) in the elevator or somebody (b) on another floor has pushed a button. Then its function in the behavioural domain could be described as: El (Z) goes up when door (c) is closed AND (button push (a) in elevator OR button push (b) on other floor).

Its structural and physical representations are shown in Fig. 7.5. A structural description describes the system as a collection of components and their interconnections, while the physical description relates to the basic devices and interconnections.

Fig. 7.5
figure 5figure 5

Structural (a ) and physical (b ) representation of the elevator function example

Each of the domains in Fig. 7.4 is divided into five levels of abstraction, represented by concentric rings. Starting with system level, at the outer ring, the design details refine as we go more and more towards the centre point of the diagram, ending at the layout level.

At the system level , the basic specification of an electronic system is determined. Usually the system at this level is represented by one or more block diagrams.

The algorithmic level specifies how the data in the system is manipulated (processed and stored) so that the system does what it has to do.

At register-transfer level , the behaviour is described in more detail as communication between registers. Figure 7.6 shows an example representation of a function at algorithmic level and a micro architecture model of the same at RTL level.

Fig. 7.6
figure 6figure 6

Representation of a function at algorithmic and RTL level

It is clear that the micro architecture is much closer to the real on-chip implementation.

We will use a signal processor as an example function to describe and explain the various abstract levels in the following (sub) sections. The chosen processor must perform an adaptive FIR filter. As a consequence, this processor must repeatedly fetch numbers from a memory, multiply or add them and then write the result back into the memory. Such a chip may contain several ROM and/or RAM memory units, a multiplier, an adder or accumulator, data and control buses and some other functional modules.

The design of an IC comprises the transformation of a specification into a layout . The layout must contain all pattern shapes in every mask layer needed to fabricate the chip. Clearly, the design path starts at the top (or system) level and ends at the bottom (or silicon) level. This ‘top-down’ process is illustrated in Fig. 7.7.

Fig. 7.7
figure 7figure 7

Abstraction levels in the design and implementation/verification paths of VLSI circuits

The various design phases are accompanied by several different abstraction levels , which limit the complexity of the relevant design description. The top-down design path allows one to make decisions across abstraction levels and gives high level feedback on specifications. The ‘bottom-up’ path demonstrates the feasibility of the implementation of (critical) blocks. This process begins at the layout level of a single part and finishes with the verification of the entire IC layout. The abstraction levels that are used in the design path are described on the following pages. Table 7.1 shows the design complexity at these levels of abstraction.

Table 7.1 Design complexity at different levels of abstraction

7.3.2 System Level

A system is defined by the specification of its required behaviour. Such a system could be a multiprocessor system and/or a heterogeneous system , consisting of different types of processing elements: microprocessor, DSP, analog, control, peripheral and memory cores. Advanced heterogeneous architectures, today, also include the integration of graphics processing units (GPU) to increase graphics processing speed by one or two orders of magnitude, compared to running it on a CPU. Figure 7.8 shows a heterogeneous system, containing a signal processor, a microprocessor (IP core), embedded software, some glue logic (some additional overall control logic), local buses, a global bus, and the clock network. The transformation of a system into one or more ICs is subject to many constraints on timing, power and area, for example.

Fig. 7.8
figure 8figure 8

Systems on a chip; an example of a heterogeneous system

While a heterogeneous system consists of several different types of processing and storage elements, there is today also an increased focus on architectures with multi-processor cores and even architectures built from only a limited number of different cores. In the ultimate case, an architecture can be built from a multiple of identical cores (tiles) to create a homogeneous system . Figure 7.9 (top) shows a layout of a massively parallel processor for video scene analysis implemented as a homogeneous design [1], as opposed to the heterogeneous chip (bottom).

Fig. 7.9
figure 9figure 9

Example of a homogeneous design, consisting of a multiple of identical cores (tiles) and a heterogeneous chip consisting of various different cores (Source: NXP Semiconductors)

System decisions taken at the highest level have the most impact on the area and performance parameters. Decisions regarding functions that are to be implemented in hardware or software are made at the system level. Filter sections, for example, are frequently programmed in software. A system-level study should also determine the number of chips required for the integration of the chosen hardware. It is generally desirable to sub-divide each chip into several sub-blocks. For this purpose, data paths and control paths are often distinguished. The former is for data storage and data manipulation, while the latter controls the information flow in the data path, and to and from the outside world. Each block in the data path may possess its own microcontrol unit . This usually consists of a decoder which recognises a certain control signal and converts it into a set of instructions.

The block diagram shown in Fig. 7.10 represents a description of the signal processor of Fig. 7.8 at the system abstraction level. The double bus structure in this example allows parallel data processing. This is typically used when very high throughputs are required. For example, data can be loaded into the Arithmetic Logic Unit (ALU) simultaneously from the ROM and the RAM. In this type of architecture, the data path and control path are completely separated. The control path is formed by the program ROM, which may include a program counter, control bus and the individual microcontrol units located in each data path element.

Fig. 7.10
figure 10figure 10

Block diagram of a signal processor

Other system implementations may not show such a clear separation of data and control paths.

7.3.3 Functional Level

A description at this level of abstraction comprises the behaviour of the different processing elements and other cores of the system. In case of the signal processor of Fig. 7.10, we distinguish: an ALU, a digital mixer, a RAM, a ROM and the I/O element.

RAMs, ROMs and I/O elements are usually not very complex in their behaviour. As a result of the simplicity of their behaviour, however, they are mostly described in the next, lower level of abstraction, the RTL level.

Let us take the digital mixer as an example. Also this one, because of its simple architecture, will be described at the lower RTL level.

There are some tools, e.g., Matlab, Simulink, SystemC, that allow a description of complex blocks at functional level. They allow high-level system evaluation and verification in different use-cases across different hierarchy levels, and exploration of alternative solutions for certain functions.

The chosen mixer, at this hierarchy level (RTL level), consists of different arithmetic units (adder, multiplier, subtractor), which are functions as well, so the RTL level and functional level show some overlaps (see also Fig. 7.17).

7.3.4 RTL Level

RTL is an abbreviation for Register-Transfer Language . This notation originates from the fact that most systems can be considered as collections of registers that store binary data, which is operated upon by logic circuits between these registers. The operations can be described in an RTL and may include complex arithmetic manipulations. The RTL description is not necessarily related to the final realisation.

To describe a function at this level is a difficult task. A small sentence in the spec, e.g., performs MPEG4 encoding, will take many lines of RTL code and its verification is extremely difficult. Logic simulation and/or even emulation may help during the verification process, but cannot guarantee full functionality, since it is simply impossible to fully cover all possible cases and situations. Let us return to our digital mixer example. The behaviour of this mixer can be described as:

$$\displaystyle{Z = k \cdot A + (1 - k) \cdot B}$$

When k = 0, Z will be equal to B and when k = 1, Z will be equal to A. The description does not yet give any information about the number of bits in which A, B and k will be realised. This is one thing that must be chosen at this level. The other choice to be made here is what kind of multiplier must perform the required multiplications. There are several alternatives for multiplier implementation, of which some are discussed as examples.

  • Serial-parallel multiplier: Input Ra input is bit-serial and the Rb input is bit-parallel, see Fig. 7.11.

    Fig. 7.11
    figure 11figure 11

    Example of a bit-serial iterative multiplier

  • During the execution of a multiplication, the partial product is present on the multiplier’s parallel output bits (Rc). These are initially zero.

  • If a i  = 1, for instance, then the Rb bits must be added to the existing partial product and then shifted one position to the left. This is a ‘shift-and-add’ operation . When a i  = 0, the Rb bits only have to be shifted one place to the left in a ‘shift’ operation and a zero LSB added to it.

  • Parallel multiplier: The bits of both inputs Ra and Rb are supplied and processed simultaneously. This ‘bit-parallel’ operation requires a different hardware realisation of the multiplier. Options include the array or parallel multiplier, schematically presented in Fig. 7.12.

    Fig. 7.12
    figure 12figure 12

    A parallel multiplier

The array multiplier necessitates the choice of a structure for the addition of the partial products. The possibilities include the following:

  • Wallace tree: Here, bits with equal weights are added together in a tree-like structure, see Fig. 7.13. An advantage of the architecture is that the two input signals for each single adder always arrive at the same time, since they have propagated through identical delay paths. This will reduce the number of glitches at the outputs of the individual adder circuits, which may occur when there is too much discrepancy between the arrival times of the input signals.

    Fig. 7.13
    figure 13figure 13

    Wallace tree addition

  • Carry-save array: Figure 7.14 illustrates the structure of this array, which consists of AND gates that produce all the individual x i ⋅ y j product bits and an array of full adders which produce the total addition of all product bits.

    Fig. 7.14
    figure 14figure 14

    Array multiplier (parallel multiplier) with carry-save array

As an example, at this level, we choose the array multiplier (parallel multiplier) with carry-save array. This would lead to a different behaviour from the serial multiplier, and thus to a different RTL description.

An example of RTL-VHDL description for the mixer is given in Fig. 7.20.

Fig. 7.15
figure 15figure 15

Basic logic-gate implementation of a full adder

Fig. 7.16
figure 16figure 16

Static CMOS realisation of the chosen full adder cell

Fig. 7.17
figure 17figure 17

Decision tree for a complex system on a chip

Fig. 7.18
figure 18figure 18

General representation of a design flow

Fig. 7.19
figure 19figure 19

Non-memory SoC and IC content in 2013 (Source: IC Manage) [6]

Fig. 7.20
figure 20figure 20

RTL-VHDL description of mixer

7.3.5 Logic-Gate Level

As stated in Sect. 7.4, the RTL description is often specified through hardware description languages (HDL) , such as VHDL and Verilog. It is then mapped onto a library of cells (logic gates). This is done by a logic synthesis tool, which transforms a VHDL code into a netlist (see example in Fig. 7.27). A netlist contains a list of the library cells used and how they are connected to each other. Examples of such library cells (logic gates) are: AND, NAND, flip-flop and full adder, etc. As an example of the decisions that need to be taken at this logic level, we choose the full adder, from which we will build the array multiplier of Fig. 7.14. A full adder performs the binary addition of three input bits (x, y and z) and produces sum (S) and carry (C) outputs. Boolean functions that describe the operation of a full adder include the following:

  1. (a)

    Generation of S and C directly from x, y and z:

    $$\displaystyle\begin{array}{rcl} C& =& xy + xz + yz {}\\ S& =& x\overline{y}\,\overline{z} + \overline{x}\,\overline{y}z + \overline{x}y\overline{z} + xyz {}\\ \end{array}$$
  2. (b)

    Generation of S from C:

    $$\displaystyle\begin{array}{rcl} C& =& xy + xz + yz {}\\ S& =& \overline{C}(x + y + z) + xyz {}\\ \end{array}$$
  3. (c)

    Generation of S and C with exclusive OR gates (EXORs).

The choice of either one of these implementations depends on what is required in terms of speed, area and power. Implementation (b) will contain fewer transistors than (a), but will be slower because the carry must first be generated before the sum can evaluate. The implementation in (c) is just to show another alternative. Suppose our signal processor is used in a consumer video application where area is the most dominant criterion, then, at this hierarchy level, it is obvious that we choose implementation (b) to realise our full adder. A logic-gate implementation is shown in Fig. 7.15.

7.3.6 Transistor Level

At this level, the chosen full adder must be mapped onto a number of transistors. In some design environments, the logic-gate level is not explicitly present and the higher level code is directly synthesised and mapped onto a ‘sea of transistors’. These are discussed in Sect. 7.6.6. The transistor level description depends on the chosen technology and the chosen logic style, such as dynamic or static CMOS. For the realisation of our full adder, we choose a static CMOS implementation, as shown in Fig. 7.16.

As this full adder consists of a relatively low number of transistors (30), it is efficient, both in terms of area and power dissipation, compared to the one realised with AND, OR and INVERT gates in Fig. 7.15. Note that both the sum S and carry C circuits are symmetrical with respect to their nMOS and pMOS transistor schematics, because the full adder is one of the few symmetrical logic functions, next to the half adder and the multiplexer.

Thus, the transistor level implementation of the logic gate is determined by either speed, area or power demands, as is actually every IC implementation. In this example we choose the implementation of Fig. 7.16 for our full-adder.

7.3.7 Layout Level

The chosen transistor implementation must be translated into a layout level description at the lowest abstraction level of a design. Most of the time, these layouts are made by specialists, who develop a complete library of different cells in a certain technology. To support high performance, low-power and low-leakage applications, today, a library may consist of 1500 different cells. There may be different cell versions of the same logic function, but with a different drive strength, a different threshold voltage and/or a different gate oxide thickness. However, special requirements on high speed or low power may create the need for custom design, to optimise (part of) the chip for that requirement. In Chap. 4, the layout process is explained in detail.

7.3.8 Conclusions

As shown in the signal processor example before, in the top-down design path, decisions have to be made at each level about different possible implementations. In this way, a decision tree arises. Figure 7.17 shows an example of a decision tree for the previously discussed signal processor system.

The decision tree starts at the highest level, i.e., the system level. Every time we move one level down in the tree, we focus on a smaller part of the design, which allows us to add sufficient details to take the right decision at this level and then move to the next level. However, the decisions at each level can be strongly dependent on the possibilities available at a lower or at the lowest level. System designers who wish to achieve efficient area implementations therefore require a reasonable knowledge about the consequences of their decision at implementation level. For instance, the decision to implement a double data bus structure (Fig. 7.10) requires twice as many interconnections as a single bus implementation. As a result, the implementation of a double bus will take twice the area, but it also doubles the noise contribution since it doubles the level of the simultaneously switching current.

Decision trees and abstraction levels basically reduce the complexity of design tasks to acceptable levels. However, the abstraction levels are also accompanied by verification problems. More levels can clearly increase verification difficulties. Requirements at a certain level of abstraction depend on details at a lower level. Details such as propagation delays, for example, can influence higher level timing behaviour.

For example, the final layout implementation of a full adder clearly influences its electrical behaviour. Delay times are also determined by factors such as parasitic wiring capacitances.

The bottom-up implementation and verification process begins at the layout level . Cell layouts are assembled to form modules, and these are combined to form the larger units that are indicated in the floor plan of the IC. The floor plan is a product of the top-down and bottom-up design process and is an accurate diagram which shows the relative sizes and positions of the included logic, analog, memory and interface cores. Cores that are identified as critical during the design path are usually implemented first. These are cores which are expected to present problems for power dissipation, area or operating frequency. Verification of their layouts reveals whether they are adequate or whether an alternative must be sought. This may have far-reaching consequences for the chosen architecture.

The inter-dependence of various abstraction levels and implementations clearly prevents a purely top-down design followed by purely bottom-up implementation and verification. In practice, the design process generally consists of various iterations between the top-down and bottom-up paths.

Abstraction level descriptions which contain sufficient information about lower-level implementations can limit the need for iterations in the design path and prevent wasted design effort. The maximum operating frequency, for example, of a module is determined by the longest delay path between two flip-flops. This worst-case delay path can be determined from suitable abstraction level descriptions and used to rapidly determine architecture feasibility. As an example, the multiplier in the previously discussed signal processor is assumed to contain the worst-case delay path.

The dimensions of logic cells in a layout library, for example, could be used to generate floor plan information such as interconnection lengths. These lengths, combined with specified delays for the library cells (e.g., full adder, multiplexer, etc.) allow accurate prediction of performance. The worst-case delay path can eventually be extracted from the final multiplier layout and simulated to verify that performance specifications are met.

The aim of modern IC-design environments is to minimise the number of iterations required in the design, implementation and verification paths. This should ensure the efficient integration of systems on silicon. Beyond the 30 nm node, designers face a continuously increasing design complexity caused by additional lithography, process and variability issues on top of the area, timing, power, leakage and noise issues that already exist since the 100 nm node. System integration and verification of multi-billion transistor designs with multi-clock and power domains (Chaps. 8 and 9) require smooth integration of reusable existing externally and in-house developed IP with newly designed IP. For many process nodes already, design verification has become the costlier part of the design cycle. For the above described complex ICs it may take even more than 60% of the total design cost, particularly when they also include a variety of analog IP.

7.4 Digital VLSI Design

7.4.1 Introduction

The need for CAD tools in the design and verification paths grows with increasing chip complexity. The different abstraction levels, as discussed in the previous subsection, were created to be able to manage the design complexity at each level.

7.4.2 The Design Trajectory and Flow

The continuous growth in the number of transistors on a chip is a drive for a greater integration of synthesis and system level design. The increasing complexity of the system level behaviour, combined with an increasing dominance of physical effects of devices (e.g., variability), supply lines (e.g., voltage drop and supply noise), and interconnections (e.g., propagation delay and cross-talk), is a drive for a greater integration of synthesis and physical design.

Figure 7.8 shows a heterogeneous system on a chip (SOC) . First, the entire design must be described in a complete specification. For several existing ICs, such a specification consists of several hundreds of textual pages. This design specification must be translated into a high-level behavioural description, which must be executable and/or emulatable.

In many cases, software simulation is too slow and inaccurate to completely verify current complex ICs. Also, the interaction with other system components is not modelled. Logic emulation is a way to let designers look before they really act. Emulation allows the creation of a hardware model of a chip. Here, proprietary emulation software is used, which is able to map a design on reprogrammable logic, and which mimics the functional behaviour of the chip. Emulation is usually done in an early stage of the design process and allows more effective hardware/software co-design . The validation/verification problem has also led to the introduction of hybrid simulator tools [2], which claim to speed up simulation by 10–100 times for a single-chip or multi-chip system. Once the high-level behavioural description is verified by simulation or emulation, all subsequent levels of design description must be verified against this top-level description. The introduction of standard verification methods such as OVM (Open Verification Methodology) and UVM (Universal Verification Methodology) is another attempt to deal with the verification complexity. These standards are supported by the major CAD vendors. Figure 7.18 shows a general representation of a design flow.

Synthesis tools automatically translate a description at a higher hierarchy level into a lower level one. These tools are available at several levels of abstraction. As systems continuously demand more performance improvements than available from the limited intrinsic performance improvements by scaling to the next technology node, the focus is currently more towards improved and more efficient algorithms. These algorithms require a higher level of design: MatLab, C/C++, system C, or similar platforms, rather than RTL. High-level synthesis transforms a behavioural description into a sequence of possible parallel operations which must be performed on an IC. The derivation of ordering operations in time is called scheduling .

The allocation (or mapping) process selects the required data-path components. These high-level components include complete signal processor and microprocessor cores, as well as co-processors, ALUs, RAMs and I/O blocks, etc. With some exceptions, high-level synthesis (HLS) tools are focussed at specific application domains, such as DSP and data-path designs, which are driven by regular streams of data samples. The design workflow requires knowledge of both software to write C applications and hardware to parallelise tasks and resolve timing and memory management issues [3, 4]. For telecom and audio processor ICs, there are tools which are different from those that are created and used for the development of video signal processors. Behavioural synthesis tools, also called high-level synthesis (HLS) tools, generate RTL hardware descriptions in VHDL or Verilog from the system specification. The RTL code of a logic block describes its functionality in detail, in fact, it describes the behaviour of every bit in that block at every clock cycle. Although research on high-level synthesis (HLS) started already in the late 80s, industrial adoption has taken off slowly because of the long learning curve one has to master and it was difficult to formally prove the equivalence between the high-level description and the synthesised RTL models. So far, it has been successfully applied in video and signal processing environments but has found only limited use in other areas [5].

Current and future systems on silicon (Fig. 7.8) are, and will be, designed by using a wide variety of pre-designed building blocks. This design reuse requires that these Intellectual Property (IP) parts, such as microcontrollers, micro- and graphics processors, memories and interfaces, can be easily ported from one chip design to another. Such a reuse must be supported by tools. Design reuse will be fuelled by the sharing of cores among companies. In many cases, a Reduced Instruction Set Computer (RISC) microprocessor core (ARM, MIPS, Sparc) is used. If we include the application (program) in an on-chip ROM or other type of memory, this is called embedded software .

A survey with 372 responses from design and verification engineers (Fig. 7.19) shows that, on average, 68% of their ASIC design content is reused IP, of which roughly two thirds is internally (in-house) developed IP [6]. The development of new design content is often done using an IP-based design approach, in which the design is partitioned into IP modules.

An overall ASIC design style thus requires several engineering teams working in parallel on managing and executing various design tasks: new IP creation, integration of new and reused IP, chip assembly (floor planning) and verification.

Synthesis tools must play a key role in integrating such pre-designed building blocks with synthesised glue logic onto one single chip. The most-used type of synthesis is from the RTL level to a netlist of standard cells. Each system on a chip can be considered to consist of many registers which store binary data. Data is operated upon by logic circuits between these registers. The operations can be described in a Register-Transfer Language (RTL). Before the VHDL code (or Verilog ) is synthesised at this level, the code must be verified by simulation.

At higher functional levels, software (VHDL) simulators are often sufficiently fast. However, in many cases, RTL level simulation is a bottle-neck in the design flow. Besides an increase in the complexity of ICs, longer frame times (as in MPEG video and DAB) must also be simulated. Such simulations may run for several days, resulting in too long iteration times and allowing only limited functional validation of an RTL design.

A hardware accelerator , with accompanying software, is a VHDL simulator platform in which the hardware is often realised with reconfigurable logic, e.g with field-programmable gate arrays (FPGAs) or with a large multiprocessor system, which is connected to the network or a host system. Gate level descriptions as well as memory modules can be downloaded into a hardware accelerator. However, most non-gate level parts (RTL and test bench) are kept in software. The accelerator hardware speeds up the execution of certain processes (i.e., gates and memory) and the corresponding events. In fact, the accelerator is an integral part of the simulator and uses the same type of interface. Generally, the raw performance of a hardware accelerator is less than with emulation .

When the RTL description is simulated and proven to be correct, RTL synthesis is used to transform the code (mostly VHDL or Verilog) into an optimised netlist. Actually, the described function or operation at RTL level is mapped onto a library of (standard) cells. Synthesis at this level is more mature than high-level synthesis and is widely used. The synthesis of the functional blocks and the composition of the complete IC is the work of the physical or back-end designer. Next to the logic synthesis, back-end design tasks include the place and route of the logic cells in the generated netlist, and the floor planning , which assigns the individual logic blocks, memories and I/O pins to regions in the chip. It also includes tasks that maintain signal integrity (crosstalk, supply noise, voltage drop, etc.), variability (parameter spread, transistor matching, etc.), reliability (electromigration, antenna rules, etc.) and design for manufacturability (DfM) (via doubling, metal widening or spreading, dummy metals, etc.). This back-end design is no longer a straightforward process, but it requires many iterations to cover all of the above design objectives simultaneously. This shows that the back-end design has become a very complex task, which needs to be supported by appropriate tools, smoothly integrated in the design flow.

Finally the design verification is also a growing part of both the front-end and back-end design trajectory. CAD tools are also used for the validation in the IC-design verification path. Simulation is the most commonly used design-verification method. Behavioural simulation is usually done on an IP block basis at a high abstraction level (algorithm/architecture). It runs quickly because it only includes the details of the behaviour and not of the implementation. Logic simulation is performed at RTL or netlist level and relates to the digital (or Boolean) behaviour in terms of logic 1’s and 0’s. Circuit simulation is the transistor level simulation of the behaviour of a schematic or extracted layout. It usually includes all device and circuit parasitics and results in a very accurate and detailed analog behaviour of the circuit. Due to the rapid increase in the IC’s complexity, it is impossible to completely simulate a system on a chip and verify that it will operate correctly under all conditions. Moreover, it is very difficult to envision and simulate all potential event candidates that may lead to problems. Achieving 100% verification coverage would require huge time-consuming simulations with an unlimited number of input stimuli combinations.

Luckily, there are other verification methods that complement the simulation. Formal verification is a mathematical method to verify whether an implementation is a correct model for the specification. It is based on reasoning and not on simulation. This verification may include the comparison of design descriptions at different levels of abstraction. Examples of this so-called equivalence checking are the comparison between behavioural description and RTL description, which checks whether the synthesis output is still equivalent to the source description, and the comparison between the RTL description and the synthesised netlist to prove equal functional behaviour. It does not prove that the design will work.

Timing verification is done at a lower hierarchy level. During a (deterministic) static timing analysis (STA) each logic gate is represented by its worst-case propagation delay. Then, the worst-case path delay is simply the sum of the worst-case delays of the individual gates in that path. Due to the increasing process-induced parameter spread in devices and interconnect structures, these worst-case numbers are often so high that this type of static timing analysis leads to design overkill, to less performance than in the previous technology node, or to incorrect critical paths. This has led to the introduction of a statistical static timing analysis (SSTA) tool, which uses probability distributions of random process variations and tries to find the probability density function of the signal arrival times at each internal node and primary output. This type of analysis is considered necessary, particularly for complex high-performance ICs [79]. However, probability density functions are difficult to compute and the method needs to be simplified to make it a standard component of the verification process.

As a result of the growing number of transistors on one chip and with the inclusion of analogue circuits or even sensors on the same chip, verification and analysis have become serious bottle-necks in achieving a reasonable design turn-around time. Extensive verification is required at each level in the design flow and, as discussed before, there is a strong need for cross-verification between the different levels. Verification often consumes 40–60% of the total design time. With increasing clock speed and performance, packaging can be a limiting factor in the overall system performance. Direct attachment of chip-on-board and flip-chip techniques continue to expand to support system performance improvements. Verification tools are therefore needed across the chip boundaries and must also include the total interconnect paths between chips.

Finally, a set of data and script files, called the process design kit (PDK) is used to enable the use of various EDA (electronic design automation) tools to support the full-custom design flow of the IC, from schematic entry to verified layout. In fact it acts as an interface between design and foundry. It is developed by the foundry. A PDK mainly consists of symbols, device models (transistors, capacitors, inductors and resistors), technology files (including process parameter spread), parameterised cells (Pcells), verification decks, design rule manual containing layout, electrical and reliability design rules, etc.

Since PDKs are foundry-specific and include a complex set of files, a detailed description of their contents, their languages and formats is beyond the scope of this book, but can be found on the internet.

7.4.3 Example of Synthesis from VHDL Description to Layout

This paragraph discusses the design steps of the digital mixer (see Sect. 7.3.4), starting at the RTL description level (in VHDL) and ending in a standard cell layout. Figure 7.20 shows the RTL-VHDL description of this mixer.

Figure 7.21a shows a high abstraction level symbol of this mixer, while a behavioural level representation is shown in Fig. 7.21b.

Fig. 7.21
figure 21figure 21

(a ) Abstraction level symbol and (b ) behavioural level representation of the mixer

After synthesis, without constraints, our mixer looks as shown in Fig. 7.22.

Fig. 7.22
figure 22figure 22

Mixer schematic after synthesis with no constraints

Figure 7.23 shows the multiplier and adder symbolic views after synthesis.

Fig. 7.23
figure 23figure 23

Multiplier and adder symbolic views

Figure 7.24 shows the schematics of the adder, after synthesis with no constraints.

Fig. 7.24
figure 24figure 24

Adder schematics after synthesis with no constraints

Figure 7.25 shows the schematics of the adder, after synthesis with a timing constraint for the worst-case delay path.

Fig. 7.25
figure 25figure 25

Adder schematics after timing-constraint synthesis

The additional hardware in Fig. 7.25 compared to that of Fig. 7.24 is used to speed up the carry ripple by means of carry look-ahead techniques. Figure 7.26 shows the relation between the delay and the area. The figure clearly shows that reducing the delay by timing constrained synthesis can only be achieved with relatively much additional hardware (area).

Fig. 7.26
figure 26figure 26

Relation between maximum delay and the amount of hardware (area)

Figure 7.27 shows a part of the netlist of library cells onto which the potentiometer function has been mapped. A netlist may contain instances, modules, pins and nets. An instance is the materialisation of a library cell or a module. A module, itself, is built from several instances and their connections. Pins, also called ports or terminals, represent the connection points to an instance or module and finally, a net represents a connection between pins.The figure shows the different library cells and the nodes to which their inputs and outputs are connected.

Fig. 7.27
figure 27figure 27

Part of the mixer netlist after synthesis with 14 ns timing constraints

The next step is to create the layout of this block. The place and route (P and R) tool places the netlist cells in rows and also creates the interconnections between the pins of the cells (or modules). Due to the growing complexity of IP cores in combination with the need to accommodate higher chip performance, the physical design of these cores becomes a real challenge. To achieve timing closure on such complex blocks with very tight area, timing and power constraints is a difficult task. Cell placement is a critical part of the backend design flow, as it has severe impact on core area, wire length, timing and power requirements. P and R tools today allow area-driven, wire-length driven, timing-driven as well as power-driven placement [10], and thus allows placement optimisation for various different application domains. A timing driven placement, for example, can assign higher weights to critical nets to reduce their wire length as well as select faster cells, e.g., with higher drive capability and/or reduced threshold voltage, to reduce the critical path delay (see also Sect. 4.7).

After the use of place and route tools, a standard cell design of the mixer is created; see Fig. 7.28 for the result. This netlist and layout are the result of the chosen description of the mixer’s functionality according to:

$$\displaystyle{Z = k \cdot A + (1 - k) \cdot B}$$

This implementation requires two adders and two multipliers. However, an obvious optimisation of the same function may lead to a more efficient implementation. The following description

$$\displaystyle{Z = k \cdot (A - B) + B}$$

requires only two adders and one multiplier. This example shows that the decision taken at one hierarchy level can have severe consequences for the efficiency of the final silicon realisation in terms of area, speed and power consumption.

Fig. 7.28
figure 28figure 28

Standard cell implementation of mixer

Although the synthesis process uses tools which automatically generate a next level of description, this process is controlled by the designer. An excellent design is the result of the combination of an excellent tool and a designer with excellent skills in both control of the tools and knowledge of IC design.

7.4.4 Floorplanning

When all required external, in-house and newly synthesised IP cores are available, these have to be integrated to create a compact chip, as shown in Fig. 3.10 Floor planning is an important part of the chip design cycle. The major modules are manually placed. Next, the blocks that have very intensive direct communication with each other must be positioned in each other’s close vicinity in order to limit power consumption and/or propagation delay across their signal interconnect wires.

Floor planning is also supported by the P and R tools in that they can change the aspect ratio of the synthesised standard-cell cores. The shape of such a chiplet is fully adjusted to the area requirements as defined by the floor plan. Other tools support further placement of the cores, based on their aspect ratios and pin positions. Some of these tools can also create and implement multi-voltage domains (see Chap. 8) to support on-chip power management [11]. DSPs, graphics processors, microprocessors and DDR memory interfaces are critical floor plan elements as they are often critical in timing and relatively take a large share of the total power consumption. Therefore in certain applications, these blocks must be distributed over the chip to prevent local overheating. Important other floor planning issues are:

  • chip level signal wiring and wire estimation

  • insertion of feed-throughs

  • distribution of power nets

  • clock distribution

Figure 7.29 shows an example of a floor plan. For educational purposes, this example chip only contains a limited number of cores.

Fig. 7.29
figure 29figure 29

Example floor plan of a chip

With the continuous growth of chip complexity, even state-of-the-art SoCs for mobile and consumer applications may contain more than 100 different cores, which may be distributed over different voltage and/or different clock domains and surrounded by 1000–2000 I/O and power supply pads. This is one of the reasons why interfaces have moved from parallel to serial architectures.

It is clear that floor planning has become one of the most crucial, critical and time-consuming tasks in a SoC design. It is not just to create the smallest chip area with the given area and pin constraints of the individual cores, it is also extremely important to position them such that all chip level timing and power requirements are achieved.

7.5 The use of ASICs

The growth in the ASIC business is primarily the result of the increasing number of application areas and of the general increase in the use of ICs. ASICs often provide the only solution to problems attributed to speed and/or space requirements. Another incentive for the use of ASICs is the degree of concealment which they afford. This concealment poses extra difficulties to competitors interested in design duplication.

ASICs make it reasonably easy to add new functionality to an existing system without an extensive system redesign. In addition, the increased integration of system parts associated with the use of ASICs has the following advantages:

  • Reduced physical size of the system

  • Reduced system maintenance costs

  • Reduced manufacturing costs

  • Improved system reliability

  • Increased system functionality

  • Reduced power consumption.

The advantages afforded by ASICs can have a positive influence on the functionality/price ratio of products and have led to the replacement of standard ICs in many application areas. However, there are also disadvantages associated with the use of ASICs. These include the following:

  • The costs of realising an ASIC are quite substantial and less predictable than those associated with standard ICs .

  • Unlike standard products, ASICs are not readily available from a diverse number of suppliers. Inaccurate specifications or errors in the design process may cause delays in ASIC turn-around time and result in additional non-recurring engineering (NRE) costs. These are costs incurred prior to production. Typical NRE costs include the cost of:

    • Training and use of design facilities

    • Support during simulation

    • Placement and routing tools

    • Mask manufacturing (where applicable)

    • Test development

    • The delivery of samples.

    Furthermore, standard products are always well characterised and meet guaranteed quality levels. Moreover, small adjustments to a system comprising standard products can be implemented quickly and cheaply.

The advantages and disadvantages associated with the use of ASICs depend on the application area and on the required ASIC type and quantities. Improved design methods and production techniques combined with better relationships between ASIC customers and manufacturers have a considerable influence on the transition from the use of standard products to ASICs.

An ASIC solution in the above discussions does not necessarily imply a single chip or system-on-a-chip (SoC) solution, but it might also refer to a system-in-a-package (SiP) solution. For a discussion on SoC versus SiP system solutions, the reader is kindly requested to read the appropriate subsection in Chap. 10

7.6 Silicon Realisation of VLSI and ASICs

7.6.1 Introduction

In addition to the need for computer programs for the synthesis and verification of complex ICs, CAD tools are also required for the automatic or semi-automatic generation of layouts. The development of Intel’s Pentium and Xeon processors, for example, took several thousands of man-years. The same holds for the IBM PowerPC. Figure 7.30 shows a photograph of an Intel Xeon processor. This Haswell-E/EP i7 Core processor in the Xeon family combines eight processor cores with a 2.56 MB L1 cache, a 1.28 MB L2 cache and a 20 MB L3 cache memory, resulting in a 356 mm2 chip containing 2.6 billion transistors. It uses a core voltage of 0.735 V and consumes 140 W, when running at its maximum clock frequency of 3 GHz.

Fig. 7.30
figure 30figure 30

The Intel Haswell-E/EP eight-core processor of the Xeon family (Courtesy of Intel)

In fact, the increased use of CAD tools in recent years has very often merely facilitated the integration of increasingly complex systems without contributing to a reduction in design time. This situation is only acceptable for very complex high-performance ICs such as a new generation of microprocessors. Less complex ICs, such as ASICs, require fast and effective design and layout tools. Clearly, the need for a fast design and layout process increases as the lifetimes of new ICs become shorter. The lifetime of a new generation of ICs for many mobile gadgets, for instance, is close to 1 year. This means that the design process may take only a couple of months. Each layout design must be preceded by a thorough floor plan study. This must ensure that the envisaged layout will not prove too large for a single chip implementation in the final design phase. As discussed before, a floor plan study can take considerable time and only leads to a definite floor plan after an iterative trial-and-error process. Layouts of some parts of the chip may be required during the floor plan study. Although we distinguish between the different ASIC categories of custom ICs, semi-custom ICs and PLDs in this book, the differences are rapidly diminishing as a result of the pace at which improvements in IC technologies are realised. PLDs are moving towards gate arrays, gate arrays are moving towards cell-based designs and cell-based designs may use sea-of-gates structures such as embedded arrays to implement the glue logic as well as for mapping of cores onto such arrays. Each category uses the best features of the others.

The choice of implementation is determined by the required development time, production volume and performance. Table 7.2 summarises the performance of various layout implementation forms . This table is only valid in general terms.

Table 7.2 Comparison of performance of different layout implementation forms

The different layout implementation forms are discussed separately in the next subsections.

7.6.2 Handcrafted Layout Implementation

A handcrafted layout is characterised by a manual definition of the logic and wiring. This definition must account for all relevant layout design rules for the envisaged technology. The design rules of modern technologies are far more numerous and complex than those used in the simple initial nMOS process. However, various CAD tools have emerged which ease the task of creating a handcrafted layout. These include interactive computer graphic editors (or polygon pushers) , compactors and design-rule-check (DRC) programs.

An example of a handcrafted layout is illustrated in Fig. 7.31. Such an implementation yields considerable local optimisation. However, the required intensive design effort is only justified in MSI circuits and limited parts of VLSI circuits. The use of handcrafted layout is generally restricted to the design of basic logic and analog cells. These may subsequently be used in standard-cell libraries, module generators and bit-slice layouts, etc. Very high-speed designs may still require hand-crafted design techniques, whether or not supported by simple CAD tools, but this is limited to those circuits that cannot achieve the performance targets using synthesis tools.

Fig. 7.31
figure 31figure 31

Typical contours of a handcrafted layout

7.6.3 Bit-Slice Layout Implementation

A bit-slice layout is an assembly of parallel single-bit data paths. The implementation of a bit-slice layout of a signal processor, for example, requires the design of a circuit layout for just one bit. This bit slice is subsequently duplicated as many times as required by the word length of the processor. Each bit slice may comprise one or more vertically arranged cells. The interconnection wires in a bit slice run over the cells with control lines perpendicular to data lines. CAD tools facilitate the efficient assembly of bit-slice layout architectures. The bit-slice design style is characterised by an array-like structure which yields a reasonable packing density. Figure 7.32 illustrates an example of a bit-slice layout architecture. A bit-slice section is also indicated in the chip photograph in Fig. 7.50.

Fig. 7.32
figure 32figure 32

Basic bit-slice layout

Fig. 7.33
figure 33figure 33

Logic functions realised with a ROM

Fig. 7.34
figure 34figure 34

Basic PLA structure

Fig. 7.35
figure 35figure 35

Basic standard-cell layout

Fig. 7.36
figure 36figure 36

Floor plan for (a ) conventional and (b ) channel-less gate arrays

Fig. 7.37
figure 37figure 37

An example of a gate array master-cell structure and floor plan

Fig. 7.38
figure 38figure 38

Sea-of-transistors array with gate isolation

Fig. 7.39
figure 39figure 39

General representation of an FPGA architecture

Fig. 7.40
figure 40figure 40

Example of a 4-input configurable block

Fig. 7.41
figure 41figure 41

Example of a 4-input LUT

Fig. 7.42
figure 42figure 42

Example of a configurable switch matrix

Fig. 7.43
figure 43figure 43

Example of a basic PAL architecture implementing three different logic functions of three inputs

Fig. 7.44
figure 44figure 44

Use of floating gate transistors to realise AND-array connections in CPLD (Source: IEEE Design and Test of Computers)

Fig. 7.45
figure 45figure 45

Altera’s MAX V CPLD architecture (Courtesy of Altera)

Fig. 7.46
figure 46figure 46

Architecture of an embedded array ASIC (Source: ICE)

Fig. 7.47
figure 47figure 47

Example architecture of Nextreme structured ASIC (Courtesy of eASIC)

Fig. 7.48
figure 48figure 48

Meet-in-the-middle strategy

Fig. 7.49
figure 49figure 49

Cost comparison of the different layout implementation forms

Fig. 7.50
figure 50figure 50

A conventional microprocessor chip which combines different layout implementation forms (Source: NXP Semiconductors)

The AMD Am2901 is an example of a bit-slice architecture. Today this layout style has become less popular, because it requires a lot of manual design effort compared to the availability of a fully synthesisable alternative with the standard-cell approach, discussed in Sect. 7.6.5.

7.6.4 ROM, PAL and PLA Layout Implementations

In addition to serving as a memory, a ROM can also be used to implement logic functions . An example is shown in Fig. 7.33.

Only one vertical line in this ROM will be ‘high’ for each combination of address inputs x n x 0. This vertical line drives the gates of m + 1 transistors in the OR-matrix. The outputs F j , that are connected to the drains of these transistors, will be ‘low’. If, for example, the address inputs are given by x 0 x 1 = 10, then the second column line will be ‘high’. A ‘low’ will then be present on outputs F 1 and F 2. The information stored in the ROM in Fig. 7.33 is thus determined by the presence or absence of connections between MOS transistor drains and the output lines. In this way, the structure of a ROM can easily be used to realise logic functions. Table 7.3 shows a possible truth table, which could be implemented with the ROM in Fig. 7.33.

Clearly, the set of logic functions that can be realised in a ROM is merely limited by the number of output and address bits. The regular array structure of a ROM leads to a larger transistor density per unit of chip area than for random logic. A large number of logic functions could, however, require an excessively large ROM while the use of a ROM could prove inefficient for a small number of logic functions. In general, a ROM implementation is usually only cheaper than random logic when large volumes are involved.

Unfortunately, there are no easy systematic design procedures for the implementation of logic functions in ROM. Other disadvantages are as follows:

  • Lower operating frequency for the circuit

  • The information in a ROM can only be stored during manufacturing

  • Increasing the number of input signals by one causes the width of the ROM to double

  • A high transistor density does not necessarily imply an efficient use of the transistors.

Table 7.3 Example of a truth table implemented with the ROM in Fig. 7.28

It is clear from Fig. 7.33 that the vertical column lines in a ROM represent the product terms formed by the address inputs x i . These product terms comprise all of the logic AND combinations of the address inputs and their inverses. Only the OR-matrix of a ROM can be programmed.

Figure 7.34 illustrates the basic structure of a programmable logic array (PLA). Its structure is similar to that of a ROM and consists of an AND matrix and an OR-matrix . In a PLA, however, both matrices can be programmed and only the required product terms in the logic functions are implemented. It is therefore more efficient in terms of area than a ROM. Area requirements are usually further reduced by minimising the number of product terms before generating the PLA layout pattern.

The logic functions implemented in the PLA in Fig. 7.34 are determined as follows: a 0 is ‘high’ when x and \(\overline{z}\) are low, i.e., \(a_{0} = \overline{x}z\). Similarly, \(a_{1} = x\overline{y}\,\overline{z}\) and \(a_{2} = \overline{x}yz\).

The outputs are therefore expressed as follows:

\(F_{0} = \overline{a_{1}} = \overline{x\overline{y}\,\overline{z}}\)

\(F_{1} = \overline{a_{0} + a_{2}} = \overline{\overline{x}z + \overline{x}yz}\)

\(F_{2} = \overline{a_{0} + a_{1}} = \overline{\overline{x}z + x\overline{y}\,\overline{z}}\)

A PLA can be used to implement any combinatorial network comprising AND gates and OR gates. In general, the complexity of a PLA is characterised by (A + C) × B, where A is the number of inputs, B is the total number of product terms, i.e., the number of inputs for each OR gate, and C is the number of outputs, i.e., the number of available logic functions.

Sequential networks can also be implemented with PLAs. This, of course, requires the addition of memory elements. A PLA can be a stand-alone chip or an integral part of another chip such as a microprocessor or a signal processor. PLAs are frequently used to realise the logic to decode microcode instructions for functional blocks such as memories, multipliers, registers and ALUs. Several available CAD tools enable a fast mapping of logic functions onto PLAs. As a result of the improvements in cell-based designs, ROM and PLA implementations are becoming less and less popular in VLSI designs. Another realisation form is the Programmable Array Logic (PAL) . In this concept, only the AND plane is programmable and the OR plane is fixed. Figure 7.43 shows an example of a PAL architecture.

Table 7.4 summarises the programmability of planes (AND, OR) in the ROM, PAL and PLA devices. Programmable techniques include fuses (early and smaller devices), floating gate transistors ((E)EPROM) and flash devices. In some cases, a ROM (PLA) block is still used in a custom design; the programming is done by a mask. These are then called mask-programmable ROMs (PLAs) . Most of the former ROM, PAL, PLA applications are currently implemented by the more flexible field-programmable gate arrays (FPGA) and complex PLDs (CPLDs), which are discussed in Sect. 7.6.7.

Table 7.4 Programmability of AND and OR planes in ROM, PAL or PLA devices

7.6.5 Cell-Based Layout Implementation

Figure 7.35 shows a basic layout diagram of a chip realised with standard cells .

In this design style, an RTL description of the circuit is synthesised and mapped onto a number of standard cells which are available in a library, see Sect. 4.7. The resulting netlist normally contains no hierarchy. The standard-cell library usually consists of a large number of different types of logic gates, which are all of equal height (Fig. 4.47).

Today’s libraries may contain between 500 and 2000 cells, due to a large variety of drive strengths and different threshold voltages (HV T , SV T and LV T , respectively referring to high, standard and low-V T). This enables the synthesis tools to better adapt a design to such performance requirements as high speed, low power or low leakage, for example.

The standard-cell design flow is supported by mature synthesis and P and R tools (Sect. 7.4). Routing is done at a fixed grid across the logic gates. The supply lines are specially structured to create a supply network with minimum resistance and is usually an integral part of the standard cell design approach. The clock network is usually generated by a clock-tree synthesis tool, which creates balanced clock trees to reduce intrinsic clock skew and also deals with timing constraints. However, many clock-synthesis tools often balance different clock paths by compensating interconnect RC delay in one path with buffer delays in another, leading to a different path sensitivity to PVT variations. High-speed processors use relatively large clock grids leading to less clock skew and less sensitivity to PVT variations, but at increased power levels. In addition, they require a detailed analysis of all parasitic resistive, capacitive and inductive effects, including the modelling and simulation of the current return paths. Modern standard-cell design environments facilitate the inclusion of larger user-defined cells in the library. These blocks, macros or cores may include multipliers, RAMs, signal processor cores, microprocessor cores, etc.

During the late 1980s, extra attention was paid to advanced circuit test methods. These include scan test and self-test techniques, see Sect. 10.2.1 The scan technique uses a sequential chain of intrinsically available flip-flops to allow access to a large number of locations on an IC or on a printed circuit board. The self-test technique requires the addition of dedicated logic to an existing design. This logic generates the stimuli required to test the design and checks the responses. The result is a logic circuit or a memory which is effectively capable of testing itself. Details of IC testing are discussed in Chap. 10

7.6.6 (Mask Programmable) Gate Array Layout Implementation

A gate array is also called a mask-programmable gate array (MPGA) . A conventional gate array contained thousands of logic gates, located at fixed positions. The layout could, for example, contain 10,000 3-input NAND gates. The implementation of a desired function on a gate array is called customisation and comprises the interconnection of the logic gates. The interconnections were located in dedicated routing channels , which were situated between rows of logic gates. In these conventional channelled gate arrays, the routing was often implemented in two metal layers.

This type of gate array is depicted in Fig. 7.36a. The channels are essential for interconnecting the cells when production processes with one or even two metal layers are involved.

In a conventional gate array, the ratio between the available cell and routing channel areas was fixed. Obviously, the actual ratio between the areas used was dependent on the type of circuit. In practice, the available area is rarely optimally used. This feature is especially important for larger circuits. Furthermore, larger circuits require more complex interconnections and this increases the density in routing channels. The channel-less gate array architecture was therefore introduced. Other names encountered in literature for this architecture include: high-density gate array (HDGA) , channel-free gate array , sea-of-gates , sea-of-transistors and gate forest .

Figure 7.36b shows the floor plan for a channel-less gate array. It consists of an array of transistors or cells. It does not contain any specially reserved routing channels. In the 1990s, more advanced gate arrays comprised an array of master cells , which consist of between four and ten transistors. In some cases, the master cells are designed to accommodate optimum implementations of static RAMs, ROMs or other special circuits. A given memory or logic function is implemented by creating suitable contact and interconnection patterns in three or more metal layers. The master cells in a gate array can be separated by field oxide isolation , which is created by using the STI technique described in Chap. 3 An example of such a gate array master-cell structure is shown in Fig. 7.37, which also shows an example of a gate array floor plan.

Figure 7.38 shows a section of a sea-of-transistors array, which comprises a row of pMOS and nMOS transistors. The complete array is created by copying the section several times in the horizontal and vertical directions. These gate arrays are also often called continuous arrays or uncommitted arrays . The rows are not separated by routing channels and the floor plan is therefore the same as shown in Fig. 7.36b. These gate array architectures facilitate the implementation of large VLSI circuits on a single gate array using a large number of metal layers. The logic and memory functions are again realised through the interconnection and contact hole patterns.

The various logic gates and memory cells in a sea-of-transistors architecture are separated by using the gate-isolation technique illustrated in Fig. 7.38.

The layout in the figure is a D-type flip-flop , based on the logic diagram shown. The gate-isolation technique uses pMOS and nMOS isolation transistors, which are permanently switched off by connecting them to supply and ground, respectively. This technique obviously requires both an nMOS and a pMOS isolation transistor between neighbouring logic gates [12].

The NRE costs of these devices depended on circuit complexity and were in the order of 100 k$–1 M$. Small transistors placed in parallel with larger transistors facilitate the integration of logic cells with RAMs, ROMs and PLAs in some of these HDGA architectures [13].

The design methods used for gate arrays are becoming increasingly similar to those used for cell-based design. This trend facilitates the integration of scan-test techniques in gate array design. As a result of the increasing number of available cells, the software for gate array programming resembles that of cell-based designs. Also, the availability of complete cores that allow reuse (IP) are becoming available to gate array implementation.

Off-the-shelf families of gate arrays are available and include the full transistor manufacture with source and drain implants. Customisation therefore only requires the processing of several contact and metal masks. This facilitates a short turn-around time in processing and renders gate arrays suitable for fast prototyping .

Gate array publications include advanced low-power schemes and technologies (SOI). For high speed gate arrays, gate delays (3-input NOR with a fan-out of two) below 50 ps have been reported. The complexity of advanced gate arrays has exceeded several tens of millions of gates. The popularity of these (mask programmable) gate arrays reached a maximum during the nineties. The last decade showed a dramatic reduction in new gate array design starts, mainly due to the rapid cost reduction and gate complexity increase of the field-programmable gate arrays. These FPGAs have now completely taken the MPGA market. The reason why the subject is still in the book is twofold. First reason is that an MPGA shows that a digital circuit just consists of a bunch of identical transistors, whose functionality is only determined by the way they are interconnected. Their architecture is very similar to today’s litho-friendly library cells, which have reached the regularity of the mask programmable gate array architecture in that they also show fixed-pitch transistor gates in technology nodes at and beyond 60 nm. Second reason is that MPGA approaches are still being used in structured ASICs. An example is the Fit Fast Structured Arrays (FFSATM) of Toshiba, which can be configured by customising only a few metal layers reducing turnaround time to as little as 5 weeks from RTL hand-off to sample delivery (see the product list at Toshiba website). FPGAs are the subject of the next paragraph.

7.6.7 Programmable Logic Devices (PLDs)

A PLD is a Programmable Logic Device , which can be programmed by fuses, anti-fuses or memory-based circuits. Another name currently also used for a certain category of these devices is Field Programmable Device (FPD) . The first user-programmable device that could implement logic was the programmable read-only memory (PROM), in which address lines serve as logic inputs and data lines as output (see also Sects. 6.5.3.2 and 7.6.4). PLD technology has moved from purely conventional bipolar technology, with a simple fuse-blowing mechanism, to complex architectures using antifuse, (E)EPROM, flash or SRAM programmability.

As a result of the continuous drive for increased density and performance, simple PLDs are losing their market share in favour of the high-density flexible PLD architectures. In this way, PLDs are moving closer and closer towards a gate array or cell-based design and are a real option for implementing systems on silicon . Another piece of evidence for this trend is the fact that several vendors are offering libraries of embedded cores and megacells. In the following, several architectures are presented to show the trend in PLDs.

7.6.7.1 Field Programmable Gate Arrays (FPGAs)

FPGAs combine the initial PLD architecture with the flexibility of an In-System Programmability (ISP) feature. Many vendors currently offer very high-density FPGA architectures to facilitate system-level integration (SLI) . Current FPGAs are mostly SRAM-based and combine memory and Look-Up Tables (LUTs) to implement the logic blocks. Vendors offering LUT-based FPGAs include Xilinx (Spartan for low-power footprint, extreme cost sensitivity and high-volume, Artix for cost-sensitive high-volume markets, Kintex mid-range family, Zynq for high-end embedded systems), and ALTERA (Stratix for high-end applications, Arria midrange family, Cyclone for low-power cost-sensitive markets and Max10 with non-volatile capability for cost-sensitive markets).

Initially, FPGAs were used to integrate the glue logic in a system. However, the rapid increase in their complexity and flexibility make them potential candidates for the integration of high-performance, high-density (sub)systems, previously implemented in gate arrays [14]. The potentials of an FPGA will be discussed on the basis of a generic FPGA architecture (Fig. 7.39).

Today, these architectures consist of a large array of hundreds of thousands of programmable (re)configurable logic blocks and configurable switch matrix blocks. A logic block generally offers both combinatorial and sequential logic. Figure 7.40 shows an example of a configurable block.

In many FPGA architectures the configurable block includes one or more look-up tables (LUTs), one or more flip-flops and multiplexers. Some also contain carry chains to support adder functions. The combinatorial logic is realised by the LUTs, which each may contain 3–8 inputs. Figure 7.41 shows an example of a 4-input LUT.

It is basically a small memory consisting of sixteen memory cells and a couple of multiplexers. By changing the values in these memory cells (when the application is loaded into the FPGA), any logic function (F) of the four inputs (a, b, c, and d) can be created. The data stored in the memory cells of the example represents the following logic function:

$$\displaystyle{F = \overline{a} \cdot \overline{b} \cdot \overline{c} \cdot \overline{d} + a \cdot b \cdot c \cdot d}$$

The LUT, however, can also serve as a distributed memory in the form of synchronous or asynchronous, single or dual-port SRAM or ROM, depending on the needs of the application.

Many FPGAs contain short wire segments for local interconnections as well as long wire segments for ‘long distance’ interconnections. The logic blocks are connected to these wire segments by the configurable switch matrix blocks. Figure 7.42 shows an example of such a block.

The individual switches in such a block are controlled by the so-called configuration memory cells, whose data is also stored when the application is loaded into the FPGA. Most FPGAs use SRAMs to store the configuration bits, although there are also a few who store them in a non-volatile EEPROM or flash memory. All FPGAs that use SRAM for configuration storage need a shadow non-volatile backup memory on the board to be able to quickly download the application into the on-chip configuration memory. Downloading from a software program would lead to relatively large configuration times, whenever the application is started again after a power down.

Next to the configurable logic and switch matrix blocks, many FPGA architectures include dedicated IP cores, digital signal processors (DSPs), microprocessors such as ARM and PowerPC, single and/or dual port SRAMS, flash memories, and multipliers.

Finally most I/O blocks support a variety of standard and high-speed interfaces. Examples of single-ended interfaces are: LVTTL, LVCMOS PCI, PCI-X, I2C, UART, GPIO, USB, GTL and GTLP, HSTL and SSTL. Examples of differential I/O standards are: LVDS, Extended LVDS (2.5 V only), BLVDS (Bus LVDS) and ULVDS, HypertransportTM, Differential HSTL, SSTL.

Of course also several dedicated memory interfaces, such as DDR, DDR-2, DDR-3 and SDRAM and memory controllers, are supported.

Among the state-of-the-art FPGAs are the Xilinx VirtexTM-7 family and Altera Stratix 10 FPGA. To get a flavour of the potentials of these FPGAs, some of the characteristic parameters of both families are presented. The Virtex UltraScale family includes about 5.5 million logic cells, 2880 DSP slices and 88.6 Mb of block RAM and a maximum of 1456 I/O pins. The Altera Stratix 10 FPGA contains 5.5 million logic elements, an integrated quad-core 64 bit ARM Cortex-A53, a floating-point DSP, 1980 DSP blocks, 166 Mb embedded memory and a maximum of 1680 I/O pins. Both FPGAs are fabricated in 14–16 nm FinFET technologies.

The design flow to develop an FPGA application has similarities with the previously discussed standard-cell design flow. An RTL level VHDL or Verilog description is simulated to validate the system requirements. Next, a synthesis tool maps the design to a netlist, which is then translated into a gate-level description. At this level the design is simulated again to verify its consistency with the original RTL level simulation. Finally this gate-level description is realised by the FPGA logic and sequential resources, while timing data is added. A final simulation, including these timing details, must then confirm whether the system requirements are met.

Further details of state-of-the-art FPGAs can easily be found on the internet and are beyond the scope of this book.

This section is meant to present the basic architecture of an FPGA and a flavour of the potentials of current state-of-the-art FPGAs. As explained before, most FPGAs reconfigurability (logic as well as interconnect) is controlled by on-chip configuration SRAM memory bits and require additional non-volatile configuration back-up memory on the board.

7.6.7.2 Complex Programmable Logic Devices (CPLDs)

The structure of a PLD has evolved from the original PALTM devices, which implement sum-of-products (min terms), where the AND-array is programmable and the OR-array is fixed (see Sect. 7.6.4). Compared to PLAs, PALs lack flexibility, but show shorter propagation delays and require less complex software. Figure 7.43 shows an example of the basic PAL architecture, which implements three logic functions.

The connections in the AND-array of the CPLD are commonly realised by programming non-volatile memory cells, which are floating-gate transistors (Fig. 7.44), which means that it can be (p)reprogrammed using in-system programmability and it will securely retain its program, even when it is powered off.

There is no technical reason why the previously discussed FPGAs use SRAM or anti-fuse programming techniques instead of non-volatile, except that the fabrication process will be cheaper.

The original simple PLDs only implemented some tens of logic functions. A large design had to be mapped onto a couple of PLDs, which became a barrier for PLD usage. As a result, ASIC vendors started developing PLDs with much larger arrays and the complex PLD or CPLD was born. CPLDs are offered by a large number of vendors, including Altera (MAX II and MAX V families), Xilinx (CoolrunnerTM II and XC9500XLTM series), Lattice Semiconductors (MACHX03 family), Atmel (CPLD ATF15xx family), etc. Most CPLD architectures look very similar and are based on the previously discussed PAL and/or PLA architectures. Since the logic depth of these arrays is relatively short, even wide-input PLD functions offer short pin-to-pin propagation delays. Many of them also include registers, but their total complexity in terms of equivalent logic gates and flip-flops is usually relatively low, compared to FPGAs.

An example of a CPLD architecture is shown in Fig. 7.45.

As stated before, the total complexity of most CPLDs in terms of equivalent logic gates and flip-flops is relatively low, compared to FPGAs. They are therefore often fabricated in relatively conventional process nodes and used in small systems to implement complex finite-state machines, fast and wide decoders or high-performance control logic. Because the functionality is stored in a non-volatile way, most CPLDs are also suited for use in applications where they can be completely switched off during idle times, without losing their functionality as an SRAM-based FPGA would. The high-end (high-complexity) CPLDs applications show some overlap with the low-end FPGAs. Because of the large number of flip flops and their dynamic reconfigurability, FPGAs are much more flexible in use, compared to CPLD.

7.6.7.3 Programmability of FPGAs and CPLDs

The most important switch-programming techniques currently applied in FPGAs are SRAM, anti-fuse and non-volatile memory cells. Figure 7.42 shows an example of a configurable switch matrix to configure the routing of signals through available interconnect patterns. SRAM cells or flip-flops are also used in a look-up table to configure logic functions (Fig. 7.41).

In the majority of current commercially available CPLDs, the switches are implemented as floating-gate devices, like those in (E)EPROM and flash technologies (Fig. 7.44) [15]. However, CPLDs with SRAM programmability appear on the market. Here, the switches are used to program the AND and OR-array of the PAL, see Fig. 7.43. In 90% of the CPLDs, the connections are made through programmable multiplexers or full cross-point switches. If an input is not used in a product term (minterm) in an AND plane on a CPLD, the corresponding EPROM gate transistor is programmed to be in the off-state. Similar architectures can be built with EEPROM transistors.

7.6.8 Embedded Arrays, Structured ASICs and Platform ASICs

The previously discussed cell-based designs (Sect. 7.6.5) may include standard cells, macro cells, embedded memory blocks and IP cores, etc. A different approach to cell-based designs is the inclusion of embedded arrays . In most cell-based designs that include an embedded array, all masks are customised, as in the cell-based designs. Embedded arrays combine a gate array-like structure and large cells such as microprocessor cores, memories and I/O functions. Cores can either be mapped onto the sea-of-gates array (see Sect. 7.6.6) or can be implemented as a separate block. Figure 7.46 shows the architecture of an embedded array ASIC.

The idea behind such an ASIC is to reduce the total turn-around time from spec definition to first silicon. During the first 20% of the spec development time, almost 80% of the system is defined. So, at that time, the engineers know already which memory type (SRAM, DRAM, flash, etc.), and how much memory is needed, what type of IP cores (CPU, DSP, ARM, analog IP, etc.) are needed and also what type of I/Os the application requires. Also a rough estimation of the required number of logic gates can be made at that time. These are then implemented as a kind of mask-programmable (sea-of-gates) array. The chip is then sent to the fab and is being processed up to the final back-end masks (metal layers and vias), in parallel to the design team defining the remaining 80% of the spec to come to the final spec definition. After completing the spec, only the final metal and via masks need to be defined and processed, thereby reducing the turn-around time and more specifically the time-to-market. Even last-minute design (spec) changes are allowed. Due to the very small lifetimes of today’s products in many consumer and communication markets, it has become very important to have the ability to put prototype products quickly to the market, perform a fast customer product review and transfer it, if necessary, into a high-volume standard-cell design. Toshiba uses this embedded array concept in their UniversalArray ASIC architecture, where the customer can define his own ASIC, with a selection of various available IPs and I/Os, and with the logic implemented on a (sea-of-gates) gate array, available in their Fit Fast Structured Array (FFSA) series [16]. In normal standard-cell blocks, the empty areas are filled with filler cells , which do not contain any transistor, but are only used to extend the supply lines and n-wells and p-wells and allow routing in most metal layers. Due to the sea-of-gates approach in the universal array architecture, the ‘empty areas’, here, also contain unused transistors and offer additional flexibility for creating small design changes. The first product needs to undergo all mask and processing steps, but redesigns, or derivatives with small changes in the logic content, can be quickly realised by changing only the final metal and via masks and performing only the back-end processing. You need to do the design yourself, using the vendor’s technology and design kit. The NRE costs for the first run may be in the order of a few hundred thousand US$ for a 120 nm CMOS design with a few million gates and a few Mb of embedded SRAM up to a million US$ or more for a 60 nm design. It includes the mask costs and delivery of about 100 samples. A new run, with only minor metal mask changes, may cost several hundred thousand US$.

7.6.8.1 Structured ASICs and Platform ASICs

(Mask-programmable) gate arrays have suffered from a declined popularity over the last decade. This has increased the gap between the cell-based design ASICs and FPGAs. A structured ASIC or platform ASIC is a combination of the cell-based and FPGA design concepts, which targets prototyping applications and relatively low volume markets (10–100 k). It offers a large selection of IP cores, which can be customised through a limited number of masks. Basically personalisation can be done by customising all metal and via masks, by customising only a subset of the metal and via masks, or by customising only one via mask. NRE costs are relatively low (from 50 k$ to several 100 k$), but the price per individual chip can be four to six times the cell-based design version. In the following a structured array ASIC example is presented to show some capabilities of this category of ASIC products.

eASIC’s Nextreme structured array ASIC Family

This structured (array) ASIC is an example of customisation through only one top-level via mask. The Nextreme family (see: http://www.easic.com/products) consists of three members, each with different sub-members, offering from 350 k to 13 million gates and 56 Mb of embedded dedicated block memory. The most advanced eASIC is processed in 28 nm CMOS [17]. Customisation is done, only through the VIA-6 mask, allowing very short production turn around times. Figure 7.47 shows an example of eASIC architecture.

It combines various processor and memory cores with peripherals and interfaces. eASIC claims 2–6 weeks design time, followed by 4 weeks of manufacturing. It allows rapid software changes using Diamond processors.

Configurable PLLs and DLLs are embedded for clock generation and clock-phase shifting purposes. Next to a variety of interfaces and I/O standards, also SERDES (serialiser-deserialiser), differential and DDR interfaces are supported through a library of input, output and bi-directional I/Os, which can be configured into a large variety of options and drive strengths.

For prototyping and other low-volume applications a direct-write eBeam machine is used to perform this VIA-6 customisation, to avoid the costly mask production. For high volumes the custom VIA-6 mask is generated from the same design data base.

Structured ASICs attack the low-end of the ASIC market. Although there has already been a ‘structured arrays ASICs vendor’ shake out, there are more vendors than the ones referred to in this section. The selection that has been made here presents a good flavour of the potentials of available products of this ASIC category.

7.6.9 Hierarchical Design Approach

The hierarchical layout design style is characterised by a modular structure (as shown in the heterogeneous chip in Fig. 7.9). The different modules are identified during the design path. With a complex system on chip, for example, the various required functional modules emerge from the specification. These modules may include microprocessor core, ROM, RAM and peripherals and interfaces, etc.

A top-down design strategy generally leads to a satisfactory implementation of a hierarchical layout. The hierarchical division allows various designers or design teams to simultaneously produce layouts of the identified modules. Reasonable gate or bit densities are combined with a reasonable speed. The afforded performance renders the hierarchical layout design style suitable for most VLSI and ASIC designs. The design time for hierarchical layouts can be drastically reduced with good CAD tools. Available libraries contain parameterised module generators .

These (mostly) software descriptions are synthesised to produce netlists, which can be used to create layouts of required modules. Assembly of the resulting instances and bond pads leads to the creation of a complete chip layout. Even the assembly and interconnection is automated in placement and routing programs (using P and R and floor planning tools) .

The hierarchical design style, can, of course, include modules which are created by using different layout design styles, e.g., standard-cell or handcrafted module layouts. The hierarchical style, for a conventional two-metal layer design, was disadvantaged by the relatively large routing areas that could be necessary. However, with the present availability of six to more than 10 metal layers, interconnections and buses can be routed across the logic blocks. In some cases, however, the chip area may not be optimum as a result of the Manhattan skyline effect , which results from different block shapes.

Figure 7.48 shows the meet-in-the-middle strategy used in the hierarchical design approach. This strategy was already introduced by Hugo de Man in the early 1980s [18]. Here, the high-level system description is used to synthesise a design description comprising macro blocks at the implementation level. This implementation level lies roughly in the middle of the top-down design path. The choice of implementation form is still open at this level and possibilities may include a cell-based, gate array or FPGA design. It must be possible to generate these macros from existing design descriptions. Sometimes, module generators are also used to generate a core. The (re)use of IP cores allows a fast ‘plug-in’ of different functional blocks, which are standardised to a certain extent. Clearly, the results of design and layout syntheses meet at the implementation level.

7.6.10 The Choice of a Layout Implementation Form

The unique characteristics of each form of layout implementation determine its applicability. The choice of implementation form is determined by chip performance requirements, initial design costs, required volumes and time-to-market requirements. Figure 7.49 shows a cost comparison of the different forms of layout implementation.

A single chip may combine different implementation forms. The previously discussed embedded array ASICs and structured ASICs are examples of this. Figure 7.50 shows a photograph of a conventional microprocessor in which handcrafted, bit-slice and memory layout styles are combined. Particularly ICs that require fast and complex data-paths which usually include a memory, one or more address counters and ALUs, may combine data path layout with standard-cell, memory and full-custom design.

An implementation technique that was popular in the 1980s and early 1990s and is still used in some cases today, is the symbolic layout and compaction technique. A symbolic layout is a technology-independent design, which can be used for every layout implementation form. In a symbolic layout, transistors and contacts are represented by symbols whose exact dimensions are unspecified while wires are represented by lines whose widths are also unspecified. The abstract symbolic layout is transformed to an actual layout by a compaction program, which accounts for all of the design rules of the envisaged manufacturing process.

The symbolic-layout technique allows a short design time and relieves designers of the need to know specific layout and technology details. The technique is, however, disadvantaged by the associated relatively low gate density and low switching speed. These compare unfavourably with handcrafted layout results. Furthermore, the abstract nature of a symbolic layout only loosely reflects technological aspects. This may result in fatal design errors. Currently, symbolic layout and compaction are only very rarely used.

Finally, the dimensions of all circuit components and wiring in an IC layout are scaled versions of the actual on-chip dimensions. This geometric layout representation is generally described in a geometric layout description language (GLDL) . Such languages are common to many CAD tools and usually serve as the data-interchange format between IC design and manufacturing environments. A GLDL has the following typical features:

  • It facilitates the declaration of important layout description parameters, e.g., masks, resolution, dimensions

  • It facilitates the definition of geometrical forms, e.g., rectangles and polygons

  • It facilitates the definition of macros, e.g., patterns or symbols

  • It enables transformations, e.g., mirroring and rotation

  • It contains statements for the creation of matrixes.

Currently, GDSII is the de facto standard for physical chip design exchange in the semiconductor industry.

7.7 Conclusions

This chapter introduces various VLSI design and layout realisations and their characteristic properties. A top-down design approach, combined with a bottom-up implementation and verification through a hierarchical layout style appears suitable for most VLSI circuits. In practice, the design process consists of a number of iterations between the top-down and bottom-up paths, the aim being to minimise the number of iterations.

The use of IP cores that are available from in-house resources and different vendors is fuelling the reuse of existing functionality, such as microprocessor and signal processing cores and memories, analog and interfaces, etc. This reuse increases the problems with timing and communication between cores from different origins. Chapter 9 discusses these problems in detail.

During the last decade, the design complexity of an ASIC has dramatically increased and caused the design costs to increase almost an order of magnitude (see Chap. 11). This has put a permanent pressure on the efficiency of the design process. Semiconductor companies have built application-domain specific platforms, which are key to a higher design productivity and improved product quality. Since IC production fabs are becoming extremely expensive, more companies will share the same production facility and production process and become fab-lite (outsourcing 40–50% of the manufacturing operations) or even fabless . So, semiconductor (design) houses can then only differentiate themselves to design better products faster and cheaper.

Various ASIC design and implementation styles have been presented. Standard-cell designs, mask-programmable and field-programmable gate arrays, and structured ASICs all are different in the way they are designed, in the way they are fabricated and in the way they are used in an application. The choice of ASIC style largely depends on the required turn-around time and product volume.

A good IC design must be accompanied by a good test and debug strategy. Testability and debug are discussed in Sect. 10.2 and require considerable attention during the design phase. The use of an extra 5% of chip area to support testability and debug might, for instance, lead to a 50% reduction in test costs.

7.8 Exercises

  1. 1.

    Why are abstraction levels used for complex IC designs?

  2. 2.

    What is meant by floor planning?

  3. 3.

    Explain what is meant by logic synthesis.

  4. 4.

    What does the term ‘Manhattan skyline’ describe in relation to a VLSI layout?

  5. 5.

    Assume that a standard-cell and a gate array library are designed in a CMOS technology. The libraries consist of logic cells with identical logic functions. Describe the main differences between the two libraries in terms of:

    1. (a)

      Cell design

    2. (b)

      Chip area

    3. (c)

      Production time and cost

    4. (d)

      Applications

  6. 6.

    Random logic functions can, for instance, be implemented using a ROM or a standard-cell realisation. Explain when each of these possibilities is preferred.

  7. 7.

    Draw a schematic diagram of a PLA which implements the following logic functions:

    $$\displaystyle\begin{array}{rcl} F_{0} = \overline{\overline{x}\,\overline{y} + xyz}\quad F_{1} = \overline{x\overline{y} + \overline{x}y + x\overline{z}}\quad F_{2} = \overline{xyz + \overline{x}\,\overline{y}\,\overline{z}}& & {}\\ \end{array}$$
  8. 8.

    Explain what is meant by mixed-level simulation.

  9. 9.

    Explain in your own words what is meant by IP. What is the cause of its existence? How can it affect design efficiency and what are the potential problems involved with it?

  10. 10.

    Explain the differences between an FPGA and a CPLD.

  11. 11.

    Explain the ‘meet-in-the-middle’ strategy.

  12. 12.

    Explain why a cell-based design implementation is much smaller than a design implemented with an FPGA.