Enabling Domain-Specific Architectures with Programmable Devices

Kaviani, Alireza

doi:10.1007/978-3-030-18338-7_13

Alireza Kaviani¹⁰

Part of the book series: The Frontiers Collection ((FRONTCOLL))

3059 Accesses

Abstract

Advances in process technology have enabled Field Programmable Gate Arrays (FPGAs) to grow in capacity and implement large heterogenous systems in a monolithic device. With the slowdown of Moore’s law, computer architects now consider domain-specific architectures as the only path left for major improvements in performance-cost-energy. FPGAs are highly adaptable devices, making them the prime candidates for a wide range of emerging domains, including compute-intensive machine learning. Upcoming programmable devices in 7 nm process technology offer a hybrid compute platform that tightly integrates traditional FPGA programmable fabric, multiple CPUs, and an array of reconfigurable vector processors. This platform combines bit-level programmable customization using FPGA-like architecture for ultimate domain optimization with traditional compute using CPUs. Moreover, highly parallelized byte-level processing using CGRA (Coarse Grained Reconfigurable Architecture) is integrated on the same platform. Looking forward, we anticipate that the rising cost of building monolithic devices will also set a disaggregation trend towards using multiple dies. Building systems in the package will provide another cost dimension for programmable devices, further expanding the possibility of new domain-specific architectures and tools. As a consequence of these hardware evolutions, the FPGA tools are also raising the abstraction of design entry with the goal of catering to software programmers and domain experts. We will end the chapter by discussing the automation required for the transition of traditional EDA-like tools to emerging domain-specific compilers for future programmable devices.

Access provided by Autonomous University of Puebla. Download chapter PDF

CAOS: CAD as an Adaptive Open-Platform Service for High Performance Reconfigurable Systems

Exo-intelligent Data-Driven Reconfigurable Computing Platform

Data-Intensive Computing Acceleration with Python in Xilinx FPGA

13.1 Introduction and Background

Advances in process technology and Moore’s law have enabled programmable devices to grow more than four orders of magnitude in capacity. The performance of these devices has skyrocketed by a factor of 100, while cost and energy per operation have decreased by more than a factor of 1000 (see Fig. 13.1). Field Programmable Gate Arrays (FPGAs) were introduced in mid-80s, and later proved to be the dominant form of the programmable devices [1]. Today’s FPGA market is more than $5 Billion and still growing three and a half decades after introduction. These advances have been fueled by process technology scaling, but the FPGA success story is also about architecture and software choices made in the industry.

The first wave of success for FPGAs came from replacing custom logic designs and Application-Specific Integrated Circuits (ASICs). In the 1980s, ASIC companies introduced the built-to-order custom integrated circuit to the electronics market with a powerful product. ASIC companies began fiercely competing to sell on the market; the winning attributes were low cost, high capacity and speed. At the time, FPGAs compared poorly on those measures, but they thrived by the virtue of programmability. At those early days transistors were of high value and FPGAs were disregarded and deemed the worst wastage of transistors. Additional transistors were utilized for accommodating field programmability by allowing the user to implement the designs on the manufactured devices off the shelf. Time has proven that such programmability and availability at the time of market need is highly valuable, which led to growth of FPGAs through the first wave.

After the first wave of replacing the custom logic, FPGAs were common components of digital systems. Moore’s Law helped the capacity of FPGAs grow beyond a collection of LUTs, flip-flops, I/O and programmable routing. They included multipliers, RAM blocks, multiple microprocessors, and high speed transceivers. This enabled FPGAs to penetrate a huge market in the data communications industry. The FPGA business grew not from general ASIC replacement, but from adoption by the communications infrastructure. Companies such as Cisco Systems used FPGAs to make custom data paths for steering huge volumes of internet and packetized voice traffic through their switches and routers [2]. New network routing architectures and algorithms could quickly be implemented in FPGAs and updated in the field. Sales to the communications industry segment grew rapidly to well over half the total FPGA business during this 2nd wave of growth. The increasing cost and complexity of silicon manufacturing eliminated “casual” ASIC users. ASICs expanded by adding programmability in the form of application specific standard product (ASSP) and system-on-chip (SoC) devices. An SoC combines a collection of fixed function blocks along with a microprocessor subsystem. The function blocks are typically chosen for a specific application domain, such as image processing or networking. The SoC gave a structure to the hardware solution, and programming the microprocessors was easier than designing hardware. Leveraging the FPGA advantages, programmable ASSP devices served a broader market, amortizing their development costs more broadly. Companies building ASSP SoCs became fabless semiconductor vendors in their own right, able to meet sales targets required by high development costs.

The FPGA industry is currently at its early stage of riding a third wave, which is serving the computing market. Programmable devices have kept their key advantages of the first two waves for customized logic and communication and preparing for the compute and acceleration opportunities in the data centers. Confirming this trend, Intel acquired the 2nd largest FPGA company in 2015 at approximately $16.7 billion. The combination with FPGA technology is expected to enable new classes of products that meet customer needs in the data center and Internet of Things (IoT) market segments. In the words of Brian Krzanich, the CEO of Intel in 2015, “With this acquisition, we will harness the power of Moore’s Law to make the next generation of solutions not just better, but able to do more. Whether to enable new growth in the network, large cloud data centers or IoT segments, our customers expect better performance at lower costs.”

A fundamental early insight in the programmable logic business was that Moore’s Law would eventually propel FPGA capability to cover ASIC requirements. Today, transistors are abundant, and their number is no longer a cost driver in the “FPGA versus ASIC” decision. Many ASIC customers use older process technology, lowering their NRE cost, but reducing the per-chip cost advantage. Instead, performance, time-to-market, power consumption, I/O features and other capabilities are the key factors. Solving transistor-level design problems such as testing, signal integrity, crosstalk, I/O design and clock distribution along with eliminating the up-front masking charges helped the FPGAs grow and have a prominent footprint in the semiconductor industry. Advances in process technology have enabled FPGAs to grow in capacity and implement large heterogenous systems in a device. The emerging devices are highly adaptable—making them the candidate of choice for a wide range of emerging domains from compute to networking. In the next section we will have a deeper dive to introduce various aspects of these devices that will be introduced to the market in the next few years with a special interest to address the compute domain.

13.2 Highly Integrated Emerging Programmable Devices

Xilinx is introducing the latest FPGAs in 7 nm process technology; a new heterogeneous compute family, called the Adaptive Compute Acceleration Platform (ACAP). In addition to the next generation Programmable Logic (PL), this monolithic platform includes vector and scalar processing elements tightly coupled together with a high-bandwidth network-on-chip (NoC), which provides memory-mapped access to all three processing element types. This tightly coupled hybrid architecture, is called Versal^TM and is conceptually depicted in Fig. 13.2. It allows more dramatic customization and performance increase than any previous programmable device. This is an architecture solution for the computing and communication needs of modern applications. The scaler ARM processors and platform management controller occupy the lower left region of the chip. The adjacency of the Processor Subsystem (PS) to Gigabit Transceivers (GTs), memory controllers, and the NoC enables those blocks to be used together without any of the fabric being programmed. GTs can occupy the left and right edges of the fabric regions. High speed IOs also run along the bottom edge of the die, which include hardened memory controllers to interface with off-chip memory such as DDR and HBM. Across the top of this example Versal architecture based floorplan is an array of AI Engines designed to accelerate math intensive functions for applications including machine learning and wireless. Finally, a hardened network-on-chip (NoC) augments the traditional fabric interconnect and enables a new class of high speed, system level communication between the various heterogeneous features, including the PS, DDR, AI Engines and FPGA fabric (in blue). In this section, we provide more detailed information on each heterogenous block, providing an overall understanding for the upcoming FPGAs in the next few years [3].

13.2.1 Programmable Fabric

Traditionally, the core architecture of an FPGA consists of an array of Configurable Logic Blocks (CLBs) and an interconnect with programmable switches, as simplified in Fig. 13.3a. This fabric, which is the core differentiation of FPGAs with other semiconductors, has benefited the most from Moore’s law. In this subsection, we introduce the latest programmable fabric, highlighting the vast evolution compared to early FPGAs. In Fig. 13.3b, a representative device floorplan for upcoming Xilinx Versal architecture is depicted. The fabric portion of the simplified floorplan in Fig. 13.3b (in blue) is conceptually similar to a traditional FPGA; it includes resources such as LUTs, flip-flops, and a rich interconnect to connect them. Every CLB contains 32 look-up tables (LUTs) and 64 flip-flops. The LUTs can be configured as either one 6-input LUT with one output, or as two 5-input LUTs with separate outputs but common inputs. Each LUT can optionally be registered in a flip-flop [3].

CLBs in early FPGAs contained a single LUT and register with 3 or 4 inputs. The CLB in the Versal architecture contains more than 60 times the amount of logic and registers in comparison with early FPGAs. By enlarging the CLB to include more logical elements, a significant fraction of local nets is subsumed internally, thereby reducing global track and wiring demand. A dedicated local interconnect structure resides within each CLB to support more versatile intra-CLB connectivity as shown in Fig. 13.4a. This is a clear architectural response to the technology scaling dynamics. Wire distances shrunk with scaling, but the cross-sectional area shrunk quadratically, resulting in a net increase in resistance for each generation. Despite the physical distance shrink and transistor delay speed up, total delay would have increased with more advanced process nodes. Hence, the designers were forced to use thicker metal with lower resistance to reduce wire delays. As technology scales, metal resources became more expensive and architectural changes such as coarser CLBs were a necessity for more efficiency.

Empirical experiments show that a significant fraction of nets have very localized sources and destinations. Since local routes are shorter and can be squeezed with tighter pitches onto fewer, lower level metal layers, the implementation cost of local routes is substantially less than global routes. On average, 18% of all pin to pin connections are intra-CLB connections, in contrast to 7% within the smaller CLB in previous 16 nm Ultrascale^TM architecture. Figure 13.4b denotes this as “Total Internal Connections.” In practice, roughly 83% of those connections are routed in Versal due to limitations such as tools. This is noted as “Internally Satisfied Connections” in the figure and compared to only 28% of UltraScale theoretical connections. As the figure shows, only 2% of all nets in UltraScale are successfully routed within a CLB compared to 15% in Versal, increasing internal net routing by a factor of almost 8X while only modestly increasing the cost of the CLB.

In addition to the LUTs and flip-flops, the CLB contains dedicated circuitry such as arithmetic carry logic and multiplexers to create wider logic functions. Internals of the CLB, such as wide function muxes, carry chain, and internal connectivity are designed to increase total device capacity by reducing area per utilized logic function. Within each CLB, 16 LUTs can be configured as 64-bit RAM, 32-bit shift registers (SRL32), or two SRL16s. For every group of 64 flip-flops, there are four clocks signals, four set/reset signals, and 16 clock enables. There are dedicated local interconnect paths for connecting LUTs together without having to exit and re-enter a CLB. This enables a flexible carry logic structure that allows a carry chain to start at any bit in the chain [3].

Another interesting new feature for the fabric is called the Imux Register Interface (IRI), which aims to provide an easier implementation of high-performance designs. These are flexible registers on the input side of all blocks that can optionally be bypassed. Such architectural features will enable time borrowing or additional pipelining to improve the performance of designs. Adding additional pipeline registers to interconnect, with an approach where registers exist on every interconnect resource, is presented in Intel Statix10, claiming that overall performance would not be affected significantly when registers are not used [5]. The authors in [3], however, state that Imux Registers are a more cost-effective solution for increasing design speed while requiring less design adaptation, compared to “registers-everywhere approach” in [5]. Both these approaches are architectural decisions that lend themselves to emerging design that are highly pipelined.

Today’s fabric portion of the device also contains hardened DSP and memory blocks, all arranged in a columnar topology. These generic hard blocks will enable the fabric to be customized for a large number of applications. This comes down to more than 20 MB of customizable on-chip memory and near 2000 DSP engines for the higher end devices. The largest Versal PL will contain close to 900K LUTs; more details of the CLB and the number of memory and DSP blocks can be found in [6].

13.2.2 Hardened Domain-Specific Features

Features in PL are often ubiquitous enough to be used for all domains, but the same trend that started in the last decade to support communication market will continue for other domains. ACAP will hardens all the necessary platform management functions and separates them from the FPGA core logic. The processor and platform management controller occupy the lower left region of the chip as s depicted in Fig. 13.3b. The example floorplan shows hardened scalar processor systems (PS), memory controllers and GTs. Versal architecture comprises a framework that enables swapping new domain-specific blocks that are market-driven and not always necessary. For example, some devices may have A-to-D converters replacing GTs. A variety of smaller domain hard IP blocks such as forward error correction, MAC blocks, Interlaken, or PCIE can occupy slots within the fabric array. In this respect, the Versal architecture enables a platform that continues the trend towards enabling families of domain specific devices.

Xilinx FPGAs have had columnar IOs over the last decade. There are several advantages to columnar IOs, including tight integration with the fabric and area efficiency. However, IO cells don’t tend to shrink with Moore’s Law and the cost of using long metal wires has increased. As a result, the interconnect delays and clock skew incurred by metal increase while crossing over large IO columns. The timing and spatial discontinuities in fabric have led to additional complexity for software tools mapping designs across those boundaries. Moreover, IO package trace breakouts from the die interior can be challenging and performance limiting with additional IOs required. Therefore, the implementation of perimeter IOs enables higher performance IOs and less fabric disruption. These high speed IOs at the bottom of Fig. 13.3b and their adjacent hardened memory controllers are often used for domains that require significant bandwidth for external memory access.

The Platform Management Controller (PMC) is another hard block that brings the platform to life and keeps it safe and secure. It boots and configures all of the blocks in the ACAP in milliseconds. All security, safety and reliability features are managed through this block. It cryptographically protects images for both hardware and software, while safely providing enhanced diagnostics, system monitoring and anti-tamper. All debug and system monitoring happen through PMC and high-speed chip-wide debug. A number of the application domains for FPGAs have been relying on partial reconfiguration of the blocks. Versal architecture offers an 8x configuration speed boost over previous generations by increasing the internal configuration bus width by 4x and a faster configuration clock. Leveraging the same configuration path speedups and rearranging CLB flop data into minimal number of frames enables up to 300X readback enhancements. Faster readback helps faster and more efficient debug of the designs.

13.2.3 Hardened Data Movement on Chip

Future cloud and high-performance computing (HPC) will be data-centric and leadership in data movement is critical to success. The consensus is that workloads in data centers will become more data-intensive and they need to manage three orders of magnitude of new data from 5G. New interconnect technologies to connect and communicate the data on and off the chip will be essential to accommodate the need for lower latency, while maintaining energy efficiency. FPGAs have been very successful in providing users with a bit level configurable interconnect. FPGA capacity has grown rapidly, and emerging applications comprise a large number of compute modules. The communication among these modules and external memory will cause routing congestion in fabric interconnect. This problem is more pronounced with process scaling since the technology is not improving wire resistance. System performance at high frequencies will require efficient global data movement across the chip from/to an external memory. Therefore, it makes sense to organize data movement into wide standardized bussed interfaces. A general technique to reduce interconnect burden is sharing the resources and Network-on-Chip (NoC) is a systematic method for sharing wires. The higher speed for the data movement makes possible the higher sharing level for valuable wire resources. ASICs and SoCs addressed a similar problem of moving many high bandwidth data streams by adding hardened NoCs. In packet switched NoCs, the same physical resource is used to route communication between multiple ports, thus increasing area efficiency.

For FPGAs, researchers have similarly proposed various techniques to improve on the efficiency of bit level interconnect. These include requiring users to reason at the word level rather than at bit level [7, 8], to implementing NoCs as hardened interconnect resources on the FPGA [9]. In the Versal architecture, a hardened NoC is a hardened layer of interconnect augmenting the traditional FPGA interconnect. Adding hard blocks for domains such as storage or compute is not new for FPGAs, but hardening data movement is a first in the industry. The traditional soft FPGA interconnect continues to provide bit level flexibility, but the NoC can absorb a significant portion of the interconnect demand. This separates system level communication implementation from compute portion. Consider the concrete case of a compute IP requiring access to some memory controller. In order to close timing at high frequencies (required to support high bandwidths), the compute would have to be placed close to the memory controller. Alternately, the physical implementation tools would have to be smart enough to insert on-demand pipelining. On the other hand, with NoC, it is possible for the compute to be implemented anywhere on the FPGA. All it needs to do is hook up to the nearest NoC port for communication to occur at a guaranteed bandwidth. This eases the timing closure for a large variety of the designs.

Mesh is a common topology for NoCs, but this is neither necessary nor useful in the FPGA case. Figure 13.3 shows a view of how the NoC integrates with the rest of the device. There are multiple Vertical NoC (VNoC) columns in the fabric and each master or slave clients simply connects to the nearest one. The figure also shows two more Horizontal NoC (HNoC) rows at the top and bottom of the floorplan. Adding more horizontal connections would not significantly improve access to the NoC, but significantly disrupt the fabric connectivity. Columnar integration with the fabric is natural in the context of FPGAs, because VNoCs will be added similar to any other columnar compute block within an FPGA. HNoCs are sized to have more physical channels than the VNoCs. This provides enough horizontal bandwidth for fabric clients attached to a particular VNoC to access memory controllers at all horizontal locations in the device—a key feature enabling a uniform view of memory across the entire device for all clients [10]. Versal NoC is a packet switched network that implements a deterministic routing flow with wormhole switches. It supports multiple Virtual Channels (VCs) to help avoid deadlock and head-of-line blocking. It also supports multiple Quality-of-Service (QoS) classes, the details of which are described in [10]. The Versal NoC is not a replacement for fabric interconnect; it provides a persistent interconnect that implements switching and routing functions that would previously have consumed fabric resources.

One key driver of the NoC requirements is to effectively manage access to external memory through DDR channels. The NoC bandwidth and resources scale both in terms of the device memory bandwidth and fabric size. The number of fabric ports on each VNoC scales with the height of the device and the number of VNoC columns scales with device memory bandwidth. This enables the NoC to support the entire memory bandwidth and at the same time allow for enough fabric access to consume it. Each horizontal and vertical line represents a full-duplex link of 128 bits wide and operating at 1 GHz. The upper bound of throughput of each unidirectional physical link is over 16 GB/s in each direction. Each VNoC contains two physical lanes, which sums up to 64 GB/s bidirectional bandwidth. HNoCs will have either 2 or 4 physical links depending on device size, which provides up to 128 GB/s horizontal bandwidth. The NoC provides unified, physically addressed access to all hard and soft components on the device. The NoC has programmable routing tables that must be initially programmed at boot time.

SoCs and ASICs have been using NoCs for many years. The requirement for such devices is different from those of a programmable device. In a programmable device NoC topology, bandwidth and QoS requirements depend on a mixture of fixed and programmable functions whose behavior varies substantially based on the application being mapped. This requires a high degree of programmability from the NoC. The Versal NoC architecture has to permit all possible point to point communication. Each egress port must be reachable from every ingress port. In a traditional NoC based system, one could have multiple instances of NoCs optimized for different needs. Within a programmable NoC platform, the compilers have to manage all the flows within the constraints of the hardened NoC architecture. This requires some level of over provisioning of the NoC resources and a high degree of programmability. For example, in the Versal NoC we provision for more VCs (8) and QoS classes (3) than would be required for typical applications. The entire topology of the NoC also needs to be designed using repeatable blocks. This permits easy integration and design of a family of devices with different communication and compute needs using the same blocks.

13.2.4 AI Engines

We mentioned in Sect. 13.1 that programmable devices will ride a prominent wave of compute-intensive applications such as 5G cellular and machine learning. 5G requires between five to 10 times higher compute density when compared with prior generations. The emergence of machine learning in many products also dramatically increases the compute-density requirements. Xilinx products started addressing computationally intense applications, by adding hardened multipliers developed with the Virtex^®-II series of FPGAs in 2001. Today, there are over 12000 DSP slices in current devices—an increase of 3 orders of magnitude in compute resources over last 2 decades. The ACAP devices include a new type of programmable compute engine, called AI engine, as shown in the top of Fig. 13.3b. AI Engines are an array of VLIW SIMD processors that deliver up to 8X silicon compute density at 50% the power consumption of traditional programmable logic solutions [11]. AI Engines have been optimized for signal processing, meeting both the throughput and compute requirements to deliver the high bandwidth and accelerated speed required for wireless connectivity. AI Engine arrays offer a leap into computational applications. They can also be viewed as a commercial realization of Coarse Grained Reconfigurable Arrays (CGRAs). Chapter 14 provides a broader academic perspective on CGRAs and their advantages with a more in-depth look in some of the architectural and compilation aspects. In the remaining portion of this subsection we focus on describing the AI Engine architecture.

Figure 13.5 shows a 9 × 9 array of AI Engine tiles with detailed accounting of the resources in each tile. Engine core includes 16 KB instruction memory, 32 KB of RAM, 32b RISC scalar processor, and both 512b fixed-point and floating-point SIMD vector processor. AI Engines are interconnected using a combination of dedicated AXI bus routing and direct connection to neighboring engine tiles. For data movement, dedicated DMA engines and locks connect directly to dedicated AXI bus connectivity, data movement, and synchronization. The vector processors are composed of both integer and floating-point units. Operands of 8-bit, 16-bit, 32-bit, and single-precision floating point are supported. Two key architectural features ensure deterministic timing: (1) Dedicated instruction and data memories and (2) Dedicated connectivity paired with DMA engines for scheduled data movement. The simplest form of inter-tile data movement is via the shared memory between immediate neighboring Tiles. This implies up to 128 KB addressable shared memories with neighbors. However, when the tiles are further away, then the AI Engine tile needs to use the AXI-Streaming dataflow. AXI-Streaming connectivity is predefined and programmed by the AI Engine complier tools based on the data flow graph. These streaming interfaces can also be used to interface directly to the PL and the NoC.

The architecture is modular and scalable; some of the devices will contain up to 400 of these tiles. One of the highest value propositions of this CGRA is the connectivity with adjacent fabric. Figure 13.6 illustrates the connectivity between the AI Engine array and the programmable logic. AXI-Streaming connectivity exists on each side of the AI Engine array interface, and extends connectivity into the programmable logic and separately into the network on chip (NoC). Leveraging NoC connectivity, AI engines communicates to the external memory. Processor subsystem (or scalar processors) on the device also manage configuration, debug and tracing of AI engines through HNoC. AI Engines are programmed using a C/C++ paradigm familiar to many programmers as will be explained in the following sections. AI Engines are integrated with Xilinx’s Adaptable and Scalar Engines (PL and PS) to provide a highly flexibly and capable overall solution. The key difference of an AI Engine array with traditional multicore computing engine is the dedicated not-blocking deterministic interconnect. Xilinx has provided results indicating 10X higher compute for ML inference, 5X higher 5G wireless bandwidth, and 40% less power compared to an earlier 16 nm FPGA devices [11].

13.3 Disaggregation Trend for Cost and Market Agility

Silicon transistors and wires are not providing much area or speed benefits due to the slowdown of Moore’s Law, and the power per chip area is increasing (reflecting the end of Dennard scaling). Advances in process technology have enabled FPGAs to grow in capacity and implement large heterogenous systems in a monolithic device. The emerging ACAP devices explained in the previous section are highly adaptable—making them the candidate of choice for a wide range of emerging domains from compute to networking. However, performance or power improvements are no longer readily available from process technology and it is no longer trivial to build cost-efficient devices. A prominent trend to respond to rising fabrication cost is disaggregation of architecture components, since only parts of the system on the chip require expensive leading-edge process nodes. Disaggregation of monolithic systems means implementing the required connectivity needs at the wafer or package level. Disruptive technologies such as wafer level connectivity or advanced packaging are expected to evolve over the next decade. ACAP unique position is that it includes many heterogenous blocks. This provides an opportunity for cost reduction by selective disaggregation per domain of interest. The significant drivers for this trend are claimed to be:

Improving yield by silicon split into smaller dies.
It’s the only way to get enough memory or a heterogenous technology such as photonics into the system.
Only some parts of the system require expensive leading-edge nodes.
It’s a way to use the same silicon to address different configurations/markets.

13.3.1 FPGA Products with Multiple Dice

The first driver mentioned above (for improving yield of large devices) led Xilinx to develop a new approach for building high-capacity FPGAs for emulation market in early 2010s [12]. The new solution enables high-bandwidth connectivity between multiple dice by providing high density package connectivity. Combining several large dice in a single device is the only way to exceed the capacity and bandwidth offered by the largest monolithic devices. Figure 13.7 shows a side view of four large FPGA dice along with a passive interposer that provides tens of thousands of die-to-die connections in the same package, responding to cost pressures of monolithic integration. The key to this enabling technology was combining Through-Silicon Via (TSV) and micro bump technology. The passive silicon interposer was a low-risk, high-yield 65 nm process that provided four layers of metallization for building the tens of thousands of traces that connect the logic regions of multiple FPGA die. C4 solder bumps connect interposer stack-up on a package substrate using flip-chip assembly techniques. This technology provided multi-Terabit-per-second die-to-die bandwidth through more than 10,000 device-scale connections—enough for the most complex multi-die FPGA product at 28 nm process technology (XC7V2000T).

Later on, the same technology was used for the integration of different types of die. Virtex-7 H870T FPGA, announced in 2012, ties together three homogeneous dice and a separate 28G transceiver chiplet via the silicon interposer. This was the world’s first heterogeneous FPGA architecture—an FPGA consisting of heterogeneous die placed side-by-side to operate as one integrated device. While this product didn’t have the market success of 2000T device for a number of reasons beyond the scope of this chapter, it was an important technology turning point for many more devices with heterogenous integration at present and future.

FPGA High Bandwidth Memory (HBM) devices, introduced in 2017, integrated 16 nm UltraScale+ FPGA fabric with HBM controller and memory stacks from Xilinx supply partners [13]. The HBM is integrated using a similar interposer-based stacking technology explained above and is depicted in Fig. 13.8. Such heterogenous integration enables more than 20X external memory bandwidth on the same device compared to that of PCB. Low power and high bandwidth memory access are essential requirements for emerging compute and data center domains. The AXI Interface in the HBM memory controller needs to be hardened to accommodate the aggregate high bandwidth between the local programmable routing and the HBM module. This structure significantly increases the user’s AXI interface bandwidth, allowing for up to 3.7 Tb/s operation [14]. Recently, Xilinx introduces the Virtex^® UltraScale+™ VU19P, the world’s largest FPGA with the highest logic density and I/O count on a single device ever built, addressing new emulation market. The devices boast more than 4M LUTs in the same package, which would not have been possible without disaggregating the whole system into multiple dice.

Versal ACAP with highly integrated monolithic features was introduced in the previous section. However, the new fabric has a unique feature that can be leveraged for adding connectivity using silicon interposer technology. Current generation of interposer technology for HBM devices or VU19p only use the wires on interposer at the edge of die and in the vertical direction. In this case micro bumps are distributed in channels along the edge by displacing the CLBs. Versal ACAP fabric architecture embeds a number of these micro bumps in each CLB allowing them to be distributed evenly across the die. This enables the architecture to utilize more wiring on the interposer and in both directions. In this new routing architecture, interposer wires serve two purposes: (1) inter-chiplet connectivity, and (2) additional regular intra-chiplet long range routing. These wires on the interposer are 30% faster for the same distance and ideal for long reach die connectivity. This enables ultra-large ACAP devices with multiple active silicon dice stacked on a passive interposer and with ample routing wires on interposer that may be introduced to the market in the next few years. This architecture will reduce delays and routing congestion at the die boundaries and will consequently ease the software burden of partitioning the design to be mapped to multiple dice. The key enabler for this form of chiplet connectivity is 4X CLB granularity that was explained earlier. Further details and quantitative benefits can be found in [4].

Intel’s recent 10-nm Intel^® Agilex™ FPGAs are also built using a disaggregated chiplet architecture, which integrates heterogeneous technology elements in a System-in-Package (SiP). Leveraging a packaging technology, called Embedded Multi-Die Interconnect Bridge (EMIB), Intel uses the chiplet approach to combine a traditional FPGA die with purpose-built semiconductor die, creating devices that are uniquely optimized for target applications. EMIB silicon bridges are positioned as an alternative to 2.5D packages using silicon interposers. They often provide a similar connectivity density as interposers but take less area on the interposer. The attractiveness of EMIB is that silicon is used only in the areas where two dies connect. Since the main cost of such advanced packaging is assembly, it is not clear if any of these two methods have superiority, and hence both approaches are expected to stay around for the time being (Fig. 13.9). Intel is also using this technology to add advanced analog functions such as 112 Gbps PAM-4 transceivers to the programmable device, as shown in Fig. 13.10 [15]. Xilinx provides a similar GT functionality with a key differentiation that monolithic integration is used to add analog high-speed functionality in contrast with Intel disaggregation strategy. This ideally exemplifies how the old trend of monolithic die integration will continuously be considered and evaluated against disaggregated package integration in the next decade. The merit of each solution will depend on a number of factors including expertise in the company and agility to market, as will be discussed further in the next subsection.

13.3.2 Upcoming Heterogenous Integration Trends and Programmable Devices

Integration in the package is not new; OEMs have used Multi-Chip Modules (MCMs) in systems to integrate several chips in a module for years. There are two new dynamics, however, that raise the importance of SiP devices: rising fabrication costs and advancements in packaging technology. In this subsection we identify some of these trends in the context of programmable logic. Heterogeneous Integration refers to the integration of separately manufactured components into a higher-level System in Package (SiP) assembly that provides enhanced functionality and improved operating characteristics in the aggregate. There are many examples of Heterogenous Integration through SiP today as explained in the previous subsection. Heterogeneous Integration is initiating a new era of technological and scientific advances to continue and complement the progression of Moore’s Law Scaling into the distant future. Packaging—from system packaging to device packaging—will form the vanguard to this enormous advance.

There is a wide range of heterogeneous integration technologies for both serial and parallel connectivity within the chiplets. We anticipate standards evolving around both Ultra-Short Reach (USR) serial and parallel (e.g. HBM-like) die-to-die interfacing. In addition to PHY layer standards, there are higher level data protocols, such as AMBA AXI, essential to any application. The key enabling metrics include energy efficiency and aggregate throughput delivery for data movement between chiplets. Figure 13.11 shows the energy efficiency of existing and emerging solutions, approximated in oval areas. There is a two orders of magnitude power gap between today’s existing PCB solutions (such as Xilinx GTY or HMC) and monolithic implementations of data movement, as noted by gray ovals. The blue oval areas in the figure show the emerging packaging connectivity based on recent published work. Some of these emerging SiP solutions are reducing the power gap by more than one order of magnitude. On the other hand, monolithic implementation of global data movement between heterogeneous blocks will require additional overhead such as shims or synchronization modules, leading to the energy increase. This provides an opportunity for reaching near-monolithic energy efficiency, especially with leveraging domain optimizations at the software level. Another take-away from Fig. 13.11 is that we need to be on the right side of the chart in terms of wire throughput in order to use MCM packaging technology with coarse wire pitch. In contrast, interposer fine micro bump pitch (such as that of HBM or EMIB) enables larger counts of interface wires to run at lower frequency as shown on the left side of the figure.

Aggregating multiple existing serial interfaces to amortize PLL power reduces energy consumption. Removing the Clock Data Recovery (CDR) blocks, often used in serial interfaces in lieu of a source synchronous data transmission, offers another degree of potential power reduction. This is a reasonable decision since distance within a package between the chiplets is short. Assuming a bump pitch of 130–150 μm for the substrate, we can fit around 44–59 bumps in 1 mm² of silicon area. By reserving about 10 bumps for power and source synchronous clocking, we can deliver 1 Tb/s bandwidth if each wire carries data at a rate above 20 Gb/s. This approach translates to differential Gigabit Transceivers (GTs) running in the range of 40–56 Gb/s. Fortunately, GT blocks are emerging to offer these ranges. Published literature introduces a USR IP in 28 nm, demonstrating less than 1 pJ/bit using a test chip [16]. This USR interfacing approach uses the CNRZ-5 coding layer that is added on top of aggregating bunch of GTs together. They claim that the coding may contribute to 2X of the power reduction. There is also a recent JESD247 standard, which is based on this IP [17]. A group of other low power GT efforts circle around extensions of OIF standard, which is mostly focused on optical and photonic connectivity. There are test chips published that claim a bit more than 2 pJ/bit for such interfacing without any special coding [18].

Both parallel and serial interfacing are viable technology trends that will enable inter-die connectivity in the coming years. The key enabling factor will be an ecosystem or market place to have those chiplets, which would be after standardization of those interfaces. There are a number of initiatives such as DARPA CHIPS program and Open Compute Platform (OCP) OSDA efforts to push in this new direction. FPGAs can provide an important role in this new paradigm. Moore’s law has significantly reduced the traditional overhead of programmability in FPGAs, which is attributed to using LUTs and interconnect. Modern FPGAs include a number of heterogeneous blocks such as processors, memory, and high speed I/O, as explained in previous section. The new programming overhead will be the unused blocks on highly integrated FPGAs. A programmable fabric chiplet along with specific domain chiplets in an SiP enables a wide range of applications, expanding the fast time to market and customization benefits of FPGAs to this new paradigm. The future package connectivity was classified by power targets in a recent keynote at Hot Interconnect 2019. For 2.5 D technologies, he envisioned 1 pJ/bit for organic substrate (which is achievable today) and 0.3 pJ/bit for interposer or EMIB connectivity. Moreover, he estimated 0.15 pJ/bit for an SiP connectivity to which he referred as 3D [19]. This is getting close to the power for long distance wires within a chip with monolithic integration. Hierarchical Integration Roadmap [20] anticipates 3D interconnect with micro bump density of less than 10um to be available in 10–15 years. FPGAs will significantly benefit from such technology as regular repeatable patterns in fabric can leverage such dense connectivity.

13.4 Software Implication and Trends

Discussing programmable devices would not be complete without understanding of the software design flow. Traditional design process for FPGAs involves transforming the design from a preferred design entry to a configuration bitstream that can be downloaded into the device. This process consists of a sequence of major steps:

(1)
synthesizing the design into the fundamental architecture blocks such as LUTs and flip-flops,
(2)
place and route those blocks under the given timing and area constraints, and
(3)
generating a configuration bitstream to program the device.

The goal of this section is not an in-depth discussion of these tools. Instead, we highlight a few meta-level trends in the recent years with a look into the future. FPGA devices started capturing market share by replacing ASICs as explained in Sect. 13.1. Therefore, the CAD tools started being EDA-like with one significant difference: reduced cost. FPGA companies started building their own CAD tools for configuring devices and offered it to the customers at a highly subsidized price. This was in contrast with the ASIC tools that were from 3rd party and often at higher cost.

FPGA capacity and complexity grew rapidly and as a response, the design entry abstraction was raised to mitigate productivity. This trend, which is shown in Fig. 13.12, occurred with a combination of organic growth and acquisition of third-party tool providers. The figure shows how the design entry abstraction is raised from schematic design entry in 1990s to RTL design entry in the last decade. Today, it is possible to use high-level programming languages such as C and Python as a method of design entry for FPGAs. The most recent Xilinx announcement is a unified software platform, called Vitis [21], that enables the development of embedded software and accelerated applications on heterogeneous Xilinx platforms including FPGAs, SoCs, and Versal ACAPs. Vitis enables integration with high-level frameworks and development in C, C++, or Python and is available free of charge.

The key trend that is prominent for programmable devices is going in the direction of catering to software programmers. The programming aspect of these devices uniquely positions them somewhere between ASIC hardware platforms (with spatial design) and CPUs (with temporal programming). Software programming models have closely tracked the evolution of processor architecture, evolving from a focus on single-core, central memory machines towards multi-core, domain-specific accelerators. As a result, the FPGA tools need to move towards what software developers expect, with improving productivity. One clear architectural attempt in this direction is adding the new CGRA-like components as explained in subsection 13.2.4. These AI engines can be programmed using C/C++ similar to other software programming platforms. AI Engine simulation can be functional or cycle accurate using an x86-based simulation environment. For system-level simulation, a System-C virtual platform is available that supports both AI engines and traditional arm-based processor (scalar engines) integrated on the chip.

In addition to adding new components that are software friendly, it is expected that software for the fabric moves in the same direction. Today’s barrier of entry to new markets such as HPC and compute is not the hardware limitations of programmable devices, but the software productivity and efficiency with new methods of design entry. Software developers expect better user experience such as faster compile time for the backend (see Fig. 13.12) and higher flexibility for building their own flows. The tools for programmable devices are likely to address these issues by two approaches: domain specific overlays and open source. Post-bitstream, programmable domain-specific soft overlays and pre-implemented shells enable fast compilation to the fabric. FPGA designers familiar with hardware will design these overlays; this enables domain programmers to leverage customized memory and interconnect architectures without the need to be an FPGA design expert. Overlays offer user programmability within a given domain. However, scaling this concept to more domains requires new tools and an active ecosystem of new domain experts.

An efficient way to enable an ecosystem of FPGA domain compilers is tapping into open source dynamics in software community. The open source movement relies on free software to stimulate innovation and progress. Software development has become significantly more complex than hardware design and open source is analogous to Moore’s law (of technology scaling) for software. FPGA tools are also expected to move towards open source in the next decade. For example, RapidWright, an open source platform, was introduced recently to provide a gateway to Xilinx’s back-end implementation tools [22]. The goal is to raise the implementation abstraction similar to the way the design entry abstractions are raised, while maintaining the full potential of advanced FPGA silicon. Such framework can help building domain-specific backend tools in two ways: (1) creating highly optimized overlay and shells, and (2) domain-specific compilers. The best opportunity for domain design tasks lies with domain application architects and the path to automation would require a domain-specific front-end compiler. This compiler may be an LLVM data flow graph parser that can automatically identify domain operators with high replication. This concept is depicted in Fig. 13.13 by application examples in domains 1 and 2. Open domain data flow and HLS compilers may be built by the community or will be available as more of free Vitis framework will be open. We anticipate great interest to maximize existing FPGA silicon performance for the age of domain-specific compute. RapidWright or similar open source frameworks are likely to be the enabler for a significant part of that journey in the next decade.

13.5 Concluding Remarks

Field programmable devices (FPGAs) are integrated circuits designed to be configured by the customer after manufacturing. Initially they contained an array of programmable logic blocks and a hierarchy of reconfigurable interconnects that allow the blocks to communicate. Reconfiguration and hardware customization were the key differentiating attributes of FPGAs that facilitated the market expansion by replacing custom logic ASICs. Riding Moore’s law, FPGAs grew more than four orders of magnitude in capacity and captured a significant portion of communication domain in addition to ASIC replacement. Today, the FPGA market is more than $5 billion and still growing by penetrating in domains such as machine learning and datacenter networking.

We highlighted some of the key features that enable programmable devices to enter the compute domain as accelerators. We summarized how Xilinx will address current semiconductor technological, economical, and scalability challenges with the new 7 nm ACAP compute platform. The Versal architecture tightly integrates programmable fabric, CPUs, and software-programmable acceleration engines into a single device that enables higher levels of software abstraction, enabling more rapid development of hardware accelerators that solve next generation problems. Such high-level integration is a direct result of process technology advancements; these new complex products introduced to the market in the next few years will be a testament to the success of such a trend.

The slow-down of Moore’s law and fabrication cost pressures will also set a disaggregation trend for semiconductor industry. Programmable devices with multiple dice in the package are already introduced to the market for yield improvement or heterogenous integration. We believe this approach will continue and we provided some guidelines and insight for how programmable devices will add value to SiP devices of the next decade. Finally, we summarized the automation software trends and the steps required to prepare programmable spatial compute devices for software programmers. FPGAs are likely to be one of the most pivotal components in the age of domain specific compute. Open source and efficient tools for these devices will be developed to enable software-friendly customer experience despite the high complex functionality of the latest devices.

References

Steve Trimberger, Three ages of FPGAs: a retrospective on the first thirty years of FPGA technology, in Proceedings of the IEEE, vol. 103, issue 3 (2015 Mar)
Google Scholar
J.W. Lockwood, N. Naufel, J.S. Turner, D.E. Taylor, Reprogrammable network packet processing on the field programmable port extender (FPX), in Proceedings of ISFPGA 2001, ACM, pp. 87–93
Google Scholar
Xilinx white paper, Versal: The First Adaptive Compute Acceleration Platform (ACAP), WP505 (v1.0.1) (23 Sept 2019)
Google Scholar
B. Gaide et. al., Xilinx Adaptive Compute Acceleration Platform: Versal™ Architecture, FPGA ’2019 (24–26 Feb 2019)
Google Scholar
D. Lewis et al., The Stratix™ 10 Highly Pipelined FPGA Architecture (2016)
Google Scholar
Xilinx advance product specification, Versal Architecture and Product Data Sheet: Overview, DS950 (v1.2) (3 July 2019)
Google Scholar
N. Napre, J. Gray, Hoplite: building austere overlay NoCs for FPGAs, in International Conference on Field-Programmable Logic and Applications (2015)
Google Scholar
P. Maidee, A. Kaviani, K. Zeng, LinkBlaze: efficient global data movement for FPGAs, in IEEE reconfig (2017)
Google Scholar
M.S. Abdelfattah, V. Betz, LYNX: CAD for FPGA-based network-on-chip, in International Conference on Field-Programmable Logic and Applications (2016)
Google Scholar
I. Swarbrick et.al., Network-on-chip programmable platform in Versal™ ACAP architecture, in ACM FPGA ’2019 (Feb 2019)
Google Scholar
Xilinx white paper, “Xilinx AI Engines and Their Applications,” WP506 (v1.0.2) (3 Oct 2018)
Google Scholar
Xilinx white paper, “Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity, Bandwidth, and Power Efficiency,” WP380 (v1.2) (11 Dec 2012)
Google Scholar
G. Singh et. al., Xilinx 16 nm datacenter device family within-package HBM and CCIX interconnect, in 2017 Hot Chips
Google Scholar
M. Wissolik et al., Virtex UltraScale+ HBM FPGA: a revolutionary increase in memory performance, WP485 (v1.1) (15 July 2019)
Google Scholar
Intel Product announcement, Intel^® Agilex™ FPGA Advanced Information Brief. www.intel.com/content/www/us/en/products/programmable/fpga/agilex.html
A. Shokrollahi et al., 10.1 A pin-efficient 20.83 Gb/s/wire 0.94pJ/bit forwarded clock CNRZ 5-coded SerDes up to 12 mm for MCM packages in 28 nm CMOS, in 2016 IEEE International Solid State Circuits Conference (ISSCC), San Francisco, CA, USA (2016)
Google Scholar
JEDEC, Multi-wire Multi-level I/O Standard (June 2016). http://www.jedec.org/standards-documents/results/jesd247
M. Erett et al., A 126mW 56 Gb/s NRZ wireline transceiver for synchronous short-reach applications in 16 nm FinFET, in IEEE ISSCC2018
Google Scholar
U. Cummings, CTO of DCG Connectivity Group, Intel, “From Microns to Miles—The Broad Spectrum of Intel’s Interconnect Technology Strategy”, Hot interconnect (2019)
Google Scholar
Heterogeneous Integration Roadmap, “Interconnects for 2D and 3D architectures,” HIR version 1.0, chapter 22, eps.ieee.org/hir (2019)
Google Scholar
Vitis Unified Software Platform, https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html, Xilinx
C. Lavin et al., RapidWright: enabling custom crafted implementations for FPGAs, in IEEE FCCM (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Distinguished Engineer, Xilinx Research Labs, 2100 Logic Drive, San Jose, CA, 95124, USA
Alireza Kaviani

Authors

Alireza Kaviani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alireza Kaviani .

Editor information

Editors and Affiliations

Department of Electrical Engineering, Stanford University, Stanford, CA, USA
Boris Murmann
Sindelfingen, Baden-Württemberg, Germany
Bernd Hoefflinger

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kaviani, A. (2020). Enabling Domain-Specific Architectures with Programmable Devices. In: Murmann, B., Hoefflinger, B. (eds) NANO-CHIPS 2030. The Frontiers Collection. Springer, Cham. https://doi.org/10.1007/978-3-030-18338-7_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-18338-7_13
Published: 09 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18337-0
Online ISBN: 978-3-030-18338-7
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)

Publish with us