High Performance Stream Processing on FPGA

McAllister, John

doi:10.1007/978-3-319-91734-4_13

John McAllister⁵

3583 Accesses
3 Altmetric

Abstract

Field Programmable Gate Array (FPGA) have plentiful computational, communication and member bandwidth resources which may be combined into high-performance, low-cost accelerators for computationally demanding operations. However, deriving efficient accelerators currently requires manual register transfer level design—a highly time-consuming and unproductive process. Software-programmable processors are a promising way to alleviate this design burden but are unable to support performance and cost comparable to hand-crafted custom circuits. A novel type of processor is described which overcomes this shortcoming for streaming operations. It employs a fine-grained processor with very high levels of customisability and advanced program control and memory addressing capabilities in very large-scale custom multicore networks to enable accelerators whose performance and cost match those of hand-crafted custom circuits and well beyond comparable soft processors.

Access provided by Autonomous University of Puebla. Download chapter PDF

Boosting general-purpose stream processing with reconfigurable hardware

Article Open access 27 February 2024

FPGA-Based DSP

Accelerating OCaml Programs on FPGA

Article 24 January 2023

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Field Programmable Gate Array (FPGA) technologies have long been recognised for their ability to enable very high-performance realisations of computationally demanding, highly parallel operations beyond the capability of other embedded processing technologies. Recent generations of FPGA have seen a rapid increase in this computational capacity and the emergence of System-on-Chip SoC-FPGA, incorporating heterogeneous multicore processors alongside FPGA programmable fabric. A key motivation for these hybrid architectures is the ability of FPGA to host performance-critical operations, offloaded from processors, as application-specific accelerators with any combination of high-performance, low cost or high energy efficiency.

The resources available with which accelerators may be built are enormous: the designer has, every second, access to trillions of multiply accumulate operations via on-chip DSP units [3, 30] and memory locations in Block RAM (BRAM) [3, 31], alongside the computationally powerful and highly flexible Look-Up Table (LUT) FPGA programmable logic [17]. For instance, the Virtex^®-7 family of Xilinx FPGAs offers up to 7 × 10¹² multiply-accumulate (MAC) operations per second and 40 × 10¹² bits/s memory access rates.

To combine these resources into accelerators of highest performance or lowest cost, though, requires manual design of custom circuit architectures at Register Transfer Level (RTL) in a hardware design language. This is a low level of design abstraction which imposes a heavy design burden, significantly more complicated than describing behaviour in a software programming language. Hence, for many years designers have sought a way to realise accelerators more rapidly without suffering critical performance or cost bottlenecks. Software-programmable ‘soft’ processors are one way to do so, but at present adopting such an approach demands substantial compromise on performance and cost. Soft processors allow their architecture to be tuned before synthesis to improve the performance and cost of the final result. Soft general-purpose processors such as MicroBlaze [32] and Nios-II [2] are performance-limited and a series of approaches attempt to resolve this issue. One approach uses soft vector coprocessors [9, 24, 33, 34] employing either assembly-level [34] or mixed C-macro and inline assembly programming. These enable performance increases by orders of magnitude beyond Nios-II and MIPS [34], but performance and cost still lag custom circuits. An alternative approach is to redesign the architecture of the central processor architecture for performance/cost benefit, and approach adopted in the iDEA [8] processor. Multicore architectures incorporating up to 16 [12, 22, 25] or even 100 processors in [12] have also been proposed.

However, the cost of enabling software programmability in all of these approaches is a reduction in performance or efficiency in the resulting accelerators, relative to custom circuit solutions. The result is that the performance of these architectures is only marginally beyond that of software-programmable devices and there is no evidence these are competitive with custom circuits. It appears that if FPGA soft processors are to be a viable alternative to custom accelerators then performance and cost must improve radically.

2 The FPGA-Based Processing Element (FPE)

A unique, lean soft processor—the FPGA Processing Element (FPE)—is proposed to resolve this deficiency. The architecture of the FPE is shown in Fig. 1. It contains only the minimum set of resources required for programmability: the instructions pointed to by the Program Counter (PC) are loaded from Program Memory (PM) and decoded by the Instruction Decoder (ID). Data operands are read either from Register File (RF), or in the case of immediate data Immediate Memory (IMM) and processed by the ALU (implemented using a Xilinx DSP48e). In addition, a Data Memory (DM) is used for bulk data storage and a Communication Adapter (COMM) performs on/off-FPE communications.

The FPE is soft and hence configurable to allow its architecture to be customised pre-synthesis in terms of the aspects listed in Table 1(a). Beyond these, custom coprocessors can also be integrated alongside the ALU to accelerate specific custom instructions. Of course, the FPE is also programmable, with an instruction set described in Table 1(b).

Table 1 FPE parameters and instructions

High Performance Stream Processing on FPGA

Abstract

Similar content being viewed by others

Boosting general-purpose stream processing with reconfigurable hardware

FPGA-Based DSP

Accelerating OCaml Programs on FPGA

Keywords

1 Introduction

2 The FPGA-Based Processing Element (FPE)

3 Case Study: Sphere Decoding for MIMO Communications

Algorithm 1 SQRD for FSD

4 FPE-Based Pre-processing Using SQRD

4.1 FPE Coprocessors for Arithmetic Acceleration

4.2 SQRD Using FPGA

5 FSD Tree-Search for 802.11n

5.1 FPE Coprocessors for Data Dependent Operations

5.2 SIMD Implementation of 802.11n FSD MCS

6 Stream Processing for FPGA Accelerators

6.1 Streaming Processing Elements

6.2 Instruction Coding

7 Streaming Block Processing

7.1 Loop Execution Without Overheads

Listing 1 RPT Instruction Coding

7.2 Block Data Memory Access

7.3 Off-sFPE Communications

7.4 Stream Frame Processing Efficiency

8 Experiments

9 Summary

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation