Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Field Programmable Gate Array (FPGA) technologies have long been recognised for their ability to enable very high-performance realisations of computationally demanding, highly parallel operations beyond the capability of other embedded processing technologies. Recent generations of FPGA have seen a rapid increase in this computational capacity and the emergence of System-on-Chip SoC-FPGA, incorporating heterogeneous multicore processors alongside FPGA programmable fabric. A key motivation for these hybrid architectures is the ability of FPGA to host performance-critical operations, offloaded from processors, as application-specific accelerators with any combination of high-performance, low cost or high energy efficiency.

The resources available with which accelerators may be built are enormous: the designer has, every second, access to trillions of multiply accumulate operations via on-chip DSP units [3, 30] and memory locations in Block RAM (BRAM) [3, 31], alongside the computationally powerful and highly flexible Look-Up Table (LUT) FPGA programmable logic [17]. For instance, the Virtex®-7 family of Xilinx FPGAs offers up to 7 × 1012 multiply-accumulate (MAC) operations per second and 40 × 1012 bits/s memory access rates.

To combine these resources into accelerators of highest performance or lowest cost, though, requires manual design of custom circuit architectures at Register Transfer Level (RTL) in a hardware design language. This is a low level of design abstraction which imposes a heavy design burden, significantly more complicated than describing behaviour in a software programming language. Hence, for many years designers have sought a way to realise accelerators more rapidly without suffering critical performance or cost bottlenecks. Software-programmable ‘soft’ processors are one way to do so, but at present adopting such an approach demands substantial compromise on performance and cost. Soft processors allow their architecture to be tuned before synthesis to improve the performance and cost of the final result. Soft general-purpose processors such as MicroBlaze [32] and Nios-II [2] are performance-limited and a series of approaches attempt to resolve this issue. One approach uses soft vector coprocessors [9, 24, 33, 34] employing either assembly-level [34] or mixed C-macro and inline assembly programming. These enable performance increases by orders of magnitude beyond Nios-II and MIPS [34], but performance and cost still lag custom circuits. An alternative approach is to redesign the architecture of the central processor architecture for performance/cost benefit, and approach adopted in the iDEA [8] processor. Multicore architectures incorporating up to 16 [12, 22, 25] or even 100 processors in [12] have also been proposed.

However, the cost of enabling software programmability in all of these approaches is a reduction in performance or efficiency in the resulting accelerators, relative to custom circuit solutions. The result is that the performance of these architectures is only marginally beyond that of software-programmable devices and there is no evidence these are competitive with custom circuits. It appears that if FPGA soft processors are to be a viable alternative to custom accelerators then performance and cost must improve radically.

2 The FPGA-Based Processing Element (FPE)

A unique, lean soft processor—the FPGA Processing Element (FPE)—is proposed to resolve this deficiency. The architecture of the FPE is shown in Fig. 1. It contains only the minimum set of resources required for programmability: the instructions pointed to by the Program Counter (PC) are loaded from Program Memory (PM) and decoded by the Instruction Decoder (ID). Data operands are read either from Register File (RF), or in the case of immediate data Immediate Memory (IMM) and processed by the ALU (implemented using a Xilinx DSP48e). In addition, a Data Memory (DM) is used for bulk data storage and a Communication Adapter (COMM) performs on/off-FPE communications.

Fig. 1
figure 1

The FPGA processing element

The FPE is soft and hence configurable to allow its architecture to be customised pre-synthesis in terms of the aspects listed in Table 1(a). Beyond these, custom coprocessors can also be integrated alongside the ALU to accelerate specific custom instructions. Of course, the FPE is also programmable, with an instruction set described in Table 1(b).

Table 1 FPE parameters and instructions

When implemented on Xilinx Virtex 5 VLX110T FPGA, a 16 bit Real FPE costs 90 LUTs, 1 DSP48e and enables 483 × 106 multiply-add operations per second. This represents around 18% of the resource of a conventional MicroBlaze processor, whilst increasing performance by a factor 2.8.

The FPE’s low cost allows it to be combined in very large numbers on a single FPGA, to realise operations via multicore architectures, with communication between FPEs via point-to-point queues. Hence the FPE may be viewed as a fundamental building block for realising computationally demanding operations on FPGA.

To do so efficiently, the FPE should be able to exploit all the different types of parallelism in a program or application. Task parallelism is exploited in the multicore architectures proposed, but using these to realise data parallel operation is less than efficient, due to the duplication of control logic and data and memory resources. In this case each FPE will contain the same instructions in their PM, access RF in the same orders and execute the same programs. There is considerable overhead incurred when control resource is duplicated for each FPE. To avoid this occurring, the FPE is further extended into a configurable SIMD processor component, as illustrated in Fig. 2.

Fig. 2
figure 2

SIMD processor architecture

The width of the SIMD is configurable via a new parameter, SIMDways, which dictates the number of datapath lanes. All of the FPE instructions (except BEQ, BGT and BLT) can be used as SIMD instructions.

3 Case Study: Sphere Decoding for MIMO Communications

To illustrate the use of FPE-based multicores for FPGA accelerators, a case study—Sphere Decoding (SD) for Multiple-Input, Multiple-Output (MIMO) communications systems—is used. MIMO systems employ multiple transmit and multiple receive channels [26] to enable data rates of unprecedented capacity, prompting their adoption in standards such as 802.11n [14]. An M-element array of transmit antennas emit a vector \(\mathbf {s} \in \mathbb {C}^{M}\) of QAM-modulated symbols. The vector of symbols \(\mathbf {y}\in \mathbb {C}^{N}\) received at an N-element array of antennas is related to s by:

$$\displaystyle \begin{aligned} \mathbf{y} = \mathbf{Hs} + \mathbf{v},\vspace{-4pt} {} \end{aligned} $$
(1)

where \(\mathbf {H} \in \mathbb {C}^{N \times M}\) represents the MIMO channel, used typically as a parallel set of flat-fading subchannels via Orthogonal Frequency Division Multiplexing (OFDM) (108 in the case of 802.11n) and \(\mathbf {v} \in \mathbb {C}^N\) additive noise. Sphere Decoding (SD) is used to derive an estimate \(\hat {\mathbf {s}}\) of s. It offers near that of the ideal ML detector, with significantly reduced complexity [20, 23]. The Fixed-Complexity SD (FSD) has a particularly low complexity, two-stage deterministic process which makes it ideal for efficient realisation via an FPGA accelerator [5]. FSD realises a two-stage detection process illustrated in Fig. 3a.

Fig. 3
figure 3

FSD algorithm components. (a) FSD Tree Structure. (b) General Form of H

Algorithm 1 SQRD for FSD

Pre-Processing (PP) orders the symbols of y according to the perceived distortion experienced by each. This is achieved by reordering the columns of H to give H (the general form of which is illustrated in Fig. 3b). Practically, this is achieved via an iterative Sorted QR Decomposition (SQRD) algorithm, described in Algorithm 1 [11].

SQRD-based PP ordering for FSD transforms the input channel matrix H to the product of a unitary matrix Q and an upper-triangular R via QR decomposition, whilst deriving order, the order of detection of the received symbols during MCS. It operates in two phases, as described in Algorithm 1. In Phase 1 Q, R, order, norm and nfs are initialized as shown in lines 2–5 of Algorithm 1, where q i is the ith column of Q. Phase 2 comprises M iterations, in each of which the kth lowest entry in norm is identified (lines 9 and 10) before the corresponding column of R and elements in order and norm are permuted with the ith (line 11) and orthogonalized (line 12–18). The resulting Q, R, and order are used for Metric Calculation and Sorting (MCS) as defined in (3) and (4).

Metric Calculation and Sorting uses an M-level decode tree to perform a Euclidean distance based statistical estimation of s. Groups of M symbols undergo detection via a tree-search structure illustrated in Fig. 3a.

The number of nodes at each tree level is given by n S = (n 1, n 2, …, n M)T. The first nfs levels process the symbols from the worst distorted paths by Full Search (FS) enumeration of all elements of the search space. This results in P child nodes at level i + 1 per node at level i, where P is the number of QAM constellation points. For full diversity, nfs is given by

$$\displaystyle \begin{aligned} nfs = \lceil \sqrt{M}-1\rceil. \end{aligned} $$
(2)

The remaining nss (nss = M − nfs) levels undergo Single Search (SS) where only a single candidate detected symbol is maintained between layers. At each MCS tree level, (3) and (4) are performed.

$$\displaystyle \begin{aligned} {\tilde{s}_i} = {{\hat s}_{ZF,i}} - \sum_{j = i + 1}^{M_t} {\frac{{{r_{ij}}}}{{{r_{ii}}}}\left( {{\hat{s}_{ZF,j}} - {{\hat s}_j}} \right)} {} \end{aligned} $$
(3)
$$\displaystyle \begin{aligned} {d_i} = \sum_{j=i}^{M_t} {r_{ij}^2} {\left\|{{\hat{s}_{ZF,j}} - {{\hat s}_j}}\right\|{}^2}, {D_i} = {d_i} + {D_{i+1}} {} \end{aligned} $$
(4)

In (3) and (4), r ij refers to an entry in R, derived by QR decomposition of H during PP, \(\hat {s}_{ZF}\) is the center of the FSD sphere and \(\tilde {s}_j\) is the jth detected data, which is sliced to \(\hat {s}_j\) in subsequent iterations of the detection process [13]. Since D i+1 can be considered as the Accumulated Partial Euclidean Distance (APED) at level j = i + 1 of the MCS tree and d i as the PED in level i, the APED can be obtained by recursively applying (4) from level i = M to i = 1. The resulting candidate symbols are sorted based on their Euclidean distance measurements, and the final result produced post-sorting.

This behaviour is duplicated across all OFDM subcarriers, of which there are 108 in 4×4 16-QAM 802.11n MIMO. For real-time processing this behaviour is repeated independently for all 108 subcarriers and must occur within 4 μs and at a rate of 480 Mbps for real-time performance. These are challenging requirements which has seen detection using custom circuit accelerators become a well-studied real-time implementation problem [4, 7, 15, 16, 21, 27]. It is notable that none of these uses software-programmable accelerator components. This section considers the use of the FPE to realise such a solution.

4 FPE-Based Pre-processing Using SQRD

The SQRD preprocessing technique is low-complexity relative to other, ideal preprocessing approaches. It is also numerically stable and lends itself well to fixed-point implementation, hence making it suitable for realisation on FPGA, as a result of its reliance on QRD. However, there are two major issues that must be resolved to enable FPE-based SQRD PP for 4 × 4 802.11n. It computational complexity remains high as outlined in Table 2; given the capabilities of a single FPE, it appears that a large-scale multi-FPE architecture is required to enable SQRD for 4 × 4 802.11n. Its reliance on square root and division operations also present a challenge, since these operations are not native to the DSP48e components used as the datapaths for the FPE and will have low performance when realised thereon [19].

Table 2 4 × 4 SQRD operational complexity

To avoid this performance bottleneck, datapath coprocessors are considered to enable real-time division and square-root operations.

4.1 FPE Coprocessors for Arithmetic Acceleration

Non-restoring 16-bit division [19] requires 312 cycles when implemented using only the DSP48e in an 16R FPE. This equates to approximately 1.2 × 106 div/s (divisions per second). Hence, around 100 FPEs would be required to realise the 120 × 106 divisions required per second (MDiv/s) for 4 × 4 SQRD for 802.11n. The high resource cost this would entail can be alleviated by adding radix-2 or radix-4 non-restoring division coprocessors [19] alongside the DSP48e in the FPE ALU (Fig. 4).

Fig. 4
figure 4

FPE division coprocessor

The performance, cost and efficiency (in terms of throughput per LUT, or TP/LUT) of the programmed FPE when division is realised using a programmed approach and the DSP48e only, (FPE-P) and when radix-2 or radix-4 coprocessors are added alongside the DSP48e (FPE-R 2, FPE-R 4 respectively) on Virtex 5 FPGA is described in Table 3. The FPE-R 2 and FPE-R 4 solutions both increase throughput, by factors of 8.9 and 13.3 respectively and hence increase hardware efficiency by respective factors of 9.4 and 10.7 as compared to FPE-P. Since 4 × 4 802.11n MIMO requires 120 MDiv/s for SQRD-based preprocessing, the implied cost and performance metrics of each option are summarised in Table 3. According to these estimates, FPE-R 2 represents the lowest cost real-time solution, enabling a 93.4% reduction in resource cost relative to FPE-P. This approach is adopted in the FPE-based SQRD implementation.

Table 3 SQRD division implementations

To realise the 120 × 106 square root operations required per second (MSQRT/s), performance and cost estimates for software-based execution on the FPE using the pencil-and-paper method [19] (FPE-P), or by adding a CORDIC coprocessor [28] (FPE-C) are compared in Table 4(a). The coprocessor-based FPE-C solution at once increases throughput and efficiency by factors of 23 and 10 respectively as compared to FPE-P, implying the resources required to realise real-time square-root for SQRD-based detection of 4 × 4 802.11n MIMO can be estimated as in Table 4(b). As this shows, FPE-C enables real-time performance using only 11% of the resource required by FPE-P, and is adopted for realising FPE-based square root operations.

Table 4 FPE square root options

4.2 SQRD Using FPGA

Integrating these components into a coherent processing architecture to perform SQRD, and replicating that behaviour to provide PP for the 108 subcarriers of 802.11n MIMO is a large scale accelerator design challenge. Figure 5 describes the SQRD algorithm as a, iterative four-task (T 1, T 2.1T 2.3) process. The first task, T 1, conducts channel norm ordering, and computes the diagonal elements of R (lines 11–13 in Algorithm 1). This is followed by T 2.1T 2.3, which are independent and permute and update Q, R and norm respectively (lines 14–18 in Algorithm 1).

Fig. 5
figure 5

4 × 4 SQRD

This process is realised using a 4-FPE Multiple Instruction, Multiple Data (MIMD) architecture, shown in Fig. 6, is used. All FPEs employ 16-bit datapaths and are otherwise configured as described in Table 5(a). FPE 1FPE 3 permute Q, R and norm and iteratively update (T 2.1T 2.3 in Fig. 5). FPE 4 calculates the diagonal elements of R (T 1). The SQRD process executes in three phases. Initially, H and the calculation of norm are distributed amongst the FPEs, with the separate parts of norm gathered by FPE 4 to undergo ordering, division and square root. The results are distributed to the outer FPEs for permutation and update of Q, R and norm. Inter-FPE communication occurs via point-to-point FIFO links, chosen due to their relatively low cost on FPGA and implicit ability to synchronize the multi-FPE architecture in a data-driven manner whilst avoiding data access conflicts.

Fig. 6
figure 6

4 × 4 SQRD mapping

Table 5 4-FPE-based SQRD

The performance and cost of the 4-FPE grouping is given in Table 5(b). According to these metrics, the throughput of each 4-FPE group is sufficient to support SQRD-based PP of 3 802.11n subcarriers. To process all 108 subcarriers, the architecture is replicated 36 times, as shown in Fig. 6. The mapping of subcarriers to groups is as described in Fig. 6.

On Xilinx Virtex 5 VSX240T FPGA, the cost and performance of this architecture is described in Table 5(b). As this describes, 32.5 MSQRD/s are achieved, in excess of the 30 MSQRD/s required for 4 × 4 802.11n MIMO.

5 FSD Tree-Search for 802.11n

Computing MCS for FSD in 4 × 4 16 QAM 802.11n is even more computationally demanding than SQRD-based preprocessing. The operational complexity is described in Table 6(a). When a single 4 × 4 16-QAM FSD MCS is implemented on a 16R FPE, the performance and cost are as reported as 16R-MCS in Table 6(b).

Table 6 802.11n MCS complexity

To scale this performance to support all 108 subcarriers for 4 × 4 16-QAM 802.11n MIMO, a large-scale architecture is required. Two important observations of the application’s behaviour help guide the choice of multiprocessing architecture:

  1. 1.

    THE FSD MCS tree exhibits strong SIMD-like behaviour, where each branch (Fig. 3a), performs an identical sequence of operations on data-parallel samples.

  2. 2.

    The number of FPEs required to implement MCS for all 108 OFDM subcarriers on a single, very wide SIMD processor implies limitations on the achievable clock rate as a result of high signal fan-outs to broadcast instructions from a central PM to a very large number of ALUs, restricting performance [10]. Hence, a collection of smaller SIMDs is used.

As described in Table 6(b), the cost of 16R-MCS as compared to the basic 16-bit FPE described in Sect. 2 (from 90 LUTs to 2530 approximately) is significantly higher. This large increase is due to the large PM required to house the 4591 instructions. A significant factor in this large number of instructions are the comparison operations required for slicing (Eq. (3)) and sorting the PED metrics, which require branch instructions, which have associated NOP operations due to the deep FPE pipeline and the lack of forwarding logic [10]. These represent wasted cycles and dramatically increase cost and reduce throughput—branch and NOP instructions represent 50.7% of the total number of instructions. Optimising the FPE to reduce the impact of these branch instructions could have a significant impact on the MCS cost/performance.

5.1 FPE Coprocessors for Data Dependent Operations

Employing ALU coprocessors can significantly reduce these penalties. A switch coprocessor compares the input to each of four constants, determined pre-synthesis (a logical depiction of behaviour is shown in Fig. 7a), selecting the closest. This increases the efficiency of slicing by comparing an input operand to one of a number of pre-defined values. Similarly, a MIN coprocessor (Fig. 7b) can be used to accelerate sorting.

Fig. 7
figure 7

(a) Switch coprocessor (b) Min coprocessor

Each of these coprocessors occupy around 20 LUTs, but their ability to eliminate wasted instructions can significantly reduce the PM size. This can enable significant reductions in overall cost and increases in performance as described in column 3 of Table 6(b). Including these components results in a 68% reduction in resource cost and a factor 2.3 increase in throughput. The resulting component is capable of realising FSD MCS for a single 802.11n subcarrier in real-time, providing a good foundation unit for implementing MCS for all 108 subcarriers.

5.2 SIMD Implementation of 802.11n FSD MCS

To scale the FPE to realise all 108 subcarriers, a range of architectures may be used. The data-parallel operation of the subcarriers suggests that a very wide single SIMD could be used, providing the most efficient realisation from the perspective of PM and control logic cost. However, as the width of an FPE SIMD unit increases beyond 16 lanes, the instruction broadcast from the single central PM limits the speedup which may be obtained by constraining the clock frequency. Hence, 16-way SIMDs are employed and FSD MCS for all 108 802.11n subcarriers is implemented on a dual-layer network of such processors, as illustrated in Fig. 8.

Fig. 8
figure 8

802.11n OFDM MCS-SIMD mapping

Level 1 consists of eight SIMDs. The 802.11n subcarriers are clustered into eight groups \(\{G_i=\{j:( j-1) \mod {8}=i \}_{j=1}^{108}\}_{i=0}^7\), where j is the set of subcarriers processed by FPE i. The 16 branches of the MCS tree for each subcarrier are processed in parallel across the 16 ways of the Level 1 SIMD onto which they have been mapped. Sorting for the subcarriers implemented in each Level 1 SIMD is performed by adjacent pairs of ways in the Level 2 SIMD—hence given the 8 Level 1 SIMDs, the Level 2 SIMD is composed of 16 ways.

Each FPE is configured to exploit 16-bit real-valued arithmetic [6]. All processors exploit PMDepth = 128, RFDepth = 32 and DMDepth = 0, and communication between the two levels exploit 8-element FIFO queues. The Level 1 SIMDs incorporate SWITCH coprocessors to accelerate the slicing operation, whilst the Level 2 SIMDs support the MIN ALU extension to accelerate the sort operation.

The program flow for each Level 1 SIMD is as illustrated in Fig. 9a. Each FPE performs a single branch of the MCS tree, with the empty parts of the program flow—representing NOP instructions—used to properly synchronise movement of data into and out of memory.

Fig. 9
figure 9

FPE branch interleaving. (a) Original FSD threads. (b) Interleaved threads

The NOP cycles represent 29% of the total instruction count but since they represent ALU idle cycles they should preferably be eliminated. To do so, NOP cycles in one branch can be occupied by the useful, independent instructions from another, i.e. the branches may be interleaved as illustrated in Fig. 9.This interleaving occupies wasted NOP cycles, to the extent that when two branches are interleaved the proportion of wasted cycles is reduced to 4%.

On Xilinx Virtex 5 VSX240T FPGA, this multi-SIMD architecture enables FSD-MCS for 802.11n as reported Table 7. As this shows, it comfortably exceeds the real-time performance criteria of 802.11n.

Table 7 4 × 4 16-QAM FSD using FPE

Together with the results of the SQRD preprocessing accelerator, these MCS metrics show that the FPE can support accelerators for applications with demanding real-time requirements. By using massively parallel networks of simple processors (> 140 in this case), FPGA can support real-time behaviour and can enable solutions with resource cost comparable to custom circuits. When the PP and MCS are combined to create a full FSD detector (FPE-FSD in Table 7) the resulting architecture is the only software-defined FPGA structure to enable real-time performance for 4 × 4 16 QAM 802.11n.

6 Stream Processing for FPGA Accelerators

The FPE is a load-store structure, supporting only register-register and immediate instructions. All non-constant operands and results access the ALU via Register File (RF). Consider the effect of this approach for a 256-point FFT (FFT256) realised using two FPE configurations: an 8-way FPE SIMD (FPE8) or a MIMD multi-FPE composed of 8 SISD FPEs (8-FPE). The FFT mappings and itemized ALU, communication (IPC), memory (MEM) and NOP instructions for each are shown in Fig. 10.

Fig. 10
figure 10

FFT256: FPE-based 256 Point FFT. (a) 8-FPE1. (b) FPE8

Figure 10 shows that the efficiency of each of these programs is low—only 52.5% and 31.8% of the respective cycles in 8-FPE1 and FPE8 are used for ALU instructions. The resulting effect on accelerator performance and cost is clear from Table 8, which compares 8-FPE1 with the Xilinx Core Generator FFT [29] component. The FPE is not competitive with the custom circuit Xilinx FFT, which exhibits twice the performance at a fraction of the LUT cost.

Table 8 256-Point FFT performance/cost comparison

These results follow from the restriction to register-register instructions. Each FFT256 stage consume 512 complex words. Since RF is the most resource-costly element of the FPE, buffering this volume of data requires BRAM Data Memory (DM); in order for these operands to be processed and results stored, a large number of loads (stores) are required between BRAM and RF, increasing PM cost. Given the simplicity of the FFT butterfly operation, the overhead imposed by these is significant. This is combined with the effect of the FPE’s requirement to be standalone: since it must handle its own communication, further cycles are consumed transferring incoming and outgoing data between DM and COMM, reducing program efficiency still further. Finally, each of these transfers induces a latency between source and destination—as Fig. 11 illustrates, each FPE DM-RF (black) and COMM-RF (red) transfer takes eight cycles, imposing the need for NOPs.

Fig. 11
figure 11

Load-store paths in the FPE

These factors combine to severely limit the efficiency of the FPE for applications such as FFT. Mitigating the effect of these overheads requires two features:

  • Direct instruction access to any combination of RF, DM and COMM for either instruction source or destination.

  • In cases where local buffering is not required, data streaming through the PE should be enabled, reducing load/store and communication cycle overhead.

6.1 Streaming Processing Elements

To support these features, a streaming FPE (sFPE) is proposed. The sFPE is still standalone, software-programmable and lean, but supports a processing approach—streaming—which diverges from the load-store FPE approach. Streaming means that focus is placed on ensuring that data can stream into and out of operation sources and destinations and through the ALU without the need for load and store cycles. This streaming takes two forms:

  • Internal: between RF, DM, COMM and IMM without load-store cycles.

  • External: from input FIFOs to output FIFOs via only ALU.

The architecture of a SISD sFPE1 is illustrated in Fig. 12. There are three main architectural features of note.

Fig. 12
figure 12

SISD sFPE architecture

  • An entire pipeline stage is dedicated to instruction decode (ID)

  • A FlexData data manager has been added which allows zero-latency access to any data source or sink.

  • Off-FPE communication has been decoupled into read (COMMGET) and write (COMMPUT) components

In the sFPE, ID and FlexData are assigned entire pipeline stages. The ID determines the source or destination of any instruction operand or result, with all of the potential sources or destinations of data incorporated in FlexData to allow each to be addressed with equal latency; this flat memory architecture is unique to the sFPE. This approach removes the load/store overhead of accessing, for example, data memory or off-FPE communication; all data operands and results may be sourced/produced to any of IMM, RF, DM or COMM with identical pipeline control and without the need for explicit load and store cycles or instructions for DM or COMM.

To allow unbuffered streaming from input FIFOs or output FIFOs via ALU, simultaneous read/write to external FIFOs is required, with direct access to ALU in both directions. Decoupling the off-FPE communication components into COMMGET and COMMPUT allow each to be accessed with zero-latency, from a single instruction—note that these both reside in the same pipeline stage and hence conform to the regular dataflow pipeline maintained across the remainder of FlexData. In addition, since all of COMMGET, COMMPUT, DM, RF and IMM access distinct memory resources (with separate memory banks employed within the sFPE and a FIFO employed per off-sFPE communication channel) there is no memory bandwidth bottleneck resulting from decoupling these accesses in this way—all could be accessed simultaneously if needed.

6.2 Instruction Coding

To support the increase level of specialisation of the operands in each instruction, however, operand addressing needs to become more complicated. Generally, sFPE ALU instructions take the form:

INSTR dest, opA, opB, opC where INSTR is the instruction class, dest identifies the result destination and opA, opB, opC identify the source operands. The possible encodings of each of dest, opA, opB, opC and the destination are described in Table 9.

Table 9 ALU operand/destination instruction coding

This encoding allows any of RF, DM, COMMGET and COMMPUT to be addressed directly from the absolute addresses quoted in the sFPE instruction. Constant operands are hard-coded into the instruction and IMM locations allocated by the assembler.

This architecture and data access strategy can lead to sFPE programs which are substantially more efficient that their FPE counterparts. Using the sFPE, the number of instructions needed for FFT256 in both the 8-sFPE and sFPE8 variants are described in Fig. 13.

Fig. 13
figure 13

FFT256: sFPE implementations. (a) 8-sFPE1. (b) sFPE8

In MIMD 8-sFPE form, the total number of instructions required is 257, a decrease of around 91%. In addition, the efficiency of this realisation is now 99.6%, with only a single non-ALU instruction required for control. Similarly, sFPE8 requires 95.9% fewer instructions and operates with an efficiency of 98.4%. Given these metrics it is reasonable to anticipate increases in throughput for 8-sFPE and sFPE8 by factors of 20 and 30.

7 Streaming Block Processing

In many operations, however, addressing modes other than the simple direct approach used in the FPE are vital. An itemized instruction breakdown for multiplication of two 32 × 32 matrices and Full-Search ME (FS-ME) with a 16 × 16 macroblock on a 32 × 32 search window are quoted in Fig. 14.

Fig. 14
figure 14

Itemised sFPE matrix multiplication and ME operations. (a) Matrix multiplication. (b) Motion estimation

A number of points are notable. Firstly, the programs are very efficient, verifying the techniques described in the previous section. However, the programs are extremely large—35,375 instructions for matrix multiplication (MM) and 284,428 for FS-ME. To store this number of instructions, a very large PM is required, requiring a lot of FPGA resources—for FS-ME, 241 BRAMs would be required for the PM alone. These demands are a direct result of the FPE’s restriction to direct addressing. This is because, in a direct addressing scheme then every operation requires an instruction; for MM and ME, this translates a very large number of instructions.

However, both of these operations and their operand accesses are very regular and can be captured in programs with many fewer instructions than those quoted above. Both repeat the same operation many times on small subsets of the input data at regularly-spaced memory locations. For example, Bock-MM of two matrices \(A \in \mathbb {R}^{m \times n}\) and \(B \in \mathbb {R}^{n \times p}\) when m = n = p = 8 via four 4 × 4 submatrices. Assuming that A and B are stored in contiguous memory locations in row-major order and that C is derived in row-major order, the operand memory access are as illustrated in Fig. 15.

Fig. 15
figure 15

sFPE block matrix multiply operand addressing

To compute an element of a submatrix of C, the inner product of a four-element vector of contiguous locations in A (a row of the submatrix) and a four-element vector of elements spaced by 8 locations in B (a column of the submatrix) is formed. Afterwards either or both of the row of A or column of B are incremented to derive the next element of C, before operation proceeds to the next submatrix. The resulting memory accesses are highly predictable: a regular repeated increment along the rows of A and columns of B, periodic re-alignment to a new row of A and/or column of B, repeated multiple times before realigning for subsequent submatrices.

These patterns can be used to enable highly compact programs if two features are available—repeat-style behaviour with the ability for a single instruction to address blocks or memory are regularly-spaced locations when invoked multiple times by a repeat.

7.1 Loop Execution Without Overheads

To enable low-overhead loop operation, the sFPE is augmented with the ability to perform repeat-type behaviour. This means managing the PC such that when a repeat instruction is encountered, the body of the associated block of statements is executed a number of times. This task if fulfilled by a PC Manager (PCM), the behaviour of which is described in Fig. 16.

Fig. 16
figure 16

sFPE PCM behaviour

The PCM controls PC update given its previous value and the instruction referenced in PM given pieces of information—the start and end lines of the body statements to be repeated S and E, the number of repetitions N. These are encoded in a RPT instruction added to the sFPE instruction set. These instructions are encoded as:

RPT N S E The behaviour of RPT is shown in Listing 1. This dictates five repetitions of lines 2–5. Any number of repeat instructions can be nested to allow efficient execution of loop nests with static and compile-time known loop bounds.

Listing 1 RPT Instruction Coding

 RPT  5 2 4

   INSTR1...

   INSTR2...

   INSTR3...  

The PCM arbitrates the PC to ensure that the body statements are repeated the correct number of times and support the construction of nested repeat operations. It enacts the flowchart in Fig. 16. For an n-level nest it maintains a n + 1-element lists of metrics, with an additional element added to support infinite repetition of the top-level program, considered to be an implicit infinite repeat instruction. For layer i of the loop nest, the start line, end line and number of repetitions are stored in element i + 1 of the lists s, e and n respectively. In all cases s 0 = 0, e 0 =  and n 0 =  to represent the start line, end line and number of repetitions of the top-level program ( in Fig. 16).Footnote 1 Every time a repeat instruction is encountered i, the current index into s, e and n is incremented and the values of the new element initialised using S, E and N from the decoded instruction in . Regular PC updating then proceeds () until either another repeat instruction is detected or until e i is encountered. In the latter case, the number of iterations of the current statement is decremented () or, if n i = 0 all of the iterations of the current repeat statement have been completed and control of the loop nest reverts to the previous level ().

The PCM component requires 36 LUTs and hence imposes a relatively high resource cost as compared to the FPE. This can be controlled by compile-time customisation via the parameters listed in Table 10.

Table 10 PC configuration parameters

The pcm_en parameter is a Boolean which dictates whether the PCM is included or not. When it is, the maximum depth of loop nest is configurable via pcm_en which can take, hypothetically, any integer value. As such, the PCM may be included or excluded and hence imposes no cost when it is not required; further, when it is included its cost can be tuned to the application at hand by adjusting the maximum depth of loop nest.

7.2 Block Data Memory Access

Enabling block memory access requires three important capabilities:

  • Auto-increment with any constant stride

  • Manual increment with any stride

  • Custom offset

The need for each of these is evident in MM: auto-increment traverses along rows and columns with a fixed memory stride—there are many such operations and so eliminating the need for an individual instruction for each reduce overall instruction count considerably. Manual increment is required for movement between rows/columns, whilst custom offset is used to identify the starting point for the increments, such as the first element of a submatrix.

A Block Memory Manager (BMM) is incorporated in the sFPE FlexData, as illustrated in Fig. 17a, to enable these properties. The BMM arbitrates access to DM via Read Pointers (RPs) and Write Pointers (WPs). The architecture of FlexData and a pointer is illustrated in Fig 17b.

Fig. 17
figure 17

sFPE block memory management elements. (a) sFPE FlexData. (b) Pointer Architecture

Each pointer controls access to a subset (block) of the sFPE DM and addresses individual elements of that block via a combination of two subaddress elements: a base and an offset. The offset selects the root block data element whilst the base iterates over elements relative to the offset.

Pointers operate in one of three modes. Either the base auto-increments, or it is incremented by explicit instruction, or the offset increments by explicit instruction. All three modes are supported under the control of the set, inc and data interfaces. The offset selects the root data element of the submatrices of A, B and C, with the base added to address elements relative to the offset. The base is updated via two mechanisms, under the control of inc. The first auto-increments by a value (s_stride in Fig.17b) set as a constant at synthesis time. Manually incrementing the base is achieved by c_stride, which is defined at run-time. Finally, when update of the offset is required, data is accepted on assertion of set. To allow absolute minimum cost for any operation, configuration parameters for the sFPE FlexData, BMM and pointer components are configurable by the parameters in Table 11.

Table 11 BMM configuration parameters

It is notable that addressing mode is now a configuration parameter of the sFPE, with direct and block modes supported. In direct mode, the BMM is absent whilst it is included in the block mode. In that case, the cost can be minimised via control of the number of read and write pointers via n_rptrs and n_rptrs. Finally, the auto-increment stride s_stride for each pointer is fixed at the point of synthesis.

To support custom increment of the base and offset for each pointer, BMM instructions take the form INSTR n val where n specify the pointer. The permitted values of INSTR are given in Table 12.

Table 12 BMM instructions

ALU operands accessing DM have an encoding of the form &<ofs><idx><!>, elaborated in Table 13.

Table 13 ALU block operand instruction coding

7.3 Off-sFPE Communications

The COMMGET and COMMPUT components, illustrated in Fig. 18 are also both configurable according to the parameters in Table 14.

Fig. 18
figure 18

sFPE COMM adapters. (a) COMMGET. (b) COMMPUT. (c) COMM pointer

Table 14 COMM configuration parameters

Each of COMMGET and COMMPUT can operate under direct and block addressing modes. In direct mode, individual FIFO channels and be accessed via addresses encoded within the instruction. Instructions for either COMM unit are encoded as:

<̂p><ofs/idx><!> where p differentiates peek (read-without-destroying) and get (read-and-destroy) operations, ofs denotes the offset, idx the pointer reference and ! autoincrement.

7.4 Stream Frame Processing Efficiency

The effect of these streaming and block addressing features can be profound. The number of instructions required by direct (sFPE) and block-based (sFPE-B) sFPE modes are quoted in Table 15. Very large reductions in program size have resulted from the addition of block memory management—sFPE-B requires fewer than 1% of the number of instructions required by sFPE. Hence, the stream processing and advanced program and memory control features of the sFPE have a clear beneficial effect on program efficiency and scale. Section 8 compares sFPE-based accelerators for a number of typical signal and image processing operations against real-time performance criteria and custom circuit and soft processor alternatives.

Table 15 sFPE-based MM and ME: itemized PM

8 Experiments

Accelerators were created using the sFPE for five typical operations:

  • 512-point Fast Fourier Transform (FFT)

  • 1024 × 1024 Matrix Multiplication

  • Sobel Edge Detection (SED) on 1280 × 768 image frames.

  • FS-ME: 16 × 16 macroblock, 32 × 32 search window on CIF 352 × 288 images.

  • Variable Block Size ME (VBS-ME) with 16 × 16 macroblock, 32 × 32 search window on CIF 720 × 480 images.

The sFPF configurations used to realise each of these operations are described in Table 16. All accelerators target Xilinx Kintex®-7 XC7K70TFBG484 using Xilinx ISE 14.2.

Table 16 sFPE-based accelerator configurations

These configurations expose the flexibility of the sFPE. One notable feature is the complete absence of RF in many components, such as MM, FS-ME and FFT. This is a very substantial resource saving which has been enabled as a result of the sFPE being able to stream data from and to COMM components and DM. This flexibility also enables a number of performance and cost advantages, as quoted in Fig. 19. Specifically, the FSME accelerator exhibits real-throughput for H.264; VBS-ME can support real-time processing of 480p video in H.264 Level 2.2. To the best of the authors’ knowledge, these are the first time an FPGA-based software-programmable component has demonstrated this capability.

Fig. 19
figure 19

sFPE accelerators. (a) T. (b) clk (MHz). (c) LUTs. (d) DSP48e. (e) BRAM

To compare the performance and cost of sFPE-based accelerators relative to custom circuits, sFPE FFTs for IEEE 802.11ac have been developed and compared to both the Xilinx FFT and those generated by Spiral [18]. The IEEE 802.11ac standard [1] mandates 8-channel FFT operations on 20 MHz, 40 MHz, 80 MHz and 160 MHz frequency bands with FFT size and throughput requirements as outlined in Table 17.

Table 17 802.11ac FFT characteristics

These multi-sFPE accelerator configurations are summarised in Table 18—in the case where more than one sFPE is used, the configurations of each are presented in vector format.Footnote 2 The performance and cost of the resulting architectures are described in Fig. 20.

Fig. 20
figure 20

FPGA-based FFT: performance and cost. (a) LUT cost (× 103). (b) DSP48e cost. (c) BRAM cost. (d) % device occupied. (e) T (× 109 Samples/s)

Table 18 sFPE FFT configurations

Figure 20 shows that the sFPE FFT accelerators for 802.11ac, supported by clock rates of 528 MHz (FFT64, FFT128), 506 MHz (FFT256) and 512 MHz (FFT512), the real-time throughput requirements listed in Table 17 are satisfied. In addition, performance and cost are highly competitive with the Xilinx and Spiral custom circuits. The LUT, DSP48e and BRAM costs are lower than the Xilinx FFT in 9 out of 12 cases, with savings of up to 69, 53 and 56%. Relative to the Spiral FFT, the performance and cost of the sFPE accelerators are similarly encouraging, enabling increased throughput in all but one case and reduced LUT and BRAM costs in 7 out of 8 cases; savings reaching 62.8% and 55% respectively. The Spiral FFTs have consistently lower DSP48e cost, however the total proportion of the device occupied by each, reported in Fig. 20d, remains in favour of the sFPE in all but one instance.

The performance and cost of sFPE-based MM and FS-ME is compared with other soft processors in Figs. 21 and 22.

Fig. 21
figure 21

Softcore matrix multiplication: performance and cost comparison. (a) T (MM/s). (b) LUTs. (c) DSP48e. (d) BRAM

Fig. 22
figure 22

Softcore FS-ME: performance and cost comparison. (a) T (FPS). (b) LUTs (× 103). (c) DSP48e. (d) BRAM

When applied to MM, the performance and cost advantages relative to 32-way VEGAS (VEGAS32) [9] and 4-way VENICE (VENICE4) [24] are clear. Relative to VEGAS32, throughput is increased by a factor 2 despite requiring only 25% of the number of datapath lanes. As compared to VENICE4, throughput is increased by a factor 4.7 whilst LUT and BRAM cost are reduced by 76% and 5% respectively.

sFPE-based ME is compared with VIPERS16, VEGAS4 and VENICE4 and the FPE in Fig. 22. sFPE32 is the only realisation capable of supporting the 30 FPS throughput requirement for standards such as H.264, with absolute throughput increased by factors of 22.3, 9.8 and 6.8 relative to VIPERS16, VEGAS4 and VENICE4.

These results demonstrate the benefit of the sFPE relative to other soft processors—coupled performance/cost increases of up to three orders of magnitude. Of course, the softcores to which the sFPE is compared here are general purpose components and hence offer substantially greater run-time processing capability than the sFPE, which is highly tuned to the operation for which it was created. In that respect, the sFPE is more a component for constructing fixed-function accelerators than a general-purpose softcore. However, despite employing similar multi-lane processing approaches as VIPERS, VEGAS and VENICE the sFPE’s focus on extreme efficiency, multicore processing, stream processing and novel block memory management have enabled very substantial performance and cost benefits.

9 Summary

Soft processors for FPGA suffer from substantial cost and performance penalties relative to custom circuits hand-crafted at register transfer level. Performance and resource overheads associated with the need for a host general purpose processor, load-store processing, loop handling, addressing mode restrictions and inefficient architectures combine to amplify cost and limit performance.

This paper describes the first approach which challenges this convention. The sFPE presented realises accelerators using multicore networks of fine-grained, high performance and standalone processors. The sFPE enables performance and cost unprecedented amongst soft processors by adopting a streaming operation model to ensure high efficiency. combined with advanced loop handling and addressing constructs for very compact and high performance operation on large data sets. These enable efficiency routinely in excess of 90% and performance and cost which are comparable to custom circuit accelerators and well in advance of existing soft processors.

Specifically, real-time accelerators for 802.11ac FFT and H.264 FS-ME VBS-ME are described; the former of these exhibits performance and cost which are highly competitive with custom circuits. In addition, it is shown how sFPE-based MM and ME accelerators offer improvements in resource/cost by up to three orders of magnitude. To the best of the authors’ knowledge, these capabilities are unique, not only for FPGA, but for any semiconductor technology.

This work lays a promising foundation for the construction of complete FPGA accelerators, but in addition may be used to further ease the design process. For example, in the case where off-chip memory access is required, the programmable nature of the SAE means that it may also be used as a memory controller to execute custom memory access schedules and highly efficient block access. However, resolving this and other accelerator peripheral functions is left as future work.