1 Introduction

The next generation of wireless communication devices will have to combine superior levels of performance and flexibility with standard interoperability in order to support an ever increasing range of users and services. These service requirements will range from huge data rates in enhanced mobile broadband to ultra-reliable low-latency communications or even low-bandwidth communications [1]. This scenario boosts the need for highly flexible physical layer (PHY) implementations in radio devices. Particularly, for baseband processing engines, the design of flexible and reconfigurable hardware infrastructures for multi-mode operation is a relevant challenge and contribution to wireless communications [2].

In the current cellular communication generation (4G), Orthogonal Frequency-Division Multiplexing (OFDM) is the dominant waveform adopted, mainly due to easy frequency selectivity, flexibility, efficient hardware implementation by Fast Fourier Transform (FFT) and Inverse FFT (IFFT) modules, and good multiple-input and multiple-output (MIMO) compatibility [3]. Although these features also place OFDM as a strong waveform candidate for 5G, its high out-of-band (OOB) leakage and the negative impact of time/frequency guard bands on spectral efficiency may prevent OFDM from achieving some 5G demands [4]. In this context, Filter-bank Multicarrier with Offset-Quadrature Amplitude Modulation (FBMC-OQAM) has been proposed as a potential enabling waveform for 5G communications [5,6,7] and its main advantages are high spectral efficiency, low OOB emissions, less stringent synchronization requirements and very good subcarrier filtering in Cognitive Radio (CR) environments. The introduction of new waveforms does not imply the exclusion of OFDM. According to [8], LTE (4G) systems and a new radio access technology (NR-RAT) should coexist and interwork tightly in a 5G communication environment. This motivates the design of reconfigurable baseband processing modules supporting OFDM and a potential 5G waveform such as FBMC.

Field-programmable gate arrays (FPGAs) are good implementation platforms for these reconfigurable modules because they present a balanced blend of flexibility, throughput and power consumption. Flexibility can be achieved by the multiplexing of dedicated hard wired components per mode of operation or by an implementation dimensioned for the most resource demanding scenario combined with the setting of parameter values - static multi-mode design. This allows for a quick operation mode switching but is resource inefficient, as not all dedicated components are simultaneously needed nor the most demanding scenario is always observed. Alternatively, flexibility can be improved by means of Dynamic Partial Reconfiguration (DPR) techniques. The use of DPR enables run-time circuit specialization [9], making the system more resource-efficient and bringing other advantages, such as hardware virtualization, feature wealth and system upgradeability [10, 11].

This work presents a flexible and dynamically reconfigurable FPGA-based dual-waveform modulator for baseband processing in wireless devices. The implemented design supports the six OFDM modes of operation defined for LTE downlink transmission as well as one FBMC-OQAM mode of operation. It exploits DPR techniques to switch between these modes of operation on-the-fly. This strategy makes the design easy to upgrade for supporting other standards and waveforms. Besides the reconfigurable dual-waveform design, the evaluation of DPR is addressed. To do so, our DPR-based design is compared with a functionally equivalent static multi-mode design regarding throughput performance, resource usage, functional density and power consumption. The enhanced flexibility of DPR comes at the price of reconfiguration latency and energy overhead which were both measured and evaluated. The results show that the amount of reserved and used resources, as well as the power associated with baseband processing in the DPR-based design are lower than in the static multi-mode design. The worst-case DPR latency and energy overhead are about 1 ms and 1 mJ, respectively. These figures are acceptable and viable in the context of wireless communication applications.

The contributions of this paper are: (1) a DPR-based dual-waveform baseband processing engine suitable for the coexistence of LTE and potential FBMC-based 5G standards; (2) the extensive evaluation and assessment of trade-offs between DPR-based and static multi-mode designs; and (3) a qualitative and quantitative evaluation of the impact of DPR in this type of application in terms of DPR latency, reserved/used resources and DPR energy overhead based on real-time power consumption measurements.

Following a review of work related to reconfigurable baseband infrastructures for wireless devices, the requirements and sequence of operations for baseband OFDM and FBMC-OQAM modulation are described. Based on that, the architecture of both static multi-mode and DPR-based dual-waveform modulators is explained and the differences between them highlighted. Next, the two modulator designs are evaluated and compared in terms of performance, resource utilization and power consumption. Regarding the DPR-based OFDM modulator, reconfiguration latency and energy overhead results are further discussed. Finally, conclusions about the work are presented.

2 Related Work

The study and design of reconfigurable baseband processing engines is often related to the Software-Defined Radio (SDR) and CR research fields.

Given the increasing service heterogeneity and data rates required by future wireless communications [1], it is unlikely that approaches based on general purpose processors (GPPs) or application specific integrated circuits (ASICs) can satisfy performance and flexibility requirements for baseband processing, respectively. In turn, FPGAs became the preferred hardware platform to implement reconfigurable baseband processors for SDRs [12, 13] and CRs [14, 15].

The multitude of OFDM-based wireless standards in 3G and 4G raised the interest for flexible and multi-standard OFDM baseband processing engines. Typically, these flexible baseband processors are implemented as a pipeline of functional elements dimensioned for the most demanding scenario and each of them is responsible for a certain baseband operation - static multi-mode approach [14,15,16,17,18]. Through mux/demux structures and parameter value selection the operation of these functional elements can be quickly adapted according to the standard specifications. In a static multi-mode approach, hardware modules for all baseband processing operations are always present in the system, even when some of them are not always needed. Moreover, the functional elements are resource-inefficient because they are designed for the worst-case scenario. The resource inefficiency may prevent the use of a smaller FPGA device, with a lower power consumption. Furthermore, under the static multi-mode approach, the addition of a new functionalities and support for more modes of operation requires the redesign of the baseband processor almost from the scratch, with negative implication on the system upgradeability.

FBMC has recently been considered as a promising multicarrier technique [7]. However, few hardware implementations have been described so far. In [19], the authors study the hardware complexity of a polyphase network (PPN)-FBMC baseband transmitter and propose a pipeline architecture to reduce it. Berg et al. [20] implement an FPGA-based FBMC transceiver aimed at television white space communications. In turn, our previous work [21] explores an FS-FBMC approach to implement an FPGA-based FBMC-OQAM baseband modulator and shows that the higher computational complexity of FS-FBMC systems does not result in a higher resource utilization.

Regarding multiple waveform systems, Nadal et al. [22] present a baseband processing engine implemented on the Zynq’s programmable logic that includes three independent cores for OFDM, Universal Filtered Multicarrier modulation (UFMC) and FBMC transmission. Each transmitter core can also be reconfigured through parameter setting and the waveform for transmission is selected through mux/demux structures. So, this design employs a static multi-mode approach, which has the previously mentioned drawbacks.

Flexible radio platforms like SDR can benefit from DPR advantages such as lower hardware constraints, hardware function optimization and higher flexibility [11, 23]. He et al. [24] proposed an FPGA-based (Xilinx Virtex-5) architecture for SDR comprising baseband processing and digital up-conversion. The design provides support for several operation modes of widely used protocols like 3GPP LTE, 3GPP WCDMA, IEEE 802.16e and IEEE 802.11n. Adopting DPR techniques, baseband parameters, such as FFT size, modulation scheme and CP length are changed at run-time. Compared with a static multi-mode design, the DPR-based design achieved a reduction in the the resource utilization. However, the comparison is not fair as the static multi-mode design considers parallel and independent processing chains for each standard, ignoring potential optimizations coming from the reutilization of common functional elements across different standards. Moreover, no reconfiguration times or power consumption measurements are reported.

Vipin and Fahmy [25] propose CoPR, an automated framework for PR-based adaptive systems on a Xilinx Zynq device, and present an illustrative case study where a reconfigurable multi-standard baseband OFDM transmitter is designed. The proposed architecture considers two reconfigurable partitions (RPs): one to implement the digital modulation scheme and another to accommodate the OFDM processing datapath. The Partial Reconfiguration is done through the Zynq PCAP interface, which leads to lower reconfiguration speeds than those achieved using the ICAP interface. This work only reports reconfiguration time results and does not provide figures for power consumption or RP reserved resources.

Also exploiting a hybrid Zynq FPGA platform, Shreejith et al. [26] demonstrate again the applicability of DPR in CR baseband processing. They present a dynamic CR design considering two conceptual planes: a data plane (in charge of baseband processing operations) and a control plane (responsible for cognitive tasks). The baseband processing datapath is implemented in a single RP, supports DVB-C and DVB-S transmission and is reconfigured via a custom ICAP-based reconfiguration manager [27]. This work does not evaluate the impact and system trade-offs related to DPR.

Pham et al. [28] present a reconfigurable multi-standard OFDM transceiver on a Xilinx Virtex-6 FPGA device, where the transmitter side is implemented in a single RP. Regarding the transmitter reconfiguration time, the authors do not present a measurement, but only the respective partial bitstream size and a theoretical peak value for the reconfiguration throughput. This paper does not present figures for the RP reserved resources and power consumption. Instead, more emphasis is put on the receiver processing chain architecture, impact of DPR on the system halting time and FIFO buffering requirements.

Rihani et al. [29] present an ARM-FPGA-based platform implemented on a Xilinx Zynq device, where several processes running on the ARM processor retrieve communication environment information to decide on how to reconfigure an OFDM baseband processing modulator. An OFDM transmitter for WiFi and WiMAX is implemented on Zynq’s programmable logic and features four RPs used for scrambling, interleaving, FEC encoding and IFFT. DPR is achieved via the PCAP interface. Results for the transmitter resource utilization and DPR latency are presented. DPR power overhead is measured through the Power Measurement Bus (PMBus). However, the method is not accurate because the sampling period is about 1 ms, which is of the same order of magnitude of the reconfiguration times observed.

Table 1 summarizes the results about DPR impact in the reconfigurable baseband modulators from [24,25,26, 28, 29]. In terms of application, these works are devoted to 3G/4G standards and waveform. From a system level perspective, they focus primarily on the enhanced flexibility DPR can offer, paying less attention to the overall impact of this technique on the design of the hardware infrastructure for wireless communications. Most of these works present only results for resource utilization and DPR latency, often ignoring power consumption. In this context, our work improves upon previous works by: 1) presenting two variants (static multi-mode and DPR-based) of baseband processing modulators suitable for the coexistence between a 4G waveform (OFDM) and a potential 5G alternative waveform (FBMC); 2) extensively evaluating and comparing these two variants in terms performance, resource utilization, functional density and power consumption; 3) providing a wider evaluation of the DPR impact in baseband processing engines for wireless communications, not only in terms of DPR latency but also regarding reserved resources and energy consumption overhead.

Table 1 Related works on reconfigurable baseband modulators exploring DPR techniques and respective results for DPR impact on the system

3 Baseband Processing for OFDM and FBMC-OQAM Modulation

This section briefly describes the operations involved in OFDM and FBMC-OQAM baseband modulation. Then, some considerations about reactivity requirements in wireless communications arepresented.

3.1 OFDM Modulation

OFDM is a Frequency Division Multiplexing scheme widely used as a method for digital multi-carrier modulation in which all subcarriers are orthogonal to each other [30]. An overview of the datapath implemented for OFDM modulation is depicted in Figure 1.

Figure 1
figure 1

Datapath structure for OFDM baseband modulation

First, each input data subcarrier is mapped into a pair of real and imaginary values (or in-phase and quadrature components, respectively) according to the constellation of values defined by the modulation scheme - Digital Modulation. In flexible radio devices, the dynamic reconfiguration of the digital modulation scheme is a valuable feature, as it enables modulation order adaptation based on the RF channel conditions, such as the Bit-Error Rate (BER) - constellation scheme adaptation. This work supports QPSK, 16-QAM and 64-QAM because they are the most used constellation schemes in LTE. The hardware implementation of these modules is very simple and requires a few LUTs and FFs to pre-store and select constellation values depending on the input value.

After digital modulation, the OFDM modulator maps the input data subcarriers to the central bins of the IFFT input array and inserts zero-valued frequency guard bands - Subcarrier Mapping. The guard bands size varies with different standards and modes of operation, depending on the IFFT size (NIFFT) and the amount of data (non-zero) subcarriers (D). The main constituent elements are a double buffer implemented with a simple dual-port block RAM and a control unit. In the double buffer, one buffer is used for input-writing and another for output-reading. The read/write access to these RAMs is managed by a control unit that receives and correctly maps the data to the correct IFFT input bin.

Next, there is the most computationally intensive and resource demanding module in an ODFM baseband modulator: the IFFT (Inverse Fast Fourier Transform). Depending on the standard and mode of operation in use, NIFFT (the actual OFDM symbol size) may vary. An IFFT core can be obtained from an FFT core by swapping real and imaginary parts of both inputs and outputs. In this work, the FFT core follows a Mixed-Radix-22/2/3 Single-Delay Feedback pipelined approach like the one presented in [31]. The FFT core is constituted by several processing stages which, in turn, are comprised of shift registers, ROM memories, complex multipliers and arithmetic blocks called butterflies. The internal structures of Radix-22, Radix-2 and Radix-3 butterflies can be found in [32] and [33]. Apart from processing elements, the FFT core also includes blocks for input data reordering and bit-reversed reordering of intermediate results, which are performed with RAM-based double buffers. Typically, bigger FFT sizes require a larger amount of hardware resources.

The IFFT processing produces time-domain OFDM symbols. In order to mitigate Inter-Symbol Interference (ISI), part of the OFDM symbol tail is copied and prepended to the beginning of the OFDM symbol - Cyclic Prefix (CP) insertion. Usually, different standards and modes of operation consider different CP lengths - LCP. Here, the main hardware elements are a dual-port Block RAM with the capacity to store an amount of elements equal to NIFFT and a control unit to command write and read operations on the Block RAM. Once a complete OFDM symbol is received and written in the Block RAM, its LCP last samples are output, followed by the complete OFDM symbol. The parameters affecting this module are NIFFT and LCP.

In LTE, a popular technique for Out-of-band interference mitigation called Weighted Overlap and Add (WOLA) is performed after the CP insertion. This technique can be divided in two stages: first, OFDM symbols are multiplied by a window (usually a raised-cosine window) and then the symbol’s tail is overlapped and added with next symbol’s head. The window length (W), NIFFT and LCP are the baseband parameters that define this operation and vary with the mode of operation. The window coefficients used in the multiplications are pre-stored in ROM elements, whereas the overlap-and-add operation is implemented with a Finite State Machine (FSM) and arrays of registers to store the OFDM symbol’s head and tail. The FSM is responsible for commanding data storage and forwarding, as well as the additions involved in WOLA.

The ODFM modulators implemented in this work were validated for LTE downlink transmission. This standard defines six modes of operation which depend on the channel bandwidth in use. To facilitate the mode of operation naming, the designation of each mode of operation is related to the corresponding channel bandwidth: for BX, X is the channel bandwidth in MHz. Depending on the active mode of operation, several OFDM parameters assume different values, which are summarized in Table 2. For validation, behavioral simulation results were compared with the output of the lteOFDMModulator() [34] function, part of the LTE Systems Toolbox from Mathworks.

Table 2 OFDM baseband requirements for LTE downlink transmission. LCP: cyclic prefix length; W: window length

3.2 FBMC-OQAM Modulation

FBMC is a modulation technique based on pulse shaping, where filter banks are used to carry filtering at a subcarrier level. Pulse shaping techniques can mitigate OOB leakage, but they are usually non-orthogonal [4]. For this reason, FBMC is often combined with OQAM to keep real-part orthogonality. Moreover, using OQAM with FBMC does not require a cyclic prefix and maximizes the spectral efficiency [35].

Depending on the filter bank implementation, there are two approaches to perform FBMC modulation: 1) polyphase network FBMC (PPN-FBMC), where filtering occurs on the time domain by a polyphase network placed after the IFFT operation; 2) frequency spreading (FS-FBMC), where filtering occurs on the frequency domain by upsampling and prototype filtering before the IFFT operation. In this work, FS-FBMC was adopted as it simplifies the FBMC concept, makes it closer to OFDM and features some intrinsic flexibility in its parametrization [36]. The datapath for FS-FBMC-OQAM modulation is shown in Figure 2.

Figure 2
figure 2

Datapath structure for frequency spreading FBMC-OQAM baseband modulation

The first step in OQAM is actually QAM mapping. Then, the in-phase and quadrature components of a QAM symbol are decoupled and one is delayed by half symbol in relation to the other. This can be efficiently implemented using an FSM to alternatively store or output the real and imaginary parts of a QAM symbol. Because OQAM is used, an FBMC system should run at twice the frequency of an equivalent OFDM system to achieve the same data rate. As in the OFDM case, the constellation scheme adaptation for QAM is a convenient feature for flexible communications. The constellation schemes considered here are also QPSK (4-QAM), 16-QAM and 64-QAM.

After OQAM modulation, the FBMC datapath operations are mainly characterized by the overlapping factor (K) and the number of subcarriers per symbol (Nc). The OQAM symbols are then placed in the central bins of an Nc-elements array. The remaining subcarriers are zero-valued and represent frequency guard bands - guard band insertion. This module mainly consists of an FSM to control data storage and reordering in Block RAMs (BRAMs).

In the FS-FBMC approach, each subcarrier is spread over 2.K − 1 carriers, which is the number of non-zero samples of the prototype filter. Prior to filtering, each multicarrier symbol is upsampled by K. The upsampler outputs K − 1 zero values between two incoming I-Q samples, using registers to store the input data and a counter to control the number of zero values at the output. The prototype filter consists of a transpose form, fully pipelined FIR filter that takes advantage of the filter coefficients symmetry to reduce the amount of multipliers needed. The combination of upsampling and prototype filtering is the actual frequency spreading operation and, as a result, the IFFT size is extended to K.Nc [35]. The IFFT algorithm and architecture used in the FBMC modulator is the same as used in for OFDM (Section 3.1).

Finally, it is necessary to overlap-and-add consecutive IFFT output stream blocks delayed by Nc/2 samples [37]. The overlap-and-add operation and the generation of transmitted symbols in FS-FBMC is described by the Matlab model from [38]. It considers a temporary array of size 2 × K × Nc, where the first half stores the current FBMC symbol to be transmitted and the second half stores the accumulation of IFFT output stream blocks. For each newly generated IFFT output block, the whole array is shifted Nc/2 positions and then the IFFT output block is summed to the second half of the array (Figure 3a). A direct mapping of this approach to a hardware implementation would be memory inefficient because it would require the use of replicated memory structures to perform two read operations per clock cycle on the temporary array [39].

Figure 3
figure 3

Overlap-and-add operation and architecture in the FS-FBMC modulator

Instead, we adapted the strategy from [36] to the case when OQAM is used in FBMC systems. The architecture of the overlap-and-add module is shown in Figure 3b and it only requires the storage of (2 × K − 1) × Nc samples. When operating in the overlap-and-add state (OAA), the module receives a stream of data from the IFFT module and the FSM generates the sel signal with the periodic pattern shown in Figure 3c. The flag data_valid indicates whether the input data value is valid or not and the flush flag is activated during the FLUSH state to initialize the shift registers with zero values. As depicted in Figure 3c, in steady-state operation, the module can consume and produce continuous data streams.

For the FBMC mode of operation supported in this work, we considered K = 4 and Nc = 512. Consequently, the IFFT size for this mode of operation is 2048, as in OFDM B20. The amount of zero-valued subcarriers (guard bands) per multicarriers symbol is 25% of Nc and the prototype filter is implemented as an FIR filter whose coefficients are reported in [36]. The choices for K and Nc values came from typical values used in FBMC related research, such as [19] and [40]. For verification purposes, the output values produced by our FS-FBMC modulator were compared with the results produced by the model from [38].

4 Reconfigurable Dual-Waveform Baseband Modulator Implementation

From the OFDM and FBMC modulation datapath descriptions and requirements presented in Section 3, two functionally equivalent designs for the reconfigurable dual-waveform modulator were implemented: a static multi-mode design and a DPR-based design. The modulator reconfiguration can occur in two independent planes, corresponding to the constellation adaptation scheme adaptation and the waveform adaptation, respectively. The former consists in the change of the digital modulation scheme used to modulate an input data sample; the later refers to the modification of the waveform (OFDM or FBMC) or mode of operation (e.g.: OFDM B1.4, B5 or B15) for baseband processing.

Due to the need for continuous data stream processing at high speeds, the hardware cores for dual-waveform modulation have data input and output ports following the Advanced eXtensible Interface 4 (AXI4) Stream protocol. The communication between internal modules within the modulator cores is also done through AXI4-Stream interfaces. The implemented datapaths are 32-bits wide and arithmetic operations are performed in fixed-point precision, considering 16 bits in Q5.11 format for both real and imaginary parts.

In a complete communication system, the designed baseband processors are likely to be integrated in communication systems like the all-digital transmitters proposed in [41]. For evaluation purposes, the dual-waveform modulator core was embedded on a top-level architecture whose basic operation is as follows: 1) read pre-stored input data from DDR memory; 2) perform baseband processing; 3) write output results back in DDR memory. A VC707 board (FPGA device: xc7vx485t) was the hardware platform used for complete system implementation.

A MicroBlaze soft processor is in charge of controlling the mode of operation of the dual-waveform modulator, either by setting values for control signals and baseband parameters through general purpose input/output (GPIO) interfaces (static multi-mode design) or by triggering DPR procedures (DPR-based design). It also sets memory transfers between DDR and the dual-waveform modulator. In order to reduce the burden on the MicroBlaze related with the data transfers, a 32-bit Direct Memory Access (DMA) controller is used to transfer baseband processing data between the dual-waveform modulator and the DDR memory. This frees the MicroBlaze to execute other system tasks and also improves the data throughputs between DDR and the FPGA logic. The interface between the FPGA core and the external DDR memory uses the Xilinx MIG (Memory Interface Generator) IP (Intellectual Property) core. Next, the two dual-waveform modulator cores are described.

4.1 Static multi-mode Design

The architecture of the static multi-mode design for the reconfigurable dual-waveform modulator core is shown in Figure 4. Constellation scheme adaptation is achieved through a multiplexer structure controlled by the sel_const input port. This port drives a 2-bit signal fed by a GPIO interface that is used to select and enable one of the three supported mappers: QPSK (sel_const=”00”), 16-QAM (sel_const=”01”) or 64-QAM (sel_const=”10”) as the constellation scheme for digital modulation. Although the three mappers are physically present on the system, only one is used at a time.

Figure 4
figure 4

Static multi-mode dual-waveform modulator core

After the digital modulation, there are two parallel datapaths comprising the baseband operations for OFDM and FBMC. The only common operation across these two datapaths is the IFFT which is shared between them. The waveform adaptation combines mux/demux structures with parameter value setting. Both techniques are controlled by a 4-bit signal from the sel_wf input port (Table 3). The most significant bit of sel_wf is the selection signal for the mux/demux structures that activate the OFDM or FBMC datapath. Similarly to the sel_const signal, the value of sel_wf is received through a GPIO interface controlled by the MicroBlaze.

Table 3 Mode of operation encoding with sel_wf

The OFDM modules for subcarrier mapping, CP insertion and WOLA mainly perform data storage, reordering and forwarding operations that require RAM-based structures, ROM memories and arrays of registers. These modules were dimensioned for the worst-case scenario and their operation parameters are defined by the three least significant bits of sel_wf. For FBMC, one mode of operation is supported and the datapath modules, excluding the IFFT, were designed for the specific set of parameters mentioned in Section 3.2.

To support all required IFFT sizes, the IFFT also combines mux/demux structures with parameter value selection through sel_wf[2:0]. Common hardware structures used for different IFFT sizes were identified and reused. Still, modules for data reordering and Radix-2 computation were designed for the worst-case scenario. Figure 5 illustrates the static multi-mode IFFT core and points out which modules are used for each IFFT size.

Figure 5
figure 5

Static multi-mode IFFT core structure

Despite the concerns about resource efficiency during the design of the static multi-mode core, only one of the two datapaths is used at a time and several modules had to be designed for the most demanding scenario.

4.2 DPR-based Design

Design customization at run-time according to system demands can improve system parameters like area, power and speed [42]. To overcome the resource inefficiency in the static multi-mode OFDM modulator, the logic datapath in the DPR-based dual-waveform modulator is optimized for each mode of operation on-the-fly.

To reconfigure a reconfigurable partition (RP) it is necessary to fetch a partial bitstreams from memory and forward it to the FPGA configuration port. Larger partial bitstream sizes incur longer reconfiguration latencies and, consequently, higher DPR energy consumption overhead. For a given RP, the partial bitstream size is proportional to the RP area. Thus, decisions on the amount and size of RPs will impact the size of the partial bitstreams used for DPR.

As constellation scheme adaptation is independent from waveform adaptation, an immediate system partition strategy for the DPR-based dual-waveform modulator core would consider two RPs: one for constellation scheme adaptation and another for waveform processing. However, the digital constellation mapping operation has very low complexity, mainly selecting one of several pre-stored constant values according to the input value. So, for constellation scheme adaptation, the benefits of applying DPR are very limited. For this reason, it was decided to reuse the architecture for digital constellation mapping from the static multi-mode design and implement waveform adaptation in a single RP (Figure 6).

Figure 6
figure 6

Top-level architecture for the DPR-based design

As a result, there are seven partial bitstreams for the waveform adaptation RP (six OFDM modes of operation plus one FBMC mode of operation). During DPR, signals driven by the RP into static logic can assume unpredictable values which, in turn, can corrupt normal system functioning. In a similar way, signals driven by static logic into the RP undergoing reconfiguration may corrupt the newly loaded reconfigurable module (RM). To avoid these potential problems, registers are used to decouple signals generated from and driven into the RP. These decoupling registers are placed on the static side of the RP-static interface.

For the DPR-based dual-waveform design, the top-level architecture is slightly different from the description in the beginning of Section 5, as support for DPR is required. Just after powering on the FPGA board, the board flash memory is loaded with a file system including all partial bitstreams and input data files. Then, when the initial FPGA configuration is completed, the MicroBlaze starts by copying all partial bitstreams from the flash memory to the DDR. Once this is done, the system is initialized and can start to operate normally. During system operation, the MicroBlaze is not only responsible for setting up data memory transfers between DDR and the FPGA logic fabric, but also for triggering DPR procedures. In our system, the access to configuration memory functions is provided by the Xilinx Internal Configuration Access Port (ICAP) primitive. The partial bitstream transfers from the DDR to the ICAP primitive are set up by the MicroBlaze, but controlled by a dedicated 32-bit DMA controller. The main motivation for using this dedicated DMA controller is the acceleration of partial bitstream transfers and, consequently, the reduction of reconfiguration latency.

The ICAP primitive has a 32-bit input port and also a 32-bit configuration data output that can be used to report and monitor the configuration status. Based on the ICAP output port values [43], two flag signals are generated: pr_dec and pr_error. The former is used to determine if a reconfiguration process is still ongoing and if so, it resets the decoupling registers placed between the static logic and the RP; the later informs the MicroBlaze whether a configuration error occurred or not.

5 Results and Discussion

In this section, both baseband modulator designs are evaluated and compared, in terms of throughput performance, resource utilization and power consumption. Moreover, results about DPR latency and power consumption overhead are presented and discussed.

5.1 Processing Throughput

The static multi-mode and the DPR-based designs operate at a clock frequency of 100 MHz and, in a steady state, both achieve a data throughput of at least 92 MSample/s. In both designs, the worst-case latency for OFDM is 9278 clock cycles (B20 mode), whereas for FBMC it is 7743 clock cycles. The similarities in the processing throughput and latency between static multi-mode and DPR-based designs are due to the fact that, for each mode of operation, the active logic path used for baseband processing is equivalent.

The obtained throughputs are compatible with the data rates imposed by the communication requirements. Regarding the OFDM implementations, the sampling rate requirements imposed by LTE depend on the mode of operation and its highest value is 30.72 MSample/s (for B20). So, the data rates of the implemented designs cover the LTE requirements with a comfortable margin. There are no FBMC-based commercial standards and most of the proposed 5G physical layer numerologies are designed for OFDM [1]. Nevertheless, the processing throughput of our FBMC designs is still higher than the LTE requirements (4G) and also compatible with the throughput defined in [1] for 5G communications in spectrum bands below 6 GHz.

5.2 Resource Utilization

The resource analysis will be presented at different levels, in order to give a better overview about resource usage in each design. First, the resources used for constellation and waveform modulator cores are quantified and discussed. Then, the resource utilization for the top-level architecture (without constellation and waveform modulator cores) in both static multi-mode and DPR-based designs is analyzed.

In both static multi-mode and DPR-based designs, the digital constellation mapper uses 43 look-up tables (LUTs) and 40 flip-flops (FFs). This represents a very low resource usage - below 0.02% of available LUTs and FFs - and supports the decision to keep a static multi-mode constellation mapper in the DPR-based design (Section 4.2). Regarding the logic for waveform modulation, resource utilization results are provided in Table 4. Each of the seven configurations of the DPR-based design is less resource-demanding than the static multi-mode waveform modulator core. The resource savings are mainly due to the circuit specialization achieved through DPR and these savings become more evident when comparing OFDM configurations for lower channel bandwidth scenarios with the static multi-mode core. In the DPR-based design, OFDMB15 mode of operation is the most demanding in terms of slices (1297) and DSPs (23), whereas the most demanding configuration regarding BRAM tiles (21) is the FBMC mode. For each resource type, a comparison with the figures of the static multi-mode design shows that the DPR-based waveform modulator achieves a resource utilization reduction of at least 54% of slices, 31% of BRAM tiles and 28% of DSP blocks.

Table 4 Post-Implementation resource utilization for the waveform modulator core

The top-level architecture for the DPR-based design requires some extra modules for reconfiguration. The main differences between the top-level architectures for static multi-mode and DPR-based designs reside in the inclusion of the ICAPE2 primitive and a dedicated DMA controller for partial bitstream data transfers. With the addition of another memory-mapped slave device to the system (DMA controller for DDR-ICAPE2 transactions), the IP core responsible for the interconnection between the MIG core (master device) and slave devices (MicroBlaze, DMA controllers) also becomes more complex and resource demanding. A comparison of the resources used by the top-level architecture in both designs is presented in Table 5.

Table 5 Post-Implementation resource utilization for the Top-level System Architecture (without the waveform modulator core)

For the DPR-based design, the top-level architecture requires more resources than the static multi-mode alternative. This overhead consists of 963 slices and 9.5 BRAM tiles. However, combining the values from Tables 4 and 5, one observes that this overhead is compensated and surpassed by the savings obtained from circuit specialization in all DPR-based design configurations.

DPR techniques allow for circuit specialization and increased flexibility at the cost of some area overhead. The main cause is the need to reserve resources for the RP. Additionally, the interconnections between the RP and static logic also consume some resources and they preclude possible place-and-route optimizations. The resources reserved by the waveform modulation RP in the DPR-based baseband modulator core is shown in Table 6. It is important to note that the RP has to be dimensioned to accommodate enough resources for the most demanding reconfigurable module to be placed in the partition. Regarding the RP resource utilization, the maximum slice and BRAM tiles utilization is about 93% and 70%, respectively, whereas for the DSP blocks case it drops to about 38%.

Table 6 Resource reservation for the Reconfigurable Partition (R - Reserved; MU - Maximum Utilization) and its comparison with the number of resources used by the Static Multi-mode waveform modulator core

From an overall perspective and combining the figures for slices, BRAM tiles and DSP blocks, the total amount of resources reserved for the RP in the DPR-based design is about half the amount of resources used by the static multi-mode waveform modulation core. This clearly shows the improvement in resource efficiency brought by the use of DPR in our application. Moreover, DPR allows for hardware virtualization and it has a tremendous potential in this application domain. For instance, by reusing RP resources and creating new configurations, the current baseband modulator could be extended to support other standards or other waveforms (e.g.: General Frequency Division Multiplexing, Universal Filtered Multicarrier). Other potential DPR applications in this context are the support for baseband processing in MIMO systems and non-contiguous spectrum aggregation scenarios. Here, dynamic replication and customization of baseband processing chains is a valuable feature. This could be enabled by considering the FPGA logic fabric as a tile of reconfigurable blocks, in which multiple baseband processing chains could be instantiated, customized or deactivated on-the-fly according to communication demands.

5.3 Reconfiguration latency

In the static multi-mode implementation, the system reconfiguration simply consists in defining the values of sel_const and sel_wf. After that the modulator is ready to operate with the new configuration. In contrast, the reconfiguration of the waveform modulation in the DPR-based implementation has a latency associated. During the DPR process, the region to be reconfigured is not available. If too big, the reconfiguration latency can have a negative impact on system performance. Thus, it is important to reduce the DPR latency and evaluate its impact on the considered application domain.

Vivado offers a bitstream compression feature that can also be used with partial bitstreams [43]. This feature reduces the partial bitstreams size and consequently allows for smaller reconfiguration latencies. Thus, compressed partial bitstreams were generated and used for reconfiguration purposes. It is worth noting that the size of compressed bitstreams for the same RP can vary. The worst-case DPR latency for waveform adaptation is 1.051 ms and corresponds to changing to the OFDMB20 mode. In this case, the amount of data transferred to the ICAP is 383 kB. In our DPR-based design, we are clocking the ICAP primitive at 100 MHz, which means that the upper limit of the reconfiguration throughput is 400 MB/s. The lowest reconfiguration throughput observed for waveform adaptation was 369 MB/s, which is 92.25% of the ICAP limit. According to the recent ITU-R report about 5G performance requirements [44], the control plane latency in 5G radio devices should be less than 10 ms. The measured DPR latencies are around 10% of the ITU-R requirements, leaving a considerable margin for other control plane tasks to be performed in the radio system. These observations attest the viability of exploiting DPR techniques in the design of baseband processing hardware infrastructures for wireless systems.

5.4 Functional Density

Combining the results from Sections 5.2 and 5.3, an analysis based on the functional density (D) metric proposed by Wirthlin and Hutchings [45] can be done. Here, constellation and waveform modulator cores will be analyzed separately, to assess the resource utilization advantages of DPR versus its reconfiguration latency in each reconfiguration scenario. Functional density is defined as:

$$ D = \frac{1}{A(T_{e}+T_{r})} , $$
(1)

where Te is execution/operation time under a certain configuration, Tr is the maximum reconfiguration latency and A is the amount of hardware resources to implement the required functionalities. For the static multi-mode design, the functional density is:

$$ D_{s} = \frac{1}{A_{s}T_{e}} $$
(2)

as there is no reconfiguration latency (Tr = 0). The functional density for the DPR-based design is:

$$ D_{r} = \frac{1}{A_{r}(T_{e}+T_{r})}. $$
(3)

In these expressions, As is the amount of resources (slices, BRAM tiles and DSP blocks) used in the static multi-mode design and Ar is the amount of resources reserved in DPR-based design. For Te, we consider it to be the same in both expressions because both static multi-mode and DPR-based designs showed the same processing throughput performance (Section 5.1). In communication environments, it is hard to know for how long a device will operate in a certain mode of operation. So, we do not have precise values for Te. However, it is reasonable to admit that, in our application scenario, typical values of Te are in the range of minutes or at least tens of seconds.

Wirthlin and Hutchings [45] present a condition for a DPR-based design to have a higher functional density than its static multi-mode counterpart:

$$ \frac{D_{r max}}{D_{s}} - 1 \geq \frac{T_{r}}{T_{e}} \Leftrightarrow \frac{A_{s}}{A_{r}} - 1 \geq \frac{T_{r}}{T_{e}}, $$
(4)

where Drmax is the upper limit for the DPR-based design functional density (when Tr = 0): Drmax = 1/ArTe. Condition (4) can be use to figure out for which values of Te we have DrDs. For the waveform modulator core, As = 2855.5, Ar = 1490 and Tr = 1.051 ms and we have:

$$ 0.916 \gtrsim \frac{T_{r}}{T_{e}} \Leftrightarrow T_{e} \gtrsim {1.147}~ms. $$
(5)

Condition (5) states that if Tr is less than about 92% of Te, the DPR-based waveform modulator shows a superior functional density than its static multi-mode counterpart. Equivalently, it states that when Te is bigger than 1.147 ms we have DrDs. For the typical values of Te previously mentioned in this section, condition (5) is satisfied and, regarding the waveform modulation core, the DPR-based design is better than the static multi-mode design in terms of functional density.

5.5 Power Sonsumption

The power analysis is divided in two parts: first, power consumption estimates for the dual-waveform baseband modulator cores described in Sections 4.1 and 4.2 are presented; then, the impact of DPR in terms of energy overhead is quantified through real-time power consumption measurements.

Power estimates for both static multi-mode and DPR-based modulator core designs were obtained using the power analysis capabilities of Vivado 2015.2 (Figure 7). In order to achieve power estimates with a high confidence level, node switching activity values were derived from post-implementation simulation of fully routed netlists. Regarding the static power consumption, both designs present similar values: 721 mW in the static multi-mode design; 720 mW in the DPR-based design - and no significant variations were observed across different modes of operation. The static power consumption is between 68% to 78% of the total power consumption and this dominance is mainly due to the fact that the FPGA device size is considerably larger than the area used by the modulator cores. For instance, the most resource-demanding DPR-based configuration and the static multi-mode modulator core occupies less than 2% and 4% of the available slices on the xc7vx485t, respectively (Table 4).

Figure 7
figure 7

Dynamic power consumption estimates per on-chip component for the dual-waveform baseband modulator cores

The dynamic power consumption exhibits considerable variations across modulator core designs and modes of operation. Although more evident for the DPR-based implementation, a general observation is that dynamic power consumption is higher for modes of operation with higher resource utilization (Table 7). For all modes of operation, the DPR-based modulator core presents a reduced dynamic power consumption compared with its static multi-mode counterpart. These power savings vary from 26 mW (13% reduction in the FBMC mode) to 90 mW (52% reduction in the OFDMB1.4 mode).

Table 7 Dynamic power consumption estimates for both dual-waveform modulator core designs

Figure 8a and b provide a deeper look at the dynamic power behavior by presenting the consumption per on-chip component type for the static multi-mode and DPR-based modulator cores, respectively. The power savings in the DPR-based design are mainly associated with clock signal propagation and BRAM primitives utilization. In the static multi-mode core, the clock-related dynamic power could be reduced through clock gating techniques. However, due to the fixed nature of the FPGA clock trees, clock gating can become challenging and potentially affect design timing closure. It is difficult to reduce the dynamic power associated with BRAM primitives, as the static multi-mode modulator core has to be dimensioned for the most demanding scenario. In turn, the run-time circuit specialization afforded by the DPR-based implementation reduces the amount of node activity within the baseband modulator core, leading to significant dynamic power savings.

Figure 8
figure 8

Power consumption behavior during DPR (real-time measurement)

For the DPR-based implementation it is important to quantify the energy consumption overhead introduced by the reconfiguration and evaluate its impact on the system. Liu et al. [46] stated that the minimization of the reconfiguration energy overhead relies on the maximization of the reconfiguration speed. However, the reconfiguration speed is not the only factor that influences the reconfiguration energy overhead. In their research about power consumption models for DPR, Bonamy et al. [47] observed that the differences between the previous and the next configuration also affect the power consumption behavior during reconfiguration. More concretely, the authors observed that the Hamming distance between the bitstreams for the previous and next configurations has a strong direct correlation with the overconsumption during reconfiguration.

Simulation and power estimation for DPR are extremely challenging. Moreover, DPR is a system-level feature and its energy impact should be evaluated regarding the whole system design. For these reasons, real-time measurements were the base to the quantitative analysis on the DPR energy overhead. In the VC707 board, the power rail that feeds the FPGA core provides a voltage of 1 V (VFPGA). The board also includes a circuit setup that allows the measurement of the current fed by this power rail and, consequently, the FPGA core power consumption. This circuit setup consists of a current-sensing shunt resistor (5 mΩ ± 1%, 3 W) and a Texas Instruments INA333 instrumentation amplifier. The current driven into the FPGA core is the same that flows through the shunt resistor and the instrumentation amplifier configuration has a gain G = 24.7. Thus, the current through the shunt resistor can be obtained from the INA333 output voltage - Vo - by the following relation:

$$ I_{R} = \frac{V_{\mathit{o}}}{G \times 5 \text{ m}{\Omega}} \qquad\textit{(A)}. $$
(6)

In turn, the FPGA core power consumption is given by:

$$ P = V_{\mathit{FPGA}} \times I_{R} \qquad\textit{(W)}. $$
(7)

A Keysight Infiniium DSO90254A 2.5 GHz digital oscilloscope was used to measure the INA333 output voltage and, thus, the FPGA core real-time power consumption. The INA333 output voltage data was acquired at a sampling rate of 1.25 MSample/s and, during all experiments, room temperature and FPGA device temperature were kept around 23 C and 33 C, respectively.

In the definition of a scenario to evaluate the worst-case DPR energy overhead, both reconfiguration latency and differences between previous and next configuration bitstreams were considered. From Section 5.3, the worst-case reconfiguration latency is 1.051 ms, when reconfiguring the waveform modulation RP to the OFDMB20 mode. Using non-compressed partial bitstreams, Hamming distances within constellation and waveform configurations were computed. OFDMB20 and B15 waveform modes are the ones with the largest Hamming distance. Thus, the scenario considered for DPR energy overhead measurement was the reconfiguration between OFDMB20 and OFDMB15 modes. For measurement purposes, the system operation loop was: remain 4.5 ms in the OFDMB20 mode (without baseband processing); reconfigure to the OFDMB15 mode; remain 2.5 ms in the OFDMB15 mode (without baseband processing); reconfigure to the OFDMB20 mode. Figure 8 shows the obtained power consumption curve. It is possible to identify periods of time corresponding to DPR procedures, where a peak in the power consumption is observable. The duration of these peaks matches the reconfiguration latencies for OFDMB20 and OFDMB15 (approximately 1 ms).

Table 8 DPR latency and energy consumption overhead

In this application domain, reconfiguration is a rather sporadic event with a short time duration compared with the periods of time where the system remains in a certain mode of operation. Due to this, the DPR impact will be quantified in terms of energy consumption. Table 8 presents values for DPR energy consumption overhead, together with DPR latency and partial bitstream sizes for the reconfiguration scenarios considered. The DPR energy overhead is less than 1.5 mJ for both configuration transitions; it is slightly higher when reconfiguring from OFDMB15 to OFDMB20. This transition has a higher latency associated, which can contribute to the higher energy overhead.

The use of FPGAs to implement multi-standard, flexible and reconfigurable radio devices, such as the one presented in this work, is particularly suited for base stations [48], where the constraints on power consumption are less stringent compared with user terminal equipments. The DPR energy consumption overheads measured in this work are acceptable in the context of commercial base stations.

6 Conclusions

This paper presents a dynamically reconfigurable FPGA-based dual-waveform baseband modulator implemented on a VC707 board, which supports OFDM and FBMC modulation. Through DPR, the system adapts both constellation scheme and waveform mode of operation on-the-fly. The DPR-based baseband modulator is compared with a static multi-mode functionally equivalent design in terms of processing throughput, resource utilization, functional density and power consumption. Moreover, DPR latency and energy overhead are measured. Both static multi-mode and DPR-based designs can process data at rates above 92 MSample/s, surpassing the requirements typically imposed by 4G and some draft modes of operation below-6 GHz for 5G communications.

The circuit specialization on the DPR-based baseband modulator core allows for resource savings of 54% of slices, 31% of BRAM tiles and 28% of DSPs, compared with the static multi-mode baseband modulator core. Additionally, the amount of resources reserved for the DPR-based baseband modulator core implementations is roughly half the amount used by its static multi-mode counterpart. The dynamic power consumption for the dual-waveform modulator core is between 26 mW to 90 mW lower in the DPR-based design, due to a reduction in circuit nodes activity comparing with the static multi-mode design. These power values correspond to 13% and 52% of the static multi-mode baseband modulator core dynamic power consumption, respectively.

The worst-case DPR latency and energy overhead observed are 1.051 ms and 1.31 mJ, respectively. Considering reactivity requirements reported by several wireless communication institutions, the latency introduced by DPR is within an acceptable range. In turn, the DPR energy overhead is tolerable in communication systems such as commercial base stations, where power consumption constraints are not so strict as in user terminals. Apart from its low performance and energy penalty, DPR improves resource utilization efficiency and makes the system more flexible and upgradeable. In turn, this can contribute to a longer service time of base station equipments and thus lead to long-time cost savings.

This work not only contributes a DPR-based baseband processing design for the transition and/or coexistence between 4G and 5G radio systems, but also discusses the advantages and drawbacks of applying DPR in modern communications, complemented with a quantitative analysis of the DPR impact on system performance concerning latency and energy consumption.