Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Benefits of flexible applications force developers and also the industry to design new techniques, methodologies, and tools. The target of this effort is to develop more flexible systems capable of adapting their performance dynamically, saving hardware resources and power consumption but maximizing their performance. However, achieving an optimal design remains a challenge since most real task workloads are dependent on run-time system conditions [14] and environmental issues. In this sense, there are two basic aspects that facilitate reaching a comprising solution. The first one goes through improving the level of parallelism of the system. The second issue requires exploiting the dynamic and partial reconfiguration (DPR) benefits that SRAM-based FPGAs offer, mostly Xilinx ones [232]. DPR allows designing IP cores with run-time adaptable parallelism and achieving a flexible assignment of resources. This chapter presents a hardware, modular, and scalable deblocking filter (DF) architecture that is capable of modifying its number of modules in one or two dimensions, following a highly modular and regular matrix template of functional units, which was previously presented in [17]. Within the set of features that characterize this architecture, stand out the regularity of communication patterns proposed through the array, since they reduce the number of distributed external memory accesses.

On the other side, the latest video standards, like the scalable video coding (SVC) [10, 22], support different levels of scalability and profiles [1626], incorporating a higher degree of flexibility. The disadvantage of such flexibility is an increase of the design and implementation costs to deal with the bunch of possibilities offered by these standards. In this context, those costs can be reduced by breaking the video encoder up into flexible hardware modules in which size can be easily changed independently. Moreover, if this modification is carried out at run time, as it is proposed in this chapter, a highly adaptable scenario can be envisaged. It might be composed of different blocks in charge of video decoding tasks, with the capability of dynamically adapting its size, and accordingly, its performance, to the type and levels of scalability selected by the users, or other run-time environmental conditions.

Going further in the profiling of the SVC decoding process, the deblocking filter is not only one of the most computationally intensive tasks of the standard, but it is also highly dependent on the selected video profile [12]. In addition, a parallelization scheme is proposed that performs the scaling process consistent with the data locality restriction imposed by the architecture. Furthermore, the proposed architecture has been designed to be reused and flexible, such that its general framework might be adapted to process different kinds of applications in which there exist certain kinds of data dependencies.

The rest of this chapter is organized as follows. In Sect. 2, a review of the state of the art on parallel and scalable architectures is shown. Then, the role of the deblocking filter within the H.264/AVC and the SVC video standards is described in Sect. 3. In Sect. 4, the possibilities of parallelizing the DF are discussed. Section 5 describes all the modules belonging to the proposed approach, which as a whole will form a DF. Section 6 focuses on different implementation issues. Section 7 shows details about how the proposed scalable DF is integrated as part of an embedded system, while in Sect. 8, dynamic reconfiguration details are presented. In turn, Sect. 9 generalizes the proposed approach to solve different kinds of problems. In Sect. 10 implementation results are presented. Finally, in Sect. 11, the main conclusions achieved in this chapter are highlighted.

2 Related Work

There exists a significant research interest in developing applications able to reach a good trade-off between flexibility and real-time performance. In this sense, numerous works have been focused on addressing the challenges introduced by parallel and distributed computing applications. This has led to develop a diverse number of solutions, in terms of tools, methodologies, and architectures. The key target is to reduce the complexity of design and implementation stages of these resource demanding applications but maximizing the final reachable performance. In any of these cases, the solution goes throughout increasing the level of parallelism of the processing cores, independently of whether these cores are software or hardware based.

Some existing approaches explore scheduling and optimal parallelism selection issues rather than considering architectural challenges. An example of this kind of approach is PARLGRAN [1], a framework that deals with mapping and scheduling aspects derived from using dynamic parallelism selection. The purpose of this solution is to maximize the performance of an application task chain by selecting an appropriated parallelism granularity for each task. Different levels of granularity are achieved by instantiating several copies of the same task, allowing that all of them work concurrently. As a consequence, the task workload is shared equitably among the instances, which means that the execution time is reduced proportionally to the number of instances. The disadvantage of this solution is that it only considers tasks without dependencies between disjoint data blocks, which restrict its applicability to many data-parallel tasks, such as the DF. Like et al. [14] tackle a similar problem, but in this case, the solution considers dynamically reconfigurable adaptive accelerators. The authors propose balancing area and execution time while unrolling application loops, dependent on the inputs and tasks running concurrently in the reconfigurable area. The more the loop is unrolled, the higher the parallelism.

Since this chapter is focused on the accelerator selection, no architectural novelties on how different accelerators are built are provided. Recently, Salih et al. have proposed in [21] a scalable embedded multiprocessor architecture implemented on an FPGA. Despite that this architecture is implemented on an FPGA in order to reduce the time to market, it does not exploit the flexibility of this hardware device as much as it could. The architecture is based on a core processor, which controls, synchronizes, and manages the whole system and the memory accesses, and several processing modules working in parallel. These modules are implemented as a scalable embedded concurrent computer (ECC) architecture, where one or several modules might work simultaneously. Once again, this architecture is conceived using software techniques, and the scalability is referred to parallelize different kinds of independent tasks, with no data dependencies among them. As a consequence, this solution is not well suited to accelerate algorithms like the DF. Following the same tendency, other examples of architectures that provide scalable characteristics are [813]. Both of them have been developed for supporting real-time multimedia applications. The authors in [8] introduce a new framework, P2G, formed by a CPU and several GPUs, that is, execution nodes. The topology between the CPU and the nodes is flexible, since it can change at run time accordingly to the dynamic addition or removal of nodes. The workload is distributed following two scheduling approaches; one of them distributes data throughout the nodes (task parallelism), while the other is responsible for maximizing the performance locally in each node (data parallelism). This approach presents an interesting concept of dynamic adaptability, though this idea is again based on software concepts. On the other hand, in [13], a parallel processor for real-time image processing is presented. This solution, named MX-2 Core, is composed by a small processor and a parallel processing unit (PPU). The design of the PPU is based on 2048 4-bit processing elements (PE) combined with SRAM cells. The scalability of this approach is explored into the PPU, since the number of enabled PEs might vary in numbers of 256, in accordance with the task that it has to process. However, the modification into the number of PEs is programmed statically, which means that it is designed depending on the specific task that is going to perform. As an alternative to these scalable software-based approaches, hardware solutions provide higher levels of flexibility and adaptability by exploiting the DPR capabilities of some FPGAs.

Achieving scalability by means of DPR allows reusing free configurable areas for other cores within the reconfigurable logic. However, most existing scalability-related works are oriented to adapt core functionality or operation quality rather than to set area-performance trade-offs. For instance, the scalable FIR filter provided in [5] offers the capability of adapting the number of taps to adjust the filter order, offering a compromise between filtering quality and required resources. Furthermore, a scalable DCT implementation in [11] allows varying the number of DCT coefficients that are calculated by the core in order to adapt the number of coefficients that will be subsequently quantified and therefore adjust the video coding quality.

Regarding the DF implementations, a scalable DF architecture is proposed in [12]. This approach exploits the properties of the DPR. This chapter explores a block-level parallelism implementation, where the variation of the number of processing units working in parallel impacts on the execution time of a single MB, without dealing with MB dependencies. This approach, while still being interesting for flexibility purposes, is limited by the maximum number of 4 ×4 blocks that can be processed simultaneously in one MB, with no further scalability levels. In addition, the designed floor planning does not allow for an easy reuse of the area released when shrinking the filter, so real area-performance trade-offs cannot be easily achieved.

3 H.264/AVC and SVC Deblocking Filter

The DF algorithm is part of the H.264/AVC and the SVC video standards, and it is responsible for improving the visual perception of a reconstructed image by smoothing the blocking artifacts generated by previous tasks within the encoding and decoding loops. Those artifacts are mainly due to the division of each video frame into disjoint sets of pixels called macroblocks (MBs), which are processed independently.

DF is a highly adaptive algorithm, where the filtering operations performed on all the MBs belonging to a decoded image highly vary depending on the image content and the encoding decisions made during the encoding process. An MB is organized into a matrix of 16  × 16 pixels of luminance and two matrixes of 8 ×8 pixels of chrominance (blue and red), when using the coding format 4:2:0, as it is the case of almost all the consumer video encoders. These pixels are organized into units of 4  × 4 pixels named blocks or into 4 lines of pixels (LOPs). Every block is enumerated from zero to 23, where 16 belong to the luminance, 4 to the blue chrominance, and the other 4 to the red chrominance.

The DF data processing is mainly executed by two blocks: the boundary strength and the filter block. The former one calculates the filtering mode, which is a value between zero and four. The difference among these values, or strengths, is the number of pixels that might be manipulated during the filtering process. The latter one is responsible of filtering data, modifying the current data and its neighbors. List et al. in [15] describe more in detail the operations performed by the DF. In addition, the aforementioned standards constrain the DF behavior, specifying how the data of an MB must be processed depending on the filter strength, as well as the order in which these data have to be filtered, as Fig. 1 represents.

Fig. 1
figure 1

DF behavior constraints. (a) Data distribution into an MB (b) Filtering order

As Fig. 1a shows, an MB is split into vertical and horizontal edges by considering the edges of its blocks. Thus, the vertical edges (V i ; 0 ≤ i ≤ 7) are determined by the left borders of a column of blocks. Similarly, the horizontal edges (H i ; 0 ≤ i ≤ 7) are the top boundary of a row of blocks. As a result, the filtering of a vertical edge between two blocks corresponds to a horizontal filtering, whereas the filtering of a horizontal edge implies a vertical filtering. According to the H.264/AVC and the SVC standards, to consider an MB completely processed, all its blocks have to be filtered (firstly, horizontally (from left to right), and afterwards, vertically (from top to down)). Figure 1b depicts how the pixels of a LOP are processed, starting from the pixel closest to the edge and moving away from it. In this figure, q0, q1, q2, and q3 represent the pixels of the block that is being filtered, while p0, p1, p2, and p3 represent its neighbor pixels.

4 Strategies of DF Algorithm Parallelization

The level of parallelism of the DF is directly dependent on the data granularity implemented (LOPs, blocks, or MBs). In this sense, a fine-grain architecture is focused on processing LOPs, whereas a medium-grain one works at block level, and a coarse-grain one at MB-level. In the two first granularity levels, the maximum level of parallelism is limited to a full MB, which means that only one MB might be processed concurrently even when it is separated into the smallest units. This fact obligates to process all the MBs sequentially one by one, following a raster-scan pattern. This pattern implies a strict order, starting from the top left corner of an image, and moving to the right, repeating the process row by row until image completion. This is the strategy followed by all the fine- and medium-grain approaches, understanding the granularity from a data perspective, independently of the device characteristics. Moreover, static architectures are also constrained to follow this kind of strategy. However, in order to overcome this limitation and to be able to process several MBs in parallel, it is necessary to explore new techniques. It is in this scenario when the wavefront strategy comes up. A wavefront pattern is characterized by allowing processing of several MBs simultaneously out of order, but not randomly, fulfilling with the data dependencies. Further details about the general wavefront approach are offered in Sect. 9, while existing dependencies in the case of the DF algorithm are discussed below.

4.1 MB-Level DF Parallelization Strategies

As follows from the DF algorithm description, horizontal filtering must precede vertical filtering, and in both cases, it is necessary information of the data that is being currently filtered and its neighbors. These constraints limit the level of parallelism of the DF.

The specific dependencies between MBs are depicted in Fig. 2. In this figure, MBH refers to an MB once it has been filtered horizontally, and MBHV represents the MB after it has been filtered both horizontally and vertically. An MB is completely processed when MBHV is filtered again, during the horizontal filtering of its right neighbor (after this process, the MB is represented as [MBHV]), and vertically during the filtering of the bottom neighbor.

Fig. 2
figure 2

Data dependencies between MBs for a 6-MB-width video frame

Following the example shown in Fig. 2, MB7 needs MB6HV information for being filtered horizontally. Subsequently, it requires both MB7H and [MB2HV] for being filtered vertically. [MB2HV] is ready once the MB3 has finished its horizontal filtering. Finally, MB7 will be completely processed once MB8 and MB12 have been filtered horizontally and vertically, respectively.

A possible solution to exploit MB-level parallelism might entail using a wavefront order in the same way as the state-of-the-art multiprocessing solutions [24], as can be observed in Fig. 2. Nevertheless, this approach needs to wait twice as many clock cycles for filtering a full MB (MBcycle), that is, until [MBHV] is available, before the subsequent core starts processing. To overcome this limitation, an optimized wavefront strategy has been proposed by the authors in [3], in which horizontal and vertical filtering are separated in sequential stages. Thus, the top MB horizontal filtering has already finished when the MB vertical filtering begins, assuming [MBHV] is available. As a result, one MBcycle is saved with respect to previous approaches, as Fig. 3 shows.

Fig. 3
figure 3

Proposed wavefront strategy for a 6-MB-width video frame

Larger architectures that want to process more MBs at a time require its parallelization strategy to be scalable itself in order to obtain consistent results. Actually, an adaptable MB-level parallelism is mandatory to respect the data dependencies between MBs, no matter the size of the architecture in terms of processing units.

Compared with previously reported architectures, in this chapter, we propose a dynamically scalable DF that exploits a coarser granularity than the mentioned state-of-the-art approaches. This architecture works at MB level, and it allows reusing the released area in order to balance area-performance trade-offs. The scalable DF follows a strategy of parallelism fully compatible with the scaling process of the architecture, where the data dependencies between MBs are respected in all cases. The level of scalability varies from just one functional unit (FU) up to the maximum number of resources available in the reconfigurable logic area.

5 Global Architecture

The core of the proposed architecture is a coarse-grain homogeneous array of processing elements, called functional units (FU), as depicted in Fig. 4. Each unit is able to carry out a complete filtering operation on an MB, such that the full array can process in parallel a region of the image. A more detailed description of each FU can be found in [4]. The main strengths of the proposed structure are its inherent parallelism, regular connections, and data processing capabilities. The purpose of the rest of the modules included in the architecture, as described below, is to feed each FU with the required MB as well as to synchronize the array.

Fig. 4
figure 4

Processing array structure

A mechanism responsible of the generation of the valid sequence of MB addresses has been included in a hardware module called input controller (IC) in order to respect the processing order defined by the proposed parallelization pattern. This module receives the corresponding sequence of MBs from the external memory and sends them to the array of FUs. In addition, other modules named input memories (IM) have been included at the top of each column of the processing array in order to parallelize data provision to the FUs. Main components of these blocks are FIFO memories that distribute the suitable MBs vertically across each column. With the purpose of capturing the MB associated with the corresponding units, a module called router has been attached to each FU. Once all the FUs capture their corresponding unfiltered MBs, these data blocks are processed in parallel. Once these blocks have been processed, the routers send them back to the output memories (OM). These modules are based on FIFO memories that store the MBs received from the vertical connection and transmit them again in sequential order to the output controller (OC). This OC sends back the processed MBs to the external memory. Data sending, processing, and results transmission stages have been pipelined and overlapped.

All the mentioned modules of the architecture include distributed control logic in order to manage data transmission, while allowing architecture scalability. Thus, different modules automatically communicate only with their neighbors using shared signals, without having to implement a centralized control, which would reduce scalability due to the fact that the control structure should be designed ad hoc for each possible size of the array.

5.1 MB to FU Allocation Policy

Once the architecture has been described, the policy defining which MB is processed in each FU of the array is discussed, considering the data locality of the algorithm and the regularity of the architecture.

Since horizontal and vertical processes for each MB have to share data corresponding to the full MB, both are carried out in the same FU to minimize communication costs. In addition, according to existing dependencies, semifiltered data (MBHV and [MBHV] according to the nomenclature in Fig. 2) have to be shared between the FU filtering an MB and the FUs in charge of their top and left neighbors. To reduce this overhead, an MB will always be filtered in the same unit as its left neighbor, and its top neighbor will be filtered in the unit below the MB. As explained in the implementation section, specific connections between FUs have been created to allow the exchange of this semifiltered information, without involving routers. The final allocation sequence for filtering all MBs within a frame is dependent on the total number of FUs of the architectural array and also on the number of MBs of the height image frame. Thus, each FU filters all MBs contained in a particular row of the frame while always respecting data dependencies. However, if the number of FUs is smaller than the height of the MBs in a frame, the filtering process is modified. The frame will be processed by stripes with the same height than the number of FUs in the array. This proposal not only reduces the amount of transferred information among FUs, but also, unlike state-of-the-art proposals, only the current MB must be requested from the external memory for each filtering process, as the information related with the neighboring MBs is received during horizontal and vertical filtering stages from other FUs, as explained above. Consequently, data transferred between the external memory and the DF is largely reduced. This allocation strategy is shown in depth in Fig. 9a, included in Sect. 9, where the generalization of the architecture is addressed.

Fig. 9
figure 9

Possible wavefront patterns

6 DF Implementation Details

In this section, details regarding the implementation of the architecture are described, focusing on the issues related with its run-time scalability, including methodological aspects.

6.1 Architectural Design for Run-Time Scalability

One of the main advantages of using this kind of highly modular and parallel architectures, when considering the use of DPR, is the straightforward nature of generating the design partitioning following the modular design flow [28]. Thus, each different module (IM, OM, and FU) is treated as a reconfigurable element on its own. The VHDL description of each module is synthesized, mapped, placed, and routed independently. As a result, three separated partial configuration bitstreams are generated. Following this methodology, since every instantiation of any component in the architecture is equal, a unique version of each one will have to be generated, and afterwards it can be reused in different positions of the array.

The adaptation of the proposed architecture to be dynamically scalable implies the addition of bus macros (BMs) [27], as well as the design of an independent floor planning for each module. According to the traditional reconfigurable design flows, the BMs must be placed as part of the interface between the static design and the reconfigurable one. Furthermore, due to the scalability property of this DF architecture, all the reconfigurable modules (IM, FU, and OM) require BMs at their input and output ports. The latest Xilinx dynamic reconfiguration design flow, from the v12 release onward, avoids the use of these fixed macros to guarantee the correctness of communications across modules frontiers [19]. Instead of BMs, elements called partition pins are automatically inserted by the tool in all frontier pins corresponding to all reconfigurable module types to grant communication validity. Even though it reduces DPR overhead, this approach does not allow module relocation in different positions of the FPGA, as this is unsuitable for this kind of scalable architecture because of module replication.

As will be shown, each module has been designed to occupy more area than a static design. This extra area allows routing all the internal signals within the reconfigurable region. Thus, despite the fact that not all the logic resources within the reconfigurable region are occupied, the entire region is necessary to come up with a fully routed design. Values related to the area usage for each module directly impact onto the size of the corresponding configuration bitstream and consequently on the DF reconfiguration time itself. The regularity of this architecture, and throughout the exploitation of the DPR, permits that only eight different configuration bitstreams are necessary for configuring any m ×n size. The scaling of the DF might be done by replicating and relocating the configuration bitstreams of different modules in other positions in the FPGA. As an example, in the case of the addition of an extra row in a DF with two columns, five modules have to be configured (two corresponding to the new FU row, plus the shift of the two OMs, and the OC in their new positions), while the others will remain unchanged.

6.2 Implementation of the DF Modules

In the following subsections, design issues corresponding to each module, as well as main floor planning decisions taken to achieve scalability, will be described.

Router and Functional Unit (FU) Since each FU is always attached to its router, both router and FU have been implemented inside a single reconfigurable module. As mentioned before, the role of the router is to capture the first MB received from the IM in each processing stage and then transmit it without changing the subsequent MBs to the FUs below. All these transactions are repeated periodically with each new MB cycle. Both, data and control vertical connections among routers and the IM/OM, have been included in the module border, as shown in Fig. 5. Additionally, specific point-to-point connections with the bottom FU have been created to exchange semifiltered information, as described in the previous section. In the case of the last FU of the column, the next FU is located at the top of the next column. To tackle with this issue, bypass logic has been included both in the OM and the FU blocks to send it upward. In the case of the FU, the bypass connection is highlighted in Fig. 5.

Fig. 5
figure 5

Router and FU design

North and south connections of the module are completely symmetric, since both use BMs in the same positions. Consequently, FU and router modules design can be stacked in vertically aligned positions inside the FPGA. It cannot be replicated horizontally because BRAM columns are located in different positions within each right and left side of a region. To overcome this limitation a different module compatible with the FPGA area to each side has been created, including the same local behavior, but with a different floor planning design. Consequently, two different versions of the FU exist. This is shown in Fig. 7, which represents the full architecture.

Fig. 7
figure 7

Complete architecture design

Input Memory (IM) Unfiltered MBs come from the external video frames memory, and they are transmitted across the IM FIFOs. During this process, all the IMs hold the MBs that will be processed by the column of FUs immediately below them. Consequently, the memory size of the IM limits the maximum vertical size of the architecture. In the final implementation, due to the physical restrictions of the FPGA, the architecture was limited to a 2-column ×3-row size. Once this memory has been filled, the IM distributes the MBs in the vertical direction to the FUs of the same column. Consequently, both horizontal and vertical BMs have been included, as shown in Fig. 6.

Fig. 6
figure 6

IM floor planning design

Lines to transmit semifiltered MBs have been included for the worst case scenario, as it was explained before. Consequently, the IM receives semifiltered data from the FU bypass output and sends it to the IM on the right side. In addition, this module sends semifiltered data from the left IM to the first unit of the column below. Furthermore, in the case of the last column, a special communication module has been implemented to send this data back to the IC. To make this feasible, a special horizontal semifiltered bypass has been included through the IM.

Output Memory (OM) This module includes the logic in charge of receiving the completely filtered MBs from the FU column above, as well as transmitting data to the output controller. It also includes inputs and outputs for transmitting semifiltered MBs. Specifically, it bypasses data coming from the semifiltered MB output of the FU immediately above to its semifiltered bypass input, so it can be transmitted to the next column.

Input Controller (IC) The IC is the input module of the architecture, and it is the communication point with the static part of the system. This module is not dynamically reconfigured when the DF is scaled. However, certain kinds of its registers are configured from the external embedded processor in order to indicate the current dimensions of the DF and the size of the video frame. With these dimensions, the IC generates the correct MB reading address sequence for any size of the architecture. Bus macros corresponding to the MB data and control signals have been included to communicate with its adjacent IM.

Output Controller (OC) The OC is also part of the static design. It receives data from OMs and sends it back to the video memory. Consequently, MBs have to be located across the static-reconfigurable borders to allow for different size architectures. However, future work will be carried out to communicate OC outputs with IMs to have a unique communication module with the static area.

The scalability of the full architecture is shown in Fig. 7. Since the columns are different, eight independent bitstreams have been generated, two per each FU, IM, and OM, as well as two for the communication modules that have been included to close open connections. Both the IC and the OC belong to the static side of the system so that no partial bitstream is generated for those modules.

Thus, the maximum size achieved is 2 ×3, using the right half of a medium-size Virtex-5 FPGA (xc5vlx110t).

7 Embedded System Integration

The purpose of this chapter is to integrate a full H.264/AVC video decoder, together with the DF, as an autonomous embedded system. In this scenario, the DF has been implemented according with the hardware architecture explained along the previous sections, and the rest of the decoder is executed in software by an embedded microprocessor (Microblaze). To guarantee the autonomy of the system, the interface selected to access the configuration memory is the internal configuration access port (ICAP). This port is embedded in the very same configurable logic of the FPGA so that the system is able to modify its own configuration. Thus, the final integrated embedded system is composed of a microprocessor, the reconfiguration manager that controls the ICAP, the reconfigurable region (where the DF is implemented), the PLB buses [20] to connect the different blocks among them, as well as two input/output embedded buffers. Therefore, it is necessary to solve the integration issues related to the unfiltered MBs supply from the external memories to the DF core, as well as the transmission of filtered MBs to a specific buffer.

Two isolated buffers have been included into the system to simplify the transmission of MBs between the DF and the rest of the system. One of these dedicated buffers is responsible for storing and loading unfiltered MBs, whereas the other one handles filtered MBs. The former corresponds with the input buffer of the DF, and it is written using a PLB bus interface by other modules of the embedded decoder, and it can be directly read from inside of the DF. The latter buffer is an output buffer that is written by the DF and read by other modules through the PLB interface. The interconnections between the DF and those buffers have been implemented using a direct memory access (DMA) approach. This fact avoids introducing more overheads in the PLB bus, shared with other blocks of the system, something that would reduce the performance of every module, including the DF itself. Furthermore, by incorporating the DMA, any word belonging to an MB might be read in one clock cycle from the memory. Therefore, the DMA is able to supply data fast enough to support the highest data rate demanded by the system, even when the DF core is scaled up to its maximum size. This circumstance could not be guaranteed by using a bus approach, since it would not be able to scale its bandwidth according to the data rate necessary to feed different DF sizes. In addition, the DMA can be directly accessed by the IC of the DF core, without requiring an extra logic. As it was aforementioned, the IC generates the number of required MB, which is equivalent to its address in the input buffer.

In spite of receiving and sending all the data using the DMA interface, the DF has been also connected to the PLB interface in order to receive configuration values and commands. The required values define the dimensions of both the processing array (M ×N) and the image (W ×H), whereas the configuration commands signal the beginning and the end of each video frame.

8 DF Dynamic Reconfiguration Management

In this section, some design aspects related to the DPR controller block, which accelerates the scaling process of the DF at run time, are introduced.

The process of scaling the architecture dynamically is carried out by the customized ICAP controller, or reconfiguration engine (RE), proposed by the authors in [18]. Its main task is to control low-level details of the reconfiguration port, including the relocation of configuration bitstreams.

Some features of the reconfiguration engine that are described in this section, together with the regularity and modularity of the DF proposed in this chapter, facilitate a faster scaling process of the DF architecture. On one hand, specific details corresponding to the relocation engine are provided, such as its implementation in hardware for speed, as well as its readback, relocation, and writeback capabilities. Moreover, this RE has been developed to work at 250 MHz. This frequency includes online relocation, beyond the maximum theoretical throughput of the Virtex-5 reconfiguration port, which according to information reported by the manufacturer is up to 100 MHz.

In addition, the configuration bitstreams corresponding to the basic modules of the scalable DF architecture are simplified versions in comparison with the traditional bitstreams generated by vendor tools. In this case, the bitstreams do not include configuration commands. That is, only the body of the logic configuration is stored, while both the header and the tail are obviated. Thus, the final position of any reconfigurable module is not fixed in a determined location in the FPGA. At the end, the complete partial bitstream is composed at run time, as shown in [18], by means of adding the header and the tail information, as well as the specific frame addresses, which are not included either. This strategy provides two advantages:

  1. 1.

    Reducing the bitstream size. Consequently, time required to transfer data from the external memory decreases, and memory storage is minimized.

  2. 2.

    Accelerating and increasing the relocation feasibility.

This relocation technique is much faster than previous proposals in the state of the art. For example, approaches like [6] are based on bitstream parsers, instead of a run-time composition of the bitstream.

Moreover, the reconfiguration engine incorporates a direct and dedicated data link to be communicated with the external DDR2 memory, which acts as a bitstream repository, by using the native port interface (NPI). Through this link the reconfiguration port gets new configuration information, fast enough to perform the reconfiguration process, without introducing stall times.

On the other hand, taking advantage of specific features of modular and regular architectures, it is possible to accelerate even more the dynamic reconfiguration. In most cases the scaling process implies the replication of the same element, which means relocating the element in different positions in the architecture. Under these conditions, the possibility of pasting the same reconfigurable module in different positions of the architecture, without having to read its configuration bitstream many times from the external memory, gains in relevance. Moreover, this module might already be configured in other positions of the device. Consequently, also internal memories have been included to allow this readback/reallocation/writeback approach of full modules of the architecture. This approach eliminates the overhead of reading the configuration information from the external memory, sharply reducing the reconfiguration overhead.

RE includes a software application programming interface (API) that simplifies the reconfiguration process. Further information about this API can be found in [18].

9 Scalable Wavefront Architecture Generalization

The benefits of the architecture proposed in this chapter go further than the hardware implementation of the H.264 DF algorithm. The whole system might be reused to implement several and diverse scalable applications and algorithms following the same parallelization pattern, the wavefront approach. This pattern appears in scientific applications such as [9, 25], as well as in several video processing tasks [24], among others. This section analyzes and brings up general considerations that should be taken into account in order to reuse the proposed architecture and to adapt its modules to cope with other problems.

9.1 General Wavefront Pattern and Dependencies Characterization

The wavefront template is based on the arrangement of data in multidimensional grids so that the elements located in certain positions are fully independent. Hence, these independent data, called data front, can be processed simultaneously. The data dependencies among operations in all the applications determine the data-front shape and, consequently, the processing order of all the elements. In any case, whenever the data front respects the given reading order, the whole array is parallelized while respecting data dependencies among the elements through the entire array. A popular example of the wavefront pattern is the diagonal model, where computation starts at the top, left corner of an array of data, and then the data front is propagated diagonally until the right bottom corner is reached. This is the pattern followed, for instance, in the modified DF proposed in this chapter. The same pattern shows up every time when the computation of each data element depends on its upper and right neighbors. In this particular scenario, all the elements that belong to the antidiagonal can be processed in parallel. However, the wavefront pattern is not limited to this diagonal shape. Some of those examples are shown at Fig. 8, including also horizontal data fronts, with each element depending on two or three cells simultaneously.

Fig. 8
figure 8

Possible wavefront dependence schemes

Each of those patterns shown in Fig. 8 corresponds to several problems, where the data dependencies have been evaluated to parallelize their execution by following the wavefront pattern, as shown in [7].

According to the schemes of dependencies described in Fig. 8, each case differs on the shape of the data front or the order in which the data fronts are explored. It is always possible to process all the elements located in these fronts at the same time, by means of assigning each element to a different FU in the processing array. Moreover, the rule established in the case of the DF algorithm to arrange the data among the FUs can be kept without losing generality. This means that the first element in the data front must be always processed in the first FU of the array and the rest of elements are distributed throughout subsequent FUs of the architecture. The number of elements which can be processed in parallel coincides with the maximum number of FUs implemented in the array.

Regarding the characterization of the architecture, for each neighboring element of which a given one depends, a vector with two components is defined. The first component (D t ) describes the distance, in terms of processing cycles, when both the current and the referenced blocks are processed. Regarding the second one (D FU), it defines the distance between the FUs in charge of processing both blocks:

$$D = ({D}_{\mathit{t}},{D}_{\mathrm{FU}}).$$
(1)

Thus, D t is an integer greater than one, and D FU is an integer which can take both positive and negative values, depending on whether the FU in charge of processing the neighboring block occupies an upper or lower position in the array, respectively.

For instance, in the case of the Fig. 8a, each element (except those located in the borders) depends on its left and upper-right neighbors. Considering both the parallelization and the FU allocation diagrams shown in Fig. 9, the vector defining the dependence with respect to the left neighbor is D 1 = (1,0), since it is processed always in the same FU, but during the previous processing cycle. Regarding the upper-right neighbor, the vector is D 2 = (1,1), since it is always processed in the previous functional unit, during the previous cycle. The same vectors can be drawn in the case (b) in spite of having different data dependencies. For the other cases, existing relationships are described below.

For (c): D 1 = (1,0), corresponding to the upper neighbor and D 2 = (1, − 1), corresponding to the upper-right neighbor

For (d): D 1 = (1,0), corresponding to the upper neighbor, and D 2 = (1,1), corresponding to the upper-left neighbor

For (e): D 1 = (1,1), corresponding to the upper-left neighbor, D 2 = (1,0), corresponding to the upper neighbor, and D 3 = (1, − 1), corresponding to the upper-right neighbor

For (f): D 1 = (1,0), corresponding to the upper neighbor, D 2 = (1,1), corresponding to the upper-left neighbor, and D 3 = (1,2), corresponding to the upper-left-left neighbor

Regardless the specific dependence pattern, the data mesh is always processed in independent disjoint blocks, corresponding to the dashed box shown in Fig. 9.

9.2 Architecture Customization

As it was explained above, the proposed architecture might be generalized to tackle different kinds of problems, such as those characterized by data fronts offered in Fig. 9, by reusing and adapting its structure. This section explains the main changes that should introduce into all the set of modules belonging to the architecture, as well as in its management principles, in order to adapt their behavior and functionality to the new conditions.

Functional Unit These modules are the processing cores of the architecture, and it is compulsory to adapt them in order to address different processing problems. In the same manner the specific design of the processing units for the DF is out of the scope of this chapter, the design of the specific units for other problems will not be offered in this section, and they can even be reused from existing nonparallel implementations. Therefore, this chapter provides a general scalable framework suited to parallelize the hardware execution of these tasks.

In addition of changing FUs behavior, it is necessary to adapt their internal memories to rearrange data blocks according to the application demands. Considering that each independent memory is able to store a basic data unit, the number and type of the required memories will be a consequence of the dependence vectors described in the previous section. Thus, two kinds of memories called M p and M q are defined. On the one hand, M p memories store the required data blocks received from the upper and lower FUs, in previous processing cycles, using the FU to TU direct connections. That is, those memories store data blocks with a dependence vector with D FU ≠ 0. On the other hand,M q memories store both the current block and previous blocks processed in the same unit. That is, dependencies with D FU = 0. The general sequence of operations that has to be done in each functional unit are

  • M q 0 = DataIN

  • DataTemp = FU(M p 0, , M p p max , M q 0, , M q q max )

  • M q i + 1 = M q i,  ∀iin(1, q max  − 1)

  • M p i + 1 = M p i,  ∀iin(0, p max  − 1)

  • M q 1 = DataTemp

  • M p 0 = DataInVertical

DataOut and DataOutVertical might be equal to DataTemp, to DataIn, or even partial results of the processing task. In addition, the processing stage might be divided into subsequent phases, carrying out data sharing in between, like in the case of the deblocking filter algorithm (H and V phases). In those cases, temporal memories have to be included in the architecture. With respect to DataInVertical and DataOutVertical, whether more than one dependence would exist with a fixed D t and | D FU |  > 0, several DataInVertical and DataOutVertical channels will exist.

Data Reading Order The main purpose of the IC is to read unprocessed data from the external memory. Before reading these data, it is necessary to generate the sequence of addresses that defines the order in which these data will be requested from the external memory. Due to the fact that the data dependencies vary with any new data front, the reading addresses have to be adapted to fulfill with the data parallelization scheme. This change can be performed by modifying the internal rules that defines the address generation process inside the IC. However, the general structure of the address generation can remain unaltered.

Data Block Allocation and Distribution The allocation strategy of unprocessed data blocks inside the architecture might be kept independently of the wavefront processing problem. Whether the data reading sequence is modified according to the previous section, the data distribution rule across the array can be kept as well. Thus, each IM will hold the N first input data, being N the number of vertical FUs, and will transmit the subsequent ones to the subsequent FUs. Regarding the routers, each one will keep the first data, transmitting the rest to the following FUs. This approach guarantees that each unit is fed with the required data. Once the data have been processed, they are transmitted to the external memory without carrying out further modifications to the current architecture. In the proposed architecture, all the FUs will work in a synchronous way, that is, all the units will be in the same state during each time period. Therefore, null data blocks have to be used in order to deal with dead times, as is shown in gray in Fig. 10.

Fig. 10
figure 10

Possible wavefront processing structures

Semiprocessed Data Sharing As depicted in Fig. 9, different data reuse requirements arise from every pattern of data dependencies. One of the main features of the proposed architecture is that it is able of tackling diverse kinds of data dependencies by exploiting local point-to-point connections, without requiring accesses to the external memory. The term semiprocessed refers to those data that will be used again after being processed in order to process other data. The number of times and the order in which these semiprocessed data are required depend on the data dependencies. In the case of the DF, to process each MB, the left neighbor is stored in the same FU, while the upper one is received from the upper FU. This strategy can be modified if it is necessary to include data dependencies with the element processed in the FU below [for instance, required in the example (e)], as well as changing the number of cells depending in each direction, north, south, or horizontal.

Considering the description vectors introduced in previous sections, the number of vertical connections is equal to the number of dependencies with a fixed D t and | D FU |  > 0. In case D FU > 0, those dependencies pass through the array from north to south, and, in case D FU < 0, from south to north. Together with the channels through the FUs, a different number of specific resources have to be included in the IC to store temporal data, belonging to different disjoint set of blocks. Mainly, it is necessary to include a FIFO memory and an FSM for controlling the process, corresponding to each vertical line implemented in the scalable architecture.

9.3 H.264/SVC Generalization

The MB wavefront parallelization pattern existing into the DF is not exclusive of this algorithm. Many functional blocks of the H.264 video coding standard, except for the entropy decoder, might be adapted to follow the same pattern [24]. In the case of the DF, the pattern of dependence is the one shown in the Fig. 8 a). Therefore, considering the dependence vectors, both one M p and oneM q memories are required. In addition, anotherM q is used to store the current MB, and an extra memory is also required to store temporal semifiltered data between the H and V filtering phases. In addition, the FU provides two results. The basic operations carried out in each functional unit are described below, following the general template described in the previous sections.

  • M q 0 = DataIN

  • (DataTemp1, DataTemp2) = FU(M q 0, M q 1)

  • DataOutVertical = DataTemp2

  • M p 0 = DataInVertical

  • (DataTemp1, DataTemp2) = FU(DataTemp1, M p 0)

  • DataOut = DataTemp1

  • M q 1 = DataTemp2

In the general case of H.264, possible dependencies at a MB-level are described in [24]. In this general case, the dependence scheme can include, in addition of the elements required in the DF, the upper-right and the upper-left macroblock. Those elements are used in certain modes of the intra prediction and the motion vector. Therefore, the dependence vectors are

  • D 1 = (1,0), corresponding to the left neighbor

  • D 2 = (1,1), corresponding to the upper-right neighbor

  • D 2 = (2,1), corresponding to the upper neighbor

  • D 2 = (3,1), corresponding to the upper-left neighbor

In consequence, two M q memories are required to implement those blocks, one for the current MB and the other for the upper neighbor, with dependence vector (1,0). With respect to M p memories, 3 units are required, as well as a single descendent communication channel.

10 Implementation Details and Results

This section collects some implementation results obtained after the synthesis and the implementation stages, considering both the modular architecture itself and the DPR issues.

10.1 Architectural Details

Within the reconfigurable stage of this architecture, the array might be formed by a different number of FUs. In this section, the architecture has been implemented without considering DPR issues, that is, different sizes are achieved after a new synthesis process of the full architecture. The amount of these units varies according to the performance or the environment constraints on demand. Depending on how the FUs are distributed into the processing array (M ×N), different configurations are obtained, in which M and N are the width and height of the array. For a specific amount of FUs there exist several configurations, characterized by having the same computational performance, but different HW resources demands and different data transfers delays. More in detail, the resource occupancy of each basic block of the proposed architecture is shown in Table 1.

Table 1 Resources occupancy

Regarding to synthesis results, the FU limits the maximum operation frequency of the whole architecture up to 124 MHz. On the other hand, performance variations are shown in Table 2. These data mean the minimum operation frequency that is necessary for processing different video formats at 30 frames per second (fps) with real-time constraint. Each column is referred to a determined number of FUs, while each row expresses frequency values for a specific video format (W ×H), where W and H are its width and height in pixels, respectively.

Table 2 Maximum frequency for real time (30 FPS)

Using many FUs is not efficient for processing the smallest video formats since some units will keep idle during filtering execution. The maximum number of FUs is determined by the height of the frame expressed in pixels (H). As an example, SQCIF and QCIF formats are 96 and 144 in height respectively; as a consequence, configurations with more than 6 or 9 FUs are not appropriated for these formats. Otherwise, using a low number of FUs is not possible to process the highest multimedia formats, like UHDTV, since its associated configurations need more than 200 MHz for real-time performance, whereas the maximum frequency of this architecture is 124 MHz.

10.2 Dynamic Reconfiguration Details

In this section, details regarding to the reconfigurability of the architecture are explained. Adapting the architecture to be dynamically scalable implies the addition of BMs, as well as the design of an independent floor planing for each module. As it is shown in the previous section, each module has been designed occupying extra area in order to be able to route all internal signals within the reconfigurable region. Thus, even though not all logical resources within the region are occupied, the entire region is necessary to come up with a fully routed design. Consequently, resource occupancy is increased, as it can be seen in Table 3, regarding each element, and in Table 4, regarding different sizes of the full core. Information in Table 3 can be compared with Table 1 to see the overhead of designing the architecture to DPR. Thus, the area of the IM is increased about 15 times, the OM about 9 times, as well as 1.7 times for the case of the functional unit.

Table 3 Resources impact of designing for dynamic scalability
Table 4 Resources occupancy of the full reconfigurable array

Values of area occupancy for each element directly entail the size of the corresponding bitstreams and, consequently, the DF reconfiguration time itself. These aspects are evaluated in the following section.

10.3 Reconfiguration Time

The DF reconfiguration time is a consequence of the area occupied by each module of the architecture. More specifically, it depends on the type of configurable resources used by its modules, since the number of frames, the basic reconfigurable units within an FPGA, is different for BRAMS, DSPs, or CLBs columns.

Reconfiguration times are offered in Table 5 for the case of creating each architecture from the scratch. The operating frequency of the reconfiguration engine is 200 MHz.

Table 5 Reconfiguration time

Values shown in Table 5 already include the latency due to the software API of the reconfiguration engine, which depend on the number of reconfiguration operations that have to be carried out.

Reconfiguration time is about one or two orders of magnitude above processing time for each video frame. This fact has to be taken into account when implementing the policy inside the decoder, deciding when to modify the size of the architecture.

10.4 Performance Limits

The main constraint on the size of the architecture has to do with the relationship between the time each unit requires to process a single data unit (T p ) and the time required to transmit input data to the FUs. Thus, only if T p <T load × N, being N the number of FUs in each column of the processing array, no overhead is introduced in the performance of the core. To fulfill this condition, input/output data channels have to be dimensioned according to the block data size.

In the case of the DF, the size of the data unit, including all the information required to process each MB, is 108 words of 32 bits. Therefore, 64-bit-width data connections have been included horizontally, between IMs and OMs, and vertically, between routers, to allow enabling up to three rows of functional units without compromising the DF performance.

11 Conclusions and Future Work

This chapter addresses the design of spatially scalable architectures which are able to adjust their size at run time, in terms of the number of elements that are implemented onto the FPGA, by means of the DPR feature of commercial FPGAs. The exploitation of this feature allows achieving a variable data level parallelism that adjusts its properties to the performance fluctuations demanded by the running application. Thus, the area-performance trade-off can be balanced at run time. This chapter takes advantage of these benefits by developing a dynamically scalable deblocking filter architecture, where its number of computation units (FUs) might be adapted online to fulfill the variable requirements of different profiles and scalability levels of H.264/AVC and SVC video coding standards.

Given a frame size as well as certain DF dimensions, each FU will always process the same MBs. This strategy simplifies DF control, but a new video frame cannot be processed until the previous one has been completely filtered, introducing an extra overhead. In addition, memory consumption and its distribution within the FU will be also optimized, reducing the area of each FU module. This improvement will also impact results of designing for DPR, since routing inside each module will be simplified. Accordingly, FUs will be floorplanned in narrower regions, looking for the homogeneity of both FU columns. Furthermore, results from the output controller will be sent upward to the input controller instead of being transmitted to the static region. Thus, DF will have a unique communication point with the rest of the system.