Novel one-dimensional and two-dimensional forward discrete wavelet transform 5/3 filter architectures for efficient hardware implementation

Savić, Goran; Prokin, Milan; Rajović, Vladimir; Prokin, Dragana

doi:10.1007/s11554-016-0656-1

Novel one-dimensional and two-dimensional forward discrete wavelet transform 5/3 filter architectures for efficient hardware implementation

Original Research Paper
Published: 29 November 2016

Volume 16, pages 1459–1478, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Novel one-dimensional and two-dimensional forward discrete wavelet transform 5/3 filter architectures for efficient hardware implementation

Download PDF

Goran Savić¹,
Milan Prokin¹,
Vladimir Rajović¹ &
…
Dragana Prokin²

188 Accesses
2 Citations
Explore all metrics

Abstract

We implemented a more efficient circuit for one-dimensional (1-D) forward discrete wavelet transform (DWT) 5/3 filter. Our design utilizes processing and memory resources that are wasted in some other state-of-the-art solutions and is at least 33% simpler in terms of used registers, is 17% simpler in terms of used logic elements, has 7% higher maximum operating frequency and has 2% lower total power dissipation than previously published designs. The advantages of our design are achieved by a novel non-stationary filter topology which reuses the same registers for generating both low-pass and high-pass output coefficients, in different time slots, due to feed-forward and feedback paths. Our design is suitable for image compression systems which use 5/3 filter, e.g., JPEG 2000. We also proposed two-dimensional (2-D) DWT 5/3 architecture which uses implemented 1-D DWT filter design. The proposed 2-D DWT architecture outperforms all previously published architectures in terms of required memory capacity, which is at least 20% lower than memory capacity in any other reported solution.

High-Performance 1-D and 2-D Inverse DWT 5/3 Filter Architectures for Efficient Hardware Implementation

Article 24 December 2016

Comparison of Approaches to the Circuits Design for DWT with CDF 9/7 Wavelet

Low-Power, Low-Area Multi-level 2-D Discrete Wavelet Transform Architecture

Article 11 April 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The advantages of the wavelet transform over conventional transforms were recognized by Daubechies [1]. The wavelet transformation has become a standard technique in audio signal processing and image compression since Mallat [2] proposed the multiresolution representation of signals based on wavelet decomposition. Two-dimensional (2-D) discrete wavelet transform (DWT) has been adopted in the well-known JPEG 2000 still image compression standard [3].

According to JPEG 2000 standard, source image or its partitions are decomposed into different decomposition levels using a wavelet transform. These decomposition levels contain a number of subbands, which consist of coefficients that describe the horizontal and vertical spatial frequency characteristics of the original image or its partitions. To perform the forward DWT, JPEG 2000 standard utilizes a one-dimensional (1-D) subband decomposition of a 1-D set of samples into low-pass coefficients and high-pass coefficients. The choice of filters is an important issue, since it has been shown by Caglar et al. [4] and by Egger and Li [5, 6] that filters represent an important factor which has the influence on the performance of the decomposition for compression purposes. The default reversible transformation, according to JPEG 2000 standard, is implemented by means of Le Gall’s 5-tap/3-tap filter [7].

The standard supports two filtering modes: a convolution-based mode and a lifting-based mode. Convolution-based filters perform a series of multiplications and additions between low-pass and high-pass filter coefficients and extended 2-D pixels forming a window matrix in case of non-orthogonal filters, or separate horizontal and vertical processing, in case of orthogonal filters. Lifting-based filtering consists of a sequence of alternative updating of pixels with odd indexes with weighted sum of pixels with even indexes and updating of pixels with even indexes with weighted sum of pixels with odd indexes.

DWT implemented by convolution requires a high number of arithmetic computations and an extensive usage of logic and memory resources, which is not desirable for high-speed and low-power image processing applications. Filter circuits for efficient hardware implementation of DWT, which are mainly convolution-based, are proposed by Parhi and Nishitani [8], Wu and Chen [9], Cheng and Parhi [10], Usha and Chilambuchelvan [11] and Ghantous and Bayoumi [12].

Lifting-based DWT filters have many advantages compared to convolution-based filters, such as simpler design, less required logic and memory resources, lower computational complexity and lower power consumption. The lifting scheme also allows “in-place” computation of DWT as it can be seen in many existing implementations. Various efficient hardware implementations of lifting schemes have been developed.

A direct mapped filter design was implemented by Liu et al. [13, 14]. However, for 5/3 filter and a single read port memory, the odd and even samples are read serially in alternate clock cycles and buffered, which slows down the overall pipelined filter design by 50%. Liu et al.’s [13, 14] design was further improved by folding the last two pipeline stages into the first two stages by Lian et al. [15]. For 5/3 filter, no folded computing is necessary since there is only one stage for lifting-based operations. The generalized filter design proposed by Andra et al. [16] is an example of a highly programmable design that can support a large set of filters, including 5/3 filter. Although some conventional lifting-based filter designs require fewer arithmetic operations, they sometimes have long critical paths. Solving this issue by pipelining would result in a significant increase in the number of registers. However, Huang et al.’s [17] flipping design solved timing accumulation problem in an efficient way. Multiply-and-accumulate-based programmable filter design has been proposed by Chang et al. [18]. Unlike the most of traditional DWT filter designs which compute the next level of decomposition upon completion of the previous level of decomposition, Liao et al.’s [19] recursive DWT filter design is able to process multiple levels of decomposition simultaneously. Liao et al. [20] presented a dual-scan filter design for DWT which processes two independent data streams together using shared functional blocks in an interleaved fashion. A filter-independent DSP-type parallel design, which can be programmed to support a wide range of filters, including 5/3 filter, has been proposed by Martina et al. [21]. Recently, Meher et al. [22] presented an optimized adder-based formulation for low-area and low-power implementation of 1-D DWT using 5/3 filters.

As it has been shown by Acharya and Chakrabarti [23], the folded design [15] is the simplest while the DSP-based design [21] is the most complex in terms of hardware complexity. Designs [13, 16–20, 22] have comparable hardware complexity and differ mostly in the number of registers and adders. The control complexity of the filter design [13] is very simple, unlike the control complexity of the designs [20, 21]. Other mentioned designs have the moderate control complexity. In terms of timing performance, the filter designs [13, 15–18] have the highest throughput. Design [19] has fewer cycles but its clock period is higher, while filter design [17] has the lowest computation delay. All designs except [19] compute all the outputs of one level of decomposition before starting computations of the next level, while only design [19] interleaves the computations of the higher levels with those of the first level. That is the reason why memory requirements for [19] are lower than for the others.

Among mentioned lifting-based DWT filter designs, in terms of complexity of basic building blocks (1-D 5/3 filter blocks), those proposed in [13–16, 22] have the simplest realizations, which leads to important savings in logic and memory resources. Designs presented in [17–21] have greater hardware complexity, but also greater flexibility including the support for a wide range of different types of filters.

Several 2-D DWT hardware architectures have recently been proposed. A straightforward implementation of the 2-D DWT (direct architecture), as well as the implementation which employs two systolic array filters and two parallel filters (systolic-parallel architecture), has been suggested by Vishwanath et al. [24]. Chrysafis and Ortega [25] proposed the line-based architecture for 2-D DWT with reduced use of memory resources. Chang et al.’s [18] filter design is also employed in the appropriate programmable 2-D DWT architecture. Wu and Chen [9] used their convolution-based filter design and developed a line-based architecture for the 2-D DWT in which they employed polyphase decomposition technique and the coefficient folding technique in order to increase the hardware utilization and to decrease the total computing time. Andra et al. [16] utilized their generalized filter design and developed block-based implementation of four-processor architecture for 2-D DWT which is highly programmable, but which requires a large embedded memory. Liao et al. [19] proposed the lifting-based 2-D DWT recursive architecture (RA) which can process multiple levels of decomposition simultaneously and the 2-D DWT dual-scan architecture (DSA) [20] which uses an interleaving scheme for multilevel decomposition with reduced size of memory and decreased number of memory accesses. A hybrid of level-by-level and line-based 2-D DWT architecture in which the image is scanned into the row processor in a raster format was proposed by Barua et al. [26]. Pipelined architecture (PA), efficient for high-speed and/or low-power applications, has been developed by Xiong et al. [27]. Xiong et al. [28] have also proposed fast 2-D DWT architecture (FA) mainly composed of two horizontal filter modules and one vertical filter module, which employs parallel and pipeline techniques, and high-speed 2-D DWT architecture (HA) which exploits parallelism among four subband transforms. The block-based design of 2-D DWT architecture which eliminates the requirement of frame buffer, but which uses larger on-chip memory and has significant overhead due to its input interface units, was presented by Mohanty and Meher [29]. Parallel multilevel lifting-based 2-D DWT architecture with a single processing unit which calculates both predict and update values was proposed by Aziz and Pham [30]. Hsia et al. [31] have designed memory efficient 2-D DWT architecture, which exploits mixed row-wise and column-wise signal flow. For computing 2-D DWT, Darji et al. [32] presented high-performance folded multilevel architecture (FMA) and pipelined multilevel architecture (PMA) with dual-pixel scanning method with higher operational frequency, lower latency and lower power consumption in comparison with other existing DWT architectures with the same specifications. In the same paper, hardware efficient recursive multilevel architecture (RMA) is also described. Highly efficient lifting-based 2-D DWT architecture based on the parallel and folding scheme processing was proposed by Hsia et al. [33, 34].

In terms of memory requirements, 2-D DWT architectures [18, 19, 25, 27, 29, 30], PMA [32,33,34] require lower memory capacity compared to the other architectures and do not require off-chip memory at all. Computing time for 2-D DWT architectures [9, 16, 20, 25–28, 32–34] is lower than for the other mentioned architectures. The lowest output latency is present in systolic-parallel architecture [24] and architectures [28, 29, 31, 32], while output latency of other architectures is several times higher, but still comparable with mentioned ones. In terms of hardware utilization efficiency, the best performances (efficiency close to or approximately equal to 1) are present in architectures [9, 16, 26, 28], FMA [32,33,34], while other mentioned architectures have efficiency lower than 1.

This paper is structured as follows. Section 2 describes the design of the novel 1-D DWT 5/3 filter. Section 3 presents comparison with other state-of-the-art 1-D filter designs. Analysis, synthesis and fitting results obtained in implementation process for our and state-of-the-art 1-D filter designs are described and compared in Sect. 4. In Sect. 5, 2-D DWT 5/3 architecture, which exploits 1-D filter design from Sect. 2, is proposed. Complexity and performance comparisons with other 2-D DWT 5/3 architectures are described in Sect. 6. Finally, a brief conclusion is stated in Sect. 7.

Some initial research, on which this paper is based on, has been presented in [35, 36].

2 Design of the 1-D novel forward DWT 5/3 filter

Many of state-of-the-art 1-D forward DWTs provide low-pass filtering with transfer function H₀(z) and downsampling by two, in the upper branch in Fig. 1, as well as high-pass filtering with transfer function H₁(z) followed by downsampling by two, in the lower branch in Fig. 1. Since these filters are followed by decimators which discard samples, final coefficients y′₀[n] comprise every second sample from the array of low-pass filtered coefficients y₀[n] and coefficients y′₁[n] comprise every second sample from the array of high-pass filtered coefficients y₁[n]. This actually means that each second time slot used by low-pass H₀(z) or high-pass H₁(z) filter for the generation of coefficients has been wasted, together with logic and memory resources.

The approach disclosed in this paper is based on the idea that logic and memory resources of 1-D forward DWT 5/3 filter, which are wasted in state-of-the-art approach from Fig. 1, must be used for generating the coefficients which will not be rejected. That means, no coefficients generated by the filter may be discarded, despite the decimation process.

Namely, unlike the state-of-the-art DWTs from Fig. 1 which pass further every even coefficient generated by H₀(z), while discarding every odd coefficient generated by H₀(z) and pass further every even coefficient generated by H₁(z), while discarding every odd coefficient generated by H₁(z), our filter uses both even time slots and odd time slots for generating and further passing the transformation coefficients (Fig. 2).

In this concept, even time slots are used for generating low-pass coefficients, while odd time slots are used for generating high-pass coefficients. Therefore, the filter utilizes logic and memory resources which have been wasted in state-of-the-art designs from Fig. 1.

Additional savings of memory resources in the filter, compared to state-of-the-art DWTs from Fig. 1, are obtained by using the same memory blocks (or registers) for the process of generating low-pass and high-pass coefficients, which is feasible since low-pass and high-pass coefficients are generated in different time slots. However, since the transfer function of the low-pass filter is different from the transfer function of the high-pass filter, our approach requires non-stationary topology for the forward DWT 5/3 filter. The first filter configuration should be applied in even time slots (when low-pass coefficients are generated), and the second filter configuration should be applied in odd time slots (when high-pass coefficients are generated). The configuration change will be accomplished using the switches. The same memory blocks (or registers) are reused for generating both low-pass and high-pass coefficients due to feed-forward and feedback paths.

Schematic block diagram of the novel forward DWT 5/3 filter is shown in Fig. 3. Each unit delay (a block with z⁻¹ operator) is implemented using the appropriate register.

Control signal c controls four switches responsible for providing non-stationary filter topology. Time diagram of control signal c is shown in Fig. 4.

Whenever the control signal c is at low level (c = 0), for every input sample x[n] with even index n = 2k, two upper switches are opened while two lower switches are closed.

Whenever the control signal c is at high level (c = 1), for every input sample x[n] with odd index n = 2k + 1, two upper switches are closed while two lower switches are opened.

The set of equations which describes signals inside the filter in time instances from n = 0 to n = 4 is presented in Table 1.

Table 1 Equations for signals inside the filter in time instances from n = 0 to n = 4

Full size table

Based on the equations for time instance n = 4, y[n] can be expressed as:

$$\begin{aligned} y[n] &= x[n - 2] + p \cdot s \cdot x[n - 2] + s \cdot x[n - 3] \\ & \quad + q \cdot s \cdot x[n - 4] + r \cdot x[n - 1] \\ & \quad + r \cdot q \cdot x[n - 2] + r \cdot p \cdot x[n] \\ \end{aligned}$$

(1)

which finally leads to:

$$\begin{aligned} y[n] &= p \cdot r \cdot x[n] + r \cdot x[n - 1] \\ & \quad + (1 + p \cdot s + q \cdot r) \cdot x[n - 2] \\ & \quad + s \cdot x[n - 3] + q \cdot s \cdot x[n - 4] \\ \end{aligned}$$

(2)

From now on, this computation is repeated on every subsequent cycle.

For every even index n, y[n] satisfies Eq. (2), which has the same form as Eq. (3) describing H₀(z) in JPEG 2000 standard.

$$\begin{aligned} y_{0} [n] &= - \frac{1}{8}x[n] + \frac{1}{4}x[n - 1] + \frac{3}{4}x[n - 2] \\ & \quad + \frac{1}{4}x[n - 3] - \frac{1}{8}x[n - 4] \end{aligned}$$

(3)

Equations (2) and (3) become identical if p, q, r and s satisfy conditions (4):

$$\begin{aligned} 1 + p \cdot s + q \cdot r &= \frac{6}{8}, \\ p \cdot r &= - \frac{1}{8}, \\ q \cdot s &= - \frac{1}{8}, \\ r = s &= \frac{1}{4}, \\ p = q &= - \frac{1}{2} \\ \end{aligned}$$

(4)

For every odd index n, y[n] satisfies Eq. (5), which has the same form as Eq. (6) describing H₁(z) in JPEG 2000 standard.

$$y[n] = p \cdot x[n - 1] + x[n - 2] + q \cdot x[n - 3]$$

(5)

$$y_{1} [n] = - \frac{1}{2}x[n - 1] + x[n - 2] - \frac{1}{2}x[n - 3]$$

(6)

Equations (5) and (6) become identical if p and q satisfy condition (7), which is in line with the condition (4):

$$p = q = - \frac{1}{2}$$

(7)

Finally, by adding the values from expressions (4) and (7) to the product operators shown in Fig. 3, our filter obtains its final form. This filter generates low-pass filtered coefficients y[n] for even index n and high-pass filtered coefficients y[n] for odd index n.

The input samples x[n] are low-pass filtered within time slots with even indexes n = 2k without any need for downsampling by two, in order to produce output samples y₀[n], which actually represent the output samples y[n] with even indexes n = 2k (8).

$$\begin{aligned} y_{0} [n] &= - \frac{1}{8}x[n] + \frac{1}{4}x[n - 1] + \frac{3}{4}x[n - 2]\\ & \quad + \frac{1}{4}x[n - 3] - \frac{1}{8}x[n - 4] \end{aligned}$$

(8)

The input samples x[n] are high-pass filtered within time slots with odd indexes n = 2k + 1 without any need for downsampling by two, in order to produce output samples y₁[n], which actually represent the output samples y[n] with odd indexes n = 2k + 1 (9).

$$y_{1} [n] = - \frac{1}{2}x[n - 1] + x[n - 2] - \frac{1}{2}x[n - 3]$$

(9)

The transfer function of the filter is fully appropriate to the Le Gall’s 5-tap/3-tap filter [7] which is used as default filter for reversible transformation in JPEG 2000 standard.

Multipliers in our filter can be made as permanently shifted hardware connections between output and input bit lines, thus removing any necessity for hardware multipliers.

In order to keep the number of wavelet coefficients the same as the number of input data samples, symmetric extension of input samples at image boundaries, which is well-known solution, can be used.

3 Comparison with other 1-D forward DWT 5/3 filter designs

In order to illustrate the advantages of described filter (Fig. 3), its design is compared with state-of-the-art convolution-based forward DWT 5/3 filter design [8–12, 25] (shown in Fig. 5) and with the most efficient among the state-of-the-art lifting-based forward DWT 5/3 filter designs [13–16, 26, 27, 33] (shown in Fig. 6) in terms of hardware complexity. Also, some other 1-D DWT 5/3 filter designs, such as [22, 28, 30] are included in comparison (these designs are not presented in figures since they have greater complexity than designs shown in Figs. 3, 5, 6). The design proposed in [16] is a bit more complex than it is shown in Fig. 6, but it can be reduced to the form in Fig. 6 after removing pipeline registers. Comparison with designs proposed in [17–21] is not presented, since these designs have greater hardware complexity, as a price paid for greater flexibility including the support for a wide range of different types of filters, as it has been shown in [23]. Also, comparison with some other 1-D DWT 5/3 filter designs, which are used as basic building blocks in 2-D DWT architectures described in Sect. 1, is not presented since those filter designs have greater hardware complexity as a compromise made in order to achieve greater flexibility of 2-D DWT architecture.

Actual implementation of the proposed filter from Fig. 3 is shown in Fig. 7. Four switches in the proposed filter design are implemented using 3 multiplexers in total. The part of the circuit which contains the upper left switch is implemented using the left multiplexer, the part of the circuit which contains the lower right switch is implemented using the right multiplexer, and finally, the part of the circuit which contains the lower left switch and the upper right switch is implemented using the multiplexer in the middle, since these switches are connected to the same node.

Table 2 provides the list of used hardware components and estimated critical path delay (where T_A and T_MUX represent the delay time of the adder and multiplexer, respectively) for aforementioned filter designs. It can be seen that the proposed filter requires the lowest memory size. State-of-the-art lifting-based 5/3 filter design (shown in Fig. 6) requires two input data samples x₀[n] and x₁[n] at the filter input at the same clock cycle, and generates, after processing delay, two resulted data samples y₀[n] and y₁[n] at the filter output at the same clock cycle. However, if input data samples are being received serially in alternate clock cycles, additional logic circuit at the filter input is necessary in order to ensure proper rearrangement (i.e., timing adjustments) of input samples, so that lifting-based 5/3 filter design could process them properly. Also, if output data samples have to be generated serially in alternate clock cycles, additional logic circuit at the filter output is necessary in order to ensure that. Similarly, state-of-the-art convolution-based 5/3 filter design (shown in Fig. 5) requires additional logic circuit at the filter output, in case when output data samples have to be generated serially in alternate clock cycles. The proposed filter design does not require any additional logic for input data splitting when the odd and even input samples are being received serially in alternate clock cycles, nor any additional logic for output data combining when the odd and even output coefficients have to be generated serially in alternate clock cycles.

Table 2 Used hardware components for 1-D forward DWT 5/3 filters

Full size table

It also can be seen that the convolution-based design and the proposed filter design have the shortest estimated critical path delays, compared to other designs, while the lifting-based design has the longest estimated critical path delay.

4 Experimental results for 1-D forward DWT 5/3 filter designs

In order to carry out functional verification, presented filter has been simulated with 24-bit two’s complement fixed point number format, with 12 integer bits and 12 fractional bits using Altera Quartus II software. This data format has been chosen since it ensures correct representation of generated coefficients for at least 4 levels of decomposition. Simulation results confirmed the perfect match between our filter and Le Gall’s 5/3 filter in terms of generated output coefficients.

The presented filter design, the state-of-the-art convolution-based filter design, the state-of-the-art lifting-based filter design, as well as some other recently published 5/3 filter designs, have been implemented for 24-bit data samples usage in Altera FPGA EP4CE115F29I8L chip. The analysis, synthesis and fitting results obtained in the implementation process using Altera Quartus II 10.0 software are represented in Table 3. The second column contains data for state-of-the-art convolution-based forward DWT 5/3 filter design [8–12, 25], the third column contains data for the most efficient state-of-the-art lifting-based forward DWT 5/3 filter design [13–16, 26, 27, 33] without any additional logic for input data splitting nor for output data combining, while the fourth column contains data for the most efficient state-of-the-art lifting-based forward DWT 5/3 filter design [13–16, 26, 27, 33] with additional logic for input data splitting and output data combining. The fifth, sixth and seventh column contain the results for 5/3 filter designs from [22, 28, 30], respectively. Finally, the eighth column represents the results for the filter design described in this paper.

Table 3 Synthesis results of 1-D forward DWT 5/3 filters in Altera FPGA EP4CE115F29I8L

Full size table

Fitting results clearly show that in terms of used registers, the presented filter is 71% simpler than convolution-based 5/3 filter design, 33% simpler than lifting-based 5/3 filter design without splitting and combining parts, and 50% simpler than lifting-based design with splitting and combining parts. The proposed filter design utilizes 50% less memory resources than recently published filter designs from [22, 30].

In terms of used logic elements, the presented filter is 37% simpler than convolution-based 5/3 filter design, 28% simpler than 5/3 filter design from [22], while has the same complexity as lifting-based 5/3 filter design without splitting and combining parts (although the theoretical estimations from Table 2 show that the proposed design has slightly higher complexity than the lifting-based design in terms of total number of adders and multiplexers, the proposed design is more suitable for optimizations made by the compiler during compilation process which leads to the same number of utilized logic elements, since the compiler merges the multiplexers with neighboring adders and makes the optimized combinational logic circuits). However, in cases when additional logic for input data splitting and for output data combining is needed, described design is 17% simpler than lifting-based design in terms of used logic elements.

Maximum operating frequency is 7% higher for presented filter in comparison with convolution-based 5/3 filter, and between 70 and 77% higher in comparison with lifting-based 5/3 filter (depending on whether lifting-based design has additional parts for data splitting/combining or not). Presented filter design has 44% higher maximum operating frequency than recently published filter design from [22]. The convolution-based 5/3 filter design and the presented filter design have the shortest critical path delay, filter design from [22] has 35% longer critical path delay, while lifting-based design with splitting and combining parts has 60% longer critical path delay and lifting-based design without splitting and combining parts has 65% longer critical path delay. Although the estimated critical path delay for the proposed design is slightly longer than for the convolution-based design, since the compiler merges the multiplexers with neighboring adders and makes the optimized combinational logic circuits for the proposed design, these two designs have the same resulting critical path delay.

The proposed filter design allows 8% higher throughput than convolution-based design, 70% higher throughput than lifting-based design with splitting and combining parts and 44% higher throughput than recently published filter design from [22]. However, due to ability to generate two output samples at the same clock cycle, the lifting-based design without splitting and combining parts allows the highest throughput (13% higher than the proposed filter design).

Also, described filter design has the lowest total power dissipation, compared with other designs.

5 The proposed 2-D DWT 5/3 architecture

The overall structure of the proposed 2-D DWT 5/3 architecture with J = 7 decomposition levels, which exploits 1-D DWT filter design presented in Sect. 2, is shown in Fig. 8. Seven levels of decomposition have been chosen since that number of levels ensures the excellent compression quality for high-definition (HD) resolution images (1920 × 1080 pixels).

Input data samples p[m, n] are received by “HF Input Register Level 1” and then horizontally filtered by “Horizontal Filter Level 1” line by line. All horizontal filters are implemented as 1-D DWT filters described in Sect. 2. As a result, coefficients y_A[m, n] are generated:

$$y_{A} [m,n] = \left\{ {\begin{array}{*{20}l} {y_{H}^{(1)} [m,k],} \hfill & {{\text{for}}\;n = 2k} \hfill \\ {y_{L}^{(1)} [m,k],} \hfill & {{\text{for}}\;n = 2k + 1} \hfill \\ \end{array} } \right.$$

(10)

where y ⁽¹⁾_H [m, k] represent high-pass horizontally filtered coefficients at level 1, and y ⁽¹⁾_L [m, k] represent low-pass horizontally filtered coefficients at level 1. Since 1-D DWT filter from Sect. 2 generates high-pass coefficient as first valid coefficient, in notation in Eq. (10) y_A[m, 0] represents the high-pass coefficient.

Coefficients y_A[m, n] are then vertically filtered by “Vertical Filter A,” producing coefficients z_A[m, n]:

$$z_{A} [m,n] = \left\{ {\begin{array}{*{20}l} {z_{LH}^{(1)} [m,k],} \hfill & {{\text{for}}\;m = 2l\;{\text{and}}\;n = 2k} \hfill \\ {z_{LL}^{(1)} [m,k],} \hfill & {{\text{for}}\;m = 2l\;{\text{and}}\;n = 2k + 1} \hfill \\ {z_{HH}^{(1)} [m,k],} \hfill & {{\text{for}}\;m = 2l + 1\;{\text{and}}\;n = 2k} \hfill \\ {z_{HL}^{(1)} [m,k],} \hfill & {{\text{for}}\;m = 2l + 1\;{\text{and}}\;n = 2k + 1} \hfill \\ \end{array} } \right.$$

(11)

Resulting coefficients at level 1: z ⁽¹⁾_LH [m, n], z ⁽¹⁾_LL [m, n], z ⁽¹⁾_HH [m, n] and z ⁽¹⁾_HL [m, n] belong to level 1 subbands LH, LL, HH and HL, respectively.

Coefficients z ⁽¹⁾_LL [m, n] are received one by one by “HF Input Register Level 2,” then horizontally filtered by “Horizontal Filter Level 2” and routed through a multiplexer, generating coefficients y_B[m, n]:

$$y_{B} [m,n] = \left\{ \begin{array}{l} y_{H}^{(j)} [m,k],\quad {\text{for}}\;n = 2k \hfill \\ y_{L}^{(j)} [m,k],\quad {\text{for}}\;n = 2k + 1 \hfill \\ \end{array} \right.$$

(12)

where y ^{(
j)}_H [m, k] represent high-pass horizontally filtered coefficients at level j (j = 2, 3, …, 7), and y ^{(
j)}_L [m, k] represent low-pass horizontally filtered coefficients at level j (j = 2, 3, …, 7).

Coefficients y_B[m, n] are then vertically filtered by “Vertical Filter B,” producing coefficients z_B[m, n]:

$$z_{B} [m,n] = \left\{ {\begin{array}{*{20}l} {z_{LH}^{(j)} [m,k],} \hfill & {{\text{for}}\;m = 2l\;{\text{and}}\;n = 2k} \hfill \\ {z_{LL}^{(j)} [m,k],} \hfill & {{\text{for}}\;m = 2l\;{\text{and}}\;n = 2k + 1} \hfill \\ {z_{HH}^{(j)} [m,k],} \hfill & {{\text{for}}\;m = 2l + 1\;{\text{and}}\;n = 2k} \hfill \\ {z_{HL}^{(j)} [m,k],} \hfill & {{\text{for}}\;m = 2l + 1\;{\text{and}}\;n = 2k + 1} \hfill \\ \end{array} } \right.$$

(13)

Resulting coefficients at level j: z ^{(
j)}_LH [m, n], z ^{(
j)}_LL [m, n], z ^{(
j)}_HH [m, n] and z ^{(
j)}_HL [m, n] belong to level j (j = 2, 3, …, 7) subbands LH, LL, HH and HL, respectively.

Coefficients z ^{(
j)}_LL [m, n] are received one by one by “HF Input Register Level j,” then horizontally filtered by “Horizontal Filter Level j” (j = 3, …, 7) and multiplexed with other coefficients generated by horizontal filters from other levels 2–7, producing coefficients y_B[m, n].

The time diagram which describes the dynamics of 2-D filtering at the beginning of even lines (starting from 0) is presented in Fig. 9. The time diagram shows the lines in first three levels of decomposition in case when lines at each presented level are even lines (since only in even lines coefficients from LL subbands are generated, and only these coefficients are further filtered at the next decomposition level).

The process of horizontal filtering of the line starts as soon as the first pixel (i.e., input data sample) in line is received by 2-D DWT system (in time instance n = 0 as denoted in Fig. 9). The first valid coefficient produced by horizontal filter at level 1, the high-pass coefficient y ⁽¹⁾_H [m⁽¹⁾, 0], appears in time instance n = 3. It is followed by the first valid low-pass coefficient y ⁽¹⁾_L [m⁽¹⁾, 0] in the next time slot. The third valid coefficient produced by horizontal filter at level 1 is the high-pass coefficient (y ⁽¹⁾_H [m⁽¹⁾, 1]). It is followed by the low-pass coefficient y ⁽¹⁾_L [m⁽¹⁾, 1] in the next time slot. The other horizontally filtered coefficients are produced in subsequent cycles.

As soon as horizontally filtered coefficient at level 1 is produced, it is being vertically filtered at the same level in the next time slot. Every even (starting from 0) both horizontally and vertically filtered coefficient in the line belongs to the subband LH (z ⁽¹⁾_LH [m⁽¹⁾, n⁽¹⁾]), while every odd both horizontally and vertically filtered coefficient in the line belongs to the subband LL (z ⁽¹⁾_LL [m⁽¹⁾, n⁽¹⁾]).

The process of horizontal filtering of the line at level 2 starts as soon as the first coefficient from subband LL from level 1 (z ⁽¹⁾_LL [m⁽¹⁾, 0]) is received by “Horizontal Filter Level 2.” The first valid coefficient produced by that filter, the high-pass coefficient y ⁽²⁾_H [m⁽²⁾, 0], appears in time instance n = 10. It is followed by the first valid low-pass coefficient y ⁽²⁾_L [m⁽²⁾, 0] in time slot n = 12. The third valid coefficient produced by horizontal filter at level 2 is the high-pass coefficient (y ⁽²⁾_H [m⁽²⁾, 1]) in time slot n = 14, followed by the low-pass coefficient y ⁽²⁾_L [m⁽²⁾, 1] in time slot n = 16. The other horizontally filtered coefficients at level 2 are produced in every second cycle.

As soon as horizontally filtered coefficient at level 2 is produced, it is being vertically filtered at the same level in the next time slot. Every even (starting from 0) both horizontally and vertically filtered coefficient in the line belongs to the subband LH (z ⁽²⁾_LH [m⁽²⁾, n⁽²⁾]), while every odd both horizontally and vertically filtered coefficient in the line belongs to the subband LL (z ⁽²⁾_LL [m⁽²⁾, n⁽²⁾]).

The horizontal filtering of the line at level 3 starts once the first coefficient from subband LL from level 2 (z ⁽²⁾_LL [m⁽²⁾, 0]) is received by “Horizontal Filter Level 3.” The first valid coefficient produced by that filter, the high-pass coefficient y ⁽³⁾_H [m⁽³⁾, 0], appears in time instance n = 23. It is followed by the first valid low-pass coefficient y ⁽³⁾_L [m⁽³⁾, 0] in time slot n = 27. The third valid coefficient produced by horizontal filter at level 3 is the high-pass coefficient (y ⁽³⁾_H [m⁽³⁾, 1]) in time slot n = 31, followed by the low-pass coefficient y ⁽³⁾_L [m⁽³⁾, 1] in time slot n = 35. The other horizontally filtered coefficients at level 3 are produced in every fourth time slot. Although these horizontally filtered coefficients could be generated one time slot earlier (i.e., in time instances n = 22, n = 26, n = 30 and n = 34), this scenario is avoided in order to utilize the same vertical filter for all coefficients at levels from 2 to 7, which is possible due to the appropriate interleaving of time slots generating vertically filtered coefficients.

As soon as horizontally filtered coefficient at level 3 is produced, it is being vertically filtered at the same level in the next time slot. Every even both horizontally and vertically filtered coefficient in the line belongs to the subband LH (z ⁽³⁾_LH [m⁽³⁾, n⁽³⁾]). Every odd both horizontally and vertically filtered coefficient in the line belongs to the subband LL (z ⁽³⁾_LL [m⁽³⁾, n⁽³⁾]).

The described pattern of 2-D filtering at the beginning of even lines continues at all other levels (j = 4, 5, 6 and 7), which has not been shown in simplified Fig. 9. The horizontal filtering of the line at the level starts once the first coefficient from the subband LL from previous level is received by the horizontal filter at current level. As soon as horizontally filtered coefficient at current level is produced, it is being vertically filtered at the same level in the next time slot. Coefficients at level 4 are generated on every eighth time slot. Coefficients at level 5 are generated on every sixteenth time slot. Coefficients at level 6 are generated on every thirty-second time slot, etc.

Starting time instance for the generation of the first coefficient at each level is chosen on the manner which allows appropriate interleaving of time slots when vertically filtered coefficients are generated. This approach allows using one vertical filter for level 1 (“Vertical Filter A”), and another vertical filter for all other decomposition levels (“Vertical Filter B”), since any overlapping of time slots when “Vertical Filter B” is used has been avoided.

The time diagram which illustrates the dynamics of 2-D filtering at the end of even lines of HD resolution images, for lines whose beginning is shown in Fig. 9, is presented in Fig. 10. For level 1 of decomposition, the pattern of filtering is the same as at the beginning of the line. For all other levels of decomposition, the pattern of filtering is the same as at the beginning of the line until the time slot when the last coefficient from LL subband at previous level is generated. After that time slot, all remaining coefficients at current level of decomposition are generated at successive time slots.

The time diagram which describes the dynamics of 2-D filtering at the beginning of odd lines (starting from 0) is presented in Fig. 11.

The pattern of filtering is almost the same as in case of filtering at the beginning of even lines. Only two differences can be noticed. First, every even (starting from 0) both horizontally and vertically filtered coefficient in the line belongs to the subband HH (z ⁽¹⁾_HH [m⁽¹⁾, n⁽¹⁾]), while every odd both horizontally and vertically filtered coefficient in the line belongs to the subband HL (z ⁽¹⁾_HL [m⁽¹⁾, n⁽¹⁾]). Second, the first level of decomposition is always the only level of decomposition, since neither the coefficients from HH subband nor the coefficients from HL subband are further filtered at the next decomposition level.

The time diagram which illustrates the dynamics of 2-D filtering at the end of odd lines of HD resolution images, for lines whose beginning is shown in Fig. 11, is presented in Fig. 12. The pattern of filtering is the same as at the beginning of the line. Also, the first level of decomposition is always the only level of decomposition, since neither the coefficients from HH subband nor the coefficients from HL subband are further filtered at the next decomposition level.

The illustration of the beginning of line-wise filtering in the proposed 2-D DWT architecture is shown in Fig. 13.

After horizontal filtering of line 0 of input image, “Vertical Filter A” generates “temp result 1,” which represents the set of zeros, while simultaneously calculates internal intermediate results later used for generation of valid resulting coefficients at level 1.

After horizontal filtering of line 1 of input image, “Vertical Filter A” generates “temp result 2,” which represents the set of zeros, while simultaneously calculates internal intermediate results later used for generation of valid resulting coefficients at level 1. After horizontal filtering of line 2 of input image, “Vertical Filter A” generates the first line of valid resulting coefficients at level 1.

That line contains coefficients alternately from subbands LH and LL, i.e., notation “z ⁽¹⁾_LH [0, n⁽¹⁾], z ⁽¹⁾_LL [0, n⁽¹⁾]” from Fig. 13 represents the following sequence of coefficients: z ⁽¹⁾_LH [0, 0], z ⁽¹⁾_LL [0, 0], z ⁽¹⁾_LH [0, 1], z ⁽¹⁾_LL [0, 1], z ⁽¹⁾_LH [0, 2], z ⁽¹⁾_LL [0, 2], etc. After horizontal filtering of line 3 of input image, “Vertical Filter A” generates the second line of valid resulting coefficients at level 1. That line contains coefficients alternately from subbands HH and HL, i.e., notation “z ⁽¹⁾_HH [1, n⁽¹⁾], z ⁽¹⁾_HL [1, n⁽¹⁾]” from Fig. 13 represents the following sequence of coefficients: z ⁽¹⁾_HH [1, 0], z ⁽¹⁾_HL [1, 0], z ⁽¹⁾_HH [1, 1], z ⁽¹⁾_HL [1, 1], z ⁽¹⁾_HH [1, 2], z ⁽¹⁾_HL [1, 2], etc. This pattern continues for all remaining lines at level 1, i.e., alternately lines with resulting coefficients from subbands LH and LL and lines with resulting coefficients from subbands HH and HL are generated.

The pattern of line-wise filtering at all other levels is almost the same as for level 1. Only two differences can be noticed: (1) filtering of particular line at current level is performed after valid LL coefficients from corresponding line from previous level are generated, which means that filtering of successive lines at current level is interleaved with every second line of generated coefficients from previous level; (2) after horizontal filtering, the vertical filtering is performed by “Vertical Filter B.”

The illustration of the end of line-wise filtering for HD resolution images in the proposed 2-D DWT architecture is shown in Fig. 14. For level 1 of decomposition, the pattern of filtering is the same as at the beginning of the line-wise filtering. For all other levels of decomposition, the pattern of filtering is the same as at the beginning of the line-wise filtering until the line where the last coefficients from LL subband at previous level are generated. Starting with that line, all remaining lines at current level of decomposition are generated successively one after another without empty time slots between successive lines.

The detailed structure of “Vertical Filter A” and “Vertical Filter B” from Fig. 8 is shown in Fig. 15.

Equation (14) describe relations between input and output signals for “Zero Line Block”:

$$\begin{aligned} T0[m,n] &= y[m,n] \\ T1[m,n] &= 0 \\ z[m,n] &= 0 \\ \end{aligned}$$

(14)

This block receives the input line 0 as input signal y[m, n]. At the output z[m, n], this block generates “temp result 1” which represents the set of zeros.

Equation (15) describe dependences between input and output signals for “First Line Block”:

$$\begin{aligned} T0[m,n] &= y[m,n] - \frac{1}{2}IT0[m,n] \\ &= y[m,n] - \frac{1}{2}T0[m - 1,n] \\ &= y[m,n] - \frac{1}{2}y[m - 1,n] \\ T1[m,n] &= IT0[m,n] = T0[m - 1,n] = y[m - 1,n] \\ z[m,n] &= IT1[m,n] = T1[m - 1,n] = 0 \\ \end{aligned}$$

(15)

This block receives the input line 1 as input signal y[m, n]. At the output z[m, n], this block generates “temp result 2” which also represents the set of zeros.

Equation (16) describe relations between input and output signals for “Second Line Block,” which receives input signal y[m, n]via the input line 2. The output signal z[m, n] of this block is described with the equation which corresponds to the special form of low-pass Le Gall’s 5/3 filter used for vertical filtering near image boundaries, instead of symmetric extension of input pixels at image boundaries, which is well-known solution. This output signal z[m, n] actually represents the line 0 of valid resulting coefficients generated by vertical filter.

$$\begin{aligned} T0[m,n] &= y[m,n] \\ T1[m,n] &= IT0[m,n] - \frac{1}{2}y[m,n] \\ &= T0[m - 1,n] - \frac{1}{2}y[m,n] \\ &= - \frac{1}{2}y[m,n] + y[m - 1,n] - \frac{1}{2}y[m - 2,n] \\ z[m,n] &= IT1[m,n] + \frac{1}{2}IT0[m,n] - \frac{1}{4}y[m,n] \\ &= T1[m - 1,n] + \frac{1}{2}T0[m - 1,n] - \frac{1}{4}y[m,n] \\ &= - \frac{1}{4}y[m,n] + \frac{1}{2}y[m - 1,n] + \frac{3}{4}y[m - 2,n] \\ \end{aligned}$$

(16)

Equation (17) describe relations between input and output signals for “Odd Line Block,” which receives input signal y[m, n] via any odd input line, except the input line 1 and the last input line. The output signal z[m, n] of this block is described with the equation which corresponds to the high-pass Le Gall’s 5/3 filter. This output signal z[m, n] actually represents any odd line (starting from 0) of valid resulting coefficients except odd lines among last three lines.

$$\begin{aligned} T0[m,n] &= y[m,n] - \frac{1}{2}IT0[m,n] \\ &= y[m,n] - \frac{1}{2}T0[m - 1,n] \\ &= y[m,n] - \frac{1}{2}y[m - 1,n] \\ T1[m,n] &= IT0[m,n] + \frac{1}{4}IT1[m,n] \\ &= T0[m - 1,n] + \frac{1}{4}T1[m - 1,n] \\ &= \frac{7}{8}y[m - 1,n] + \frac{1}{4}y[m - 2,n] - \frac{1}{8}y[m - 3,n] \\ z[m,n] &= IT1[m,n] = T1[m - 1,n] \\ &= - \frac{1}{2}y[m - 1,n] + y[m - 2,n] - \frac{1}{2}y[m - 3,n] \\ \end{aligned}$$

(17)

Equation (18) describe relations between input and output signals for “Even Line Block,” which receives input signal y[m, n] via any even input line, except the input line 0 and input line 2. The output signal z[m, n] of this block is described with the equation which corresponds to the low-pass Le Gall’s 5/3 filter. This output signal z[m, n] actually represents any even line (starting from 0) of valid resulting coefficients except the line 0 and except the even line among last three lines.

$$\begin{aligned} T0[m,n] &= y[m,n] \\ T1[m,n] &= IT0[m,n] - \frac{1}{2}y[m,n] \\ &= T0[m - 1,n] - \frac{1}{2}y[m,n] \\ &= - \frac{1}{2}y[m,n] + y[m - 1,n] - \frac{1}{2}y[m - 2,n] \\ z[m,n] &= IT1[m,n] + \frac{1}{4}IT0[m,n] - \frac{1}{8}y[m,n] \\ &= T1[m - 1,n] + \frac{1}{4}T0[m - 1,n] - \frac{1}{8}y[m,n] \\ &= - \frac{1}{8}y[m,n] + \frac{1}{4}y[m - 1,n] + \frac{3}{4}y[m - 2,n] \\ & \quad + \frac{1}{4}y[m - 3,n] - \frac{1}{8}y[m - 4,n] \\ \end{aligned}$$

(18)

Description of remaining three blocks is given in case when total number of lines within the image is even.

Equation (19) describe relations between input and output signals for “Last Line Block,” which receives input signal y[m, n] via the last input line. The output signal z[m, n] of this block is described with the equation which corresponds to the high-pass Le Gall’s 5/3 filter.

$$\begin{aligned} T0[m,n] &= y[m,n] - IT0[m,n] \\ & = y[m,n] - T0[m - 1,n] \\ & = y[m,n] - y[m - 1,n] \\ T1[m,n] &= IT0[m,n] + \frac{1}{4}IT1[m,n] \\ &= T0[m - 1,n] + \frac{1}{4}T1[m - 1,n] \\ &= \frac{7}{8}y[m - 1,n] + \frac{1}{4}y[m - 2,n] - \frac{1}{8}y[m - 3,n] \\ z[m,n] &= IT1[m,n] = T1[m - 1,n] \\ &= - \frac{1}{2}y[m - 1,n] + y[m - 2,n] - \frac{1}{2}y[m - 3,n] \\ \end{aligned}$$

(19)

Equation (20) describe relations between input and output signals for “Last Plus 1 Line Block,” which is responsible for vertical filtering of the remaining intermediate results IT0[m, n] and IT1[m, n]. The output signal z[m, n] of this block is described with the equation which corresponds to the special form of low-pass Le Gall’s 5/3 filter used for vertical filtering near image boundaries, instead of symmetric extension of input pixels at image boundaries.

$$\begin{aligned} T0[m,n] &= y[m,n] \\ T1[m,n] &= IT0[m,n] = T0[m - 1,n] \\ &= y[m - 1,n] - y[m - 2,n] \\ z[m,n] &= IT1[m,n] + \frac{1}{4}IT0[m,n] \\ &= T1[m - 1,n] + \frac{1}{4}T0[m - 1,n] \\ &= \frac{1}{4}y[m - 1,n] + \frac{5}{8}y[m - 2,n] \\ & \quad + \frac{1}{4}y[m - 3,n] - \frac{1}{8}y[m - 4,n] \\ \end{aligned}$$

(20)

Finally, Eq. (21) describe relations between input and output signals for “Last Plus 2 Line Block,” which is responsible for vertical filtering of the remaining intermediate results IT1[m, n]. The output signal z[m, n] of this block is described with equation which corresponds to the special form of high-pass Le Gall’s 5/3 filter used for vertical filtering near image boundaries, instead of symmetric extension of input pixels at image boundaries. This output signal z[m, n] actually represents the last line of valid resulting coefficients generated by vertical filter.

$$\begin{aligned} T0[m,n] &= y[m,n] \\ T1[m,n] &= IT0[m,n] = T0[m - 1,n] = y[m - 1,n] \\ z[m,n] &= IT1[m,n] = T1[m - 1,n] \\ &= y[m - 2,n] - y[m - 3,n] \\ \end{aligned}$$

(21)

All these equations are derived with respect to the fact that intermediate results T0[m, n] and T1[m, n] are stored in on-chip memory which produces the dependences:

$$\begin{aligned} IT0[m,n] &= T0[m - 1,n] \\ IT1[m,n] &= T1[m - 1,n] \\ \end{aligned}$$

(22)

On-chip memory is shown in Fig. 16. For successful 2-D DWT filtering and decomposition of N × N image, two lines of intermediate results have to be stored in on-chip memory at each level. “On-chip memory A” is used for storing the intermediate results from level 1 of decomposition and it contains one buffer with capacity of 2N coefficients. “On-chip memory B” is used for storing the intermediate results from other levels of decomposition and it contains six buffers (in case of J = 7 levels of decomposition) with capacity halved at every succeeding level, starting from capacity of N coefficients at level 2. All these buffers represent FIFO memory.

6 Complexity and performance comparisons of various 2-D DWT 5/3 architectures

For J levels of decomposition of N × N image, the proposed 2-D DWT 5/3 architecture utilizes J FIFO buffers for storing the intermediate results T0[m, n] and T1[m, n], where the capacity of FIFO buffer for level 1 is 2N coefficients. The capacity of FIFO buffer for every succeeding level is half of the capacity of FIFO buffer for the preceding level. Also, each level of decomposition requires one input register for horizontal filter and horizontal filter itself, which contains 2 registers (delay elements). Therefore, the total on-chip memory used by the proposed 2-D DWT architecture can be calculated as follows:

$$\begin{aligned} & 2N + N + \frac{N}{2} + \frac{N}{4} + \cdots + \frac{N}{{2^{J - 2} }} + 3J + 2 \\ & \quad = 4N\left( {1 - 2^{ - J} } \right) + 3J + 2 \\ \end{aligned}$$

(23)

The proposed 2-D DWT architecture does not require off-chip memory at all. Since for all real image processing applications is J << N, the total used memory is approximately 4N(1 − 2^−J).

Based on time diagrams shown in Figs. 9 and 10 it can be concluded that computing time per line is N + 4J clock cycles. Based on line-wise diagrams shown in Figs. 13 and 14, the total number of time slots for line processing can be calculated as N + J + 1. Therefore, the total computing time for the proposed 2-D DWT architecture is:

$$\left( {N + 4J} \right) \cdot \left( {N + J + 1} \right) \approx N^{2}$$

(24)

Based on Figs. 9 and 13, the output latency for the proposed architecture can be calculated as:

$$2N + 4 \approx 2N$$

(25)

Hardware utilization efficiency HUE(PA) of a parallel architecture (PA) can be defined as:

$${\text{HUE}}({\text{PA}}) = \frac{{\sum\nolimits_{i = 1}^{M} {A_{i} \times {\text{HUE}}(A_{i} )} }}{{\sum\nolimits_{i = 1}^{M} {A_{i} } }}$$

(26)

where A_i denotes the ith computing unit, HUE(A_i) denotes the hardware utilization efficiency of computing unit A_i and M denotes the total number of computing units. The proposed 2-D DWT 5/3 architecture represents the parallel architecture which consists of two computing units: A₁ (which comprises the part of the proposed architecture responsible for filtering and decomposition of level 1) and A₂ (which comprises the part of the proposed architecture responsible for filtering and decomposition of all other levels j, where j > 1).

Hardware utilization efficiency HUE(A_i) of the ith computing unit A_i can be defined as the ratio of the actual computation time to the total processing (computing) time, with time expressed in numbers of clock cycles. The actual computation time in case of the proposed architecture is equal to the number of data samples to be processed in computing unit A_i.

Therefore, the hardware utilization efficiency of computing unit A₁ can be expressed as:

$${\text{HUE}}(A_{1} ) = \frac{N \cdot N}{(N + 4) \cdot (N + 2)} \approx 1$$

(27)

The hardware utilization efficiency of computing unit A₂ can be expressed as:

$$\begin{aligned} {\text{HUE}}(A_{2} ) &= \frac{{\frac{1}{3}N^{2} \left( {1 - 4^{ - J + 1} } \right)}}{(N + 4J - 5) \cdot (N + J - 1)} \\ &\approx \frac{1}{3}\left( {1 - 4^{ - J + 1} } \right) \\ \end{aligned}$$

(28)

Finally, the hardware utilization efficiency for the proposed 2-D DWT 5/3 architecture can be calculated as follows:

$$\begin{aligned} {\text{HUE}}({\text{PA}}) &= \frac{{1 \cdot 1 + 1 \cdot \frac{1}{3}\left( {1 - 4^{ - J + 1} } \right)}}{1 + 1} \\ &= \frac{2}{3}\left( {1 - 4^{ - J} } \right) \\ \end{aligned}$$

(29)

Therefore, for performing only one level of decomposition, the hardware utilization efficiency of the proposed 2-D DWT 5/3 architecture is approximately equal to 1. However, for performing J (J > 1) levels of decomposition, the hardware utilization efficiency reduces to approximately 0.66, as can be calculated from Eq. (29).

Both computing time and output latency are represented in number of clock cycles, while the capacity of total required memory is represented in number of coefficients.

The performance of the proposed 2-D DWT architecture and architectures reported in [9, 16, 18–20, 24–34] are compared in Table 4 in terms of required on-chip memory capacity, required off-chip memory capacity, computing time, output latency and hardware utilization efficiency, for J levels of decomposition of N × N image.

Table 4 Comparison of 2-D DWT 5/3 architectures

Full size table

It can be noticed that, compared to other architectures, the proposed architecture has medium computing time, medium output latency and comparable hardware utilization efficiency. Due to very high level of regularity which can be seen in time diagrams in Figs. 9, 10, 11, 12, 13 and 14, the proposed architecture has medium control complexity compared to other architectures. However, the proposed architecture has the lowest total used memory in comparison with all other published architectures. For J → ∞ levels of decomposition of N × N image, the proposed 2-D DWT 5/3 architecture requires the memory capacity of only 4N, which is 20% lower capacity than required capacity for the best previously published architecture.

In order to compare the synthesis results of the proposed 2-D DWT 5/3 architecture with the best available synthesis results of other 2-D DWT 5/3 architectures reported in the literature, the proposed 2-D DWT 5/3 architecture is implemented on Xilinx Virtex-4 XC4VFX100 and Virtex-5 XC5VLX110T FPGA target devices.

Synthesis results for 16-bit word length are presented in Table 5. Slice-delay product (SDP) is calculated based on Eq. (30):

$${\text{SDP}} = N_{\text{SL}} \cdot \frac{\text{CT}}{{f_{\text{MAX}} }}$$

(30)

where N_SL denotes the number of used CLB slices, CT denotes the computing time represented in number of clock cycles and f_MAX denotes the maximum operating frequency of 2-D DWT 5/3 architecture. It can be noticed that the proposed 2-D DWT 5/3 architecture requires the lowest number of CLB slices in comparison with architecture [30] and PMA [32]. The proposed architecture also requires comparable number of CLB slices with RMA [32], even though the proposed architecture is implemented for 512 × 512 image size and 5 levels of decomposition, while RMA from [32] is implemented for 256 × 256 image size and 3 levels of decomposition. Due to very high maximum operating frequency, PMA [32] has the lowest slice-delay product.

Table 5 FPGA synthesis results of 2-D DWT 5/3 architectures

Full size table

Memory usage (for 10-bit word length in order to make proper comparison with results available in [30]) is presented in Table 6. It can be seen that the proposed 2-D DWT 5/3 architecture requires the lowest memory size in comparison with architectures [16, 30, 37–39], even though the proposed architecture is implemented for 512 × 512 image size and 5 levels of decomposition, while architectures from [16, 37–39] are implemented for only 1 level of decomposition and some of them for smaller image size.

Table 6 Comparison of memory usage

Full size table

Finally, the FPGA post-synthesis power analysis at 100 MHz for image size 512 × 512, 16-bit word length and Virtex-5 XC5VLX110T FPGA target device (presented in Table 7) clearly shows that the proposed architecture has comparable dissipation with architecture [30] and PMA [32], even though the power dissipation for the proposed architecture is estimated for design with 5 levels of decomposition, while power dissipations for other architectures are estimated for designs with only 1 level of decomposition.

Table 7 FPGA post-synthesis power analysis at 100 MHz (for image size 512 × 512 and Virtex-5 XC5VLX110T FPGA chip)

Full size table

7 Conclusion

One-dimensional filter presented in this paper allows utilization of each time slot for the generation of output coefficients. Time slots with input samples with even indexes are used for the generation of low-pass output coefficients. Time slots with input samples with odd indexes are used for the generation of high-pass output coefficients. This approach saves memory resources, since the same memory blocks are used for both low-pass and high-pass coefficients generation. The same filter components are reused, thus reducing the used logic resources, allowing higher operating frequency, lower critical path delay and lower total power dissipation in comparison with other published designs. Two-dimensional DWT 5/3 architecture proposed in this paper, which exploits implemented 1-D filter design, represents memory efficient solution which requires lower storage capacity than any other previously published architecture. The proposed architecture does not require any off-chip memory.

References

Daubechies, I.: The wavelet transform time-frequency localization and signal analysis. IEEE Trans. Inf. Theory 36(5), 961–1005 (1990)
Article MathSciNet Google Scholar
Mallat, S.: A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989)
Article Google Scholar
Acharya, T., Tsai, P.S.: JPEG2000 Standard for Image Compression: Concepts, Algorithms and VLSI Architectures. Wiley, Hoboken (2004)
Book Google Scholar
Caglar, H., Liu, Y., Akansu, N.: Optimal PR-QMF design for subband image coding. J. Vis. Commun. Image Represent. 4(3), 242–253 (1993)
Article Google Scholar
Egger, O., Li, W.: Subband coding of images using asymmetrical filter banks. IEEE Trans. Image Process. 4(4), 478–485 (1995)
Article Google Scholar
Li, W., Egger, O.: Improved subband coding of images using unequal length PR filters. In: 14th Gretsi Symposium Signal and Image Processing, pp. 451–454 (1993)
Le Gall, D., Tabatabai, A.: Subband coding of digital images using symmetric short kernel filters and arithmetic coding techniques. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 761–764 (1988)
Parhi, K.K., Nishitani, T.: VLSI architectures for discrete wavelet transforms. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 1(2), 191–202 (1993)
Article Google Scholar
Wu, P.C., Chen, L.G.: An efficient architecture for two-dimensional discrete wavelet transform. IEEE Trans. Circuit Syst. Video Technol. 11(4), 536–545 (2001)
Article Google Scholar
Cheng, C., Parhi, K.K.: High-speed VLSI implementation of 2-D discrete wavelet transform. IEEE Trans. Signal Process. 56(1), 393–403 (2008)
Article MathSciNet Google Scholar
Usha, B.N., Chilambuchelvan, A.: Efficient VLSI architecture for discrete wavelet transform. Int. J. Comput. Sci. Issues (Online) 1(1), 32–36 (2011)
Google Scholar
Ghantous, M., Bayoumi, M.: P²E-DWT: a parallel and pipelined efficient VLSI architecture of 2-D discrete wavelet transform. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 941–944 (2011)
Liu, C.C., Shiau, Y.H., Jou, J.M.: Design and implementation of a progressive image coding chip based on the lifted wavelet transform. In: 11th VLSI Design/CAD Symposium (2000)
Jou, J.M., Shiau, Y.H., Liu, C.C.: Efficient VLSI architectures for the biorthogonal wavelet transform by filter bank and lifting scheme. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 529–532 (2001)
Lian, C.J, Chen, K.F., Chen, H.H., Chen, L.G.: Lifting based discrete wavelet transform architecture for JPEG2000. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 445–448 (2001)
Andra, K., Chakrabarti, C., Acharya, T.: A VLSI architecture for lifting-based forward and inverse wavelet transform. IEEE Trans. Signal Process. 50(4), 966–977 (2002)
Article Google Scholar
Huang, C.T., Tseng, P.C., Chen, L.G.: Flipping structure: an efficient VLSI architecture for lifting-based discrete wavelet transform. IEEE Trans. Signal Process. 52(4), 1080–1089 (2004)
Article MathSciNet Google Scholar
Chang, W.H., Lee, Y.S., Peng, W.S., Lee, C.Y.: A line-based, memory efficient and programmable architecture for 2D DWT using lifting scheme. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 330–333 (2001)
Liao, H., Mandal, M.K., Cockburn, B.F.: Efficient implementation of lifting-based discrete wavelet transform. Electron. Lett. 38(18), 1010–1012 (2002)
Article Google Scholar
Liao, H., Mandal, M.K., Cockburn, B.F.: Efficient architectures for 1-D and 2-D lifting-based wavelet transform. IEEE Trans. Signal Process. 52(5), 1315–1326 (2004)
Article MathSciNet Google Scholar
Martina, M., Masera, G., Piccinini, G., Zamboni, M.: Novel JPEG 2000 compliant DWT and IWT VLSI implementations. J. VLSI Signal Process. Syst. Signal Image Video Technol. 35(2), 137–153 (2003)
Article Google Scholar
Meher, P.K., Mohanty, B.K., Swamy, M.N.S.: Low-area and low-power reconfigurable architecture for convolution-based 1-D DWT using 9/7 and 5/3 filters. In: International Conference on VLSI Design, pp. 327–332 (2015)
Acharya, T., Chakrabarti, C.: A survey on lifting-based discrete wavelet transform architectures. J. VLSI Signal Process. 42(3), 321–339 (2006)
Article Google Scholar
Vishwanath, M., Owens, R.M., Irwin, M.J.: VLSI architectures for the discrete wavelet transform. IEEE Trans. Circuits Syst. II 42(5), 305–316 (1995)
Article Google Scholar
Chrysafis, C., Ortega, A.: Line-based, reduced memory, wavelet image compression. IEEE Trans. Image Process. 9(3), 378–389 (2000)
Article MathSciNet Google Scholar
Barua, S., Carletta, J.E., Kotteri, K.A., Bell, A.E.: An efficient architecture for lifting-based two-dimensional discrete wavelet transform. Integr. VLSI J. 38(3), 341–352 (2005)
Article Google Scholar
Xiong, C.-Y., Tian, J., Liu, J.: Efficient high-speed/low-power line-based architecture for two-dimensional discrete wavelet transform using lifting scheme. IEEE Trans. Circuits Syst. Video Technol. 16(2), 309–316 (2006)
Article Google Scholar
Xiong, C.-Y., Tian, J.-W., Liu, J.: Efficient architecture for 2-D discrete wavelet transform using lifting scheme. IEEE Trans. Image Process. 16(3), 607–614 (2007)
Article MathSciNet Google Scholar
Mohanty, B.K., Meher, P.K.: Memory efficient modular VLSI architecture for highthroughput and low-latency implementation of multilevel lifting 2-D DWT. IEEE Trans. Signal Process. 59(5), 2072–2084 (2011)
Article MathSciNet Google Scholar
Aziz, S.M., Pham, D.M.: Efficient parallel architecture for multi-level forward discrete wavelet transform processors. Comput. Electr. Eng. 38(5), 1325–1335 (2012)
Article Google Scholar
Hsia, C.-H., Chiang, J.-S., Guo, J.-M.: Memory-efficient hardware architecture of 2-D dual-mode lifting-based discrete wavelet transform. IEEE Trans. Circuits Syst. Video Technol. 23(4), 671–683 (2013)
Article Google Scholar
Darji, A.D., Kushwah, S.S., Merchant, S.N., Chandorkar, A.N.: High-performance hardware architectures for multi-level lifting-based discrete wavelet transform. Eurasip J. Image Video Process. 47, 1–19 (2014)
Google Scholar
Hsia, C.-H., Chiang, J.-S., Chang, S.-H.: An efficient VLSI architecture for 2-D dual-mode SMDWT. In: IEEE International Conference on Networking, Sensing and Control (ICNSC), pp. 775–779 (2013)
Hsia, C.-H.: A new VLSI architecture for symmetric mask-based discrete wavelet transform. J Internet Technol 15(7), 1083–1090 (2014)
Google Scholar
Rajović, V., Savić, G., Prokin, M.: Hardware realization of fast image encoder with minimum memory size. In: 22nd Telecommunications Forum (TELFOR), pp. 717–724 (2014)
Savić, G., Prokin, M., Rajović, V., Prokin, D.: Hardware realization of direct subband transformer with minimum used resources. In: 4th Mediterranean Conference on Embedded Computing (MECO), pp. 220–223 (2015)
Dillen, G., Georis, B., Legat, J.D., Cantineau, O.: Combined line-based architecture for the 5–3 and 9–7 wavelet transform of JPEG2000. IEEE Trans. Circuits Syst. Video Technol. 13(9), 944–950 (2003)
Article Google Scholar
Lan, X., Zheng, N., Liu, Y.: Low-power and high-speed VLSI architecture for lifting-based forward and inverse wavelet transform. IEEE Trans. Consum. Electr. 51(2), 379–385 (2005)
Article Google Scholar
Liu, L., Chen, N., Meng, H., Zhang, L., Wang, Z., Chen, H.: A VLSI architecture of JPEG2000 encoder. IEEE J. Solid State Circuits 39(11), 2032–2040 (2004)
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by Ministry of Education, Science and Technology Development of Republic of Serbia under Grant No. TR32039.

Author information

Authors and Affiliations

School of Electrical Engineering, University of Belgrade, Bul. kralja Aleksandra 73, Belgrade, 11120, Serbia
Goran Savić, Milan Prokin & Vladimir Rajović
School of Electrical and Computer Engineering of Applied Studies, Vojvode Stepe 283, Belgrade, 11000, Serbia
Dragana Prokin

Authors

Goran Savić
View author publications
You can also search for this author in PubMed Google Scholar
Milan Prokin
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Rajović
View author publications
You can also search for this author in PubMed Google Scholar
Dragana Prokin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Goran Savić.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Savić, G., Prokin, M., Rajović, V. et al. Novel one-dimensional and two-dimensional forward discrete wavelet transform 5/3 filter architectures for efficient hardware implementation. J Real-Time Image Proc 16, 1459–1478 (2019). https://doi.org/10.1007/s11554-016-0656-1

Download citation

Received: 06 April 2016
Accepted: 19 November 2016
Published: 29 November 2016
Issue Date: October 2019
DOI: https://doi.org/10.1007/s11554-016-0656-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Novel one-dimensional and two-dimensional forward discrete wavelet transform 5/3 filter architectures for efficient hardware implementation

Abstract

Similar content being viewed by others

High-Performance 1-D and 2-D Inverse DWT 5/3 Filter Architectures for Efficient Hardware Implementation

Comparison of Approaches to the Circuits Design for DWT with CDF 9/7 Wavelet

Low-Power, Low-Area Multi-level 2-D Discrete Wavelet Transform Architecture

1 Introduction

2 Design of the 1-D novel forward DWT 5/3 filter

3 Comparison with other 1-D forward DWT 5/3 filter designs

4 Experimental results for 1-D forward DWT 5/3 filter designs

5 The proposed 2-D DWT 5/3 architecture

6 Complexity and performance comparisons of various 2-D DWT 5/3 architectures

7 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Novel one-dimensional and two-dimensional forward discrete wavelet transform 5/3 filter architectures for efficient hardware implementation

Abstract

Similar content being viewed by others

High-Performance 1-D and 2-D Inverse DWT 5/3 Filter Architectures for Efficient Hardware Implementation

Comparison of Approaches to the Circuits Design for DWT with CDF 9/7 Wavelet

Low-Power, Low-Area Multi-level 2-D Discrete Wavelet Transform Architecture

1 Introduction

2 Design of the 1-D novel forward DWT 5/3 filter

3 Comparison with other 1-D forward DWT 5/3 filter designs

4 Experimental results for 1-D forward DWT 5/3 filter designs

5 The proposed 2-D DWT 5/3 architecture

6 Complexity and performance comparisons of various 2-D DWT 5/3 architectures

7 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation