1 Introduction

The discrete wavelet transform (DWT) has established itself as an efficient tool for many applications, such as speech analysis, signal analysis, numerical analysis, video processing and compression due to its time-frequency localization characteristics [1]. It has been adapted in the JPEG-2000 standard due to its ability to decorrelate large images. The DWT is computationally intensive, which make it a challenge to implement. It requires a sizable quantity of arithmetic resources and memory. Therefore, to reduce the operational complexity Daubechies and Sweldens proposed the lifting scheme [2] to construct the DWT. The lifting based architectures have many advantages over the classical convolution based structures [3] in computational complexity, power consumption, and memory.

At present, many VLSI architectures for the 2-D DWT based on both the convolution and lifting scheme are available for real-time processing. However, designing a highly efficient structure at low hardware cost is an exacting task. In this work, two DWT architectures for the biorthogonal 9/7 and 5/3 wavelet are proposed based on the modified lifting scheme. The advantages of the proposed structure are 100% hardware utilization, low critical path delay, hardware efficient, and low control complexity. The structure is flipping based, with a modest memory requirement.

The rest of this paper is organized as follows: Sect. 2 briefly reviews the different DWT structures based on the lifting scheme. Section 3 describes the classical lifting scheme and modified lifting scheme algorithm. The proposed DWT architectures are depicted in Sect. 4. Implementation and performance analysis is done in Sect. 5. Finally, in Sect. 6 conclusion is drawn.

2 Related Work

The traditional DWT architectures are based on the convolution scheme [4,5,6,7]. These structures are not suitable for VLSI implementation because of the requirement of larger chip area. Hence, in recent years, a number of novel architectures based on the lifting algorithm are proposed to reduce the operations involved in computing the DWT. Jou et al. [8] proposed a structure, which is a direct implementation of the lifting steps. It has the advantage of low implementation complexity. In order to increase the hardware utilization, Lian et al. [9] proposed a folded architecture based on the lifting algorithm, but both these architectures have limitation on the critical path delay. Chen [10] implemented a 1-D, 5/3 and 9/7 wavelet transform. Since the structure is multi-level lifting based transform, it cannot be extended directly to the 2-D transform. The structure is folded to achieve higher hardware utilization. It avoids the use of external memory to store the intermediate results, and thus reducing the delay required to access the memory. Barua et al. [11] developed an architecture for 2-D DWT of the biorthogonal 9/7 wavelet. The update and the predict steps are modified to change the computation for the first and the last pixel respectively in each row and column. The structure can achieve 100% hardware utilization. It has a regular data flow, low complexity and high throughput. It utilizes 30% less memory than the conventional filter bank scheme. Shi et al. [12] proposed an efficient architecture employing the folding technique. The hardware complexity in the structure is moderate. Although it requires minimum hardware resources, the critical path delay is the sum of \(T_m\) and \(T_a\), where, \(T_m\) and \(T_a\) are multiplier and adder delay respectively. The throughput rate of the structure is one, with 100% hardware utilization. To speed up the computation, Lai et al. [13] introduced a two-input/two-output pipelined architecture. The predict and update stages are merged to reduce the critical path delay. The registers required for a 1-D DWT are 22, with eight pipelined stages.

The flipping based structure is another prominent architecture for DWT. It has the advantage of reducing the critical path delay. Huang et al. [14] proposed a flipping structure for DWT. It has a minimum critical path of one multiplier delay (\(1T_m\)) with five pipeline stages. Daubechies 9/7 filter, integer 9/7 filter, and 6/10 filter are used in the design to demonstrate the performance of the flipping structure. Wu and Lin [15] proposed a memory efficient pipelined architecture by modifying the lifting algorithm. The predictor and updater are merged to minimize the critical path delay to \(1T_m\). The throughput of the structure is one-input/one-output, and this reduces the processing speed. Inverse DWT is also proposed adopting the same architecture of the forward transform. Xiong et al. [16] presented a parallel based lifting scheme (PLS) to build a VLSI architecture for DWT. This scheme reduces the critical path delay and the number of registers used to implement a 1-D DWT. Using this scheme, implementation of forward and inverse DWT can be identical. Xiong et al. [17] proposed a novel 1-D and 2-D DWT architecture. Embedded decimation technique is utilized to optimize the DWT structure. A line based 2-D DWT structure is developed using parallel and pipelined technique called the fast architecture (FA), and another line based DWT structure called the high-speed architecture (HA) is proposed based on the parallelism technique. Cao et al. [18] presented the decomposed lifting scheme (DLS) algorithm to perform a 1-D DWT. Three structures are proposed with a throughput of one-input/one-output, two-input/two-output, and four-input/four-output. Two memory efficient structures, FA and HA are constructed for 2-D transform. The computing time is reduced with these structures at the cost of multipliers and adders. Lin et al. [19] proposed a pipeline architecture for 2-D DWT. The 1-D DWT structure requires fewer pipeline registers to achieve a critical path of \(1T_m\). The rescheduling algorithm is proposed to merge the lifting steps to attain a critical path same as that of the other architectures, but with fewer pipeline registers. Tian et al. [20] proposed a multi-input/multi-output structure for 2-D DWT. This architecture is used for high speed application with the computing time of \(N^2/M\) for an \(N \times N\) image, and M is the throughput. It has a simple control procedure, and hardware cost is minimum. Zhang et al. [21] developed a high-speed DWT architecture for 2-D transform. The critical path delay is \(1T_m\) with three pipeline stages. It is a two-input/two-output structure. The lifting scheme is modified, and the intermediate data is recombined to reduce the pipeline. A novel transpose module is developed to meet the input order of the data flow for the row processor. Hsia et al. [22] built an efficient architecture for 2-D DWT to address the issues of memory requirement and critical path delay. To reduce the transpose memory, interlaced read scan method is introduced. The structure has the advantage of regular signal flow, low transpose memory, and latency. The multiplication and accumulation cell (MAC) unit is used in the 1-D DWT architecture. Nagesh [23] developed a lifting based DWT structure for 1-D transform. The structure is developed by applying the folding and pipelining method. The serial-in parallel-out (SIPO) shift registers are used at the input of the architecture and parallel-in serial-out (PIPO) shift register at the output. Darji et al. [24] presented three architectures for multi-level 2-D DWT. These structures use different types of input scanning methods. They reduce the total computing cycles, and the hardware utilization efficiency ranges from 60 to 100%. Darji et al. [25] proposed a flipping architecture for 2-D DWT. The structure optimizes the lifting algorithm by simultaneously computing the intermediate data beforehand. The structure has a minimum critical path delay and 100% hardware utilization. Ang et al. [27] developed a scalable architecture that supports multiple DWT blocks. Each block operates independently. The throughput rate in the design is m, where m is the number of parallel DWT blocks. The advantages of the architecture are simple control complexity, low memory, fast computation, and minimum power consumption. Basiri et al. [28] proposed lifting based 1-D and 2-D DWT architecture for the 5/3 and 9/7 filters to optimize the area, power, and delay requirement. The area is reduced by using only one processing element (PE) to perform the entire DWT operation. Multiple PE’s are processed in parallel to reduce the delay. The PE consists of floating point adders, floating point multipliers, and fused multiply-add (FMA) unit. Wang and Choy [29] presented a novel data scan method for 2-D DWT. The stripe based method accesses the adjacent even and odd rows simultaneously to increase the throughput rate. The systolic array structure is proposed for the row transform, and the conventional two-input/two-output lifting based architecture is adopted for the column transform. Pipelining technique is employed to reduce the critical path delay to \(1T_m\). The structure gives good area, power, and delay results for the stripe size of 2, 4 and 8.

A multi-level 2-D DWT structure using the Haar wavelet is presented by Al-Azawi [30]. The architecture operates at 209 MHz for a three-level 2-D DWT. The hardware resources and memory requirement is constant. The transposition memory is required at every level of decomposition. The throughput of the structure is one-input/one-output. A five-level 2-D DWT structure for the 5/3 filter is presented by Aziz and Pham [26]. Multiple processing blocks operate in parallel for a five-level transform. The designs in [31,32,33] present a memory efficient structure for multi-level 2-D DWT. The proposed architectures are based on the lifting scheme with different scanning methods being employed. Inverse DWT architecture is presented in [34, 35] for the Daubechies 5/3 and 9/7 wavelet. It is a high performance memory efficient structure which supports JPEG 2000 decoder.

The prominence of lifting-based DWT has led to the advancement of few designs in recent time. The architectures range from the parallel type, flipping type to the folded type. In this work, a pipelined structure with a low critical path delay of \(1T_m\) is presented with least hardware usage based on the modified lifting scheme.

3 DWT Based on Lifting Scheme

The lifting scheme [2] is a simple and efficient algorithm to compute the DWT. This scheme is an alternate way of building the wavelet filters by the lifting steps. The block diagram in Fig. 1 depicts the three steps in the lifting scheme.

Fig. 1
figure 1

Block diagram of the lifting scheme [2]

The input samples are first split into even and odd samples. The predict function is then applied on the even samples to predict the odd samples. The difference between the predicted odd samples and the actual odd samples form the high-pass coefficients or the detail coefficients. Applying the update function on the detail coefficient and combining it with the even samples, it forms the low-pass coefficients or the approximate coefficients. Finally, the coefficients are scaled by the normalization factors, K1 and K2, to obtain the high-pass and the low-pass coefficients respectively. The following section details the conventional lifting and the modified lifting scheme.

3.1 Conventional Lifting Scheme for Discrete Wavelet Transform

The Euclidean algorithm can be used to factorize every wavelet transform into the lifting scheme [2]. The polyphase matrix of a DWT filter is resolved into a sequence of alternating upper triangular matrix and lower triangular matrix, and a diagonal matrix. The polyphase matrix of the wavelet transform can be represented as

$$\begin{aligned} P(z) =\begin{bmatrix} g_e(z) &{}\quad g_o(z)\\ h_e(z) &{}\quad h_o(z)\\ \end{bmatrix} \end{aligned}$$
(1)

where h(z) and g(z) are low-pass and high-pass filters respectively. P(z) of the 9/7 filter can be factorized as

$$\begin{aligned} P(z)&= \begin{bmatrix} 1 &{}\quad \alpha (1+z^{-1}) \\ 0 &{}\quad 1 \\ \end{bmatrix} \begin{bmatrix} 1 &{}\quad 0 \\ \beta (1+z) &{}\quad 1 \\ \end{bmatrix} \begin{bmatrix} 1 &{}\quad \gamma (1+z^{-1}) \\ 0 &{}\quad 1 \\ \end{bmatrix} \\ &\quad\times \begin{bmatrix} 1 &{}\quad 0 \\ \delta (1+z) &{}\quad 1 \\ \end{bmatrix} \begin{bmatrix} K &{}\quad 0 \\ 0 &{}\quad 1/K \\ \end{bmatrix} \end{aligned}$$
(2)

where \(\alpha (1+z^{-1})\) and \(\gamma (1+z^{-1})\) are predict polynomials, \(\beta (1+z)\) and \(\delta (1+z)\) are update polynomials and K is a constant.

Given the input sequence x(n) with \(n=0, 1,\ldots , N-1\), the lifting scheme of the 9/7 filter is mathematically described in the following three steps [15].

  1. 1.

    Split step

    $$\begin{aligned} d_i^0&= x_{2n+1} \end{aligned}$$
    (3)
    $$\begin{aligned} s_i^0&= x_{2n} \end{aligned}$$
    (4)
  2. 2.

    Lifting step

    1. (a)

      First lifting step

      $$\begin{aligned} d_i^1&= d_i^0 + \alpha \left( s_i^0 + s_{i+1}^0\right) ----- {\textit{predict}} \end{aligned}$$
      (5)
      $$\begin{aligned} s_i^1&= s_i^0 + \beta \left( d_{i-1}^1 + d_i^1\right) ----- {\textit{update}} \end{aligned}$$
      (6)
    2. (b)

      Second lifting step

      $$\begin{aligned} d_i^2&= d_i^1 + \gamma \left( s_i^1 + s_{i+1}^1\right) ----- {\textit{predict}} \end{aligned}$$
      (7)
      $$\begin{aligned} s_i^2&= s_i^1 + \delta \left( d_{i-1}^2 + d_i^2\right) ----- {\textit{update}} \end{aligned}$$
      (8)
  3. 3.

    Scaling step

    $$\begin{aligned} d_i&= 1/k \times d_i^2 \end{aligned}$$
    (9)
    $$\begin{aligned} s_i&= k \times s_i^2 \end{aligned}$$
    (10)

\(d_i\) and \(s_i\) are the high-pass and the low-pass coefficients respectively, \(0 \le i \le M-1\), where M is data length. The 9/7 filter coefficients are \(\alpha = -1.586134342\); \(\beta = -0.05298011854\); \(\gamma = 0.8829110762\); \(\delta = 1.149604398\) and \(k=1.149604398\) is the scaling coefficient. Similarly, the 5/3 filter lifting algorithm consists of one lifting step (first lifting) along with the split step and the scaling step. The two filter coefficients are \(\alpha = -0.5\) and \(\beta = 0.25\).

The design bottleneck of 9/7 DWT is the large critical path of one multiplier and two adder \((T_m + 2T_a)\) delay. This can be improved by modifying the lifting algorithm. The data flow graph of the conventional lifting scheme for the 9/7 filter shown in Fig. 2 requires a maximum of four adders and two multipliers for the computation in a clock cycle. The latency in the scheme to produce 2-D DWT is 11 cycles delay. The high-pass and low-pass coefficients are obtained after scaling by a factor of 1/k and k respectively.

Fig. 2
figure 2

Data flow graph of the conventional lifting scheme

3.2 Modified Lifting Scheme for Discrete Wavelet Transform

In the conventional lifting scheme, intermediate data is processed serially, resulting in an extended critical path delay. Pipeline register can be employed to reduce the critical path delay to one multiplier, but the internal memory size of the 2-D DWT structure increases. Hence, the lifting algorithm is redesigned by modifying the predictor and the updater to bring down the critical path delay, and to reduce the latency. Two modified lifting schemes are discussed in the following section.

3.2.1 Scheme-1

In order to minimize the critical path delay in the lifting based DWT structure, the predict function (Eq. 11) of the first lifting step is modified, and merged with the update function (Eq. 12) to obtain a modified equation of the low-pass coefficient at the first lifting step. In the second lifting step, both the predict and the update functions are modified and merged. Later, the coefficients are scaled with the appropriate scaling function to obtain the high-pass and the low-pass coefficients.

  1. 1.

    First lifting step

    $$\begin{aligned} d_i^1&= d_i^0 + \alpha \left( s_i^0 + s_{i+1}^0\right) \end{aligned}$$
    (11)
    $$\begin{aligned} s_i^1&= s_i^0 + \beta \left( d_{i-1}^1 + d_i^1\right) \end{aligned}$$
    (12)

    Modifying Eq. 11, and substituting for \(\beta d_{i}^1\) and \(\beta d_{i-1}^1\) in Eq. 12

    $$\begin{aligned} s_i^1= s_i^0 + \left( \beta d_{i-1}^0 + \alpha \beta \left( s_{i-1}^0 + s_{i}^0\right) \right) + \left( \beta d_i^0 + \alpha \beta \left( s_i^0 + s_{i+1}^0\right) \right) \end{aligned}$$
    (13)

    Similarly \(\beta d_{i+1}^1\) and \(s_{i+1}^1\) can be given as

    $$\begin{aligned} \beta d_{i+1}^1= \beta d_{i+1}^0 + \alpha \beta \left( s_{i+1}^0 + s_{i+2}^0\right) \end{aligned}$$
    (14)
    $$\begin{aligned} s_{i+1}^1= s_{i+1}^0 + \left( \beta d_{i}^0 + \alpha \beta \left( s_{i}^0 + s_{i+1}^0\right) \right) + \left( \beta d_{i+1}^0 + \alpha \beta \left( s_{i+1}^0 + s_{i+2}^0\right) \right) \end{aligned}$$
    (15)
  2. 2.

    Second lifting step

    $$\begin{aligned} d_i^2&= d_i^1 + \gamma \left( s_i^1 + s_{i+1}^1\right) \end{aligned}$$
    (16)
    $$\begin{aligned} s_i^2&= s_i^1 + \delta \left( d_{i-1}^2 + d_i^2\right) \end{aligned}$$
    (17)

    Modifying Eq. 17 and replacing \(d_{i-1}^2\) and \(d_i^2\) by Eq. 16

    $$\begin{aligned} s_i^2/\delta \gamma= s_i^1/\delta \gamma + \left( d_{i-1}^1/\gamma + s_{i-1}^1 + s_{i}^1\right) + \left( d_i^1/\gamma + s_i^1 + s_{i+1}^1\right) \end{aligned}$$
    (18)

    Similarly \(d_{i+1}^2/\gamma\) and \(s_{i+1}^2/\delta \gamma\) can be given as

    $$\begin{aligned} d_{i+1}^2/\gamma= d_{i+1}^1/\gamma + s_{i+1}^1 + s_{i+2}^1 \end{aligned}$$
    (19)
    $$\begin{aligned} s_{i+1}^2/\delta \gamma= s_{i+1}^1/\delta \gamma + \left( d_{i}^1/\gamma + s_{i}^1 + s_{i+1}^1\right) + \left( d_{i+1}^1/\gamma + s_{i+1}^1 + s_{i+2}^1\right) \end{aligned}$$
    (20)
  3. 3.

    Scaling step

    $$\begin{aligned} d_i&= \gamma /k \times d_i^2/\gamma \end{aligned}$$
    (21)
    $$\begin{aligned} s_i&= k \delta \gamma \times s_i^2 / \delta \gamma \end{aligned}$$
    (22)

The data flow graph of the modified lifting scheme for the 9/7 filter is shown in Fig. 3. Two input samples are fetched in each cycle for processing. The filter coefficients in this method are \(\beta\), \(\alpha \beta\), \(1/\beta \gamma\) and \(1/\delta \gamma\). The outputs of the first lifting step are \(\beta d_i^1\) and \(s_i^1\). Due to the modification in the lifting step, one cycle is saved, each in the first and the second lifting step. The maximum number of cycles required to produce one level 2-D DWT in this method are nine. The output coefficients after the second lifting step are \(d_i^2/\gamma\) and \(s_i^2 / \delta \gamma\). The coefficients are later scaled by a factor of \(\gamma /k\) and \(k \delta \gamma\) to obtain the high-pass and low-pass wavelet coefficients respectively.

Fig. 3
figure 3

Data flow graph of the modified lifting scheme with two-input/two-output

3.2.2 Scheme-2

In this method, at the first lifting step, the predict function (Eq. 23) is merged with the modified update function (Eq. 24) to obtain an equation for the low-pass coefficient. In the second lifting step, the update function (Eq. 29) is modified, and the predict function (Eq. 28) is merged with it to obtain a modified low-pass coefficient equation. Later, the high-pass and the low-pass coefficients are scaled appropriately.

  1. 1.

    First lifting step

    $$\begin{aligned} d_i^1&= d_i^0 + \alpha \left( s_i^0 + s_{i+1}^0\right) \end{aligned}$$
    (23)
    $$\begin{aligned} s_i^1&= s_i^0 + \beta \left( d_{i-1}^1 + d_i^1\right) \end{aligned}$$
    (24)

    Modifying Eq. 24, and substituting \(d_{i-1}^1\) and \(d_i^1\) by Eq. 23

    $$\begin{aligned} s_i^1 / \beta = s_i^0 / \beta + \left( d_{i-1}^0 + \alpha \left( s_{i-1}^0 + s_i^0\right) \right) + \left( d_i^0 + \alpha \left( s_i^0 + s_{i+1}^0\right) \right) \end{aligned}$$
    (25)

    Similarly \(d_{i+1}^1\) and \(s_{i+1}^1\) can be given as

    $$\begin{aligned} d_{i+1}^1= d_{i+1}^0 + \alpha \left( s_{i+1}^0 + s_{i+2}^0\right) \end{aligned}$$
    (26)
    $$\begin{aligned} s_{i+1}^1 / \beta= s_{i+1}^0 / \beta + \left( d_{i}^0 + \alpha \left( s_{i}^0 + s_{i+1}^0\right) \right) + \left( d_{i+1}^0 + \alpha \left( s_{i+1}^0 + s_{i+2}^0\right) \right) \end{aligned}$$
    (27)
  2. 2.

    Second lifting step

    $$\begin{aligned} d_i^2= d_i^1 + \gamma \left( s_i^1 + s_{i+1}^1\right) \end{aligned}$$
    (28)
    $$\begin{aligned} s_i^2= s_i^1 + \delta \left( d_{i-1}^2 + d_i^2\right) \end{aligned}$$
    (29)

    Modifying Eq. 29 and substituting Eq. 28 for \(d_{i-1}^2\) and \(d_i^2\)

    $$\begin{aligned} s_i^2 / \beta= s_i^1 / \beta + (\delta /\beta )\left[ d_{i-1}^1 + \gamma \left( s_{i-1}^1 + s_{i}^1\right) \right] + (\delta /\beta )\left[ d_i^1 + \gamma \left( s_i^1 + s_{i+1}^1\right) \right] \end{aligned}$$
    (30)

    Similarly \(d_{i+1}^2\) and \(s_{i+1}^2\) can be given as

    $$\begin{aligned} d_{i+1}^2= d_{i+1}^1 + \gamma \left( s_{i+1}^1 + s_{i+2}^1\right) \end{aligned}$$
    (31)
    $$\begin{aligned} s_{i+1}^2 / \beta= s_{i+1}^1 / \beta + (\delta /\beta )\left[ d_{i}^1 + \gamma \left( s_{i}^1 + s_{i+1}^1\right) \right] + (\delta /\beta )\left[ d_{i+1}^1 + \gamma \left( s_{i+1}^1 + s_{i+2}^1\right) \right] \end{aligned}$$
    (32)
  3. 3.

    Scaling step

    $$\begin{aligned} d_i&= 1/k \times d_i^2 \end{aligned}$$
    (33)
    $$\begin{aligned} s_i&= k \beta \times s_i^2 / \beta \end{aligned}$$
    (34)

The data flow graph of the modified lifting scheme for the 9/7 filter is shown in Fig. 4. The filter coefficients in this method are \(\alpha\), \(1/ \beta\), \(\beta \gamma\) and \(\delta /\beta\). Four input coefficients are fetched simultaneously to give two, high-pass and two, low-pass coefficients. The outputs of the first lifting step are \(d_i^1\) and \(s_i^1 / \beta\). The high-pass and low-pass coefficients after the second lifting step are \(d_i^2\) and \(s_i^2 / \beta\) respectively. The latency to produce the output is 13 cycles delay. The high-pass and low-pass coefficients are scaled by a factor of 1/k and \(k \beta\) respectively in the scaling step.

Fig. 4
figure 4

Data flow graph of the modified lifting scheme with four-input/four-output

4 Proposed Architecture for the 2-D Lifting Based DWT

4.1 Overall Architecture

The overall block diagram of the proposed 2-D DWT architecture is shown in Fig. 5. The structure consists of row processor, column processor, transpose block and temporal memory. The row and the column processor are designed to perform 2-D transform with row-column wise scanning scheme. The intermediate data is stored in the temporal memory during column processing. For the 9/7 filter, the first and the second lifting step together constitute a row/column processor, and for the 5/3 filter, row/column processor comprises of only the first lifting step.

Fig. 5
figure 5

Block diagram of the proposed 2-D DWT structure

4.2 Data Scanning Method

Z-scanning [24] is employed to process the row and column elements simultaneously. The data scanning method for an \(N \times M\) image is shown in Fig. 6. Two elements, viz., odd and even are read in unison at every rising edge of the clock. All the elements in the image are read in a Z fashion. This type of scanning method results in reducing the latency and developing a transpose block, independent of the number of elements in an image.

Fig. 6
figure 6

Z-scan method [24]

4.3 First Lifting Step Processing Architecture

The first lifting step processing block shown in Fig. 7 is based on the data flow graph described in Fig. 3. It consists of two input lines and two output lines with four adders and two multipliers. The structure has four pipeline stages to reduce the critical path delay to \(1T_m\). Input scanning is done in a Z fashion with a throughput rate of two. The first and the second element of the odd row is read at the rising edge of the clock. Next, when the elements of the even row are read and processed, intermediate results of the odd row are stored in first-in first-out (FIFO) registers, P3 and P5. The intermediate results are later utilized when the third and the fourth element of the odd row is read. P3 and P5 registers are used to store the processed data of the odd and the even row elements. The data flow of the first lifting step is shown in Table 1, where subscript i is the element in an image. The outputs obtained at the end of the cycle are \(\beta d^1_i\) and \(s^1_i\).

Fig. 7
figure 7

Proposed first lifting step processing block

Table 1 Data flow of the first lifting step

4.4 Second Lifting Step Processing Architecture

Based on the data flow graph in Fig. 3, the second lifting step processing block is constructed as shown in Fig. 8. The inputs to this block are the outputs of the first lifting step processing architecture. The intermediate results are stored in the FIFO registers, P3 and P5. Similar to the first lifting step, the four pipeline stages reduce the critical path delay to \(1T_m\). The outputs of this block are \(d_i^2/\gamma\) and \(s_i^2/\delta \gamma\). Table 2 shows the data flow of the second lifting step.

Fig. 8
figure 8

Proposed second lifting step processing block

Table 2 Data flow of the second lifting step

The first and the second lifting architectures are cascaded to form the row/column processing block. This forms a structure for 1-D DWT. In the column processing block the FIFO registers are replaced by an internal memory of size N. For the 2-D transform, the transposed output of 1-D DWT is fed again to the processing block. Since the scanning is in Z fashion, the intermediate results of the two rows are stored in the temporal memory of size 4N and 2N during column processing of the 9/7 and 5/3 filters respectively.

4.5 Transpose Block

The column processing block accepts the column-wise data managed by the transpose block. An efficient transpose block is developed to rearrange the output sequence of the row processor for the use of the column processor. The transpose block, as shown in Fig. 9, consists of three registers (\(R1{-}R3\)) and one 4-to-2 multiplexer. The input and output data flow in the transpose block is shown in Table 3. h and l are the high-pass and the low-pass coefficients after 1-D transform. The structure of the transpose block is independent of the size of the image.

Fig. 9
figure 9

Block diagram of the transpose block

Table 3 Data flow of the transpose block

4.6 Four-Input/Four-Output 1-D DWT Architecture

The 1-D DWT architecture for the 9/7 filter shown in Fig. 10 processes four elements per clock cycle. This structure is designed from the data flow graph outlined in Fig. 4. The input data flow is line-based, with four pixels being read simultaneously from the row of an image. The structure consists of sixteen adders and eight multipliers with two stage pipeline. For the 2-D DWT, the column-wise decomposition is done after the row-wise decomposition with the same structure. However, temporal buffer and transpose block are necessary to perform the 2-D transform.

Fig. 10
figure 10

Proposed four-input/four-output 1-D DWT architecture

5 Implementation and Performance Analysis

5.1 Implementation

In order to build a systematic and standard description, the proposed structures are described in structural VHDL, following uniform coding guidelines. The structures are synthesized using Cadence RTL compiler over the commercial 90 nm technology of low power library \({\textit{slow}}\_{\textit{lib}}\) for the lowest possible voltage and worst-case delay. Power consumption for the design is obtained by performing the power simulation using Cadence NCsim, simulating \(256\times 256\) vectors to obtain the toggle count format (TCF) file. Full hierarchy TCF dumps are performed, and the data is fed to the Cadence tool for power simulation. MATLAB is used to read the input image, and convert the transformed coefficients back to an image.

Derived netlist and design constraints are fed to the Cadence Encounter tool to generate a layout, from which RC parasitic information is extracted. The interconnect stack consists of one polysilicon layer and six metal layers. The layout is optimized and nano routed. Post route static timing analysis and hold timing analysis are done. It is observed that the worst negative slack (WNS) is positive and the total negative slack (TNS) is zero.

The implementation summary of the proposed single-level 2-D DWT architecture with two-input/two-output is given in Table 4. For an image size of \(256 \times 256\), the power dissipation and delay of the structure for the 5/3 filter is 8.45 mW and 4.26 ns respectively. Similarly, the power and delay parameters for the 9/7 filter design is 35.93 mW and 7.64 ns respectively. Further, four \(256 \times 16\) two-port RAM and two \(256 \times 16\) two-port RAM are employed for 9/7 and 5/3 filter respectively to store the intermediate results from the column processor.

Table 4 Design specification of the proposed structure

In the hardware implementation, the floating point number computation needs larger chip area and more computation time. Hence, these numbers are converted into fixed point representation [15]. The 9/7 filter coefficients are irrational numbers; they are converted into a fixed-point representation of 12-bits to resolve the overflow issues [13]. The 16-bit data path is allocated to 1 sign bit, 10 integer bits and 5 fractional bits to preserve the image quality. Although the truncation compromises the quality of the image, its effect is negligible.

5.2 Performance Analysis

The comparison of the proposed DWT structure with other architectures for 9/7 and 5/3 filter are presented in Tables 56 and 7. The performance evaluation is done in terms of hardware requirement, memory, critical path delay and throughput. The proposed two-input/two-output 1-D 9/7 DWT architecture hardware requirement is 4 multipliers and 8 adders. The memory requirement of the structure is 22 registers. It has a critical path delay of \(1T_m\) with simple control complexity in contrast to the architecture proposed by Lai et al. [13] and Zhang et al. [21]. The effective folded architecture [12] reduces the hardware cost and memory requirement by re-utilizing the hardware in the structure. However, the critical path and throughput limits its application. It can be observed that the proposed structure is better than the direct structure [8] and the flipping structure [14] in terms of critical path delay. High-performance structure [15] has minimum adder and multiplier requirement due to the merging of the updater and the predictor steps, but this benefit is at the cost of throughput. The pipeline structure [19] and multi-input/multi-output structure [20] have reasonable hardware and memory requirement, but the critical path delay is high compared to the proposed structure. 1-D DWT architecture [28] with single processing element (PE) utilizes minimum hardware resource, but has a high critical path delay, which is the sum of fused multiply-add unit delay and floating point adder delay. The PE’s are connected in parallel for higher throughput rate. The proposed four-input/four-output 1-D 9/7 DWT architecture based on modified lifting scheme can be used for high speed application. It has a critical path delay of \(T_m+2T_a\) and hardware requirement of 8 multipliers and 16 adders. The quad-input/quad-output structure [18] needs the same amount of adders and multipliers as that of the proposed structure, but the memory requirement is high.

Table 5 Comparison of 1-D DWT architectures for 9/7 filter (Thrp: throughput, \(T_m\): multiplier delay, \(T_a\): adder delay, \(T_{fma}\): fused multiply-add unit delay, \(T_{fadd}\): floating point adder delay, \(N\times N\): image size)
Table 6 Comparison of 2-D DWT architectures for 9/7 filter (Thrp: throughput, \(T_m\): multiplier delay, \(T_a\): adder delay, \(T_{fma}\): fused multiply-add unit delay, \(T_{fadd}\): floating point adder delay, \(N\times N\): image size)
Table 7 Comparison of 2-D DWT architectures for 5/3 filter (Thrp: throughput, \(T_m\): multiplier delay, \(T_a\): adder delay, \(N\times N\): image size)

As for the 2-D DWT, the proposed structure for 9/7 filter requires 16 adders, 8 multipliers for processing and 2 multipliers for scaling. The critical path delay of the structure is \(1T_m\) with a throughput rate of two. The computation time for \(N \times N\) image is \(N^2/2\). The temporal memory needed is 4N and the transpose buffer required is 3 registers. Compared with other structures [14, 19, 36, 37], the proposed structure has the least hardware cost and memory. The critical path delay defined in flipping structure [14] with 5-stage pipeline is same as the proposed structure, but the transpose buffer and temporal buffer requirement is in excess. Dual-scan architecture [38] reduces the transpose buffer to 4 registers, but the critical path delay limits its application. Even though the high-performance structure [15] is hardware cost effective, it is evident that the proposed 2-D structure has better throughput. The parallel based lifting scheme structure [16] for DWT aims at reducing the critical path to \(1T_m\). However, the transposing buffer is 1.5N. The FA [17] reduces the hardware requirement by utilizing the same hardware for predict and update steps. Because of the hardware reuse, pipelining cannot be done, and the critical path delay surges to \(T_m+2T_a\). The memory-efficient structure [18] requires 32 register in excess along with 4N temporal memory to bring down the critical path delay to \(1T_m\). The two-input/two-output module [20] has a high memory requirement, and the critical path delay is \(T_m+2T_a\). Although the hardware and memory utilized in [21] are same as that of the proposed structure, the control procedure is complex. Even though the 2-D dual mode architecture [22] is hardware efficient, the critical path delay of \(2T_m+4T_a\) and computing time of \((3/4)N^2 + (3/2)N + 7\) does not make it a fast and effective structure for real-time application. Shifters are used instead of multipliers in the design. The structure consists of multiply accumulate unit (MAC), multiplexers and de-multiplexers in the 1-D DWT structure. The proposed design reduces the transpose buffer to 3 registers as compared to the dual-scan parallel flipping structure [25], where, 5 registers are used to rearrange the output from the row processor in an order required for the column processor. Darji and Limaye [39] proposed a memory efficient structure requiring 2N temporal memory, but the hardware resources essential for processing is high. The arithmetic resources required in the 2-D DWT architecture [27] is minimum when compared with all the other architectures, but the computation time required for the two DWT engines in parallel is \(3N^2/32 + 3N^2/8\), which is more than other structures. N-parallel DWT architecture [28] uses single PE for the transform. For 9-pin 2-D transform, nine architectures are connected in parallel, containing 9 PEs and 18 FMAs. The transpose memory is not needed in this structure. However, the critical path delay is \(T_{fma}+T_{fadd}\).

A trade-off exits between the critical path delay and the complexity in the design of the 2-D DWT architecture. The proposed 2-D 5/3 filter DWT structure utilizes 4 multipliers and 8 adders. The temporal memory required is 2N and the computation time is \(N^2/2\). The critical path delay is limited to \(1T_m\) with simple control complexity. Diou et al. [40] presented a 2-D DWT structure for 5/3 filter by interleaving technique. Compared to the proposed structure, arithmetic resources required are high, and the critical path delay is the sum of \(T_m\) and \(T_a\). Chen and Wu [41] presented a folded structure for the 2-D 5/3 filter DWT. The throughput rate is one-input/one-output, and the temporal memory required to store the intermediate data is 2.5N for \(N \times N\) image. The high-speed VLSI architecture [42] requires 4 multipliers and 8 adders for computation. The critical path delay is minimum but the temporal memory required is 3N.

6 Conclusion

In this paper, an efficient 1-D DWT architecture of two-input/two-output and four-input/four-output is proposed based on the modified lifting scheme. A 2-D DWT structure for the 9/7 and 5/3 filter is also proposed with a throughput rate of two per cycle. The DWT architecture is energy efficient, and has a low critical path delay of \(1T_m\). The structure has shown desired quality performance in terms of hardware usage and memory. Based on the comparison analysis, the proposed structure can achieve a good speed with low hardware cost, which will be an efficient alternative for the high speed application. The proposed one-level 2-D DWT can be easily extended to multilevel by storing the LL-band in an external memory for the next level transformation. From the ASIC synthesis result, it can be observed that the proposed structure is area and power efficient, which is suitable for real-time image and video processing.