1 Introduction

Digital Image Processing (DIP) is used in almost every fields known by today’s modern human society such as medical, astronomy, entertainment and computer vision (Gonzalez and Woods 2008; Jayaraman et al. 2009) etc. Video Processing is the extension of the digital image processing (Marques 2012) where a sequence of still images are changing at very fast rate with proper sequences. This makes illusion to the viewer that the objects present in the frame are moving. In the case of the video, each still image is known as frame and the rate at which the frame changes are calculated in frames per second (fps) unit. As a result, image processing techniques (Jayaraman et al. 2009) can also be used in video processing. For good quality image, the number of the pixels present in the corresponding image must be high and similarly for video both number of pixels in the frame and frame rate must be high.

In the field of the image and video processing, wavelet transforms are widely used in various applications due to its various advantages over other similar kind of transformation techniques (Vaidyanathan 1993; Bhairannawar et al. 2018b). Image compression, image denoising, image fusion and image recognition are the most used applications of wavelet transforms. (i). Image Compression: Any wavelet transform can be able to separate the redundant components present in the image. As a result the size of the generated bands are reduced. Here the bands generated by low pass filter consists of the compressed version of the input image directly. As a result, the wavelet transform can be used for image compression purpose (Nashat and Hussain 2016; Rajasekhar et al. 2014). (ii). Image Denoising: The filters used in any wavelet transforms are designed using Perfect Reconstruction (PR) condition (Mallat 2008) which introduces de-noising capabilities within the mathematical model of wavelet transform itself. So, the wavelet transform can be used for image de-noising (Aravind and Suresh 2015; Gupta et al. 2013) also. (iii). Image Fusion: Merging of two images into a single composite image is known as image fusion. In most of the existing fusion techniques, wavelet transforms are used with different types of fusion equations to retain the property of both images (Aishwarya et al. 2016; Zhang and Zhang 2015). (iv). Image Recognition: This is used to identify some predefined objects present in input image. Wavelet transform techniques are used in most of the existing recognition algorithms (Elakkiya and Audithan 2014; Alsubari et al. 2017).

To implement real time high speed discrete wavelet transform, it is essential to use hardware level implementation techniques. But most of the existing wavelet transform techniques are implemented on either software or embedded system based hardware techniques which are not suitable for real time high speed applications mainly due to the lower processing speed. Moreover some ASIC based hardware architectures were presented for the same which uses complex architectural model and is not suitable for real time high speed applications due to the architectural complexcity.

Contributions The novel concepts of this paper are listed as follows:

(i). :

Optimized the Kogge–Stone Adder architecture using Modified Carry Correction block.

(ii). :

The novel Clock Dividers block is used to generate different frequencies required to synchronize the filtered data at output side in optimized way.

(iii). :

The novel Reset Controller block is used to discard the overlapped data at output side in efficient manner to generate proper output coefficients.

2 Related works

The existing techniques of Adder and Haar Wavelet Transform along with its advantages and disadvantages are discussed briefly in this section.

2.1 Adders

Addition is a commonly used mathematical operation in most of the digital implementations. Ripple Carry Adder (Koyada et al. 2017) is the basic architecture normally used to implement addition operation in digital logic which works fine for small number of bits, but to implement adders with large number of bits, delay will be major issue which is proportional to the number of bits. To overcome from this delay problem, various types of adder architectures are introduced which are generally categoried as Fast Adders (Smith et al. 2004). The Carry Select Adder (Tyagi 1993) is the modified version of the Ripple Carry Adder where the carry propagation time is reduced. This architecture generates separate Sum and Carry for each bit for all possible combinations of Cin (i.e., 0 and 1 respectively) using separate Ripple Carry Adder blocks. The corrected Sum and Carry values are then selected by Multiplexer with the help of the Carry signal generated at respective previous stages. The main disadvantage of this architecture is the requirement of large area and power. To reduce the carry propogation without increasing the area requirement upto a great extent, Carry LookAhead Adder (Lee and Oklobdzija 1990) is introduced in which the intermediate stages are calculated separately using Carry Propogation and Carry Generation logics regardless of the previous level of Carry. The Sum is then calculated using the Carry Propogation and Carry Generation values with the help of the corresponding previous level Carry. Some extra hardware components are needed to implement Carry Propogation and Carry Generation equations which increases the area requirements upto some small extent. Instead of immediate transfer of the previous stage carry to the next stage, the Carry Save Adder (Vamsi et al. 2018) saves the Carry and then add it with the next Sum generated by the architecture using simple Full Adder (Koyada et al. 2017) circuit. This architecture is similar to Ripple Carry Adder where the stored Sum and Carry of each bits are added seperately. For smaller number of bits (i.e., 4 or less), the amount of delay generated by the Ripple Carry Adder is almost negligible. As a result, in the case of Carry Skip Adder (Arora and Niranjan 2017), the total number of bits are added by considering a finite number of Ripple Carry Adder with smaller bit size. The error introduced in the intermediate block of carry is corrected by AND gate. To get proper tradeoff between hardware parameters, it is crucial to select proper size of intermediate Ripple Carry Adder. The Carry LookAhead Adder is arranged in parallel-prefix form to get adder architecture with good tradeoff between different parameters, known as Kogge–Stone Adder (Xiang et al. 2018; Kogge and Stone 1973). Among all the existing adder architectures, Kogge–Stone Adder shows better performance in speed with moderate area utilizations (Koyada et al. 2017) making it suitable to be used in high speed and area efficient architectures. Soares et al. (2019) presented approximation based adder architecture which is then used to design efficient multiplier architecture. In this case, the overall carry propogation signals are divided into a finite number of blocks depending upon the generated values which are then segregated and approximated. This reduces overall area and power consumption with some decrement in output accuracy. This architecture is implemented using 45-nm standard cell based ASIC technique. Mohammadi et al. (2010) presented power and area efficient fault tolerant adder architecture. To achieve this, Berger Code checker with multivalued logic in current mode is used which can detect faults more effectively. This architecture is implemented using 90-nm ASIC technology. The large delay time to detect and correct error is the main drawback of this architecture.

2.2 Haar wavelet transform

Talukder and Harada (2007) presented image compression based on wavelet transform which is used to check the quality of the compressed image with respect to different thresholding techniques. The basic Haar wavelet transform is used to perform this compression and the entire algorithm is implemented using software based simulation technique. The result shows that the soft thresholding technique is able to generate better quality image in terms of PSNR than hard thresholding technique. Nedunuri et al. (2006) presented a hardware based discrete wavelet transform architecture. To reduce the memory requirements in the architecture, novel diagonal scan method is used to read the input image and also to design efficient filter banks, recursive pyramid hierarchical approach is considered. The three level DWT is implemented in this paper where VHDL is used to code the architecture and synthesized it for Virtex-2 FPGA board. Hasan et al. (2013) presented multilevel decomposition of image through discrete wavelet transform for image compression. The decomposition is performed through hardware architectures which is derived from fast Haar wavelet transform. This technique reduces the hardware utilizations required to decompose the input image. This architecture is coded using VHDL language and implemented on Quartus-II FPGA. Mamatha et al. (2015) presented image fusion using wavelet decomposition method for satellite image. The fusion technique is implemented using co-simulation based techniques where the in built blocks present in the System Generator tool are used which increases the overall hardware utilizations. Vijendra and Kulkarni (2016) used Haar wavelet transform to filter ECG signal. The entire filtering process is done through co-simulation method where the architecture is designed using VHDL language and the input ECG signal was fed through MATLAB present in System Generator tool. This architecture is inefficient in terms of hardware utilizations due to the use of built-in blocks without proper optimizations. Gafsi et al. (2016) presented a hardware based architecture to implement watermarking technique in which the Haar wavelet is used as main component and to reduce the design complexity, modified lifting scheme is proposed. The complete architecture is built using the in built functions that are available on System Generator tool which generates inefficient hardware architecture. Harender and Sharma (2017) presented de-noising of ECG signal based on Haar wavelet transform and universal thresholding techniques. The entire architecture is designed through the built-in functions present in the System Generator tool and synthesized it using XST tool. Khan et al. (2019) presented image compression based on Haar transform, DCT and Run Length Encoding techniques separately for JPEG image. All these techniques were simulated separately on MATLAB tool and then the size of the compressed images were compared. The Haar wavelet transform showed good compression ratio in terms of image size and the PSNR than existing techniques. Bhardwaj and Khunteta (2017) presented a research paper on video watermarking. The watermarking architecture was designed using Haar DWT and DCT algorithm where the conventional algorithms were modified to process videos frame by frame which is simulated using MATLAB software. Chakraborty and Banerjee (2020) presented a tunable VLSI architecture of DWT and to achieve area and memory efficient architecture, Distributed Arithmetic technique was used along with some degree of parallelism. The authors tested this architecture for Daubechies 9/7 and LeGall 5/3 filters. The entire architecture was implemented on Xilinx FPGA board using high level synthesis method. Talukder et al. (2020) presented efficient integer wavelet transform architecture which is used for QRS detection of ECG signal effectively where the wavelet transform is mainly used to de-noise the ECG signal. The haar wavelet, zero-crossing detector, threshold and decision blocks are used to implement the entire architecture which is coded using verilog language and implemented on Digilent Nexys-4 FPGA.

3 Mathematical background

In this section, the mathematical models of existing Kogge–Stone Adder and Haar Wavelet Transform are discussed.

3.1 Kogge–Stone adder

The Kogge–Stone Adder (Kogge and Stone 1973) is the modified version of Carry Look-Ahead Adder (Wang et al. 1993). The modification is done to reduce the delay problem in generating carry signal for large size adder architecture. This adder is able to produce the output faster than other existing adders with small area overheads (Koyada et al. 2017). The operation of this adder is divided into three parts as Pre-processing, Carry LookAhead Network and Post-Processing (Kogge and Stone 1973) respectively.

  1. 1.

    Pre-processing: In this stage, the propogate (p) and genearte (g) signals are computed separately for each ‘A’ and ‘B’ signals respectively. The logical equations of this block can be written as

    $$\begin{aligned} p_{i}= & {} A_{i} \oplus B_{i} \end{aligned}$$
    (1)
    $$ \begin{aligned} g_{i}= & {} A_{i} \& B_{i} \end{aligned}$$
    (2)

    where i \(\leftarrow \) Length of the adder.

  2. 2.

    Carry Lookahead Network: The carry of the corresponding bits are computed separately in this stage which increases the maximum operating speed of this adder. This stage uses the propogate (p) and generate (g) signals to determine the corresponding carry signal. The logical equation of this stage is given as

    $$ \begin{aligned} p_{i:j}= & {} p_{i:k+1} \& p_{k:j} \end{aligned}$$
    (3)
    $$ \begin{aligned} g_{i:j}= & {} g_{i:k}|(p_{i:k+1} \& g_{k:j}) \end{aligned}$$
    (4)

    where {j, k} \(\leftarrow \) Intermediate integer values used to mix signals.

  3. 3.

    Post-processing: The computation of the final sum of the corresponding bits are calculated in this stage. The logical equation for this stage is

    $$\begin{aligned} S_{i}= p_{i}\oplus c_{i-1} \end{aligned}$$
    (5)

    where \(c_{i-1}\) \(\leftarrow \) Generated carry from previous adder block.

3.2 Haar wavelet transform

For many cases of image analysis, it is necessary to convert the input image into frequency domain to overcome various issues that occurs in time domain analysis (Prasad and Iyengar 1997). Normally various types of Fourier Transforms such as DFT, FFT and STFT etc., are used (Gonzalez and Woods 2008) where complex sinusoidal input data is considered. In most of the real time scenarios, the input data is infinite where the information is spread over the whole time axis of the signal making it difficult to model through regular Fourier Transforms. To overcome from this type of problems, windowing methods are used (Sateesh Kumar et al. 2015). The windowed version of Fourier Transforms is known as windowed Fourier Transforms (Gonzalez and Woods 2008) which is given in Eq. (6) as

$$\begin{aligned} X(\tau ,\omega )=\int _{-\infty }^{\infty } \omega (t-\tau )\cdot x(t)\cdot e^{-j{\omega }t} dt \end{aligned}$$
(6)

where \(\omega (\cdot )\) \(\leftarrow \) Appropriate window Size.

The \(X(\tau ,\omega )\) is the Fourier Transforms of x(t) where the window \(\omega (\cdot )\) is shifted by an amount ‘\(\tau \)’ which is modulated version of the window and named as Short-Time Fourier Transform (STFT). Due to the use of single window, the resolution of the analysis is always same for all locations in the time-frequency plane. By varying the window size, the resolution in both time and frequency domain can be changed which is achieved by wavelet transform. In the case of Wavelet Transform, the basis function is obtained from a Wavelet and Dilation/Contraction operations (Prasad and Iyengar 1997) as

$$\begin{aligned} h{_{a,b}}(t)={\frac{1}{\sqrt{a}}}\cdot h{\bigg (\frac{t-a}{b}\bigg )} \end{aligned}$$
(7)

where a \(\leftarrow \) Haar basis function.

For large ‘a’ values, the basis function becomes low frequency function and for small ‘a’ values, it becomes high frequency function. So, the equation of basic Wavelet Transform (Gonzalez and Woods 2008) is

$$\begin{aligned} X(a,b)={\frac{1}{\sqrt{a}}}{\int _{-\infty }^{\infty } x(t)\cdot h{^{*}}{\bigg (\frac{t-a}{b}\bigg )} dt} \end{aligned}$$
(8)

Different tradeoffs are used in Eqs. (7) and (8) to generate different time-frequency resolutions. To find the equation of Discrete Wavelet Transform, it is needed to discretize the resolution and Dilation/Contraction parameters of Eq. (7) which corresponds to \(a={{a{_{0}}}^m}\) and \(b=n{{a{_{0}}}^m}{b{_{0}}}\)

$$\begin{aligned} h{_{m,n}}(t)={a{_{0}}^{-\frac{m}{2}}}\cdot h({a{_{0}}^m}t - n{b{_{0}}}) \end{aligned}$$
(9)

where \(\{m,n\} \in z, a_{0} > 1, b_{0} \ne 0\).

Now the equation for Discrete Wavelet Transform is

$$\begin{aligned} X(m,n)={{a_{0}}^{-{\frac{m}{2}}}}{\int _{-\infty }^{\infty } x(t)\cdot h\left( {{a_{0}}^{-{\frac{m}{2}}}t - n{b_{0}}}\right) dt} \end{aligned}$$
(10)

In the case of wavelets, it is possible to design the wavelet function h(t) such that the set of translated and scaled versions of h(t) forms an Orthogonal basics function with the input signal (Prasad and Iyengar 1997). Using this Orthogonality, the Haar basis function (Gonzalez and Woods 2008; Prasad and Iyengar 1997) can be defined as

$$\begin{aligned} h(t)= {\left\{ \begin{array}{ll} ~~~ 1,&{} 0< t \le {\frac{1}{2}}\\ - 1,&{} {\frac{1}{2}}\le t < 1\\ ~~~0, &{} \text {Otherwise} \end{array}\right. } \end{aligned}$$
(11)

The non-separable (Bamerni et al. 2019) version of the two dimensional Haar transform of a M\(\times \)N matrix in discrete format is given as

$$\begin{aligned} B={\omega }_{M}{\cdot }A{\cdot }{{\omega }_{N}^T} \end{aligned}$$
(12)

where A \(\leftarrow \) Input matrix of M\(\times \)N size, \(\{{\omega }_{M}, {{\omega }_{N}}\}\) \(\leftarrow \) Haar basis function of corresponding axis in XY coordinate.

But for simplicity in mathematical modelling, \(4 \times 4\) sized matrix value of signal ‘A’ is considered (Bhairannawar et al. 2016) as

$$\begin{aligned} A={\left[ \begin{array}{cccc} a_{11} &{} a_{12} &{} a_{13} &{} a_{14}\\ a_{21} &{} a_{22} &{} a_{23}&{} a_{24} \\ a_{31} &{} a_{32} &{} a_{33} &{} a_{34} \\ a_{41} &{} a_{42} &{} a_{43} &{} a_{44} \end{array} \right] } \end{aligned}$$
(13)

Any Wavelet Transform uses two different filter banks namely high-pass and low-pass filter which generates low and high frequency coefficients present of the respective input image. For such partitioning, let us consider

$$\begin{aligned} {\omega }_{4}={\left[ \begin{array}{cc} H \\ G \end{array} \right] } \end{aligned}$$
(14)

where H \(\leftarrow \) Low-pass filter coefficient matrix for Haar Wavelet, G \(\leftarrow \) High-pass filter coefficient matrix for Haar Wavelet.

For Haar Wavelet transform, the value of \({\omega }_{4}\) (Prasad and Iyengar 1997) becomes

$$\begin{aligned} {\omega }_{4}={\left[ \begin{array}{cccc} {\frac{1}{2}}&{}{\frac{1}{2}}&{}0&{}0 \\ 0&{}0&{}{\frac{1}{2}}&{}{\frac{1}{2}}\\ -{\frac{1}{2}}&{}{\frac{1}{2}}&{}0&{}0\\ 0&{}0&{}-{\frac{1}{2}}&{}{\frac{1}{2}} \end{array} \right] } \end{aligned}$$
(15)

By using the relation of Eqs. (14) and (15), the value of ‘H’ and ‘G’ becomes

$$\begin{aligned} H= & {} {\left[ \begin{array}{cccc} {\frac{1}{2}}&{}{\frac{1}{2}}&{}0&{}0 \\ 0&{}0&{}{\frac{1}{2}}&{}{\frac{1}{2}} \end{array} \right] } \end{aligned}$$
(16)
$$\begin{aligned} G= & {} {\left[ \begin{array}{cccc} {-\frac{1}{2}}&{}{\frac{1}{2}}&{}0&{}0 \\ 0&{}0&{}{-\frac{1}{2}}&{}{\frac{1}{2}} \end{array} \right] } \end{aligned}$$
(17)

Then Eq. (12) becomes

$$\begin{aligned} B= & {} {{\left[ \begin{array}{cc} H \\ G \end{array} \right] }A{\left[ \begin{array}{cc} H^T&G^T \end{array} \right] }} \end{aligned}$$
(18)
$$\begin{aligned} B= & {} {\left[ \begin{array}{cc} (HA{H^{T}}) &{} (HA{G^{T}}) \\ (GA{H^{T}}) &{} (GA{G^{T}}) \end{array} \right] } \end{aligned}$$
(19)
$$\begin{aligned} B= & {} {\left[ \begin{array}{cc} y_{LL} &{} y_{HL} \\ y_{LH}&{} y_{HH} \end{array} \right] } \end{aligned}$$
(20)

where \( y_{LL}=HA{H^{T}}\) \(\leftarrow \) Approximation (LL-Band), \(y_{LH}=HA{G^{T}}\) \(\leftarrow \) Vertical Difference (LH-Band), \(y_{HL}=GA{H^{T}}\) \(\leftarrow \) Horizontal Difference (HL-Band), \(y_{HH}=GA{G^{T}}\) \(\leftarrow \) Diagonal Difference (HH-Band).

By substituting the corresponding values into Eq. (20), the equations for all four bands namely LL, LH, HL and HH as

$$\begin{aligned} y_{LL}= & {} {\frac{1}{4}}\times {\left[ \begin{array}{cc} {(a_{11}+a_{12}+a_{21}+a_{22})} &{} {(a_{13}+a_{14}+a_{23}+a_{24})} \\ {(a_{31}+a_{32}+a_{41}+a_{42})} &{} {(a_{33}+a_{34}+a_{43}+a_{44})} \end{array} \right] } \end{aligned}$$
(21)
$$\begin{aligned} y_{HL}= & {} {\frac{1}{2}}\times {\left[ \begin{array}{cc} {(a_{12}+a_{22}-a_{11}-a_{21})} &{} {(a_{14}+a_{24}-a_{13}-a_{23})} \\ {(a_{32}+a_{42}-a_{31}-a_{41})} &{} {(a_{34}+a_{44}-a_{33}-a_{43})} \end{array} \right] } \end{aligned}$$
(22)
$$\begin{aligned} y_{LH}= & {} {\frac{1}{2}}\times {\left[ \begin{array}{cc} {(a_{21}+a_{22}-a_{12}-a_{11})} &{} {(a_{23}+a_{24}-a_{13}-a_{14})} \\ {(a_{31}+a_{32}-a_{42}-a_{41})} &{} {(a_{43}+a_{44}-a_{33}-a_{34})} \end{array} \right] } \end{aligned}$$
(23)
$$\begin{aligned} y_{HH}= & {} {\frac{1}{2}}\times {\left[ \begin{array}{cc} {(a_{11}+a_{22}-a_{12}-a_{21})} &{} {(a_{13}+a_{24}-a_{23}-a_{14})} \\ {(a_{31}+a_{42}-a_{32}-a_{41})} &{} {(a_{33}+a_{44}-a_{43}-a_{34})} \end{array} \right] } \end{aligned}$$
(24)

By observing Eqs. (21)–(24), we can decrease the size of input matrix from \(4 \times 4\) to \(2 \times 2\). So, now consider the input matrix \({\left[ \begin{array}{cc} {a} &{} {b} \\ {c} &{} {d}\end{array} \right] }\) then the above equations can be modified to Eqs. (25)–(28) as

$$\begin{aligned} y_{LL}= & {} {\frac{1}{4}}\times {[(a+b)+(c+d)]} \end{aligned}$$
(25)
$$\begin{aligned} y_{HL}= & {} {\frac{1}{2}}\times {[(b+d)-(a+c)]} \end{aligned}$$
(26)
$$\begin{aligned} y_{LH}= & {} {\frac{1}{2}}\times {[(c+d)-(b+a)]} \end{aligned}$$
(27)
$$\begin{aligned} y_{HH}= & {} {\frac{1}{2}}\times {[(a+d)-(b+c)]} \end{aligned}$$
(28)

4 Proposed architecture

The equations of nonseperable Haar Wavelet Transform are given in Eqs. (25)–(28) which consists of constant division factor of ‘4’ and ‘2’ respectively. The constant division factor can be replaced by shifters in binary arithmetic (Bhairannawar et al. 2016, 2018b) to get optimum hardware architecture. As a result, these equations can be rewritten as

$$\begin{aligned} y_{LL}= & {} {RS_2}\times {[(a+b)+(c+d)]} \end{aligned}$$
(29)
$$\begin{aligned} y_{HL}= & {} {RS_1}\times {[(b+d)-(a+c)]} \end{aligned}$$
(30)
$$\begin{aligned} y_{LH}= & {} {RS_1}\times {[(c+d)-(b+a)]} \end{aligned}$$
(31)
$$\begin{aligned} y_{HH}= & {} {RS_1}\times {[(a+d)-(b+c)]} \end{aligned}$$
(32)

where \(RS_2\) \(\leftarrow \) Right shift logical by position of ‘2’, \(RS_1\) \(\leftarrow \) Right shift logical by position of ‘1’.

The proposed hardware architecture of Optimized Haar Wavelet Transform is shown in Fig. 1 which consists of Pre-processing, Reset Controller, Data Format Conversion, Optimized Controller, Moving Window Architecture, Optimized Kogge–Stone Adder/Subtractor, Buffer, Shifter and D_FF blocks respectively. First the input video is converted into a number of finite frames of standard size (\(256 \times 256\)) by the Pre-processing block through MATLAB (Gonzalez et al. 2009) and Simulink/System Generator (Karris 2006) tool. The pixel values of those frames are then converted into corresponding user-defined format by the Data Format Conversion block to increase data accuracy which is then used to generate \(2 \times 2\) overlapped sub-matrix through Moving Window Architecture (Sateesh Kumar et al. 2015; Bhairannawar et al. 2016) block. These sub-matrix pixel values are then processed by Optimized Kogge–Stone Adder/Subtractor blocks to generate all four sub-bands (i.e., LL, LH, HL and HH) respectively. Among these bands, HL, LH and HH Bands produce some negative coefficients which are removed by Buffer (Bhairannawar et al. 2018b; Sateesh Kumar et al. 2015) block. Now the intermediate signals are shifted using seperate Shifter (Bhairannawar et al. 2018b; Sateesh Kumar et al. 2015; Palnitkar 2003) blocks to perform the corresponding division factors. But in the case of non-separable Haar Wavelet Transform, it must be in non-overlapped format. As a result, Optimized Controller and D_FF blocks are used in interdependent manner for discarding the intermediate values generated by these overlapped matrix pixels which are also used to implement Downsample by 2 of the intermediate values by the D_FF (D-Flipflop) block. The extra output signal clk_out and rst_out are used for proper synchronization purpose. The entire architecture is synchronized to the specific video parameters through Reset Controller block which is used to reset the entire architecture to its initial condition and makes it ready to process next video frame in the similar manner for video applications.

Fig. 1
figure 1

Proposed architecture of optimized Haar wavelet transform

4.1 Data format conversion

From the equations of Haar Wavelet Transform given in Eqs. (26)–(29), it is clear that the fractional numbers do appear during the computations. To address this problem, normally IEEE 754 format (IEEE 754 format for floating number 2020) is used which increases the hardware requirement to a great extent. As a result, Q-format (Singh and Srinivasan 2003) is considered to implement the fractional part to provide good tradeoff between hardware utilizations and data accuracy. The Data Format Conversion block is used to convert the pixel value into this particular user-defined format as shown in Fig. 2 where 16-bits from MSB side are considered as Integer Part and the 8-bits at LSB side are considered as the Fractional Part of a floating point number. The Concatenation operation present in binary arithmetic (Palnitkar 2003) is used to convert the input pixel values into this user-defined format.

Fig. 2
figure 2

Bit arrangement of proposed user defined data format

4.2 Optimized Kogge–Stone adder

The general Kogge–Stone Adder (Kogge and Stone 1973) architecture shows increment in delays when larger bits are used for addition which is not acceptable for many high speed real time applications. In such cases, the total number of bits are divided into a finite number of blocks where each block consists of 4-bits of data which are then added using seperate 4-bit Kogge–Stone adder block and the carry continuity problem is eliminated by adding some extra logic at output stage (Tapasvi et al. 2015). Such an implementation is shown in Fig. 3 where Modified Carry Correction block is used with existing Kogge–Stone Adder architecture (Kogge and Stone 1973) to optimize it’s performance.

Fig. 3
figure 3

Proposed optimized Kogge–Stone adder architecture (16-bit)

To reduce the carry propagation delay, the Modified Carry Correction block is implemented using basic logic gates through parallel architecture. The equations used to design the Modified Carry Correction block are given in Eqs. (33)–(37) (Tapasvi et al. 2015).

$$\begin{aligned} S\_C_{4n+4}= & {} S_{4n+4}\oplus C_{4n+3}; \end{aligned}$$
(33)
$$ \begin{aligned} S\_C_{4n+5}= & {} S_{4n+5}\oplus \{S_{4n+4} \, \& \, C_{4n+3}\}; \end{aligned}$$
(34)
$$ \begin{aligned} S\_C_{4n+6}= & {} S_{4n+6}\oplus \{S_{4n+5}\, \& \,S_{4n+4} \, \& \, C_{4n+3}\}; \end{aligned}$$
(35)
$$ \begin{aligned} S\_C_{4n+7}= & {} S_{4n+7}\oplus \{S_{4n+6}\, \& \,S_{4n+5}\, \& \,S_{4n+4} \, \& \, C_{4n+3}\}; \end{aligned}$$
(36)
$$ \begin{aligned} C\_C_{4n+7}= & {} C_{4n+7}\oplus \{S_{4n+7}\, \& \,S_{4n+6}\, \& \,S_{4n+5}\, \& \,S_{4n+4} \, \& \, C_{4n+3}\}; \end{aligned}$$
(37)

where \(\{S\_C_{4n+4}, S\_C_{4n+5}, S\_C_{4n+6}, S\_C_{4n+7}\}\) \(\leftarrow \) Corrected sum of corresponding bits, \(\{S_{4n+4}, S_{4n+5}, S_{4n+6}, S_{4n+7}\}\) \(\leftarrow \) Intermediate sums of the corresponding stage, \(\{C_{4n+3}, C_{4n+7}\}\) \(\leftarrow \) Intermediate carry of the corresponding stage, \(C\_C_{4n+7}\) \(\leftarrow \) Corrected carry of the corresponding stage, n \(\leftarrow \) Number of stages \((i.e., 0,1,2,\ldots ,(\frac{m}{4}-1))\), m \(\leftarrow \) Number of bits are used for additions.

The diagram of Modified Carry Correction block is given in Fig. 4 where parallel architecture is considered to reduce the carry propagation delay as much as possible without increasing hardware utilizations to much extent.

Fig. 4
figure 4

Proposed hardware architecture of modified carry correction

4.3 Optimized Kogge–Stone subtractor

In binary arithmetic, negative numbers are modelled by 2’s complement format (Roth 1992). As a result, any subtraction can be modelled by additions only in the case of the binary arithmetic and can be written as

$$\begin{aligned} y= & {} a - b; \end{aligned}$$
(38)
$$\begin{aligned} y= & {} a + (- b); \end{aligned}$$
(39)
$$\begin{aligned} y= & {} a + \{({\sim }b) + 1\}; \end{aligned}$$
(40)

where \(\{a,\,b\}\) \(\leftarrow \) Input Signals, y \(\leftarrow \) Output Signal.

The general architecture of binary subtractor (Roth 1992) is used to implement the Optimized Kogge–Stone Subtractor architecture where the normal adder block is replaced by the Optimized Kogge–Stone Adder block which increases the operating speed and reduces the hardware requirements moderately.

4.4 Optimized controller

The block diagram of the Optimized Controller is shown in Fig. 5 which consists of novel Clock Dividers and novel Reset Counter blocks respectively. This block is used to control the operation of the entire architecture by controlling it’s intermediate datapath with the help of D_FF block in proper synchronized manner. The use of simple architectural model designed by basic logical elements simplifies the entire architecture with respect to existing (Bhairannawar et al. 2016) where complex Finite State Machine (FSM) model is used.

Fig. 5
figure 5

Proposed optimized controller architecture

4.4.1 Clock dividers

The Clock Dividers block is used to generate different control signals through seperate clock division like logics which mainly consists of two blocks namely Clock Divider 1 and Clock Divider 2 respectively.

  1. 1.

    Clock Divider 1: The Clock Divider 1 is basically a simple clock divider circuit with a division factor of ‘2’ which is used to perform Downsample by 2 operation present in the Wavelet Algorithm (Mallat 2008) with the help of D_FF blocks. The D-Flipflop and NOT Gate arrangement (Roth 1992) is used to implement the Clock Divider 1 block on FPGA.

  2. 2.

    Clock Divider 2: The Moving Window Architecture (Sateesh Kumar et al. 2015; Bhairannawar et al. 2016) block is used to generate \(2 \times 2\) sub-matrices from the serial data. But to calculate proper sub-band coefficients, the input sub-matrices must be in non-overlapped fashion (Mallat 2008). To generate non-overlapped image sub-matrix, novel Clock Divider 2 and D_FF blocks are used synchronously in which the Clock Divider 2 block is used to generate control signal (‘en’) which is further used by D_FF block to eliminate the overlapped data present at the output side. The algorithm is used to built this block is shown in Algorithm 1.

figure c

4.4.2 Reset counter

To perform all the calculations and to generate proper output bands, the entire architecture needs some time and also it is essential to synchronize the output frequency with display devices. This is because the output frequency of any wavelet transform is equal to some factor of the input frequency. For this frequency synchronization, the novel Reset Counter and AND Gate blocks are used. The algorithm is used to design the Reset Counter is shown in Algorithm 2.

figure d

5 FPGA implementation

In this section, the implemented results of the proposed architecture with it’s sub-blocks are discussed. For hardware implementation, Digilent Spartan-6 (xc6slx45-3csg324) FPGA (Datasheet of Digilent ATLYS FPGA board 2020) is used and standard VHDL laguage (Roth 2006) is used to simulate the architecture with the help of Xilinx 14.7 Design Suite (Xilinx ISE In-Depth Tutorial 2020) tool through balanced synthesis option. The image and video frames are then fed to the proposed architecture using Simulink/System Generator (Karris 2006) tool with the help of MATLAB (Gonzalez et al. 2009) in real time using standard interfacing procedure.

5.1 Optimized Kogge–Stone adder

The hardware utilizations of Optimized Kogge–Stone Adder with different input bit sizes and Modified Carry Correction blocks are given in Table 1. From Table 1, it can be seen that the delay and hardware requirements have not increased rapidly with the size of input bits which further helps to achieve faster adder for higher bit sizes.

Table 1 Hardware utilizations of different size of optimized Kogge–Stone Adder

5.2 Optimized Kogge–Stone subtractor

The hardware utilization of Optimized Kogge–Stone Subtractor with different input bit sizes are given in Table 2 and from the table it can be seen that the delay and hardware utilizations have not increased rapidly with the size of the bits which further helps to achieve faster subtractor for higher bit sizes.

Table 2 Hardware utilizations of different size of Optimized Kogge–Stone Subtractor

5.3 Optimized Haar wavelet transform

The hardware utilization of two-dimensional non-separable Optimized Haar Wavelet transform is given in Table 3 with it’s corresponding internal components in which 24-bit Optimized Kogge–Stone Adder and Subtractor are used to get optimum result with optimum hardware utilizations.

Table 3 Hardware utilizations of Optimized Haar DWT with its individual components

The generated Technology Schematic and FPGA Schematic of Optimized Haar Wavelet is shown in Fig. 6a, b respectively where the Technology Schematic represent that how the designed architecture is mapped inside FPGA through its building blocks (LUTs, Slice Registers and LUT-FF Pairs etc.) and the FPGA Schematic shows that how effectively the design is placed and routed inside FPGA with all design constraints. Both diagrams are indirectly related to the area and resource utilizations at hardware level when implemented at chip level using various backend VLSI techniques (Weste and Eshraghian 1994).

Fig. 6
figure 6

Generated schematic diagrams of the proposed optimized Haar DWT after synthesis

The four sub-bands (LL, LH, HL and HH respectively) generated by the proposed architecture for both standard videos (Standard Video Database 2020) and standard images (Gonzalez and Woods 2008; Jayaraman et al. 2009) are given in Figs. 789 and 10 for different video frames and images respectively. In all the cases, it can be seen that a small portion of edges are distorted and is mainly due to the use of overlapped image sub-matrix for internal processing. The processed video frames of Sample 1 video is shown in Figs. 7 and 8 where the sub-bands for 36th frame are shown in Fig. 7 and 75th frame in Fig. 8 respectively.

Fig. 7
figure 7

Generated sub-bands by optimized Haar DWT architrcture for 36th frame of a standard video

Fig. 8
figure 8

Generated sub-bands by optimized Haar DWT architrcture for 75th frame of a standard video

In the similar manner, the sub-bands of standard images (Lena and Mandrill) are shown in Figs. 9 and 10 respectively.

Fig. 9
figure 9

Generated sub-bands by optimized Haar DWT architrcture for Lena image

Fig. 10
figure 10

Generated sub-bands by optimized Haar DWT architrcture for Mandrill Image

6 Performance analysis

The performance of the proposed Optimized Haar Wavelet transform architecture is compared with various existing architectures in various different aspects to check the overall performance of the proposed architecture.

6.1 Performance parameters

The performance parameters are used in this case is mainly divided into two catagories as Data Accuracy and Hardware Parameters.

6.1.1 Data accuracy

The data accuracy of any image and video processing system is mainly depends upon the accuracy of all the pixels present in that image or video frame. Normally frame or image comparisons are used to check the data accuracy of the the system which indicates the quality of the processed image with respect to input image or frame in terms of some mathematical parameter having huge impact on calculating the accuracy of the image and video processing systems (Bhairannawar et al. 2016, 2018b). In this paper, Euclidean Distance is used to compare the data accuracy of the frame or image. The equation is used to calculate Euclidean Distance between two same sized image/frame (Basics of Euclidean Distance for Image Processing 2020) is shown in Eq. (41) as

$$\begin{aligned} Eucledian \,Distance=\sqrt{{\sum _{i=0}^{(M-1)}}{\sum _{j=0}^{(N-1)}}{{(I_{i,j} - O_{i,j})}^{2}}} \end{aligned}$$
(41)

where I \(\leftarrow \) Input image/frame pixel values, O \(\leftarrow \) Output image/frame pixel values, \(\{M, N\}\) \(\leftarrow \) Image/frame size in both direction (XY) respectively, \(\{i, j\}\) \(\leftarrow \) Integer values.

6.1.2 Hardware parameters

Hardware parameters of any architecture is directly depends upon the design efficiency of that architecture such as the total area required by the architecture on silicon wafer, maximum operating frequency and power consumption etc. As a result, the main hardware parameters like Slice LUTs, Occupied Slice, LUT-FF Pairs, Path Delay, Slice Registers, Memory and Maximum Frequency (Wolf 2004) etc., are considered for comparison purpose.

6.2 Performance comparison

In this section, the performance of the proposed architecture is compared with existing to valuate the effectiveness of the proposed architecture.

6.2.1 Data accuracy

The Haar Wavelet Transform of a \((M{\times }N)\) frame or image produces four sub-bands of size \((\frac{M}{2}\times \frac{N}{2})\). Among those sub-bands, only LL-Band consists of major frame data which can be visualized as the compressed version of the input frame. To compare the data accuracy of the frame or image generated by the proposed architecture with respect to existing, the input frame or image is first resized into the same size of the generated LL band \((\frac{M}{2}\times \frac{N}{2})\) and then the corresponding Euclidean Distances are calculated for the output of the proposed architecture and the output generated by existing software (MATLAB) implementation (Gonzalez et al. 2009) seperately with the help of the resized frame or image. The Euclidean Distance values for different standard frames and images are tabulated in Tables 4 and 5 respectively where \(\mid {Difference}\mid \) (Bhairannawar et al. 2018a, b) is calculated using Eq. (42) as

$$\begin{aligned} \mid {Difference}\mid =\mid {{A} - {B}}\mid \end{aligned}$$
(42)

where A \(\leftarrow \) LL-Band generated by the hardware architecture, B \(\leftarrow \) LL-Band generated by the existing software algorithm.

From Tables 4 and 5, it can be seen that the Euclidean Distance values of the hardware generated LL-Band is very near to the existing software generated LL-Band (Gonzalez et al. 2009), resulting in very less \(\mid {Difference}\mid \) value and almost nigligible.This proves that the proposed architecture is able to produce results which are very much nearer to the actual results. The small differences are due to the truncation error (Roth 2006; Weste and Eshraghian 1994) which normally occures in any fixed point binary arithmetic.

Table 4 Euclidean distance comparisons of the LL band of proposed Haar DWT with actual LL band for different standard video frames
Table 5 Eucledian distance comparisons of the LL band of proposed Haar DWT with actual LL band for different standard images

6.2.2 Hardware comparisons

The hardware parameters which are explained in Sect. 5.3, are used to compare the proposed Optimized Kogge–Stone Adder and Optimized Haar Wavelet Transform with corresponding existing techniques in terms of hardware utilizations.

  1. 1.

    Optimized Kogge–Stone Adder: The comparison of proposed 8-bit Kogge–Stone adder with existing is given in Table 6. Similarly, the comparison of the proposed 16-bit and 32-bit Kogge–Stone Adder are given in Tables 7 and 8 respectively. From the tables, it can be observed that the proposed architecture requires less hardware resources and less execution time in the form of delay than existing which makes this architecture more efficient than existing. The main reason behind this optimizations is the use of Modified Carry Correction architecture which helps to reduce the worst case path delay (Smith et al. 2004; Weste and Eshraghian 1994) without increasing area utilizations.

Table 6 Hardware comparisons of the existing and proposed 8-bit Kogge–Stone Adder
Table 7 Hardware comparisons of the existing and proposed 16-bit Kogge–Stone Adder
Table 8 Hardware comparisons of the existing and proposed 32-bit Kogge–Stone Adder
  1. 2.

    Optimized Haar Wavelet Transform: The comparison of the hardware utilizations of the proposed Haar DWT with existing is given in Table 9. From the Table 9, it is clear that the proposed Haar Wavelet Transform architecture requires less hardware resources and can operate at higher speed than existing which makes it more superior than existing. The main reason behind this superiority is the optimization of each sub-blocks present in the architecture through various optimization techniques such as the constant division factors are replaced by corresponding shifters, the existing Kogge–Stone Adder/Subtractor architecture is optimized by Modified Carry Correction block and the Controller architecture is optimized through the uses of basic gates.

Table 9 Hardware comparisons of the existing and proposed Haar DWT

Limitation: The entire architecture is designed to operate on specific video parameters such as frame size and frame rate etc., which are briefly discussed in the paper. To process videos with different frame size and frame rate, the proposed architecture needs to be modified according to required parameter values.

7 Conclusion

In this paper, efficient FPGA architecture to implement Haar wavelet transform for image and video processing is proposed which is optimized by using Moving Window Architecture, Modified Kogge–Stone Adder/Subtractor, Optimized Controller, Buffer and Shifter respectively. The Modified Carry Correction block is used to build the Optimized Kogge–Stone adder/Subtractor block through its parallel architecture. The Optimized Controller is modelled using Counter and Clock Dividers to reduce the architectural complexity and hardware requirements. Similarly, for same reason the dividers are replaced by the shifters. The Reset Controller block makes this architecture suitable for video processing, by resetting it after each frame. To generate nearly accurate result, enough bit sizes are considered at intermediate level with Q-notations. In future, high resolution camera is interfaced with FPGA from which real time high speed video is captured and processed directly using this architecture with some modifications.