Keywords

1 Introduction

Wavelets are actively used for solving image processing problems in various fields of science and technology such as denoising [1], color image processing [2], video analysis [3]. However, modern imaging systems have not kept pace with the rapid growth in the amount of digital visual information that needs to be processed, stored, and transmitted. Many approaches are being developed and used to speed up computations in the implementation of various image processing methods. The authors of [4] focus on the evolution and application of various hardware architectures. The fast decomposition algorithms based on a different representation called product convolution extension has been proposed in [5]. This decomposition can be efficiently estimated by assuming that multiple operator impulse responses are available. The new simple adjacent sum method is developed in [6] for multidimensional wavelet constructing. This method provides a systematic way to build multidimensional non-separable wavelet filter banks from two 1D low-pass filters, one of which is interpolating to increase image processing speed. The authors of [7] describe an asymmetric 2D Haar transform and extend it to wavelet packets containing an exponentially large number of bases. A basis selection algorithm is also proposing for optimal basis finding in wavelet packets. Various modern GPU optimization strategies for the discrete wavelet transform implementation such as the use of shared memory, registers, warp shuffling instructions, and parallelism at the level of threads and instructions are presented in [8]. A mixed memory structure for the Haar transform is proposed in which a multilevel transform can be performed with a single launch of the combined kernel. The paper [9] proposes a new algorithm for 2D discrete wavelet transform of high-resolution images on low-cost visual sensors and nodes of the Internet of things. The reduction in computational complexity and power consumption compared to modern low-memory 2D discrete wavelet transform methods are the main advantages of the proposed segmented modified fractional wavelet filter. However, all of these methods are based on pixel-by-pixel image processing. The Winograd method (WM) reduce image processing time due to group pixel processing. The processed image is assembled from fragments of a certain size which reduces the multiplications number by increasing the additions number.

The purpose of this paper is to accelerate wavelet image processing using WM on modern microelectronic devices.

2 Wavelet Image Processing Using the Direct Implementation and the Winograd Method

Wavelet filtering using direct implementation (DI) has the form

$${I}_{2}\left(x\right)={\sum }_{i=0}^{f-1}{I}_{1}\left(x-i\right)K\left(i\right),$$
(1)

where \({I}_{1}\) and \({I}_{2}\) are the original and processed 2D images, respectively, \(x\) is the row number of the pixel processed by \(f\)-tap wavelet filter \(K\). The wavelet transform extracts local information about the signal in both frequency and time. High computational complexity is a significant disadvantage of this transform. The scheme of 1D wavelet filtering of an image fragment using DI is shown in Fig. 1a, where \({S}_{I}\) is the original image fragment, \(L\) and \(H\) are the low- and high-pass wavelet filters, \({P}_{A}\) and \({P}_{D}\) are the processed image pixels with approximate and detailing image information, respectively.

Image filtering using WM in matrix form [10] can be presented as

$$Z={A}^{T}\left(\left(GK\right)\odot \left({B}^{T}S\right)\right),$$
(2)

where: \(Z\) is the processed image fragment of size \(z\times 1\); \(K\) is the wavelet filter of size \(f\times 1\); \(S\) is the original image fragment of size \(s\times 1\), where \(s=z+f-1\); \({A}^{T}\), \(G\), \({B}^{T}\) are the transformation matrices of sizes \(z\times s\), \(s\times f\), \(s\times s\), respectively; \(\odot\) is the element-wise matrix multiplication. Algorithms for matrices \({A}^{T}\), \(G\), \({B}^{T}\) obtaining are described in [11]. WM is denoted as \(F(z,f)\). Digital filtering is performed on two computational channels corresponding to low- and high-frequency wavelet filters during wavelet image processing. The products of \(GL\) and \(GH\) are calculated in advance when using a specific wavelet. The product of \(S\) and the transformation matrix \({B}^{T}\) can be computed before splitting the calculations into two channels because does not depend on the wavelet choice. Next, the element-wise multiplications \({B}^{T}S\) by \(GL\) and \(GH\) and the products of the obtained results with the transformation matrix \({A}^{T}\) are performed over two computational channels. The scheme of 1D wavelet filtering of an image fragment using WM is shown in Fig. 1b, where \(S\) is the original image fragment, \(L\) and \(H\) are the low- and high-frequency wavelet filters, \({B}^{T}\), \({A}^{T}\), \(G\) are the transformation matrices, \({S}_{A}\) and \({S}_{D}\) are the processed image fragments with approximate and detailing image information, respectively.

Fig. 1.
figure 1

The schemes of 1D wavelet filtering of an image fragment using: a) the direct implementation; b) the Winograd method

The results of increasing the speed of wavelet image processing using MW are presented below.

3 Acceleration of Wavelet Image Processing Using the Winograd Method

The computational complexity in time of wavelet filtering using WM \(F(z,f)\) depends on the \(z\) and \(f\) and on the choice of points \({s}_{0},{s}_{1},...,{s}_{n-2},{s}_{n-1}\). These values determine the form of transformation matrices \({A}^{T}\), \(G\), \({B}^{T}\). The set of the Lagrange polynomial points \(L=\mathrm{0,1},-\mathrm{1,2},-\mathrm{2,4},-4,...,{2}^{l},-{2}^{l},{2}^{l+1},-{2}^{l+1},...,\infty \) was used to construct the Vandermonde matrix \(V\) and matrices \({A}^{T}\), \(G\), \({B}^{T}\) [11]. The cases of using 4-, 6-, and 8-tap wavelets and processing of the original image fragments with size \(z=\mathrm{2,3},\mathrm{4,5},\mathrm{6,7}\) are considered. Table 1 is based on transformation matrices and contains the counting results of the multiplications and additions number required for wavelet filtering of images using DI and WM. The table values are obtained as follows.

  1. 1.

    The DI multiplications number is equal to the wavelet filters coefficients number.

  2. 2.

    The WM multiplications number is equal to twice the number of the processed image fragment pixels.

  3. 3.

    The DI additions number is equal to the number 2 less than the multiplications number.

  4. 4.

    The WM basic additions number is equal to the additions number of nonzero elements of matrices \({A}^{T}\) (twice) and \({B}^{T}\) by rows.

  5. 5.

    The WM complementary additions number is equal to the sum of the matrix element units in binary notation reduced by 1 for all elements of matrices \({A}^{T}\) (twice) and \({B}^{T}\).

  6. 6.

    The total additions number is equal to the sum of basic and complementary additions.

  7. 7.

    WM receives several pixel values of the processed image in one iteration. Obtaining pixel brightness value requires the entire iteration as well as obtaining the entire fragment. Introduce the pixel specific value (PSV) for a correct comparison of the methods computational complexity. PSV is calculated as a quotient of the required operations number (multiplications or additions) divided by the number of pixels in the processed image fragment.

Table 1. The number of additions and multiplications in wavelet filtering of an image fragment using the direct implementation and the Winograd method

Table 1 shows that the greatest reduction in the specific weight of a pixel by multipliers is observed for 8-tap wavelet using WM \(F\left(\mathrm{6,8}\right)\). The computational complexity decreases asymptotically by 72.9% compared to DI. The asymptotic estimate does not take into account addition operations since their complexity is an order of magnitude less than multiplication. This assessment is predominantly theoretical and may have a low correlation with the results obtained in the design of wavelet image processing devices in practice. Therefore, the unit-gate model (UGM) was used to calculate the operating time of a microelectronic device. UGM is a method for theoretical evaluation of device characteristics based on counting the number of the basic logical elements “and”, “or” [12]. The response time of one such element will be taken as a conventional unit (CU). Describe the principles of performing calculations in the theoretical estimation of the wavelet filtering devices delay according to the schemes in Fig. 1a and Fig. 1b for DI and WM, respectively. All multiplications are performed in parallel when using both methods.

Matrix multiplication operations can be replaced by shift and addition operations using the \({B}^{T}\) and \({A}^{T}\) matrices. The number of ones in the number binary representation for each element of the matrices \({A}^{T}\) and \({B}^{T}\) was calculated to determine the terms number in the rows of these matrices (Table 2). The products \(GL\) and \(GH\) are performed a priori. The products \({B}^{T}\) S on \(GL\) and \(GH\) are realized by element-wise multiplications. Multiplications and additions are implemented using a generalized multiplier (GM) and a multi-operand adder (MOA), respectively [13]. The delays of GM and MOA for \(k\)-bit numbers on computing devices are \(6.8\,{\mathrm{log}}_{2}\,N+2\,{\mathrm{log}}_{2}\,k+4\) and \(8.8\,{\mathrm{log}}_{2}\,k+4\), respectively [14], where \(N\) is the largest number of elements in rows of matrices \({A}^{T}\) and \({B}^{T}\), \(k\) is the image color depth and the coefficients bit-width of used wavelet filters. The calculations are performed for \(k=8\). The results of the device delay evaluation for wavelet image processing using DI and WM are presented in Table 2.

Table 2. UGM-based evaluation results of the device delay for wavelet processing of 8-bit image using the direct implementation and the Winograd method

The following conclusions are drawn based on the results in Table 2.

  1. 1.

    WM reduced the device delay of wavelet image processing to 66.9%, 73.6%, and 68.8% for 4-, 6-, and 8-tap wavelets, respectively, compared DI according to UGM.

  2. 2.

    The larger the processed image fragments size \(z\), the less time is spent on wavelet filtering, but the larger the transformation matrices size, the more difficult their compilation and WM design on modern microelectronic devices.

  3. 3.

    The greatest reduction in device delay with an increase in the size of the resulting image fragments processed using WM is achieved at \(z=2\) and \(z=3\) according to UGM. For example, the device delay is reduced by \(55.0-39.3=15.7\) CU and \(39.3-26.9=12.4\) CU at \(z=2\) and \(z=3\), respectively, the device delay is reduced by 4.3 CU and 4.5 CU at \(z=4\) and \(z=5\), respectively, for 4-tap wavelet according to UGM.

4 Conclusion

The scheme for 1D wavelet image processing using WM has been developed. A comparative analysis of the image filtering time with DI was carried out. WM reduced the computational complexity of wavelet image processing asymptotically to 72.9% depending on the size of the filters used and fragments of the processed image. WM reduced the device delay of wavelet image processing to 66.9%, 73.6%, and 68.8% for 4-, 6-, and 8-tap wavelets, respectively, according to UGM. The larger the processed image fragments size z, the less time is spent on wavelet filtering, but the larger the transformation matrices size, the more difficult their compilation and WM design on modern microelectronic devices. The obtained results can be used to improve the performance of wavelet image processing devices for image compression and denoising. WM hardware implementation on FPGAs and ASICs to accelerate wavelet image processing is a promising direction for further research.