Parallel implementations of the 3D fast wavelet transform on a Raspberry Pi 2 cluster

Bernabé, Gregorio; Hernández, Raúl; Acacio, Manuel E.

doi:10.1007/s11227-016-1933-2

Parallel implementations of the 3D fast wavelet transform on a Raspberry Pi 2 cluster

Published: 02 December 2016

Volume 74, pages 1765–1778, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

The Journal of Supercomputing Aims and scope Submit manuscript

Parallel implementations of the 3D fast wavelet transform on a Raspberry Pi 2 cluster

Download PDF

497 Accesses
7 Citations
Explore all metrics

Abstract

Current low-cost general-purpose single-board computing (SDC) devices are gaining increasing interests in research computing due to their very low cost/performance ratio and energy consumption. Among all the SDCs available nowadays, Raspberry Pi devices constitute maybe the most renowned representatives. On the other hand, the wavelet transform plays an important role in contemporary standards for image compression (such as JPEG-2000) and video compression (MPEG-4). In this work, we present and evaluate three parallelization strategies of the 3D fast wavelet transform (3D-FWT) on a cluster of Raspberry Pi 2 SDCs. Each parallelization strategy has been implemented using both POSIX Threads (shared memory) and MPI (message passing). The set of implementations using POSIX Threads is restricted to runs on a single board, whereas multiple boards can be used for the MPI versions. We find out that noticeable speed-ups can be obtained when all MPI processes or POSIX Threads are run using the cores of a single Raspberry Pi 2 SDC. However, in the case of the MPI versions, we observe that performance drops drastically when all MPI processes spread to several boards. The reason for this is the limited bandwidth that the onboard LAN port can deliver, and that proves insufficient for the fine-grained, high-volume communication requirements of the studied parallelization strategies. Finally, we have also considered the execution of the POSIX Threads and MPI versions on a very high-performance but power-hungry 4-core Intel Xeon CPU E5606, obtaining that the Raspberry Pi 2 SDC can do the task with much lower total energy consumption (up to 4 times).

Parallel Image Processing Applications Using Raspberry Pi

Capabilities of Raspberry Pi 2 for Big Data and Video Streaming Applications in Data Centres

Performance Evaluation of Raspberry Pi 4B Microcomputer: Case Studies on MPICH Cluster, VMware ESXi ARM Fling, and Windows 11 ARM OS

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Continuous improvements in the technologies used to build computers have recently made possible manufacturing of extremely low-cost general-purpose single-board computing devices. Nowadays, one can buy one of these tiny computers for a few dollars and make it run Windows 10 or Ubuntu-Linux operating systems [1]. Today, one of the most renowned examples of these single-board computers is Raspberry Pi devices. Although the initial aim of these devices was to promote the teaching of basic computer science in schools [2] and developing countries [3], the recent appearance of single-board computers (SBC) with multicore CPU chips has attracted interest of a multitude of projects trying to take advantage of their extremely low energy consumption and cost-performance ratio (i.e., in data centers [4]).

Raspberry Pi SBCs are priced between US$20 and US$35.^{Footnote 1} The development of the Raspberry Pi SDCs has gone through several generations. The first generation (Raspberry Pi 1) was released in February 2012 in basic model A and a higher specification model B. A+ and B+ models were released a year later. All these Raspberry Pi’s had a single-core ARM-based CPU chip which restricted their use to toy computers. Raspberry Pi 2 model B was released in February 2015 and its main novelty was the introduction of a 4-core ARM-based CPU chip, which brought a sixfold performance increase compared to their predecessor. The last family member, the Raspberry Pi 3 model B, was launched in February 2016 and adds wireless connectivity (2.4 GHz WiFi 802.11n and Bluetooth 4.1).

In the last few years, a very attractive area of research involves the proposal and evaluation of different transform functions that may overcome the limitations that the discrete cosine transform (DCT) used by MPEG-2 presents for some particular types of video. Wavelet techniques have recently generated much interest in applied areas and the wavelet transform has been mainly applied to images. Several coders have been developed using the 2D wavelet transform [5,6,7]. The latest image compression standard, JPEG-2000 [8, 9], is also based on the 2D wavelet transform with a dyadic mother wavelet transform. The 3D wavelet transform has been also applied for compressing video. Since one of the three spatial dimensions can be considered similar to time, Chen and Pearlman developed a three-dimensional subband coding to code video sequences [10], later made an improvement with an embedded wavelet video coder using 3D set partitioning in hierarchical trees (SPHIT) [11]. Today, the standard MPEG-4 [12, 13] supports an ad-hoc tool for encoding textures and still images, based on a wavelet algorithm. In a previous work [14], we have presented the implementation of a lossy encoder for medical video based on the 3D fast wavelet transform. This encoder achieves both high compression ratios and excellent quality, so that medical doctors cannot find longer differences between the original and the reconstructed video. Furthermore, the execution time achieved by this encoder allows for real-time video compression and transmission. In the case of parallelizing the 2D or 3D fast wavelet transform, we have shown previously that contemporary GPGPUs can achieve dramatical speed-ups [15,16,17,18]. Unfortunately, automatic parallelization methods do not yield enough benefits, while manual parallelization methods pose a considerable burden in software development [19].

In this work, we study the parallelization of the 3D fast wavelet transform (3D-FWT) on a cluster of Raspberry Pi 2 SDCs. Particularly, we have implemented three different parallelization strategies for the 3D-FWT (namely, decomposition by rows, decomposition by rows and columns, and decomposition by video sequences) using both POSIX Threads [20] and MPI [21], and we have evaluated them on a cluster of 4 Raspberry Pi 2 SDCs. While the POSIX Threads versions are restricted to runs on a single boards, the MPI versions can span to several Raspberry Pi 2 SDCs. Our results show that noticeable speed-ups can be obtained when all MPI processes or POSIX Threads are run using the cores of a single Raspberry Pi 2 SDC. However, in the case of the MPI versions, we observe that performance drops drastically when all MPI processes spread to several boards. The reason for this is the limited bandwidth that the onboard LAN port (rated as 10/100 Fast Ethernet, and driven through the onboard USB 2.0 bus) can deliver, and that proves insufficient for the fine-grained, high-volume communication requirements of the parallelization strategies.

We have also considered the execution of the POSIX Threads and MPI versions on a very high-performance but power-hungry 4-core Intel Xeon CPU E5606, obtaining that the Raspberry Pi 2 SDC can do the task with much lower total energy consumption (up to 4 times). Overall, our results prove that the Raspberry Pi 2 SDC reveals as an appealing platform for giving support to 3D-FWT-based applications with low-cost and energy efficiency requirements [22]. As a use case, we think that Raspberry Pi 2 SDCs could be used in the development of different telemedicine specialities based on 3D-FWT [23,24,25] in rural zones of developing countries.

The rest of the paper is organized as follows. Section 2 contains important background about the wavelet transform. The parallelization strategies that we have implemented and evaluated in this work are explained in Sect. 3. In Sect. 4, we give the details of the cluster of Raspeberry Pi 2 SDCs used for the evaluation and, then, we present the results. Finally, Sect. 5 concludes the paper and draws some lines of future work.

2 Wavelet transform foundations

The basic idea of the wavelet transform is to represent any arbitrary function f as a weighted sum of functions, referred to as wavelets. Each wavelet is obtained from a mother wavelet function by conveniently scaling and translating it. The result is equivalent to decomposing f into different scale levels (or layers), where each level is then further decomposed with a resolution adapted to that level.

In a multiresolution analysis, there are two functions: the mother wavelet and its associated scaling function. Therefore, the wavelet transform can be implemented by quadrature mirror filters (QMF), $G = g(n)$ and $H = h(n),\,n\epsilon Z. H$ corresponds to a low-pass filter, and G is a high-pass filter. For a more detailed analysis of the relationship between wavelets and QMF, see [26].

The filters H and G correspond to one step in the wavelet decomposition. Given a discrete signal, s, with a length of $2^{n}$, at each stage of the wavelet transformation the G and H filters are applied to the signal, and the filter output downsampled by two, thus generating two bands, G and H. The process is then repeated on the H band to generate the next level of decomposition, and so on. This procedure is referred to as the 1D fast wavelet transform (1D-FWT).

It is not difficult to generalize the one-dimensional wavelet transform to the multi-dimensional case [26]. The wavelet representation of an image, f(x, y), can be obtained with a pyramid algorithm. It can be achieved by first applying the 1D-FWT to each row of the image and then to each column, that is, the G and H filters are applied to the image in both the horizontal and vertical directions. The process is repeated several times, as in the one-dimensional case. This procedure is referred to as the 2D fast wavelet transform (2D-FWT).

As in 2D, we can generalize the one-dimensional wavelet transform for the three-dimensional case. Instead of one image, there is now a sequence of images. Thus, a new dimension has emerged, the time (t). The 3D-FWT can be computed by successively applying the 1D wavelet transform to the value of the pixels in each dimension.

Based on the previous work [27], we consider Daubechie’s $W_{4}$ mother wavelet [28] as an appropriate baseline function. This selection determines the access pattern to memory for the entire 3D-FWT process. Let us assume an input video sequence consisting of a number of frames (3rd dimension), each composed of a certain number of rows and columns (1st and 2nd dimension). The 1D-FWT is performed across all frames for each row and column, that is, we apply the 1D-FWT $rows \times cols$ times in the third dimension. The first 1D-FWT instance requires four elements to calculate the first output element for the reference video and the detailed video, with these elements being the first pixel belonging to the first four frames. The second output element for the reference and detailed video are calculated using the first pixel of the third, fourth, fifth and sixth video frames. We continue this way until the entire reference and detailed video are calculated, and these data are the input used for the next stage.

The 2D-FWT is performed frames times, i.e., once per frame. This transform is performed by first applying the 1D-FWT on each row (horizontal filtering) of the image, followed by the 1D-FWT on each column (vertical filtering). The fact that vertical filtering computes each column entirely before advancing to the next column forces the cache lines belonging to the first rows to be replaced before the algorithm moves on to the next column. Meerwald et al. [29] propose two techniques to overcome this problem: row extension and aggregation or tiling.

Other studies [30, 31] also report remarkable improvements when applying the tiling technique over the 2D-FWT algorithm. Our experience implementing on a CPU the sequential 2D-FWT algorithm revealed a reduction of almost an order of magnitude in the overall execution time with respect to a baseline version. This process can straightforwardly be applied to the 3D case, where there are reports of solid gains on execution times as well, which range from 3$\times $ to 7$\times $ factors [14].

In previous works [16, 17], we contributed with a CUDA implementation for the 2D-FWT running more than 20 times faster than a sequential C version on a CPU, and more than twice faster than optimized OpenMP and pthreads versions implemented on a quad-core CPU. We extended the analysis to the 3D-FWT scenario [15, 18], where speed-up factors have been improved using a new set of optimization techniques. We presented different alternatives and programming techniques for an efficient parallelization of the 3D fast wavelet transform on multicore CPUs and manycore GPUs. OpenMP and pthreads were used on the CPU to build different implementations to maximize parallelism, whereas CUDA and OpenCL were selected for data parallelism exploitation on the GPU with an explicit memory handling. Speed-ups of the CUDA version on Fermi architecture were the highest obtained, improving the execution times on CPU on a range from 5.3$\times $ to 7.4$\times $ for different image sizes, and up to 81$\times $ faster when communications are neglected. Meanwhile, OpenCL obtains solid gains in the range from 2$\times $ for small frame sizes to 3$\times $ for larger ones.

3 Parallelization on a cluster of Raspberry Pi’s

In this section, we present the three parallelization strategies that we have considered in this work. Each one of them has been implemented using both MPI [21] and POSIX Threads (POSIX Threads) [20]. In the case of the POSIX Threads versions, all threads must execute on the same SDC, since there is no physical shared memory between the different nodes of the cluster. Contrarily, MPI processes can either be run on the same node or using different nodes.

3.1 Decomposition by rows

Our first parallelization strategy tries to take advantage of the fact that computations on the different rows of each frame in a video sequence can be done in parallel. To do so, in this approach different blocks of rows of each frame of the video sequence are divided among the different participating MPI processes. Then, the 1D-FWT in the time and x dimensions are executed by each process over its assigned rows. Next, every process sends its results to the first process (in our case, process with id 0), which is in charge of applying the 1D-FWT to all the columns of the frame. The low-pass outputs of the 3D-FWT are sent from the first process to the rest of them in order to apply the second iteration of the 3D-FWT and so on. Figure 1 describes graphically how this first parallelization strategy would proceed when implemented using MPI.

The parallelization procedure followed in the POSIX Thread version is exactly the same: each of the threads is responsible for computing a block of rows of each frame of the video sequence and, subsequently, just one of the threads (that with id 0) applies the 1D-FWT to all the columns of the frame. These two phases have been separated by means of a barrier synchronization. Once thread with id 0 has finished, the second iteration of the 3D-FWT could proceed. Again, a barrier synchronization guarantees that the second iteration cannot initiate until thread 0 has gone through all the columns. Finally, it is important to note that while communications must be explicitly defined in the case of the MPI version, they occur implicitly in the POSIX Threads version as the different threads read and write to the shared memory. This is in favor of the POSIX Threads implementation that is much more easier to code.

3.2 Decomposition by rows and columns

In an attempt to try to increase the amount of parallel work, we also consider the decomposition by rows and columns parallelization strategy. This is an evolution of the previous decomposition strategy. Particularly, once the 1D-FWT is applied in the time and x dimensions by each process, instead of sending the results to a particular process, in this alternative results are exchanged between all participating processes. Once this is done, the 1D-FWT is applied by each process for a block of columns. Then, the low-pass output of the 3D-FWT is also exchanged between all the processes to apply the second iteration of the 3D-FWT and so on. Figure 2 describes graphically how this second parallelization strategy would proceed . It is clear that in this case more work is done in parallel but at the expense of increasing communication requirements.

As in the Decomposition by rows version, the implementation with POSIX Threads divides the different blocks of rows of the video sequence among all the available threads. Each thread applies the 1D-FWT in the time and x dimensions over its assigned set of rows (first phase). Once all threads have finished, they apply the 1D-FWT for their assigned block of columns (second phase). Again, the first and second phases are separated using a barrier synchronization, and also a barrier synchronization after the second phase is inserted to guarantee that the second iteration (which is applied to the low-pass output) cannot start until the first one has finished. Developing this parallelization strategy using MPI requires a lot of communications to be defined by the programmer, which results in increased complexity compared with the Decomposition by rows version. However, the fact that all communications are implicit in the POSIX Threads version does not bring additional programming efforts in this case.

3.3 Decomposition by video sequences

In an attempt to minimize communications, this parallelization strategy divides the video stream into several sequences, which are subsequently assigned to each one of the available MPI processes or threads. This way, each MPI process or thread can execute a number of iterations of the 3D-FWT independently of what the rest are doing and, thus, no communications are required between them. In this case, there are no meaningful differences in terms of complexity between the MPI and POSIX Thread versions, since both of them proceed almost the same way. Figure 3 describes graphically how this third parallelization strategy would work.

4 Experiments

We have built a cluster which is composed by four Raspberry Pi 2 Model B nodes. Each node contains a 900 MHz quad-core ARM Cortex-A7 CPU and 1 Gb of RAM memory. The interconnection network of the cluster is Ethernet at 100 Mbps. The Operating System is Raspbian Wheezy. In our cluster, we have installed MPICH2 (v1.4.1) as the MPI library implementation for the MPI versions of the parallelization strategies previously described. Additionally, we have also implemented shared-memory versions using PThreads (POSIX Threads) [20].

4.1 Performance results

We have executed and measured the optimized 3D-FWT sequential version previously used in [15, 18] on a Raspberry Pi 2 for 2 iterations of the 3D fast wavelet transform. This will be the baseline along the evaluation. As already commented on, all threads must execute on the same SDC in the case of the POSIX Threads versions, since there is no physical shared memory between the different nodes of the cluster. Contrarily, MPI processes can either be run on the same node or using different nodes. This way, we have configured different parallel execution scenarios for the three parallel versions of the 3D-FWT explained before in each case. For the POSIX Threads versions, experiments have been done for 2 and 4 PThreads running on one Raspberry Pi 2 (remember that each board has a 4-core CPU). For the MPI versions, we consider parallel executions with 2 and 4 MPI processes, running on the same Raspberry Pi 2 board or different boards. This allows us evaluating the impact that the limited bandwidth of the onboard LAN port has on the results.

Figure 4 shows the execution times (in seconds) for 2 iterations of the 3D-FWT on an input video containing 64 frames of $512\times 512$ pixels. In all cases, the execution times include the time needed to read the input video and write the result from and to the disk, respectively (down portion of each bar (I/O)). Note that the I/O time keeps constant when all threads run on the same Raspberry Pi 2. Contrarily, when the threads spawn to different Raspberry Pi 2, I/O time grows. This is due to input and out data must go through the slow LAN interface. From left to right, we first present the result obtained for the sequential version (Sequential), then we show the results for the decomposition by rows parallelization strategy implemented both using MPI (Rows-MPI) and PThreads (Rows-PT), decomposition by rows and columns parallelization strategy implemented both using MPI (Rows&Cols-MPI) and PThreads (Rows&Cols-PT), and decomposition by video sequences parallelization strategy implemented both using MPI (Seq-MPI) and PThreads (Seq-PT). For each one of the parallelization strategies, we consider 2 and 4 MPI processes or PThreads (2P and 4P, respectively) running on one Raspberry Pi 2, and only for the MPI versions, running on 2 or 4 Raspberry Pi 2 (2P-R and 4P-R, respectively), having just 1 MPI process per board in each case.

From the results, we can see that the three proposed parallelization strategies of the 3D-FWT obtain noticeable speed-ups when executed on a single Raspberry Pi 2 with regard to the optimized sequential version. This is regardless of whether we have employed MPI or POSIX Threads in the parallelization. However, the executions on different Raspberry Pi 2 for the MPI versions, except in the case of the decomposition by video sequences parallelization strategy, show negative outcomes from the performance point of view. What makes the differences is that in the first case (all processes or threads run on the same CPU) all communications take place on the same board and, therefore, can be performed with low latency. Contrarily what happens when communications take place between several Raspberry Pi. In this case, all communications go through the low bandwidth onboard LAN ports, which ballast any performance advantages that parallel execution could entail. The lack of communications in the decomposition by video sequences strategy justifies the speed-ups of 1.25 and 1.59 for 2 and 4 MPI processes, respectively; although as it can be seen in Fig. 4, speed-ups are even greater in this case when all MPI processes run on the same SDC.

Taking a closer look at the results for one Raspeberry Pi board, we see that the speed-ups of the decomposition by rows parallelization strategy for 2 and 4 MPI processes are 1.44 and 1.24, respectively. Similarly, the decomposition by rows and columns parallelization strategy obtains speed-ups of 1.50 and 1.16 for 2 and 4 MPI processes, respectively. Therefore, in both cases, fewer MPI processes and, therefore, lesser amount of communications among several processes imply the best results. These two parallel versions do not scale due to the decreasing computation/communication ratio that they exhibit as the number of processes grows. Comparing the results for the two parallelization strategies, we observe that the increased potential parallelism exhibited by the decomposition by rows and columns strategy can only be exploited efficiently when the number of involved processes is small (2). In this case, the communication budget that it entails can be significantly amortized by the large amount of floating-point operations that each process performs. However, when the number of cores is 4, the larger communication requirements cannot be compensated, and lower speed-ups are obtained when compared with the decomposition by rows strategy.

Regarding the implementations using PThreads, the speed-ups of the decomposition by rows parallelization strategy for 2 and 4 PThreads are 1.26 and 1.37, respectively. Similarly, the decomposition by rows and columns parallelization strategy obtains speed-ups of 1.40 and 1.63 for 2 and 4 PThreads respectively. Now, in both cases, higher number of PThreads leads to better results. We have observed that for the POSIX Threads versions communications scale more gracefully than in the MPI cases and, thus, increasing the number of threads in these two parallel versions does not penalize as happened for the MPI versions. In this case, communications occur through the shared memory, as every thread reads data blocks previously written by other thread. This allows for efficient exploitation of the higher potential parallelism that the decomposition by rows and columns strategy has, and higher speed-ups are achieved when compared with the decomposition by rows strategy for both 2 or 4 PThreads. Additionally, comparing these speed-ups with the ones obtained for the MPI versions, we observe that the highest speed-up corresponds to the PThreads implementation of the decomposition by rows and columns for executions with 4 PThreads (1.63).

Finally, we consider the results for the decomposition by video sequences parallelization strategy when all PThreads or MPI processes are run on the same SDC. In this case, speed-ups of 1.90 and 2.84 are obtained for 2 and 4 MPI processes, respectively. In the same way, the speed-ups for 2 and 4 PThreads are 1.51 and 2.10, respectively. Therefore, in this case for both implementations, higher number of MPI processes or PThreads leads to better results. It is important to note that the main goal of the decomposition by video sequences parallelization strategy was to eliminate communications between MPI processes or PThreads by making them operate on different frames. This way, it can reach better results in terms of speed-up and scalability than the decomposition by rows and the decomposition by rows and columns parallelization strategies.

In general, the speed-up results obtained within one Raspberry Pi keep in the expected range for a class of algorithms like the 3D-FWT, with low arithmetic intensity, pseudo-regular access patterns and intricate loop traversing. In these cases, maintaining communication requirements as low as possible is a key factor to obtain substantial profits.

4.2 Energy efficiency results

We also compare the results observed for the POSIX Threads and MPI versions when running on a Raspberry Pi 2 SDC, with those obtained by a conventional, high-performance multicore processor. Particularly, we consider the case of an Intel^® Xeon^® CPU E5606 running at 2.13 GHz. To do so, we measure the execution times of each version on each platform, and multiply them by the reported Thermal Design Power (TDP) measures in each case (4 W for the Raspberry Pi 2 and 80 W for Intel Xeon processor). In all the cases, we consider 2 iterations of the 3D-FWT on an input video containing 64 frames of $512\times 512$ pixels. The results are shown in Fig. 5.

From left to right, we first present the results for the decomposition by rows parallelization strategy implemented both using MPI (Rows-MPI) and PThreads (Rows-PT), decomposition by rows and columns parallelization strategy implemented both using MPI (Rows&Cols-MPI) and PThreads (Rows&Cols-PT), and decomposition by video sequences parallelization strategy implemented both using MPI (Seq-MPI) and PThreads (Seq-PT). For each one of the parallelization strategies, we consider 2 and 4 MPI processes or PThreads running on one Raspberry Pi 2 (2T-RP2 and 4T-RP2, respectively), and running on the Intel Xeon (2T-Xeon and 4T-Xeon, respectively).

As it can be seen in the figure, the results for the Raspberry Pi 2 are 3 to 4 times lower than those obtained for the Intel Xeon. This means that this platform (Raspberry Pi 2) can do the job with much lower total energy consumption than its higher-performance opponent. Overall, our results prove that the Raspberry Pi 2 SDC reveals as an appealing platform for giving support to 3D-FWT-based applications with low-cost and energy efficiency requirements.

5 Conclusions and future work

Current low-cost general-purpose single-board computing (SDC) devices are gaining increasing interests in research computing due to their very low cost/performance ratio and energy consumption. In this work, we have analyzed the potential of a cluster built from Rasbperry Pi 2 SDCs as a viable platform for accelerating the execution of the 3D fast wavelet transform. To do so, we have implemented three parallelization strategies for the 3D fast wavelet transform using both MPI and POSIX Threads. We found that although such cluster can constitute a very appealing solution from the cost point of view, it is unable to improve performance beyond what it is obtained when a single Raspberry Pi 2 is employed. That is, while noticeable speed-ups can be obtained when all MPI processes or PThreads run on cores of a single Raspberry Pi 2 SDC, performance drops drastically when they spread to several boards. The reason for this is the limited bandwidth that the onboard LAN port (rated as 10/100 Fast Ethernet, and driven through the onboard USB 2.0 bus) can deliver, and that proves insufficient for the fine-grained, high-volume communication requirements of the two parallelization strategies. Finally, we have also considered the execution of the POSIX Threads and MPI versions on a very high-performance but power-hungry 4-core Intel Xeon CPU E5606, obtaining that the Raspberry Pi 2 SDC can do the task with much lower total energy consumption (up to 4 times).

Our plans for future work include the possibility of using other SDCs different than Raspberry Pi for building the cluster. Particularly, we have found that Odroid C2 SDC may be an interesting alternative to consider since it provides, among other things, a Gigabit Ethernet LAN port which would help palliate the bandwidth limitations that our cluster configuration currently has.

Notes

A Pi Zero with smaller footprint and limited IO (GPIO) capabilities was released in November 2015 for US$5.

References

Membrey P, Hows D (2015) Learn Raspberry Pi 2 with Linux and Windows 10, 2nd edn. Apress, New York
Google Scholar
Hague A, Hastings G, Klling M, Croston B, Oldknow A, Lockwood B, Beale C (2012) The Raspberry pi education manual version 1.0. Computing at School, Creative Commons License
Heeks R, Robinson A (2013) Ultra-low-cost computing and developing countries. Commun ACM 56(8):22–24
Article Google Scholar
Schot N (2015) Feasibility of raspberry pi 2 based micro data centers in big data applications. In: Proceedings of the 23rd Twenty Student Conference on IT
Antonini M, Barlaud M (1992) Image coding using wavelet transform. IEEE Trans Image Process 1(2):205–220
Article Google Scholar
Lewis AS, Knowles G (1992) Image compression using the 2D wavelet transform. IEEE Trans Image Process 1(2):244–256
Article Google Scholar
Shapiro JM (1993) Embedded image coding using zerotrees of wavelets coefficients. IEEE Trans Signal Process 41(12):3445–3462
Article MATH Google Scholar
Marcellin MW, Gormish MJ, Bilgin A, Boliek MP (2000) An overview of JPEG-2000. In: Proceedings of Data Compression Conference
Santa-Cruz D, Ebrahimi T (2000) A study of JPEG 2000 still image coding versus others standards. In: Proceedings of X European Signal Processing Conference
Chen Y, Pearlman WA (1996) Three-dimensional subband coding of video using the zero-tree method. In: Proceedings of SPIE-Visual Communications and Image Processing, pp 1302–1310
Kim Y, Pearlman WA (2000) Stripe-based SPIHT Lossy compression of volumetric medical images for low memory usage and uniform reconstruction quality. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp 2031–2034
Battista S, Casalino F, Lande C (1999) MPEG-4: a multimedia standard for the third millenium, Part 1. IEEE Multimed 6(4):74–83
Article Google Scholar
Battista S, Casalino F, Lande C (2000) MPEG-4: a multimedia standard for the third millenium, Part 2. IEEE Multimed 7(1):76–84
Article Google Scholar
Bernabé G, García JM, González J (2005) Reducing 3D wavelet transform execution time using blocking and the streaming SIMD extensions. J VLSI Signal Process 41(2):209–223
Article Google Scholar
Bernabé G, Guerrero G, Fernández J (2012) CUDA and OpenCL implementations of 3D fast wavelet transform. In: 3rd IEEE Latin American symposium on circuits and systems, Playa del Carmen, Mexico
Franco J, Bernabé G, Fernández J, Acacio ME (2009) A parallel implementation of the 2D wavelet transform using CUDA. In: 17th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Weimar
Franco J, Bernabé G, Fernández J, Ujaldón M (2012) The 2D wavelet transform on emerging architectures: GPUs and Multicores. J Real-Time Image Process 3:145–152. doi:10.1007/s11554-011-0224-7
Franco J, Bernabé G, Fernández J, Ujaldn M (2010) Parallel 3D fast wavelet transform on manycore GPUs and multicore CPUs. In: 10th International Conference on Computational Science, Amsterdam
Bernabé G, Fernández R, García JM, Acacio ME, González J (2007) An efficient implementation of a 3D wavelet transform based encoder on hyper-threading technology. J Parallel Comput 33(1):54–72
Article MathSciNet Google Scholar
Butenhof D (1997) Programming with POSIX threads. Addison-Wesley Professional, Melbourne
Google Scholar
Gropp W, Lusk E, Skjellum A (1999) Using MPI, Second Edition edn. MIT Press, Massachusetts
MATH Google Scholar
Ramesh M, Ragi G, Abishek T (2012) Low-power intelligent wearable cardiac sensor using discrete wavelet compression. In: Proceedings of 2012 International Conference on Advances in Mobile Networks, Communication and its Applications, pp 107–110
Navarro AA, Vélez JA, Satizabal JE, Múnera LE, Bernabé G (2003) virtual surgical telesimulations in ophtalmology. In: 17th International Congress on Computer Assisted Radiology and Surgery (CARS 2003), London
Vélez JA, Navarro AA, Roche CADL, Múnera LE, Bernabé G, Bermudez C, Jiménez JF, Kopec A (2003) A virtual surgical telesimulations in micrographic dermatologic surgery (MOHS). In: 17th International Congress on Computer Assisted Radiology and Surgery (CARS 2003), London
Navarro AA, Hernández CJ, Vélez JA, Múnera LE, Bernabé G, Gamboa CA, Reyes AJ (2005) Virtual surgical telesimulations in otolaryngology. In: 13th annual medicine meets virtual reality, Long Beach
Mallat S (1989) A theory for multiresolution signal descomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell 11(7):674–693
Article MATH Google Scholar
Bernabé G, González J, García JM, Duato J (2000) A New Lossy 3-D wavelet transform for high-quality compression of medical video. In: Proceedings of IEEE EMBS International Conference on Information Technology Applications in Biomedicine, pp 226–231
Daubechies I (1992) Ten lectures on wavelets. Society for Industrial and Applied Mathematics, Philadelphia
Meerwald P, Norcen R, Uhl A (2002) Cache issues with JPEG2000 wavelet lifting. In: Proceedings of Visual Communications and Image Processing Conference, pp 626–634
Shahbahrami A, Juurlink B, Vassiliadis S (2006) Improving the memory behavior of vertical filtering in the discrete wavelet transform. In: Proceedings of ACM Conference in Computing Frontiers, pp 253–260
Tao J, Shahbahrami A, Juurlink B, Buchty R, Karl W, Vassiliadis S (2007) Optimizing cache performance of the discrete wavelet transform using a visualization tool. In: Proceedings of IEEE International Symposium on Multimedia, pp 153–160

Download references

Acknowledgements

This work was supported by the Spanish MINECO, as well as by European Commission FEDER funds, under Grant TIN2015-66972-C5-3-R.

Author information

Authors and Affiliations

University of Murcia, Murcia, Spain
Gregorio Bernabé, Raúl Hernández & Manuel E. Acacio

Authors

Gregorio Bernabé
View author publications
You can also search for this author in PubMed Google Scholar
Raúl Hernández
View author publications
You can also search for this author in PubMed Google Scholar
Manuel E. Acacio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gregorio Bernabé.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bernabé, G., Hernández, R. & Acacio, M.E. Parallel implementations of the 3D fast wavelet transform on a Raspberry Pi 2 cluster. J Supercomput 74, 1765–1778 (2018). https://doi.org/10.1007/s11227-016-1933-2

Download citation

Published: 02 December 2016
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11227-016-1933-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Parallel implementations of the 3D fast wavelet transform on a Raspberry Pi 2 cluster

Abstract

Similar content being viewed by others

Parallel Image Processing Applications Using Raspberry Pi

Capabilities of Raspberry Pi 2 for Big Data and Video Streaming Applications in Data Centres

Performance Evaluation of Raspberry Pi 4B Microcomputer: Case Studies on MPICH Cluster, VMware ESXi ARM Fling, and Windows 11 ARM OS

1 Introduction

2 Wavelet transform foundations