1 Introduction

Continuous improvements in the technologies used to build computers have recently made possible manufacturing of extremely low-cost general-purpose single-board computing devices. Nowadays, one can buy one of these tiny computers for a few dollars and make it run Windows 10 or Ubuntu-Linux operating systems [1]. Today, one of the most renowned examples of these single-board computers is Raspberry Pi devices. Although the initial aim of these devices was to promote the teaching of basic computer science in schools [2] and developing countries [3], the recent appearance of single-board computers (SBC) with multicore CPU chips has attracted interest of a multitude of projects trying to take advantage of their extremely low energy consumption and cost-performance ratio (i.e., in data centers [4]).

Raspberry Pi SBCs are priced between US$20 and US$35.Footnote 1 The development of the Raspberry Pi SDCs has gone through several generations. The first generation (Raspberry Pi 1) was released in February 2012 in basic model A and a higher specification model B. A+ and B+ models were released a year later. All these Raspberry Pi’s had a single-core ARM-based CPU chip which restricted their use to toy computers. Raspberry Pi 2 model B was released in February 2015 and its main novelty was the introduction of a 4-core ARM-based CPU chip, which brought a sixfold performance increase compared to their predecessor. The last family member, the Raspberry Pi 3 model B, was launched in February 2016 and adds wireless connectivity (2.4 GHz WiFi 802.11n and Bluetooth 4.1).

In the last few years, a very attractive area of research involves the proposal and evaluation of different transform functions that may overcome the limitations that the discrete cosine transform (DCT) used by MPEG-2 presents for some particular types of video. Wavelet techniques have recently generated much interest in applied areas and the wavelet transform has been mainly applied to images. Several coders have been developed using the 2D wavelet transform [5,6,7]. The latest image compression standard, JPEG-2000 [8, 9], is also based on the 2D wavelet transform with a dyadic mother wavelet transform. The 3D wavelet transform has been also applied for compressing video. Since one of the three spatial dimensions can be considered similar to time, Chen and Pearlman developed a three-dimensional subband coding to code video sequences [10], later made an improvement with an embedded wavelet video coder using 3D set partitioning in hierarchical trees (SPHIT) [11]. Today, the standard MPEG-4 [12, 13] supports an ad-hoc tool for encoding textures and still images, based on a wavelet algorithm. In a previous work [14], we have presented the implementation of a lossy encoder for medical video based on the 3D fast wavelet transform. This encoder achieves both high compression ratios and excellent quality, so that medical doctors cannot find longer differences between the original and the reconstructed video. Furthermore, the execution time achieved by this encoder allows for real-time video compression and transmission. In the case of parallelizing the 2D or 3D fast wavelet transform, we have shown previously that contemporary GPGPUs can achieve dramatical speed-ups [15,16,17,18]. Unfortunately, automatic parallelization methods do not yield enough benefits, while manual parallelization methods pose a considerable burden in software development [19].

In this work, we study the parallelization of the 3D fast wavelet transform (3D-FWT) on a cluster of Raspberry Pi 2 SDCs. Particularly, we have implemented three different parallelization strategies for the 3D-FWT (namely, decomposition by rows, decomposition by rows and columns, and decomposition by video sequences) using both POSIX Threads [20] and MPI [21], and we have evaluated them on a cluster of 4 Raspberry Pi 2 SDCs. While the POSIX Threads versions are restricted to runs on a single boards, the MPI versions can span to several Raspberry Pi 2 SDCs. Our results show that noticeable speed-ups can be obtained when all MPI processes or POSIX Threads are run using the cores of a single Raspberry Pi 2 SDC. However, in the case of the MPI versions, we observe that performance drops drastically when all MPI processes spread to several boards. The reason for this is the limited bandwidth that the onboard LAN port (rated as 10/100 Fast Ethernet, and driven through the onboard USB 2.0 bus) can deliver, and that proves insufficient for the fine-grained, high-volume communication requirements of the parallelization strategies.

We have also considered the execution of the POSIX Threads and MPI versions on a very high-performance but power-hungry 4-core Intel Xeon CPU E5606, obtaining that the Raspberry Pi 2 SDC can do the task with much lower total energy consumption (up to 4 times). Overall, our results prove that the Raspberry Pi 2 SDC reveals as an appealing platform for giving support to 3D-FWT-based applications with low-cost and energy efficiency requirements [22]. As a use case, we think that Raspberry Pi 2 SDCs could be used in the development of different telemedicine specialities based on 3D-FWT [23,24,25] in rural zones of developing countries.

The rest of the paper is organized as follows. Section 2 contains important background about the wavelet transform. The parallelization strategies that we have implemented and evaluated in this work are explained in Sect. 3. In Sect. 4, we give the details of the cluster of Raspeberry Pi 2 SDCs used for the evaluation and, then, we present the results. Finally, Sect. 5 concludes the paper and draws some lines of future work.

2 Wavelet transform foundations

The basic idea of the wavelet transform is to represent any arbitrary function f as a weighted sum of functions, referred to as wavelets. Each wavelet is obtained from a mother wavelet function by conveniently scaling and translating it. The result is equivalent to decomposing f into different scale levels (or layers), where each level is then further decomposed with a resolution adapted to that level.

In a multiresolution analysis, there are two functions: the mother wavelet and its associated scaling function. Therefore, the wavelet transform can be implemented by quadrature mirror filters (QMF), \(G = g(n)\) and \(H = h(n),\,n\epsilon Z. H\) corresponds to a low-pass filter, and G is a high-pass filter. For a more detailed analysis of the relationship between wavelets and QMF, see [26].

The filters H and G correspond to one step in the wavelet decomposition. Given a discrete signal, s, with a length of \(2^{n}\), at each stage of the wavelet transformation the G and H filters are applied to the signal, and the filter output downsampled by two, thus generating two bands, G and H. The process is then repeated on the H band to generate the next level of decomposition, and so on. This procedure is referred to as the 1D fast wavelet transform (1D-FWT).

It is not difficult to generalize the one-dimensional wavelet transform to the multi-dimensional case [26]. The wavelet representation of an image, f(xy), can be obtained with a pyramid algorithm. It can be achieved by first applying the 1D-FWT to each row of the image and then to each column, that is, the G and H filters are applied to the image in both the horizontal and vertical directions. The process is repeated several times, as in the one-dimensional case. This procedure is referred to as the 2D fast wavelet transform (2D-FWT).

As in 2D, we can generalize the one-dimensional wavelet transform for the three-dimensional case. Instead of one image, there is now a sequence of images. Thus, a new dimension has emerged, the time (t). The 3D-FWT can be computed by successively applying the 1D wavelet transform to the value of the pixels in each dimension.

Based on the previous work [27], we consider Daubechie’s \(W_{4}\) mother wavelet [28] as an appropriate baseline function. This selection determines the access pattern to memory for the entire 3D-FWT process. Let us assume an input video sequence consisting of a number of frames (3rd dimension), each composed of a certain number of rows and columns (1st and 2nd dimension). The 1D-FWT is performed across all frames for each row and column, that is, we apply the 1D-FWT \(rows \times cols\) times in the third dimension. The first 1D-FWT instance requires four elements to calculate the first output element for the reference video and the detailed video, with these elements being the first pixel belonging to the first four frames. The second output element for the reference and detailed video are calculated using the first pixel of the third, fourth, fifth and sixth video frames. We continue this way until the entire reference and detailed video are calculated, and these data are the input used for the next stage.

The 2D-FWT is performed frames times, i.e., once per frame. This transform is performed by first applying the 1D-FWT on each row (horizontal filtering) of the image, followed by the 1D-FWT on each column (vertical filtering). The fact that vertical filtering computes each column entirely before advancing to the next column forces the cache lines belonging to the first rows to be replaced before the algorithm moves on to the next column. Meerwald et al. [29] propose two techniques to overcome this problem: row extension and aggregation or tiling.

Other studies [30, 31] also report remarkable improvements when applying the tiling technique over the 2D-FWT algorithm. Our experience implementing on a CPU the sequential 2D-FWT algorithm revealed a reduction of almost an order of magnitude in the overall execution time with respect to a baseline version. This process can straightforwardly be applied to the 3D case, where there are reports of solid gains on execution times as well, which range from 3\(\times \) to 7\(\times \) factors [14].

In previous works [16, 17], we contributed with a CUDA implementation for the 2D-FWT running more than 20 times faster than a sequential C version on a CPU, and more than twice faster than optimized OpenMP and pthreads versions implemented on a quad-core CPU. We extended the analysis to the 3D-FWT scenario [15, 18], where speed-up factors have been improved using a new set of optimization techniques. We presented different alternatives and programming techniques for an efficient parallelization of the 3D fast wavelet transform on multicore CPUs and manycore GPUs. OpenMP and pthreads were used on the CPU to build different implementations to maximize parallelism, whereas CUDA and OpenCL were selected for data parallelism exploitation on the GPU with an explicit memory handling. Speed-ups of the CUDA version on Fermi architecture were the highest obtained, improving the execution times on CPU on a range from 5.3\(\times \) to 7.4\(\times \) for different image sizes, and up to 81\(\times \) faster when communications are neglected. Meanwhile, OpenCL obtains solid gains in the range from 2\(\times \) for small frame sizes to 3\(\times \) for larger ones.

3 Parallelization on a cluster of Raspberry Pi’s

In this section, we present the three parallelization strategies that we have considered in this work. Each one of them has been implemented using both MPI [21] and POSIX Threads (POSIX Threads) [20]. In the case of the POSIX Threads versions, all threads must execute on the same SDC, since there is no physical shared memory between the different nodes of the cluster. Contrarily, MPI processes can either be run on the same node or using different nodes.

3.1 Decomposition by rows

Our first parallelization strategy tries to take advantage of the fact that computations on the different rows of each frame in a video sequence can be done in parallel. To do so, in this approach different blocks of rows of each frame of the video sequence are divided among the different participating MPI processes. Then, the 1D-FWT in the time and x dimensions are executed by each process over its assigned rows. Next, every process sends its results to the first process (in our case, process with id 0), which is in charge of applying the 1D-FWT to all the columns of the frame. The low-pass outputs of the 3D-FWT are sent from the first process to the rest of them in order to apply the second iteration of the 3D-FWT and so on. Figure 1 describes graphically how this first parallelization strategy would proceed when implemented using MPI.

The parallelization procedure followed in the POSIX Thread version is exactly the same: each of the threads is responsible for computing a block of rows of each frame of the video sequence and, subsequently, just one of the threads (that with id 0) applies the 1D-FWT to all the columns of the frame. These two phases have been separated by means of a barrier synchronization. Once thread with id 0 has finished, the second iteration of the 3D-FWT could proceed. Again, a barrier synchronization guarantees that the second iteration cannot initiate until thread 0 has gone through all the columns. Finally, it is important to note that while communications must be explicitly defined in the case of the MPI version, they occur implicitly in the POSIX Threads version as the different threads read and write to the shared memory. This is in favor of the POSIX Threads implementation that is much more easier to code.

Fig. 1
figure 1

Decomposition by rows parallelization strategy

3.2 Decomposition by rows and columns

In an attempt to try to increase the amount of parallel work, we also consider the decomposition by rows and columns parallelization strategy. This is an evolution of the previous decomposition strategy. Particularly, once the 1D-FWT is applied in the time and x dimensions by each process, instead of sending the results to a particular process, in this alternative results are exchanged between all participating processes. Once this is done, the 1D-FWT is applied by each process for a block of columns. Then, the low-pass output of the 3D-FWT is also exchanged between all the processes to apply the second iteration of the 3D-FWT and so on. Figure 2 describes graphically how this second parallelization strategy would proceed . It is clear that in this case more work is done in parallel but at the expense of increasing communication requirements.

Fig. 2
figure 2

Decomposition by rows and columns parallelization strategy

As in the Decomposition by rows version, the implementation with POSIX Threads divides the different blocks of rows of the video sequence among all the available threads. Each thread applies the 1D-FWT in the time and x dimensions over its assigned set of rows (first phase). Once all threads have finished, they apply the 1D-FWT for their assigned block of columns (second phase). Again, the first and second phases are separated using a barrier synchronization, and also a barrier synchronization after the second phase is inserted to guarantee that the second iteration (which is applied to the low-pass output) cannot start until the first one has finished. Developing this parallelization strategy using MPI requires a lot of communications to be defined by the programmer, which results in increased complexity compared with the Decomposition by rows version. However, the fact that all communications are implicit in the POSIX Threads version does not bring additional programming efforts in this case.

3.3 Decomposition by video sequences

In an attempt to minimize communications, this parallelization strategy divides the video stream into several sequences, which are subsequently assigned to each one of the available MPI processes or threads. This way, each MPI process or thread can execute a number of iterations of the 3D-FWT independently of what the rest are doing and, thus, no communications are required between them. In this case, there are no meaningful differences in terms of complexity between the MPI and POSIX Thread versions, since both of them proceed almost the same way. Figure 3 describes graphically how this third parallelization strategy would work.

Fig. 3
figure 3

Decomposition by video sequences parallelization strategy

4 Experiments

We have built a cluster which is composed by four Raspberry Pi 2 Model B nodes. Each node contains a 900 MHz quad-core ARM Cortex-A7 CPU and 1 Gb of RAM memory. The interconnection network of the cluster is Ethernet at 100 Mbps. The Operating System is Raspbian Wheezy. In our cluster, we have installed MPICH2 (v1.4.1) as the MPI library implementation for the MPI versions of the parallelization strategies previously described. Additionally, we have also implemented shared-memory versions using PThreads (POSIX Threads) [20].

4.1 Performance results

We have executed and measured the optimized 3D-FWT sequential version previously used in [15, 18] on a Raspberry Pi 2 for 2 iterations of the 3D fast wavelet transform. This will be the baseline along the evaluation. As already commented on, all threads must execute on the same SDC in the case of the POSIX Threads versions, since there is no physical shared memory between the different nodes of the cluster. Contrarily, MPI processes can either be run on the same node or using different nodes. This way, we have configured different parallel execution scenarios for the three parallel versions of the 3D-FWT explained before in each case. For the POSIX Threads versions, experiments have been done for 2 and 4 PThreads running on one Raspberry Pi 2 (remember that each board has a 4-core CPU). For the MPI versions, we consider parallel executions with 2 and 4 MPI processes, running on the same Raspberry Pi 2 board or different boards. This allows us evaluating the impact that the limited bandwidth of the onboard LAN port has on the results.

Figure 4 shows the execution times (in seconds) for 2 iterations of the 3D-FWT on an input video containing 64 frames of \(512\times 512\) pixels. In all cases, the execution times include the time needed to read the input video and write the result from and to the disk, respectively (down portion of each bar (I/O)). Note that the I/O time keeps constant when all threads run on the same Raspberry Pi 2. Contrarily, when the threads spawn to different Raspberry Pi 2, I/O time grows. This is due to input and out data must go through the slow LAN interface. From left to right, we first present the result obtained for the sequential version (Sequential), then we show the results for the decomposition by rows parallelization strategy implemented both using MPI (Rows-MPI) and PThreads (Rows-PT), decomposition by rows and columns parallelization strategy implemented both using MPI (Rows&Cols-MPI) and PThreads (Rows&Cols-PT), and decomposition by video sequences parallelization strategy implemented both using MPI (Seq-MPI) and PThreads (Seq-PT). For each one of the parallelization strategies, we consider 2 and 4 MPI processes or PThreads (2P and 4P, respectively) running on one Raspberry Pi 2, and only for the MPI versions, running on 2 or 4 Raspberry Pi 2 (2P-R and 4P-R, respectively), having just 1 MPI process per board in each case.

Fig. 4
figure 4

Execution times for 2 iterations of the 3D fast wavelet transform

From the results, we can see that the three proposed parallelization strategies of the 3D-FWT obtain noticeable speed-ups when executed on a single Raspberry Pi 2 with regard to the optimized sequential version. This is regardless of whether we have employed MPI or POSIX Threads in the parallelization. However, the executions on different Raspberry Pi 2 for the MPI versions, except in the case of the decomposition by video sequences parallelization strategy, show negative outcomes from the performance point of view. What makes the differences is that in the first case (all processes or threads run on the same CPU) all communications take place on the same board and, therefore, can be performed with low latency. Contrarily what happens when communications take place between several Raspberry Pi. In this case, all communications go through the low bandwidth onboard LAN ports, which ballast any performance advantages that parallel execution could entail. The lack of communications in the decomposition by video sequences strategy justifies the speed-ups of 1.25 and 1.59 for 2 and 4 MPI processes, respectively; although as it can be seen in Fig. 4, speed-ups are even greater in this case when all MPI processes run on the same SDC.

Taking a closer look at the results for one Raspeberry Pi board, we see that the speed-ups of the decomposition by rows parallelization strategy for 2 and 4 MPI processes are 1.44 and 1.24, respectively. Similarly, the decomposition by rows and columns parallelization strategy obtains speed-ups of 1.50 and 1.16 for 2 and 4 MPI processes, respectively. Therefore, in both cases, fewer MPI processes and, therefore, lesser amount of communications among several processes imply the best results. These two parallel versions do not scale due to the decreasing computation/communication ratio that they exhibit as the number of processes grows. Comparing the results for the two parallelization strategies, we observe that the increased potential parallelism exhibited by the decomposition by rows and columns strategy can only be exploited efficiently when the number of involved processes is small (2). In this case, the communication budget that it entails can be significantly amortized by the large amount of floating-point operations that each process performs. However, when the number of cores is 4, the larger communication requirements cannot be compensated, and lower speed-ups are obtained when compared with the decomposition by rows strategy.

Regarding the implementations using PThreads, the speed-ups of the decomposition by rows parallelization strategy for 2 and 4 PThreads are 1.26 and 1.37, respectively. Similarly, the decomposition by rows and columns parallelization strategy obtains speed-ups of 1.40 and 1.63 for 2 and 4 PThreads respectively. Now, in both cases, higher number of PThreads leads to better results. We have observed that for the POSIX Threads versions communications scale more gracefully than in the MPI cases and, thus, increasing the number of threads in these two parallel versions does not penalize as happened for the MPI versions. In this case, communications occur through the shared memory, as every thread reads data blocks previously written by other thread. This allows for efficient exploitation of the higher potential parallelism that the decomposition by rows and columns strategy has, and higher speed-ups are achieved when compared with the decomposition by rows strategy for both 2 or 4 PThreads. Additionally, comparing these speed-ups with the ones obtained for the MPI versions, we observe that the highest speed-up corresponds to the PThreads implementation of the decomposition by rows and columns for executions with 4 PThreads (1.63).

Finally, we consider the results for the decomposition by video sequences parallelization strategy when all PThreads or MPI processes are run on the same SDC. In this case, speed-ups of 1.90 and 2.84 are obtained for 2 and 4 MPI processes, respectively. In the same way, the speed-ups for 2 and 4 PThreads are 1.51 and 2.10, respectively. Therefore, in this case for both implementations, higher number of MPI processes or PThreads leads to better results. It is important to note that the main goal of the decomposition by video sequences parallelization strategy was to eliminate communications between MPI processes or PThreads by making them operate on different frames. This way, it can reach better results in terms of speed-up and scalability than the decomposition by rows and the decomposition by rows and columns parallelization strategies.

In general, the speed-up results obtained within one Raspberry Pi keep in the expected range for a class of algorithms like the 3D-FWT, with low arithmetic intensity, pseudo-regular access patterns and intricate loop traversing. In these cases, maintaining communication requirements as low as possible is a key factor to obtain substantial profits.

4.2 Energy efficiency results

We also compare the results observed for the POSIX Threads and MPI versions when running on a Raspberry Pi 2 SDC, with those obtained by a conventional, high-performance multicore processor. Particularly, we consider the case of an Intel® Xeon® CPU E5606 running at 2.13 GHz. To do so, we measure the execution times of each version on each platform, and multiply them by the reported Thermal Design Power (TDP) measures in each case (4 W for the Raspberry Pi 2 and 80 W for Intel Xeon processor). In all the cases, we consider 2 iterations of the 3D-FWT on an input video containing 64 frames of \(512\times 512\) pixels. The results are shown in Fig. 5.

From left to right, we first present the results for the decomposition by rows parallelization strategy implemented both using MPI (Rows-MPI) and PThreads (Rows-PT), decomposition by rows and columns parallelization strategy implemented both using MPI (Rows&Cols-MPI) and PThreads (Rows&Cols-PT), and decomposition by video sequences parallelization strategy implemented both using MPI (Seq-MPI) and PThreads (Seq-PT). For each one of the parallelization strategies, we consider 2 and 4 MPI processes or PThreads running on one Raspberry Pi 2 (2T-RP2 and 4T-RP2, respectively), and running on the Intel Xeon (2T-Xeon and 4T-Xeon, respectively).

As it can be seen in the figure, the results for the Raspberry Pi 2 are 3 to 4 times lower than those obtained for the Intel Xeon. This means that this platform (Raspberry Pi 2) can do the job with much lower total energy consumption than its higher-performance opponent. Overall, our results prove that the Raspberry Pi 2 SDC reveals as an appealing platform for giving support to 3D-FWT-based applications with low-cost and energy efficiency requirements.

Fig. 5
figure 5

Execution time \(\times \) TDP for 2 iterations of the 3D fast wavelet transform in Raspberry Pi 2 and Intel Xeon

5 Conclusions and future work

Current low-cost general-purpose single-board computing (SDC) devices are gaining increasing interests in research computing due to their very low cost/performance ratio and energy consumption. In this work, we have analyzed the potential of a cluster built from Rasbperry Pi 2 SDCs as a viable platform for accelerating the execution of the 3D fast wavelet transform. To do so, we have implemented three parallelization strategies for the 3D fast wavelet transform using both MPI and POSIX Threads. We found that although such cluster can constitute a very appealing solution from the cost point of view, it is unable to improve performance beyond what it is obtained when a single Raspberry Pi 2 is employed. That is, while noticeable speed-ups can be obtained when all MPI processes or PThreads run on cores of a single Raspberry Pi 2 SDC, performance drops drastically when they spread to several boards. The reason for this is the limited bandwidth that the onboard LAN port (rated as 10/100 Fast Ethernet, and driven through the onboard USB 2.0 bus) can deliver, and that proves insufficient for the fine-grained, high-volume communication requirements of the two parallelization strategies. Finally, we have also considered the execution of the POSIX Threads and MPI versions on a very high-performance but power-hungry 4-core Intel Xeon CPU E5606, obtaining that the Raspberry Pi 2 SDC can do the task with much lower total energy consumption (up to 4 times).

Our plans for future work include the possibility of using other SDCs different than Raspberry Pi for building the cluster. Particularly, we have found that Odroid C2 SDC may be an interesting alternative to consider since it provides, among other things, a Gigabit Ethernet LAN port which would help palliate the bandwidth limitations that our cluster configuration currently has.