1 Introduction

Nowadays, multicore systems equipped with homogeneous or heterogeneous processing elements are popular and common in modern computers. For example, world leading semiconductor chip makers, such as Intel, IBM, AMD, MediaTek, and Qualcomm, add data accelerators, e.g., digital signal processor and graphical processing unit, to improve system performance for certain types of data computation jobs. There are lots of programming paradigms that have been proposed to facilitate programming for such heterogeneous platforms, such as OpenMP [6], OpenMPI [14], OpenCL [5], OpenVX [15], and CUDA [13].

While these programming facilities are helpful to abstract underlying hardware at certain level, they are simply too complicated for average programmers or algorithm developers. Therefore, a more intuitive environment can greatly reduce the burden of parallel programming. There have been several visual programming frameworks proposed to simplify general programming tasks, such as OpenBlocks [7], NetLogo [12], StarLogo The Next Generation (TNG) [11], Scratch [10], Blockly [3], and Scratch Blocks [16]. In addition, several tools have been designed to support specific programming languages, such as Android App Inventor for Android system [9], and Hopscotch for Apple iOS [4]. However, there are few tools available to ease the pain of parallel programming on multicore platforms.

This paper presents a graphical programming tool for CUDA and OpenCL called GPUBlocks. A prototype implementation of GPUBlocks has been constructed upon the open source visual programming frameworks OpenBlocks [7] and ArduBlock [1]. Major building blocks have been developed so that the CUDA and OpenCL programs can be generated automatically after users graphically specify array and matrix computations of target applications. Furthermore, several optimization blocks have also been developed to quickly optimize the program performance by switching from the standard code blocks to the optimized ones. In addition, as the tool is capable of displaying the generated CUDA or OpenCL programs alongside corresponding CUDA or OpenCL blocks, beginners will be able to learn the CUDA or OpenCL programming by comparing the blocks against the generated codes. Experimental results have shown that the generated CUDA and OpenCL programs can achieve reasonable speedups on GPUs.

The main results of this paper are as follows:

  • This paper presents a GUI tool called GPUBlocks that integrates the visual programming approach to facilitate parallel programming on GPUs.

  • Programmers simply need to drag-n-drop blocks, fill the fields of the blocks, and connect them according to array or matrix computations that are specified by algorithms. GPUBlocks can then translate block-based code to CUDA or OpenCL programs.

  • A prototype visual programming tool CUDABlock that converts the OpenBlocks-based diagrams into CUDA programs has been implemented upon OpenBlocks and ArduBlock [8].

  • This paper extends CUDABlock by incorporating OpenCL blocks in the frontend and an OpenCL code generator in the backend. Consequently, programmers can conveniently drag and connect GPUBlocks blocks and then generate OpenCL or CUDA programs.

The rest of the paper is organized as follows. Section 2 surveys the related work. Section 3 highlights the key components of the GPUBlocks tool. Section 4 presents the experimental results and Section 5 concludes this paper.

2 Related Work

OpenCL is an open standard maintained by Khronos Group for the systems which incorporate with different types of computing devices, such as CPUs, GPUs, DSPs, and FPGAs [5]. With its platform-independent APIs, all OpenCL code is portable cross different devices. The standard is now supported by many hardware vendors, and with the device-specific drivers, the OpenCL-based data-parallel programs are able to take advantage of the computing power of underlying parallel hardware.

CUDA is a parallel computing platform and programming model invented by NVIDIA [13]. CUDA is a closed, data-parallel library, which is developed specifically to use a CUDA-enabled GPU for general purpose processing. Compared with OpenCL, CUDA is considered to be more efficient than the OpenCL counterpart on the NVIDIA platforms.

Overall, the above programming languages facilitate heterogeneous computing by abstracting different hardware architectures and generating the performance-oriented parallel code. Still, they are way too sophisticated for average programmers, who write programs for fun or focus on developing application algorithms.

In recent years, there are many attempts that have been made to develop visual programming frameworks for ease of the programming efforts and/or for computer education purpose, such as OpenBlocks [7], ArduBlock [1], NetLogo [12], StarLogo The Next Generation (TNG) [11], Scratch [10], Blockly [3], and Scratch Blocks [16]. Furthermore, several tools have been designed to support specific programming languages, such as Android App Inventor for Android system [9], and Hopscotch for Apple iOS [4].

Scratch is a graphical programming tool developed by MIT [10]. The educational purpose software aims to help young people learn to think creatively. It can be used to build interactive stories, games, and animations. Blockly [3], which is a Google’s project, is an open source library that creates the virtual code editor in the form of web pages and Android apps. Also, Blockly helps create the programs with its visual code editor, and users do not worry about the language syntax. It supports the generation of various codes, including JavaScript, Python, PHP, Lua, and Dart, where each language has its own code generator, converting the diagrams into the corresponding program codes. Based on Blockly, Scratch Blocks [16] is a new development project for building creative learning tools for young people. The ongoing project aims to develop the next generation of graphical programming blocks, based on the collaboration between Google and the Scratch team from MIT. Currently, the developer preview code is available on the project website.

CUDABlock is a visual programming tool which converts the OpenBlocks-based diagrams into CUDA programs [8]. CUDABlock has a similar programming interface to ArduBlock [1], which is a graphic programming language for Arduino [2]. This work extends CUDABlock by incorporating OpenCL blocks in the frontend and an OpenCL code generator in the backend. Consequently, programmers can conveniently drag and connect GPUBlocks blocks and then generate OpenCL or CUDA programs.

3 GPUBlocks

GPUBlocks has been developed based on the OpenBlocks framework by adding sets of new blocks for CUDA and OpenCL programming: ANSI C Blocks, CUDA Blocks, and OpenCL Blocks.

3.1 OpenBlocks Framework Overview

OpenBlocks is a Java-based visual programming tool that helps facilitate programming tasks by stacking the pre-defined blocks [7]. Programmers simply drag code blocks to denote specific actions, and then connect the selected blocks according to algorithms. Figure 1 shows an example of the OpenBlocks programming environment. The visual programming environment is divided into three areas. The top-left area allows users to choose from different blocks for programming, whereas the bottom-left region displays the different categories of the available blocks. The displaying area on the right is the main place for visual programming by dragging the blocks on the left and connecting them on the right.

Figure 1
figure 1

OpenBlocks.

3.2 ANSI C Blocks

Five sets of ANSI C blocks have been integrated into GPUBlocks for C programming, namely Control, Test, Math, System IO, and Variables/Constants:

  • Control blocks denote the control-flow related constructs in C, as shown in Fig. 2.

  • Test blocks evaluate boolean expressions.

  • Math blocks represent pre-built mathematical functions.

  • System IO blocks refer to services offered by the host system, currently only the print function is implemented.

  • Variable/Constants blocks are used to declare constants and variables, shown in Fig. 3a.

Figure 2
figure 2

Control Blocks of ANSI C Blocks.

Figure 3
figure 3

ANSI C Block.

In addition, there is another class of blocks, Code Blocks, which helps build customized C codes by allowing users to define specialized code blocks. As shown in Fig. 3b, there are three buttons, head, setup, and main in the Code Blocks. Users can make customized codes for these buttons. In particular, head button is for including a header file or function/variable declarations, setup button can be dragged into the main function for function/variable initialization, and main button allows the user to specify the customized statements.

Figure 4 depicts an example of programming a 2D array computation. The top part of the figure depicts the blocks for the initialization and computations of a 2D array, while the bottom part shows its corresponding C statements that are generated by GPUBlocks. This example clearly shows that the above operations can be done easily with two nested for-loop blocks and a variable block.

Figure 4
figure 4

Example of ANSI C Blocks (top) and Its Corresponding C Code (bottom).

3.3 CUDA and OpenCL Blocks

GPUBlocks supports basic array and matrix operations of linear algebra, e.g., addition, subtraction, multiplication, and convolution operations, as listed in Fig. 5a. The default setting is that these operations will be translated into a sequential C program. When a CUDA block or an OpenCL block is placed before these array and matrix operations, GPUBlocks will convert the block program into a CUDA program or an OpenCL program, respectively. In other words, it is straightforward to switch among C, CUDA, and OpenCL by dragging an appropriate language block and placing it before array and matrix operations.

Figure 5
figure 5

CUDA and OpenCL Blocks.

In addition to CUDA and OpenCL blocks, Fig. 5b illustrates that GPUBlocks also includes blocks of two basic and commonly used optimization techniques in both CUDA and OpenCL, i.e. share blocks for shared memory caching and a blocking blocks for data tiling. When a share block is used, the backend of GPUBlocks will generate a CUDA or OpenCL code that allocates data in the GPU shared memory. As shared memory is on-chip, it is much faster than local and global memory, roughly 100x lower than uncached global memory latency. Therefore, speedups generally can be observed when this optimization is applied. Furthermore, additional speedups might be achieved if a blocking block is chosen, since tiling (or block) is a commonly used programming pattern that partitions data in order to operate in well-sized blocks whose size is small enough to be staged in shared memory.

CUDA Example

This section uses a matrix multiplication as an example to illustrate GPUBlocks programs and their corresponding CUDA code. Two GPUBlocks variable blocks are first constructed to initialize two random matrixes a and b, and then a matrix multiplication block and a CUDA block are added to specify performing the matrix multiplication computation c = a × b in CUDA, as illustrated in Fig. 6a. GPUBlocks then converts the code into its corresponding kernel code and host program, as listed in Fig. 7. This example has demonstrated that GPUBlocks is an intuitive approach that can significantly reduce the complexity of parallel programming in CUDA.

Figure 6
figure 6

CUDA Examples.

Figure 7
figure 7

Generated CUDA Code of Fig. 6a: CUDA Kernel (left) and Host Code (right).

Performing optimizations on CUDA programs is easy and straightforward in GPUBlocks by simply replacing the plain CUDA blocks with the specific optimization blocks. Figure 6b shows that only one block needs to be replaced in order to utilize shared memory to optimize the CUDA code, and Fig. 8 lists the optimized CUDA code that is generated by GPUBlocks. This example has shown that this intuitive approach can effectively reduce the complexity of writing optimized CUDA programs.

Figure 8
figure 8

Optimized CUDA Code of Fig. 6b: CUDA Kernel (left) and Host Code (right).

OpenCL Example

Figure 9 depicts the same matrix multiplication example in OpenCL, which looks almost identical to the CUDA example shown Fig. 6. The unoptimized code in Fig. 9a is translated into the OpenCL code presented in Fig. 10. While generating OpenCL programs, GPUBlocks makes the following assumptions unless otherwise specified. First, GPUBlocks uses the first computing device that could be found by the OpenCL runtime to perform the computation of the generated code. Second, GPUBlocks inserts the assertion functions which OpenCL function would call to ensure that the parallel code executes as expected. By default, the debugging feature is enabled in order to notify programmers if there are problems during the execution of the generated programs. Third, GPUBlocks also injects the performance debugging code for the purpose of performance analysis. However, the default settings above can be changed via the configuration file, and the new configuration will take effect when GPUBlocks is re-started.

Figure 9
figure 9

Examples of OpenCL Blocks Programs.

Figure 10
figure 10

Generated OpenCL Code of Fig. 9a, Kernel (bottom-left) and Host Code (right).

Figure 11 depicts the data tiling code of the matrix multiplication program in Fig. 9b. As shown in the OpenCL kernel code, the current thread id is obtained first to calculate the boundary of the data block to be processed. Inside the for loop, local memory buffers are used to keep the data block in the device local memory, and data required for the matrix multiplication are read from the local buffers, which accelerates the program performance. While data tiling is a common technique in parallel computing, it is tedious work, and costs significant amount of time for average programmers. GPUBlocks simplifies the optimization process, and helps generate the optimization code on-the-fly.

Figure 11
figure 11

Generated OpenCL kernel code illustrated in Fig. 9b.

4 Experiment Results

Several array and matrix operations have been performed to evaluate the efficiency of the generated OpenCL and CUDA programs. Microbenchmarks, as shown in Fig. 12, have been executed on the Linux/x86 system with the Intel Xeon E5506 Processor, where their corresponding CUDA and OpenCL programs have been tested on the NVIDIA Geforce GTX 1080 (CUDA SDK 8.0 and OpenCL 1.2). Each program is executed with the following four configurations.

  • cpu. The baseline performance is the elapsed time of sequential C code on the main processor.

  • gpu. The total time of the generated CUDA or OpenCL programs that have spent on both main processor and GPU.

  • shared. The total time that have been spent on both main processor and GPU by the generated CUDA or OpenCL programs with shared memory caching.

  • blocking. The total time that have been spent on both main processor and GPU by the generated CUDA or OpenCL programs with the tiling optimization.

Figure 12
figure 12

The microbenchmarks and their input data sizes.

Figures 13 and 14 depict the experimental results of the CUDA and OpenCL programs, respectively. Note that the sequential C code, denoted as cpu, is the baseline configuration as an indicator to show the performance differences between the CUDA and OpenCL versions. Overall, the program performance delivered by the NVIDIA device outperforms that by the AMD device in our experiments. In addition, the optimized CUDA/OpenCL versions are faster than those without optimizations. Nevertheless, optimizations may lead to poor performance, e.g., shared optimization for matmul and blocking for Volve.

Figure 13
figure 13

CUDA Performance on NVIDIA GPU.

Figure 14
figure 14

OpenCL Performance on AMD GPU.

The slowdown of the optimized codes could be attribute to the program behaviors, and software/hardware interactions. For example, while the shared optimization copying the data to the local buffers prior to the computation, the matrix multiplication does not reuse the localized data, and hence the data copying adds the overhead. The delivered performance of the blocking optimization would depend on the number of concurrent threads, which would vary across different hardware/software combinations, and intensive experiments should be done to explore the best configuration, e.g., the thread number.

In current stage, performance tuning for the best configuration of the generated program is not the focus of this work. Still, we develop some facilities to help profile the performance of the converted programs, and hopefully, programmers could use them to tweak the program performance.

5 Conclusion

This paper introduced a visual programming tool GPUBlocks for CUDA and OpenCL programming. This tool could help beginners and average programmers to implement CUDA and OpenCL programs by simply dragging and connecting blocks. The generated CUDA and OpenCL programs could be used directly to perform computations on GPU, or served as the first draft of CUDA and OpenCL kernels which would be further optimized manually.