1 General Introduction of OpenCL

 Open Computing Language (OpenCL) is an open framework of parallel computing for many devices (GPU, CPU, FPGA); it is dissimilar to CUDA that only supports NVIDIA’s GPU. The specification of OpenCL was developed by the Khronos group [1], which is an open consortium of software frameworks.

Although device vendors supply the Software Development Kits (SDKs) of OpenCL that comply with the specifications of the Khronos groups, the extension deviates from the approved specifications. They exhibit two types of application programming interfaces (APIs): one is a candidate for future specifications, and the other is vendor dependent and can be distinguished by their names [2]. Thus, the consumers must consider the conformance of each API.

Although OpenCL is based on the C language, there are some wrappers for other languages, e.g., the official C++ wrapper [3] and PyOpenCL [4] for python (further information are descriptive on the website of STREAM HPC [5]), enabling many software engineers to utilize GPGPU. This chapter focuses on OpenCL based on C/C++.

The basic techniques for accelerating a program with OpenCL and CUDA are almost similar; thus, this chapter focuses on clarifying the technique for utilizing OpenCL on your devices, as well as its differences with CUDA. However, owing to the page limit, the details of OpenCL (the definition of APIs) cannot be discussed; thus, the programming guides, which are released by vendors of the computing device, can be referenced by readers who wish to learn OpenCL detailedly [6,7,8].

2 Setting Up an OpenCL Environment

Most vendors of OpenCL-supporting devices avail their SDKs for developers; these SDKs include the OpenCL library of their devices and standard headers (.h), as well as other headers for extended functions that support only their devices. Therefore, intending users of OpenCL must first download and install the SDKs of their devices.

Notably, a Windows 10 64-bit environment was employed in this chapter, although readers employing other environments, e.g., macOS and Linux, can substitute the filenames or extensions according to the available environment, e.g., OpenCL.dll -> OpenCL.so for Linux users. The static“OpenCL.lib” and dynamic link “OpenCL.dll” libraries are the required libraries for developing and executing the OpenCL program. “OpenCL.lib” is available in the directories of an SDK, while “OpenCL.dll” is preinstalled in the system directories of Windows, following the installation of the graphics driver. Further, a header file (“cl.h”), which is available in the directory of an SDK, should be included in the program.

3 Constructing an OpenCL Program

This section introduces the construction of an OpenCL program employing a simple computer-generated hologram (CGH) calculation source code as a “Hello, world” program of OpenCL, which is depicted on Listings 6.1 (host program) and 6.2 (device program). Readers who have already set up the OpenCL environment can attempt to execute the sample codes by copying Listing 6.1 (with an appropriate name for a C++ file) to your computer and Listing 6.2 with the name, “CGH_helloworld.cl,” which should be placed in the same directory with an executable file of Listing 6.1. After executing the program, a kinoform-type CGH with a resolution of \(1024\times 1024\) in the “bfh_CGH” buffer can be obtained, as shown in Fig. 6.1.

Fig. 6.1
figure 1

Input and output of the example code: a a 3D model with 100 point clouds (input, generated in the program), b kinoform-type CGH (output)

An OpenCL program comprises two types of source codes, the host (.c or .cpp, .h) and device (.cl) codes. A standard OpenCL program adopts the online compile of the device code to improve its portability. Therefore, a C/C++ compiler, e.g., clang, gcc, and Visual C++, compiles the host code employing the OpenCL static library and creates the executable file, which will read and compile the device code according to specified devices for the program, following its execution. Noteworthily, OpenCL also supports offline compile.

The most significant differences between CUDA and OpenCL are the concepts of the platform and the devices. Since OpenCL supports many computing devices, an OpenCL program requires the availability of the available devices; users must specify the desired devices to execute the program. Every device must correspond to a platform. For example, when executing OpenCL on a CPU Intel Core–i7 8700K CPU employing an Intel OpenCL SDK environment, the platform would be “Intel OpenCL,” and two devices (Integrated GPU, Intel UHD Graphics 630, and Intel Core i7-8700K CPU), which are available on the platform, would be utilized. The platforms and devices are specified by IDs; thus, many OpenCL APIs requests set the IDs in the arguments.

3.1 Creating OpenCL Objects That Are Not Required in CUDA

Dissimilar to CUDA, OpenCL defines many objects, e.g., the memory and kernel objects, to manage the device-related information, such as memory address and binary code of an executing program, since OpenCL is assumed to be executed on different platforms and devices. Thus, OpenCL requires the creation of such objects before the execution of a kernel. Table 6.1 and Fig. 6.2 exhibit the required object in a standard OpenCL program and the roles and relation between the objects, respectively. The OpenCL objects that are not required in CUDA are introduced in this subsection with reference to the sample code in Listing 6.1.

Context object is a fundamental object for managing all the objects on a platform; thus, it must be declared on the first line of an OpenCL program with the intended platform ID, as well as the number of devices on the platform. The available platforms and devices can be obtained by “clGetPlatformIDs(),” which was employed on Lines 65 and 69 of Listing 6.1, and “clGetDeviceIDs(),” which was used on Line 86 of Listing 6.1, for the platforms and devices, respectively. Detailed information on the platforms and devices can be obtained by “clGetPlatformInfo()” and “clGetDeviceInfo(),” which employed utilized on Lines 77 and 91, respectively. Here, this program obtains the names of the platforms and devices. The context object is created by API “clCreateContext(),” which was employed on Line 115 of the list.

Table 6.1 Definition of the objects in OpenCL
Fig. 6.2
figure 2

Calculation model of OpenCL employing relation between the objects

command-queue object is an interface that manages all the commands, e.g., the execute-the-kernel and the transfer-the-data-in-a-buffer functions; thus, it must be declared per all to-be-utilized devices. A command-queue object is created by “clCreateCommandQueueWithProperties()” with a corresponding device ID, which is depicted on Line 118 of the list. The commands to a device are queued by the ”clEnqueue***()” API via a command-queue object. For example, to copy data from the memory of a device to a host, “clEnqueueReadBuffer(),” Line 165 of the list, is called employing the command-queue object in the first argument. Worthy, the commands are only enqueued; thus, the time of executing is unknown, and it depends on the preceding commands on the queue.

program object is an object that manages a raw (readable text) source code, as well as the compiled program of a device function. Thus, it must read a device source code as a text buffer before creating it. Lines 122–133 on the list show an example of reading the device source code from a file (CGH_helloworld.cl) to a char buffer (src), as well as creating a program object with “clCreateProgramWithSource()” on Line 128. After creating the program object, it can be built by “clBuildProgram()” employing a specified platform ID, as shown on Line 131 of the List.

kernel object is an object, which is created by the “clCreateKernel()” function employing program object and named the kernel function, that specifies the kernel function in a program object; thus, it must be created per device functions to be executed. On the List, only one device function is defined in the device code (Listing 6.2); therefore, only one kernel object is created on Line 136 of the Listing 6.1.

memory object is an object that manages the memory buffer on a device. It functions as a memory pointer. The memory object is created by “clCreateBuffer()” with context object and attributions that pertain to memory (size and writability), as obtainable in “cudaMalloc()” of CUDA. On Listing 6.1, four memory objects were created on Lines 139–142. Noteworthy, the hierarchical memory architectures of OpenCL and CUDA are almost the same (Table 6.2), and the memory buffer, which was created by “clCreateBuffer(),” is assigned on the global memory.

The creations of the discussed objects indicate that the preparation for executing the kernel is almost completed. Further, the following section introduces the procedure for driving the OpenCL kernel.

Table 6.2 Corresponding names of memory

3.2 Executing the Kernel Function

Dissimilar to CUDA, OpenCL requires a two-step setup before executing the enqueued kernel. The first step involves setting up the arguments of the kernel function via the “clSetKernelArg()” function (Lines 151–155 on Listing 6.1). Notably, all the arguments must be passed by a void* type pointer.

The second step involves the definition of the division unit for parallel execution; these units are called the grid, block, and thread in CUDA. However, the “grid,” “block,” and “thread” correspond to “NDRange,” “workgroup,” and “workitem,” respectively. The sizes of NDRange and workgroup are specified by multidimensional size_t-type arrays, as exhibited on Lines 158 and 159 of the List. In the sample code, the size of NDRange was set to be equal to the size of the CGH, and the size of the workgroup was set to \(256\times 1\). The maximum number of workitems in a work group is defined by the specifications of hardware.

After the two-step preparation, the command for executing the kernel function can be enqueued by “clEnqueueNDRange()” employing the sizes of NDRange (globalSize), workgroup (localSize), and the queue object (Line 162 of the list).

Finally, the kernel function can be executed by transferring the buffer data from the device to the host. “clEnqueueReadBuffer()” is a transfer function; it is executed to transfer the buffer data from the device to the host (Line 165 of the list), and it is equivalent to “cudaMemcopy()” in CUDA. To ensure complete transfer, a call function for synchronizing the device to the host must be executed before subjecting the data to the host buffer (bfh_CGH). In the sample code, the “clFinish()” function, which was waiting to execute the last command that was enqueued in the command queue, was executed. Noteworthy, there are other functions, e.g., clWaitForEvents() with an event object, for achieving a finer synchronization; thus, those APIs can be referenced by readers who wish to construct a more complex OpenCL program.

This subsection only discusses the method for executing data-parallel-type computation. However, OpenCL comprises methods for parallelizing the calculation in a task unit, as obtainable in CUDA. Readers who wish to employ the task-parallel program may refer to the instruction manual of OpenCL, which is supplied by the vendors of devices.

To summarize the above introductions, the standard structure of the host program of OpenCL is, as follows:

  1. 1.

    Determine an available platform, as well as devices, and specify the appropriate devices.

  2. 2.

    Create a context object, which manages all the objects on a platform.

  3. 3.

    Create a command-queue object, which is connected to a device to manage the commands to be executed therein.

  4. 4.

    Read a device program as a text and build it, thereby treating it as a program object.

  5. 5.

    Create the kernel objects from a program object by specifying the name of the function that was written in the .cl file

  6. 6.

    Create the memory objects, which manage the memory space on a device.

  7. 7.

    Set the arguments and workgroup size, which are to be executed by the kernel.

  8. 8.

    Execute the kernel function.

  9. 9.

    Copy the result from the device memory.

Table 6.3 Corresponding names of the modifiers of the variables and memories
Table 6.4 Corresponding names of the modifier of the functions
Table 6.5 Corresponding methods for obtaining the index values: N is a dimension

3.3 Writing the Kernel Function

The kernel function is one, which would be executed by a device. The grammars and syntaxes of the kernel functions of OpenCL and CUDA are almost the same, although the names of the modifiers of their variables, memories, and functions, as well as the methods for obtaining their index values, e.g., “gridDim” in CUDA, are different. Tables 6.3, 6.4, and 6.5 present the correlations of the modifiers and other basic functions of CUDA and OpenCL. N in Table 6.5 indicates that a dimension must be obtained employing the functions; thus, blockDim.x in CUDA is equivalent to get_num_groups(0);

The standard kernel function for calculating CGH is presented on Listing 6.2, which is a simplified version of the sample code of calculating CGH employing CUDA (Listing 10.2). For the readers who wish to execute an OpenCL program, the modification of Listing 6.2 is an easy technique for first building the OpenCL program. Here (Listing 6.2), three pre-processors are defined to substitute the constant values. “CNS_255_DIV_2_PI” and “CNS_2_PI_DIV_LAMBDA” correspond to \(\frac{255}{2\pi }\) and \(\frac{2\pi }{\lambda }\), respectively (\(\lambda =532\) [nm] and “CNS_PITCH” represents the pixel pitch of a displaying device.

The calculation times for this execution are 95.2 ms with NVIDIA Quadro P1200 GPU and CUDA 11.0, 1738 ms with an Intel Core i7-8850H CPU, and 324 ms with an Intel UHD Graphics 630 GPU, all of them are evaluated with OpenCL. The kernel source code (Listing 6.2) is a very simple structure to understand; thus, applying the optimization techniques that are mentioned in Chapter 6 will be quite fast. Unfortunately, the techniques described in those sections are not within the scope of OpenCL, although readers who already briefly understand the differences and similarities of CUDA and OpenCL can easily apply those techniques in their OpenCL codes.

Moreover, only a few literature illustrate the fast calculation of CGH via OpenCL, although readers can refer to [9] as a practical example of implementing OpenCL to calculate CGH.

figure a
figure b
figure c
figure d
figure e

Fundings This work was supported by JSPS KAKENHI Grant Number 22H03616.