Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

General-Purpose computing on Graphics Processing Units (GPGPU) is the technique of using a GPU, which typically handles computation only for computer graphics, to perform computation across a variety of applications traditionally handled by the CPU. Compute Unified Device Architecture (CUDA) is a parallel computing platform and programming model created by NVIDIA and implemented by the GPUs that they produce. Open Computing Language (OpenCL) is a framework executing programs on heterogeneous platforms consisting of CPUs, GPUs and other dedicated processors. It is also a good candidate in comparison with the CUDA approach specifically developed for GPU platforms of NVIDIA.

Fig. 1.
figure 1

The flowchart of proposed prototyping framework

Generally speaking, programming with GPU is one complex, error-prone and time-consuming procedure comparing with sequential dataflow and programming. For computer vision algorithms, there are various programming environments and languages of different platforms. So it is very hard to make one rapid development on various hardware accelerators like GPU. In general, there are two ways to perform the code porting from sequential code to the code executable on GPU. One is to manually write the kernel code of CUDA or OpenCL, which redesigns and compiles the kernel code with the specified compiler. Another way is to automatically generate code executable for hardware accelerators from the high-level description of application with the aid of the prototyping framework. Proposed prototyping framework consists of such three tools: Open Reconfigurable Video Coding Compiler (Orcc) [1], Parallel and Real-time Embedded Executives Scheduling Method (Preesm) [2], and Hybrid Multicore Parallel Programming (HMPP) [3] as shown in Fig. 1. As the high level programming model, Orcc contains an Reconfigurable Video Coding(RVC)-CAL textual editor, a compilation infrastructure, a simulator and a debugger which is proposed by IETR of INSA-Rennes. Lots of works have been proposed with Orcc and many backends are supported in Orcc like C, C++,VHDL, HMPP and so on [68, 11]. As the intermediate level programming model, Preesm offers a fast prototyping tool for parallel implementations used in many applications like LTE RACH-PD algorithm. HMPP directive-based programming model is such a tool that offers a powerful syntax to efficiently offload computations on hardware accelerators and keeps the original C/C++ or Fortran codes [35, 9, 10].

In this paper, we evaluate the prototyping framework with the previous Full Search ME method. We not only implement the ME method on the heterogeneous system with one CPU and one GPU, but also find the balance to distribute the workload in heterogeneous computing system. With the prototyping framework, we can generate code automatically and evaluate the performance rapidly.

2 Full Search Motion Estimation Algorithm

Full Search motion estimation is to search the best candidate of macroblock (MB) in the reference frame for the original macroblock in the current frame. The detailed illustration is introduced in our previous research work [4]. The following two subsection with Pseudo code: Algorithms 1 and 2 simply summarize the procedure of the proposed ME algorithm, which is divided into two OpenCL kernels. One is SADs computation; the other is SADs comparison for the best SAD candidate.

SAD Computation. When an OpenCL program invokes a kernel, N work-groups are enumerated and distributed as thread blocks to the multiprocessors with available Compute Units of CPU and GPU. In kernel_compute, all pixels of current MB are transferred into local memory (local[256]) by the 256 work-items in one work-group. One novelty of this paper is as follows. Until all these work-items in the same work-group reach the synchronous point using barrier() function, all the 256 work-items continue transferring the 2304 pixels of search region concerned of reference image into local memory (\(local\_ref[2304]\)). This differentiates our approach from previous approaches of [12, 13] and obtains better time efficiency (speed-up). In their work, current MB is stored in local memory, but the search region of reference frame still locates in the global memory, which results in inevitable re-fetching from global memory with performance loss. At the end, we adopt Full Search strategies to calculate the 1024 SAD candidates in local memory, in order to avoid re-fetching from the global memory as presented in Algorithm 1. All these 1024 SAD candidates are transferred back to global memory (cost[1024]) for SAD comparison.

SAD Comparison. In kernel_compare, we search the best candidate with the minimum SAD from cost[1024] using 256 work-items as presented in Algorithm 2. First, we transfer the cost[1024] into the local memory. Then, each work-item compares 4 candidates with a stride to find the minimum value. We employ the parallel reduction method [14] which adopts x times iterations (\(2^{x} = 256, x = 8\)) to find the candidate with the minimum SAD value from the remaining 256 candidates, also to obtain the final MV.

figure a
figure b

2.1 Experiments Discussion

We evaluate the performance of the proposed ME algorithm with manual OpenCL kernels in such hardware HASEE environment: Intel I7 2630qm (2.8 GHz), NVIDIA GT540m. We compare the performance of GPU-based FSME implementation and available state-of-the-art fast ME algorithms. To the best of our knowledge, there are two criterions of evaluation: time efficiency, and matching accuracy. Our experimental results mainly focus on PSNR.

The PSNR is defined as:

$$\begin{aligned} psnr= & {} 10\cdot log10(width\times height\times 255\times 255/sse) \end{aligned}$$
(1)
$$\begin{aligned} sse= & {} \mathop {\sum }_{n=0}^{sum}\mathop {\sum }_{i=0}^{blocksize}\mathop {\sum }_{j=0}^{blocksize}(current_{frame}(i,j)-ref_{frame}(i+dx,j+dy))^2 \end{aligned}$$
(2)

where sse is the error sum of square, sum is the total number of macroblocks in one frame, and (dxdy) is the calculated motion vector. Proposed approach has faster speed and better accuracy than the state-of-the-art fast ME methods.

3 Heterogeneous Parallel Computing with OpenCL

Besides of manually writing the kernel code of OpenCL, we target to parallel computing with the aid of prototyping framework. Based on our previous research on motion estimation, we describe the proposed motion estimation approach with one high level description (RVC-CAL). There are two branches in our rapid prototyping framework, one branch is \(RVC-CAL\rightarrow Orcc\rightarrow HMPP\rightarrow GPU (OpenCL/CUDA)\), and another one is \(RVC-CAL\rightarrow Orcc\rightarrow Preesm\rightarrow DSP/ARM (Embedded C)\) as shown in Fig. 1: rapid prototyping framework. In this paper, we choose the first branch to generate and verify C/OpenCL/CUDA implementation on heterogeneous platforms such as multi-core CPU and GPU platforms respectively. With the HMPP-backend of Orcc, we obtain the C code with HMPP directives from the high level description. Then the HMPP compiler will automatically generate the OpenCL/CUDA kernel code. Our prototyping methodology framework can greatly simplify the procedure of implementation target to hardware devices (Fig. 2).

Fig. 2.
figure 2

PSNR comparison of specified state-of-the-art fast ME algorithms

figure c

In the aforementioned discussion, there are paired directives of HMPP: Codelet and Callsite. Algorithm 3 presents the pseudo-code of default HMPP Transformation. Using simple paired directives, HMPP can replace the complex procedure of manually writing the CUDA/OpenCL kernel code. As shown in the above sample code, the codelet is the routine implementation for hardware accelerator and the callsite is the routine invocation of ME function on hardware. When we compile the sample code with HMPP, we can obtain the .cu or .cl kernel of proposed method which targets to GPUs. Instead of manually designing the CUDA/OpenCL kernel, HMPP greatly reduces the development period with GPU and CPU device.

The Reconfigurable Video Coding (RVC) [15] defines a set of standard coding techniques called Functional Units (FUs). A FU is described with a portable, platform-independent language called RVC-CAL. It is one high level language of description which is designed to describe the reconfigure video coding standard. Now RVC-CAL is not only used in video coding, but also in some image and video processing algorithms, like motion estimation and stereo matching [4]. RVC defines a XML-based format called FU Network Language (FNL) that is used for the description of networks, also named the XML Dataflow Format (XDF). The Model of Computation defines the behavior of Full Search ME as a dataflow graph shown in Fig. 3.

Fig. 3.
figure 3

The dataflow diagram of proposed ME approach

In the block diagram of the motion estimation, the block “source” indicates the function of load_frame, a FU that loads video frames; the block “ExtractYRef” and “ExtactY” indicate the function of extractY, a FU that extracts Y channel from the current and reference frames of YUV format video sequence; the block “Search_FS” indicates the function of fullSearchME, a FU that does the Full Search motion estimation; the block “ShowVector” indicates the function of display, a FU that shows the calculated motion vectors. With the high level description of applications, the prototyping framework can directly and rapidly generate the target code like OpenCL/CUDA for heterogeneous systems, which greatly reduces the development period of parallel programming and easily accesses the parallel computing without concentrating on the complex kernel code.

4 Conclusion

We introduce one prototyping framework to implement the proposed ME method, and evaluate the prototyping methodology. Experimental results show that, our implementation has better performance than other GPU-based FSME implementations. One basic method is introduced to find the balance of workload on the heterogeneous parallel system with OpenCL. Additionally, we have found the accurate method to distribute the workload in video applications based on the heterogeneous system. It is also the first prototyping methodology generating target code for OpenCL-supported device from the high level description (different with other code generator like OpenACC), which greatly reduces the development period of parallel programming and easily accesses the parallel computing without concentrating on the complex kernel code.