Real-time parallel image processing applications on multicore CPUs with OpenMP and GPGPU with CUDA

Aydin, Semra; Samet, Refik; Bay, Omer Faruk

doi:10.1007/s11227-017-2168-6

Real-time parallel image processing applications on multicore CPUs with OpenMP and GPGPU with CUDA

Published: 12 December 2017

Volume 74, pages 2255–2275, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

The Journal of Supercomputing Aims and scope Submit manuscript

Real-time parallel image processing applications on multicore CPUs with OpenMP and GPGPU with CUDA

Download PDF

Semra Aydin¹,
Refik Samet² &
Omer Faruk Bay¹

1184 Accesses
10 Citations
Explore all metrics

Abstract

This paper presents real-time image processing applications using multicore and multiprocessing technologies. To this end, parallel image segmentation was performed on many images covering the entire surface of the same metallic and cylindrical moving objects. Experimental results on multicore CPU with OpenMP platform showed that by increasing the chunk size, the execution time decreases approximately four times in comparison with serial computing. The same experiments were implemented on GPGPU using four techniques: (1) Single image transmission with single pixel processing; (2) Single image transmission with multiple pixel processing; (3) Multiple image transmission with single pixel processing; and (4) Multiple image transmission with multiple pixel processing. All techniques were implemented on GeForce, Tesla K20 and Tesla K40. Experimental results of GPU with CUDA platform showed that by increasing the core number speedup is increased. Tesla K40 gave the best results of 35 and 12 (for the first technique), 36 and 13 (for the second technique), 54 and 16 (for the third technique), 71 and 17 (for the fourth technique) times improvement without and with data transmission time in comparison with serial computing. As a result, users are suggested to use Tesla K40 GPU and Multiple image transmission with multiple pixel processing to get the maximum performance.

Parallel and Distributed Computing for Processing Big Image and Video Data

Parallel Algorithm of Digital Image Processing Based on GPU

Heterogeneous parallel computing accelerated iterative subpixel digital image correlation

Article 15 December 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Image processing requires long time, which is tightly limited in the real-time applications [1, 2]. Processing time increases depending on both the number and resolution of the images [3]. The problem is that the serial image processing does not satisfy the real-time conditions [3]. Parallel computing techniques, especially multicore and multiprocessor technologies, should be used to solve this problem [4]. Parallel algorithms are much more complex than serial ones. Generally, the parallel algorithms are designed by modification of serial algorithms [5]. Obtained parallel algorithms can be improved and accelerated by taking into consideration the hardware that the algorithms will work with.

Segmentation is one of the steps in image processing. Thresholding is widely used for this aim. In real-time applications, multicore CPUs and GPGPU should be used to execute thresholding on many images covering the entire surface of the same metallic and cylindrical moving object to satisfy real-time conditions. A multicore CPU is a single computing component with two or more independent actual central processing units (called cores) [6, 7]. Pthreads, OpenMP (Open Multiprocessing), TBB (Threading Building Blocks) and Cilk are API (Application Programming Interfaces) to efficiently use the capacity of a multicore CPU [8]. In this paper, a general-purpose and platform-independent OpenMP that supports shared memory for multiprocessing programming in C, C++ and FORTRAN will be used.

A Graphic Processing Unit (GPU) is a Single Instruction stream and Multiple Data streams (SIMD) architecture where the same instruction is performed on all data elements in parallel. On the other hand, the pixels of an image can be considered as separate data elements. So, GPU is a suitable architecture to process data elements of an image in parallel [9]. General-Purpose computing on Graphics Processing Units (GPGPU) is a tool to increase the utilization of GPU. There are many platforms to efficiently use the capacity of GPGPU, such as CUDA, DirectCompute and OpenCL. The CUDA platform, which is the most common one, will be used in this paper [10].

This paper shows that more efficient algorithms and techniques still need to be developed to improve the performance of real-time image processing applications. One of the aims of this study is to make a contribution to this area using OpenMP and CUDA. To this end, bi-level thresholding is implemented on the images covering the entire surface of the same metallic and cylindrical moving object in parallel with the five following techniques. One technique is related to CPU programming with the OpenMP platform. In this context, shared-memory multicore programming with OpenMP, scheduling threads on cores with different parameters, and performance related to the execution time are analyzed. The other four techniques are related to GPU programming with the CUDA platform:

1.
Single Image Transmission with Single Pixel Processing (SISP) in which the images are transmitted from CPU to GPU one by one and the pixels of the images are processed one pixel per core of GPU;
2.
Single Image Transmission with Multiple Pixel Processing (SIMP) in which the images are transmitted from CPU to GPU one by one and the pixels of the images are processed multipixels per core of GPU;
3.
Multiple Image Transmission with Single Pixel Processing (MISP) in which the multiple images are combined and transmitted from CPU to GPU as a single data unit and the pixels of the images are processed one pixel per core of GPU;
4.
Multiple Image Transmission with Multiple Pixel Processing (MIMP) in which the multiple images are combined and transmitted from CPU to GPU as a single data unit and the pixels of the images are processed multipixels per core of GPU.

Performance analysis related to execution time was performed by comparison of the results obtained by these techniques with serial computing. The technique with multicore CPU showed that, by increasing the chunk size, the execution time decreases approximately four times. All techniques with GPU were implemented on GeForce, Tesla K20 and Tesla K40. Tesla K40 gave best results of 35 (for SISP technique), 36 (for SIMP technique), 54 (for MISP technique) and 71 (for MIMP technique) time improvement in comparison with serial computing.

The rest part of the paper is organized as follows. In Sect. 2, some related works are presented. In Sect. 3, the real-time image processing techniques are proposed. Section 4 describes the image transmission techniques between CPU and GPU. The experimental results are given in Sect. 5. Section 6 concludes with the main findings.

2 Related works

Multicore CPU and GPGPU technologies are widely used for non-real-time and real-time image processing applications. It is well known that the multithreading, multicore and GPU architectures have advantages in comparison with serial computing [11]. A short literature review related to these technologies is given below.

Thapliyal and Arabnia in their works [12,13,14,15,16,17] discuss a historical perspective and relevant context about how hardware and software can work in concert on scalable multiprocessor systems with a number of illustrative examples and applications in imaging science. In fact, the proposed imaging architecture presented in these works can be considered to be early designs of GPU processor architectures.

There are many studies reported in the literature related to non-real-time image segmentation using the threshold technique [18, 19]. The performance was and still remains an urgent issue to be solved in real-time image processing applications. To this end, different algorithms and techniques have been developed for serial computing [20,21,22]. Despite some performance improvements in these works, it is very difficult to satisfy real-time conditions by serial computing. Researchers have looked into alternative solutions and found the multicore CPU and GPGPU technologies to solve this issue. At the same time, in order to efficiently use these technologies, different platforms, such as OpenMP and CUDA, have been developed and widely used. For example, OpenMP platform has been used in multithread image processing and image segmentation applications with multicore computing [23]. CUDA platform has been used for parallel image segmentation by region growing, watershed and Otsu binarization algorithms on GPU [24,25,26,27]. The reduction sweep algorithm was used for image segmentation on both CPU and GPU [28]. In [29], several techniques for image segmentation were implemented using CUDA and GPU and processing time was accelerated about 20 times. The authors of [30] present the results of image segmentation on a video with a frame rate of 30 Hz using CUDA and GPU. Despite existing works, in order to satisfy the need for higher speed and low cost more efficient techniques and algorithms are needed. This paper tries to meet to this need.

To accelerate the image thresholding, in existing works images are transferred to the GPU one by one and each pixel is processed in separate cores. In the proposed paper, the images are combined and transmitted and multiple pixels are processed in one core. Due to these contributions, a higher acceleration rate is obtained.

3 Real-time image processing techniques

In this section, we present three techniques: (1) Serial thresholding (Sect. 3.1); (2) Parallel thresholding on a multicore CPU with OpenMP (Sect. 3.2); and (3) Parallel thresholding on a GPU with CUDA (Sect. 3.3). The final one is divided into four techniques which are SISP (Sect. 3.3.1), SIMP (Sect. 3.3.2), MISP (Sect. 3.3.3) and MIMP (Sect. 3.3.4).

Real-time applications of this study are related to the inspection of certain defects on the entire surface of metallic and cylindrical objects. Images taken from the entire surface of the same metallic and cylindrical moving object were used to inspect the defects in real time. In order to detect certain defects of a single object, the image processing steps should be processed on K images covering its entire surface. Time is limited in given applications. In this paper, only the first step of image processing related to image segmentation will be handled. Thresholding is the simplest and a fast way for image segmentation. Parallel programming techniques, such as multicore and multiprocessing technologies, were used to speed up the thresholding of the metallic and cylindrical object.

Firstly, serial thresholding is described. Then, parallel thresholding on a multicore CPU with OpenMP is presented. Finally, parallel thresholding on GPU with CUDA is discussed.

3.1 Serial thresholding

Image segmentation is the process of dividing the individual elements of an image into a set of groups so that all elements in a group have a common property. Segmentation allows visualization of the structures of interest, removing unnecessary information [31]. Thresholding is the simplest, most commonly used and the most popular technique for segmentation. Thresholding techniques can be classified into two categories: bi-level and multilevel. In this paper, bi-level segmentation is used for the segmentation of objects and the background [19]. Thresholding is often used as a preprocessing step, followed by other post-processing techniques [32]. Let us denote by g(x, y) the segmented image obtained from f(x, y). If we consider T as the threshold value, the resulting image will be given by following expression.

$$\begin{aligned} g(x,y) = \left\{ \begin{array}{ll} 255, &{} \hbox { if } f(x,y) \ge \hbox { T}\\ 0, &{} \hbox { if } f(x,y) < \hbox { T} \end{array}\right. \end{aligned}$$

(1)

According to serial thresholding, Eq. (1) should be calculated on each pixel of (x, y) of an original image of f(x, y), where $x=1,2,\ldots ,N$ and $y=1,2,\ldots ,M$. The performance or processing time of serial thresholding is defined as follows:

$$\begin{aligned} t_\mathrm{{ST}}= N*M*\Delta t, \end{aligned}$$

(2)

where $t_\mathrm{{ST}}$ is the processing time of serial thresholding and $\Delta t$ is the processing time for thresholding on one pixel.

3.2 Parallel thresholding on a multicore CPU with OpenMP

In order to accelerate the thresholding process to satisfy the real-time conditions, the shared-memory multicore programming with OpenMP is proposed. An OpenMP platform always begins with a single thread of control, called the master thread, which exists during the run time of the program (Fig. 1). The master thread may encounter parallel regions, in which the master thread will fork the new threads, each with its own stack and execution context. At the end of the parallel region, the forked threads will be terminated, intermediate results will be joined, and the master thread will continue the program execution as shown in Fig. 1.

To achieve the optimal performance in multithread applications, different scheduling types and chunk sizes should be tested. With OpenMP, static, dynamic and guided scheduling mechanisms can be specified. Static scheduling divides the loop into equal-sized chunks or as equal as possible in the case when the number of loop iterations is not evenly divisible by the number of threads multiplied by the chunk size. Dynamic scheduling uses the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue. By default, the chunk size for dynamic scheduling is 1. Guided is similar to dynamic scheduling, but the chunk size starts off large and decreases to better handle the load imbalance between iterations. The optional chunk parameter specifies the minimum chunk size to use. By default, the optimal (S) chunk size for guided scheduling is defined as follows:

$$\begin{aligned} S=\frac{N_\mathrm{{L}}}{N_\mathrm{{T}}}, \end{aligned}$$

(3)

where $ N_\mathrm{{L}}$ is a number of operations in the loop and $ N_\mathrm{{T}}$ is a number of threads.

The processing time of parallel thresholding with OpenMP ($ t_\mathrm{{MP}}$) is defined as follows:

$$\begin{aligned} t_\mathrm{{MP}}= \frac{t{_\mathrm{{ST}}}}{N_\mathrm{{T}}}+t_{0}= \frac{N*M*\Delta t}{N_\mathrm{{T}}}+t_{0}, \end{aligned}$$

(4)

where $ t_{0}$ is a processing time for fork and join of threads. One of factors that effects $ t_{0}$ is chunk size of S.

3.3 Parallel thresholding on a GPU with CUDA

The CUDA platform consists of functions, called kernels, which can be executed simultaneously by a large number of threads on the GPU. Threads are grouped into warps. A warp consists of 32 threads which are executed as SIMD architecture independently. Threads within a warp execute the same instruction on different data elements in parallel [33].

In order to parallelize the thresholding process, the kernels should be used. To organize kernels to work in parallel, streams are used (Table 1).

As shown in Table 1, firstly, the K streams are defined (Line 1) and created (Lines 2, 3). Then, data (images) for created streams are transmitted asynchronously from the CPU to the GPU (Lines 4, 6). After that, kernels execute the same instructions on K images asynchronously (Lines 5, 7). Finally, the results are transmitted from the GPU to the CPU (Lines 8, 9).

Images can be sent from the CPU to the GPU one by one or in a combined data array. Images can be processed in the cores of the GPU as one pixel by one pixel or in multipixels. Results can be returned from the GPU to the CPU one by one, or in a combined data array. Algorithm 1 for sending K images from the CPU to the GPU one by one, processing them in GPU and returning the results from the GPU to the CPU one by one is given in Table 2. Algorithm 2 for sending K images from the CPU to the GPU in a combined data array, processing them in the GPU and returning the results from the GPU to the CPU in a combined data array is given in Table 3. Algorithm 3 for distributing and processing the images as one pixel per core of the GPU is given in Table 4. Algorithm 4 for distributing and processing the images as P pixels per core of the GPU is given in Table 5.

Table 1 Multiple kernel organization by streams

Full size table

Table 2 Algorithm 1: Single image transmission

Full size table

Table 3 Algorithm 2: Multiple image transmission

Full size table

Table 4 Algorithm 3: Single pixel processing in the GPU

Full size table

Four techniques are proposed to execute thresholding on the GPU with CUDA: (1) SISP; (2) SIMP; (3) MISP, and (4) MIMP (Table 6).

3.3.1 SISP technique

In this technique, the images are transmitted from the CPU to the GPU one by one and results are returned from the GPU to the CPU one by one using the proposed Algorithm 1 (Table 2). Also, the pixels of the images are distributed and processed one pixel per core of the GPU using the proposed Algorithm 3 (Table 4).

3.3.2 SIMP technique

In this technique, the images are transmitted from the CPU to the GPU one by one and the results are returned from the GPU to the CPU one by one using the proposed Algorithm 1 (Table 2). Also, the pixels of the images are distributed and processed as multi pixels per GPU core using the proposed Algorithm 4 (Table 5). Pixels of images are distributed among cores of GPU as P pixels per core. The number of pixels per core depends on GPU hardware.

3.3.3 MISP technique

In this technique, the images are transmitted from the CPU to the GPU in a combined data array [34] and the results are returned from the GPU to the CPU in a combined data array using the proposed Algorithm 2 (Table 3). After transferring the combined results to CPU, they are separated according to size of the images. Also, the pixels of the images are distributed and processed one pixel per core of the GPU using the proposed Algorithm 3 (Table 4).

3.3.4 MIMP technique

In this technique, the images are transmitted from the CPU to the GPU in a combined data array and the results are returned from the GPU to the CPU in a combined data array using the proposed Algorithm 2 (Table 3). Also, the pixels of the images are distributed and processed as multipixels per GPU core using the proposed Algorithm 4 (Table 5).

Table 5 Algorithm 4: Multi pixel processing in the GPU

Full size table

Table 6 Proposed techniques

Full size table

4 Image transmission between CPU and GPU

In real-time applications, data transmission time is also very important factor. In systems with GPU, data transmission time consists of two components. These components are defined as the time spent for transmission of the data from CPU to GPU and from GPU to CPU, accordingly. Before executing a kernel on the GPU, all of the data used by kernel need to be transmitted from the CPU memory to the GPU memory. After execution, the results produced by the kernel most likely need to be transmitted back to the CPU memory. cudaMemcpy function is used to transmit data in both directions.

Transmission time in both directions consists of two components. First component is latency which includes preparation overhead. This overhead may occur due to instruction decoding, memory latency, waiting for bus access and other causes. Second component is the propagation time which depends on bandwidth (the number of bits propagated per second). This property has great impact on the performance of a graphic processor since all data which shall be used in the computation must be copied to the graphics processor.

The Hockney model describes in its simplest form how the bandwidth and latency affect the transmission time (t) which is necessary to transmission a given set of data [35].

$$\begin{aligned} t=L+m/B , \end{aligned}$$

(5)

where L is the latency, B is bandwidth and m is size of transmitted data.

Latency and bandwidth depend on graphic card, memory allocation, memory architecture, memory speed, CPU architecture, CPU speed, chipsets and bus clock frequency. Calculation of the transmission time accepting into account all of above-listed parameters is not so easy task. In practice, measured transmission time is used.

5 Experimental results

Experiments were related to the real-time detection of standard defects such as scratches, dents, wrinkles and crimps on the surface of the military cases [36, 37] (Fig. 2). Eight images covering the entire 360-degree (8 $\times $ 45 degree) surface of the same moving military cases were used to detect the defects (Fig. 3). A multicore CPU with OpenMP and GPGPU with CUDA was used to perform the parallel segmentation of the military cases and background using the thresholding. Speedup rate (r) was used to evaluate segmentation techniques:

$$\begin{aligned} r=t_\mathrm{{ST}}/t_\mathrm{{PT}}, \end{aligned}$$

(6)

where $t_\mathrm{{PT}}$ is the processing time of parallel thresholding.

The following platform was used: Intel Core i7-3630QM CPU with 4 cores and hyper threading technologies; 8 GB RAM; Windows 7. The codes were written in C++ using the Visual Studio 2012. Images with different resolutions (320 $\times $ 240, 640 $\times $ 480, and 1280 $\times $ 960) were used.

5.1 Parallel thresholding on a multicore CPU with OpenMP

Static, dynamic and guided scheduling types with different chunk sizes were implemented to speed up the segmentation process (Table 7).

Table 7 Experiment results on a multicore CPU with OpenMP

Full size table

Table 7 presents the experimental results of the speedup of different scheduling types with different chunk sizes. As seen, the dynamic and guided scheduling types gave the best results. By increasing the chunk size, the speedup is decreased for all scheduling types. In summary, in order to obtain the best results by OpenMP, chunk sizes should be as small as possible and dynamic or guided scheduling types should be used. There is one important point to be underlined. Namely, as shown in Table 7, the values of speed up with dynamic and guided scheduling exceed 4. The reason is that the CPU with four cores has hyperthreading technology.

5.2 Parallel thresholding on a GPU with CUDA

NVIDIA GeForce GT 635M with 96 cores, Tesla K20 with 2496 cores and Tesla K40 with 2880 cores were used. The number of thread size was set to 1024. Four techniques were implemented: (1) SISP; (2) SIMP; (3) MISP; and (4) MIMP.

5.2.1 SISP

In this technique, eight images were sent and executed one by one. The pixels of the images were distributed as one pixel (or 8 bits) per GPU core (Table 8).

Table 8 Data transmit and process performance evaluation for SISP technique

Full size table

Table 9 Data transmit and process performance evaluation for SIMP technique

Full size table

Table 10 Data transmit and process performance evaluation for MISP technique

Full size table

Table 11 Data transmit and process performance evaluation for MIMP technique

Full size table

Table 12 Comparison results of the proposed techniques without transmit time

Full size table

Table 13 Comparison results of the proposed techniques with transmit time

Full size table

As seen, Tesla K40 gave the best result 35 times of improvement without count of transmission time and 12 times of improvement with count of transmission time in comparison with serial computing. Another finding is that, in general, by increasing the image resolution the speedup rate decreases for all kinds of GPU for both cases (without and with transmission time).

5.2.2 SIMP

In this technique, eight images were sent and executed one by one. The pixels of the images were distributed as four pixels (or 32 bits) per GPU core (Table 9).

As seen, Tesla K40 gave the best result 36 times of improvement without count of transmission time and 13 times of improvement with count of transmission time in comparison with serial computing. Another finding is that by increasing the image resolution the speedup rate decreases for all kinds of GPU for both cases (without and with count of transmission time).

5.2.3 MISP

In this technique, eight images were combined in a data array. This data array was sent and executed in a kernel. The pixels of the images were distributed as one pixel (or 8 bits) per GPU core (Table 10).

As seen, Tesla K40 gave the best result 54 times of improvement without count of transmission time and 16 times of improvement with count of transmission time in comparison with serial computing. Another finding is that by increasing the image resolution, the speedup rate decreases for Geforce and Tesla K40 and increases for Tesla K20 without transmission time. Also by increasing the image resolution the speedup rate decreases for Geforce, Tesla K20 and Tesla K40 with transmission time.

5.2.4 MIMP

In this technique, eight images were combined in a data array. These data were sent and executed in a kernel. The pixels of the images were distributed as four pixels (or 32 bits) per GPU core (Table 11).

As seen, Tesla K40 gave the best result 71 times of improvement without count of transmission time and 17 times of improvement with count of transmission time in comparison with serial computing. Another finding is that by increasing the image resolution the speedup rate decreases for Geforce and increases for Tesla K20 and Tesla K40 without transmission time. Also, in general, by increasing the image resolution the speedup rate decreases for Geforce, Tesla K20 and Tesla K40 with transmission time.

The comparison results of the proposed techniques with CUDA in terms of speedup are given in Tables 12, 13 and Fig. 4.

Different computers were used to implement GeForce, Tesla K20 and Tesla K40. Due to the differences of CPUs of these computers, the different serial times to process the image with same resolution were measured. For example, the serial time to process the image with resolution of (320 $\times $ 240) by different CPUs was measured as 4.88, 8.25 and 5.79 ms (see the column of serial time in Table 8). Speedup rate for each GPU was affected by the capacity of CPUs.

In general, GeForce gave less improvement than Tesla K20 and K40. This is due to the fewer number cores (96) in comparison with Tesla K20 and K40, which have 2496 and 2880 cores. As shown in Tables 12 and 13, the best results of speedup rate without and with transmission times were obtained by using Tesla K40 for all techniques and image resolutions. Among all techniques, MIMP gave the maximum speedup 71 times without transmission time and 17 times with transmission time. From Tables 8, 9, 10, 11, 12, 13, it can be summarized that Tesla K40 GPU and MIMP technique should be used to get the maximum performance. As seen, there is a big difference between speedup rates without and with transmission time. The reason is the transmission time between CPU and GPU.

As seen, Tesla K40 gave the best results for all techniques. With Tesla K40, the speedup rates for MISP and MIMP techniques were higher than those of the SISP and SIMP ones. Another point with Tesla was that, by increasing the image resolution, the speedup rate increased. In summary, in order to obtain the best results with CUDA, MISP and MIMP techniques should be used.

An example for segmentation results with parallel thresholding is given in Fig. 5.

6 Conclusion

This paper has presented the image processing applications using multicore and multiprocessing technologies to satisfy real-time conditions. To this end, the algorithms and techniques for the parallel image segmentation through thresholding on K images covering the entire surface of the same metallic and cylindrical moving objects were proposed. A multicore CPU with OpenMP and GPGPU with CUDA was used to implement the thresholding of military cases using eight real images covering their entire surface. Obtained implementation results were compared with the results of serial computing in terms of speedup metric. Experiment results have showed that a GPU with CUDA has a huge capacity to increase the performance of real-time applications.

The best results of speedup rate without and with transmission times were obtained by using Tesla K40 for all techniques and image resolutions. Four techniques have been proposed to process the real-time thresholding such as SISP, SIMP, MISP and MIMP. Among all proposed techniques, MIMP gave the maximum speedup 71 times without transmission time and 17 times with transmission time in comparison with serial computing. As seen, there is a big difference between speedup rates without and with transmission time. The reason is the transmission time between CPU and GPU. As summary, Tesla K40 GPU and MIMP technique should be used to get the maximum performance.

As future work, the time to transmit images from the CPU to the GPU and results from the GPU to the CPU will be analyzed and optimized. More studies can be made on the chained-cubic tree and optical chained-cubic tree topologies. It would be interesting to apply our implementation on these topological properties [38, 39].

References

Hu J, Zhang T, Jiang H (2006) New multi-DSP parallel computing architecture for real-time image processing. J Syst Eng Electron 17(4):883
Article MATH Google Scholar
Mondal P, Biswal PK, Banerjee S (2016) FPGA based accelerated 3D affine transform for real-time image processing applications. Comput Electr Eng 49(1):69
Article Google Scholar
Mertes JG, Marranghello N, Pereira AS (2013) Real-time module for digital image processing developed on a FPGA. In: 12th IFAC Conference on Programmable Devices and Embedded Systems. IFAC Proceedings Volumes 46(28), p 405
Daz-Pernil D, Berciano A, Pea-Cantillana F, Gutirrez-Naranjo MA (2013) Segmenting images with gradient-based edge detection using membrane computing. Pattern Recognit Lett 34(8):846
Article Google Scholar
Huqqani AA, Schikuta E, Ye S, Chen P (2013) Multicore and GPU parallelization of neural networks for face recognition. Procedia Comput Sci 18:349
Article Google Scholar
Mahafzah BA (2011) Parallel multithreaded IDA heuristic search: algorithm design and performance evaluation. Int J Parallel Emerg Distrib Syst 26(1):61
Article MathSciNet MATH Google Scholar
Mahafzah BA (2013) Performance assessment of multithreaded quicksort algorithm on simultaneous multithreaded architecture. J Supercomput 66(1):339
Article Google Scholar
Szgyi Z, Trk M, Pataki N (2011) Multicore C++ standard template library in a generative way. In: Proceedings of the Third Workshop on Generative Technologies (WGT) 2011. Electronic Notes in Theoretical Computer Science, vol 279(3), p 63
Smistad E, Elster AC, Lindseth F (2014) GPU accelerated segmentation and centerline extraction of tubular structures from medical images. Int J Comput Assist Radiol Surg 9(4):561. https://doi.org/10.1007/s11548-013-0956-x
Article Google Scholar
Brodtkorb AR, Hagen TR, SeTra ML (2013) Graphics processing unit GPU programming strategies and trends in GPU computing. J Parallel Distrib Comput 73(1):4
Article Google Scholar
Patil S, Junnarka A (2015) Color image segmentation using median cut and contourlet transform: a parallel segmentation approach. Int J Comput Sci Inf Technol (IJCSIT) 5(6):7353
Google Scholar
Thapliyal H, Arabnia H (2006) Reversible programmable logic array (RPLA) using Fredkin and Feynman gates for industrial electronics and applications. In: Proceedings of 2006 International Conference on Computer Design and Conference on Computing in Nanotechnology, Las Vegas, pp 70–74
Thapliyal H, Arabnia H, Bajpai R, Sharma K (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for FPGAs. In: Proceedings of 2007 International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, pp 449–450
Arabnia HR, Oliver MA (1986) Fast operations on raster images with SIMD machine architectures. Comput Graph Forum 5(3):179–188. https://doi.org/10.1111/j.1467-8659.1986.tb00296.x
Gopineedi PD, Thapliyal H, Srinivas MB, Arabnia HR (2006) Novel and efficient 4:2 and 5:2 compressors with minimum number of transistors designed for low-power operations, pp 160–168
Balasubramanian P, Arisaka R, Arabnia H (2012) RB DSOP a rule based disjoint sum of products synthesis method. In: Proceedings of 2012 International Conference on Computer Design, Las Vegas, pp 39–43
Thapliyal H, Srinivas M, Arabnia H (2005) Reversible logic synthesis of half, full and parallel subtractors. In: Proceedings of 2005 International Conference on Embedded Systems and Applications, Las Vegas, pp 165–172
Al-amri SS, Kalyankar NV, D KS (2010) Image segmentation by using threshold techniques. CoRR abs/1005.4020
Osuna-Enciso V, Cuevas E, Sossa H (2013) A comparison of nature inspired algorithms for multi-threshold image segmentation. Expert Syst Appl 40(4):1213
Article Google Scholar
Wei S, Hong Q, Hou M (2011) Automatic image segmentation based on PCNN with adaptive threshold time constant. Neurocomputing 74(9):1485
Article Google Scholar
Han S, Tao W, Wu X, cheng Tai X, Wang T (2010) Fast image segmentation based on multilevel banded closed-form method. Pattern Recognit Lett 31(3):216
Article Google Scholar
Ayala HVH, dos Santos FM, Mariani VC, dos Santos Coelho L (2015) Image thresholding segmentation based on a novel beta differential evolution approach. Expert Syst Appl 42(4):2136
Article Google Scholar
Wang R, Li C, Wang J, Wei X, Li Y, Zhu Y, Zhang S (2015) Threshold segmentation algorithm for automatic extraction of cerebral vessels from brain magnetic resonance angiography images. J Neurosci Methods 241:30
Article Google Scholar
Happ P, Feitosa R, Bentes C, Farias R (2012) A parallel image segmentation algorithm on GPUs. In: Proceedings of the 4th GEOBIA, Rio de Janeiro, 2012, pp 580–586
Smistad E, Elster AC, Lindseth F (2014) GPU accelerated segmentation and centerline extraction of tubular structures from medical images. Int J Comput Assist Radiol Surg 9(4):561
Article Google Scholar
Korbes A, Vitor GB, de Alencar Loyufoi R, Ferreira JV (2010) Analysis of a step-based watershed algorithm using CUDA. Int J Curr Res Rev 1(1):6
Google Scholar
Singh BM, Sharma R, Mittal A, Ghosh D (2011) Parallel implementation of Otsus binarization approach on GPU. Int J Comput Appl 32(2):16
Google Scholar
Farias R, Farias R, Marroquim R, Clua E (2013) Parallel image segmentation using reduction-sweeps on multicore processors and GPUs. In: Proceedings of the 2013 XXVI Conference on Graphics, Patterns and Images, SIBGRAPI ’13. IEEE Computer Society, Washington, DC, pp 139–146
Prosser N (2010) Medical image segmentation using gpu accelerated variational level set methods. Master’s thesis, Rochester Institute of Technology
Abramov A, Kulvicius T, Wörgötter F, Dellen B (2010) Real-time image segmentation on a GPU. In: Keller R, Kramer D, Weiss JP (eds) Facing the multicore-challenge. Lecture notes in computer science, vol 6310. Springer, Berlin, Heidelberg
Smistad E, Falch TL, Bozorgi M, Elster AC, Lindseth F (2015) Medical image segmentation on GPUs a comprehensive review. Med Image Anal 20(1):1
Article Google Scholar
Li Y, Jiao L, Shang R, Stolkin R (2015) Dynamic-context cooperative quantum-behaved particle swarm optimization based on multilevel thresholding applied to medical image segmentation. Inf Sci 294:408
Article MathSciNet Google Scholar
Chen Z, Meng X, Guo L, Liu G (2012) GICUDA: a parallel program for 3D correlation imaging of large scale gravity and gravity gradiometry data on graphics processing units with CUDA. Comput Geosci 46:119
Article Google Scholar
Bay OF, Samet R, Aydn S, Tural S, Bayram A (2015) Performance analysis of GPU-based parallel image segmentation using CUDA. In: Proceedings of the 2th International Conference on Advanced Technology and Sciences (Antalya-Turkey, 2015), ICAT’15, pp 426–429
Hovland RJ Latency and bandwidth impact on gpu-systems. Tech. rep., Norwegian University of Science and Technology
Samet R, Aydin S, Bay OF, Tural S, Bayram A (2015) Real time image processing applications on multicore CPU and GPGPU. In: The 21st International Conference on Parallel and Distributed Processing, WORLDCOMP’15, Las Vegas-Nevada, 27–30 July 2015
Samet R, Aydin S, Tural S, Bayram A (2016) Primer defects detection on military cartridge cases. In: The 15th annual International Conference, NICOGRAPH’15, Hangzhou, 6–8 July 2016
Abdullah M, Abuelrub E, Mahafzah B (2011) The chained-cubic tree interconnection network. Int Arab J Inf Technol 8(3):334
Google Scholar
Mahafzah BA, Alshraideh M, Abu-Kabeer TM, Ahmad EF, Hamad NA (2012) The optical chained-cubic tree interconnection network: topological structure and properties. Comput Electr Eng 38(2):330. https://doi.org/10.1016/j.compeleceng.2011.11.023
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Gazi University, Ankara, Turkey
Semra Aydin & Omer Faruk Bay
Ankara University, Ankara, Turkey
Refik Samet

Authors

Semra Aydin
View author publications
You can also search for this author in PubMed Google Scholar
Refik Samet
View author publications
You can also search for this author in PubMed Google Scholar
Omer Faruk Bay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Semra Aydin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aydin, S., Samet, R. & Bay, O.F. Real-time parallel image processing applications on multicore CPUs with OpenMP and GPGPU with CUDA. J Supercomput 74, 2255–2275 (2018). https://doi.org/10.1007/s11227-017-2168-6

Download citation

Published: 12 December 2017
Issue Date: June 2018
DOI: https://doi.org/10.1007/s11227-017-2168-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Real-time parallel image processing applications on multicore CPUs with OpenMP and GPGPU with CUDA

Abstract

Similar content being viewed by others

Parallel and Distributed Computing for Processing Big Image and Video Data

Parallel Algorithm of Digital Image Processing Based on GPU

Heterogeneous parallel computing accelerated iterative subpixel digital image correlation

1 Introduction

2 Related works