1 Introduction

Image processing requires long time, which is tightly limited in the real-time applications [1, 2]. Processing time increases depending on both the number and resolution of the images [3]. The problem is that the serial image processing does not satisfy the real-time conditions [3]. Parallel computing techniques, especially multicore and multiprocessor technologies, should be used to solve this problem [4]. Parallel algorithms are much more complex than serial ones. Generally, the parallel algorithms are designed by modification of serial algorithms [5]. Obtained parallel algorithms can be improved and accelerated by taking into consideration the hardware that the algorithms will work with.

Segmentation is one of the steps in image processing. Thresholding is widely used for this aim. In real-time applications, multicore CPUs and GPGPU should be used to execute thresholding on many images covering the entire surface of the same metallic and cylindrical moving object to satisfy real-time conditions. A multicore CPU is a single computing component with two or more independent actual central processing units (called cores) [6, 7]. Pthreads, OpenMP (Open Multiprocessing), TBB (Threading Building Blocks) and Cilk are API (Application Programming Interfaces) to efficiently use the capacity of a multicore CPU [8]. In this paper, a general-purpose and platform-independent OpenMP that supports shared memory for multiprocessing programming in C, C++ and FORTRAN will be used.

A Graphic Processing Unit (GPU) is a Single Instruction stream and Multiple Data streams (SIMD) architecture where the same instruction is performed on all data elements in parallel. On the other hand, the pixels of an image can be considered as separate data elements. So, GPU is a suitable architecture to process data elements of an image in parallel [9]. General-Purpose computing on Graphics Processing Units (GPGPU) is a tool to increase the utilization of GPU. There are many platforms to efficiently use the capacity of GPGPU, such as CUDA, DirectCompute and OpenCL. The CUDA platform, which is the most common one, will be used in this paper [10].

This paper shows that more efficient algorithms and techniques still need to be developed to improve the performance of real-time image processing applications. One of the aims of this study is to make a contribution to this area using OpenMP and CUDA. To this end, bi-level thresholding is implemented on the images covering the entire surface of the same metallic and cylindrical moving object in parallel with the five following techniques. One technique is related to CPU programming with the OpenMP platform. In this context, shared-memory multicore programming with OpenMP, scheduling threads on cores with different parameters, and performance related to the execution time are analyzed. The other four techniques are related to GPU programming with the CUDA platform:

  1. 1.

    Single Image Transmission with Single Pixel Processing (SISP) in which the images are transmitted from CPU to GPU one by one and the pixels of the images are processed one pixel per core of GPU;

  2. 2.

    Single Image Transmission with Multiple Pixel Processing (SIMP) in which the images are transmitted from CPU to GPU one by one and the pixels of the images are processed multipixels per core of GPU;

  3. 3.

    Multiple Image Transmission with Single Pixel Processing (MISP) in which the multiple images are combined and transmitted from CPU to GPU as a single data unit and the pixels of the images are processed one pixel per core of GPU;

  4. 4.

    Multiple Image Transmission with Multiple Pixel Processing (MIMP) in which the multiple images are combined and transmitted from CPU to GPU as a single data unit and the pixels of the images are processed multipixels per core of GPU.

Performance analysis related to execution time was performed by comparison of the results obtained by these techniques with serial computing. The technique with multicore CPU showed that, by increasing the chunk size, the execution time decreases approximately four times. All techniques with GPU were implemented on GeForce, Tesla K20 and Tesla K40. Tesla K40 gave best results of 35 (for SISP technique), 36 (for SIMP technique), 54 (for MISP technique) and 71 (for MIMP technique) time improvement in comparison with serial computing.

The rest part of the paper is organized as follows. In Sect. 2, some related works are presented. In Sect. 3, the real-time image processing techniques are proposed. Section 4 describes the image transmission techniques between CPU and GPU. The experimental results are given in Sect. 5. Section 6 concludes with the main findings.

2 Related works

Multicore CPU and GPGPU technologies are widely used for non-real-time and real-time image processing applications. It is well known that the multithreading, multicore and GPU architectures have advantages in comparison with serial computing [11]. A short literature review related to these technologies is given below.

Thapliyal and Arabnia in their works [12,13,14,15,16,17] discuss a historical perspective and relevant context about how hardware and software can work in concert on scalable multiprocessor systems with a number of illustrative examples and applications in imaging science. In fact, the proposed imaging architecture presented in these works can be considered to be early designs of GPU processor architectures.

There are many studies reported in the literature related to non-real-time image segmentation using the threshold technique [18, 19]. The performance was and still remains an urgent issue to be solved in real-time image processing applications. To this end, different algorithms and techniques have been developed for serial computing [20,21,22]. Despite some performance improvements in these works, it is very difficult to satisfy real-time conditions by serial computing. Researchers have looked into alternative solutions and found the multicore CPU and GPGPU technologies to solve this issue. At the same time, in order to efficiently use these technologies, different platforms, such as OpenMP and CUDA, have been developed and widely used. For example, OpenMP platform has been used in multithread image processing and image segmentation applications with multicore computing [23]. CUDA platform has been used for parallel image segmentation by region growing, watershed and Otsu binarization algorithms on GPU [24,25,26,27]. The reduction sweep algorithm was used for image segmentation on both CPU and GPU [28]. In [29], several techniques for image segmentation were implemented using CUDA and GPU and processing time was accelerated about 20 times. The authors of [30] present the results of image segmentation on a video with a frame rate of 30 Hz using CUDA and GPU. Despite existing works, in order to satisfy the need for higher speed and low cost more efficient techniques and algorithms are needed. This paper tries to meet to this need.

To accelerate the image thresholding, in existing works images are transferred to the GPU one by one and each pixel is processed in separate cores. In the proposed paper, the images are combined and transmitted and multiple pixels are processed in one core. Due to these contributions, a higher acceleration rate is obtained.

3 Real-time image processing techniques

In this section, we present three techniques: (1) Serial thresholding (Sect. 3.1); (2) Parallel thresholding on a multicore CPU with OpenMP (Sect. 3.2); and (3) Parallel thresholding on a GPU with CUDA (Sect. 3.3). The final one is divided into four techniques which are SISP (Sect. 3.3.1), SIMP (Sect. 3.3.2), MISP (Sect. 3.3.3) and MIMP (Sect. 3.3.4).

Real-time applications of this study are related to the inspection of certain defects on the entire surface of metallic and cylindrical objects. Images taken from the entire surface of the same metallic and cylindrical moving object were used to inspect the defects in real time. In order to detect certain defects of a single object, the image processing steps should be processed on K images covering its entire surface. Time is limited in given applications. In this paper, only the first step of image processing related to image segmentation will be handled. Thresholding is the simplest and a fast way for image segmentation. Parallel programming techniques, such as multicore and multiprocessing technologies, were used to speed up the thresholding of the metallic and cylindrical object.

Firstly, serial thresholding is described. Then, parallel thresholding on a multicore CPU with OpenMP is presented. Finally, parallel thresholding on GPU with CUDA is discussed.

3.1 Serial thresholding

Image segmentation is the process of dividing the individual elements of an image into a set of groups so that all elements in a group have a common property. Segmentation allows visualization of the structures of interest, removing unnecessary information [31]. Thresholding is the simplest, most commonly used and the most popular technique for segmentation. Thresholding techniques can be classified into two categories: bi-level and multilevel. In this paper, bi-level segmentation is used for the segmentation of objects and the background [19]. Thresholding is often used as a preprocessing step, followed by other post-processing techniques [32]. Let us denote by g(xy) the segmented image obtained from f(xy). If we consider T as the threshold value, the resulting image will be given by following expression.

$$\begin{aligned} g(x,y) = \left\{ \begin{array}{ll} 255, &{} \hbox { if } f(x,y) \ge \hbox { T}\\ 0, &{} \hbox { if } f(x,y) < \hbox { T} \end{array}\right. \end{aligned}$$
(1)

According to serial thresholding, Eq. (1) should be calculated on each pixel of (xy) of an original image of f(xy), where \(x=1,2,\ldots ,N\) and \(y=1,2,\ldots ,M\). The performance or processing time of serial thresholding is defined as follows:

$$\begin{aligned} t_\mathrm{{ST}}= N*M*\Delta t, \end{aligned}$$
(2)

where \(t_\mathrm{{ST}}\) is the processing time of serial thresholding and \(\Delta t\) is the processing time for thresholding on one pixel.

3.2 Parallel thresholding on a multicore CPU with OpenMP

In order to accelerate the thresholding process to satisfy the real-time conditions, the shared-memory multicore programming with OpenMP is proposed. An OpenMP platform always begins with a single thread of control, called the master thread, which exists during the run time of the program (Fig. 1). The master thread may encounter parallel regions, in which the master thread will fork the new threads, each with its own stack and execution context. At the end of the parallel region, the forked threads will be terminated, intermediate results will be joined, and the master thread will continue the program execution as shown in Fig. 1.

Fig. 1
figure 1

Thread organization with OpenMP

To achieve the optimal performance in multithread applications, different scheduling types and chunk sizes should be tested. With OpenMP, static, dynamic and guided scheduling mechanisms can be specified. Static scheduling divides the loop into equal-sized chunks or as equal as possible in the case when the number of loop iterations is not evenly divisible by the number of threads multiplied by the chunk size. Dynamic scheduling uses the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue. By default, the chunk size for dynamic scheduling is 1. Guided is similar to dynamic scheduling, but the chunk size starts off large and decreases to better handle the load imbalance between iterations. The optional chunk parameter specifies the minimum chunk size to use. By default, the optimal (S) chunk size for guided scheduling is defined as follows:

$$\begin{aligned} S=\frac{N_\mathrm{{L}}}{N_\mathrm{{T}}}, \end{aligned}$$
(3)

where \( N_\mathrm{{L}}\) is a number of operations in the loop and \( N_\mathrm{{T}}\) is a number of threads.

The processing time of parallel thresholding with OpenMP (\( t_\mathrm{{MP}}\)) is defined as follows:

$$\begin{aligned} t_\mathrm{{MP}}= \frac{t{_\mathrm{{ST}}}}{N_\mathrm{{T}}}+t_{0}= \frac{N*M*\Delta t}{N_\mathrm{{T}}}+t_{0}, \end{aligned}$$
(4)

where \( t_{0}\) is a processing time for fork and join of threads. One of factors that effects \( t_{0}\) is chunk size of S.

3.3 Parallel thresholding on a GPU with CUDA

The CUDA platform consists of functions, called kernels, which can be executed simultaneously by a large number of threads on the GPU. Threads are grouped into warps. A warp consists of 32 threads which are executed as SIMD architecture independently. Threads within a warp execute the same instruction on different data elements in parallel [33].

In order to parallelize the thresholding process, the kernels should be used. To organize kernels to work in parallel, streams are used (Table 1).

As shown in Table 1, firstly, the K streams are defined (Line 1) and created (Lines 2, 3). Then, data (images) for created streams are transmitted asynchronously from the CPU to the GPU (Lines 4, 6). After that, kernels execute the same instructions on K images asynchronously (Lines 5, 7). Finally, the results are transmitted from the GPU to the CPU (Lines 8, 9).

Images can be sent from the CPU to the GPU one by one or in a combined data array. Images can be processed in the cores of the GPU as one pixel by one pixel or in multipixels. Results can be returned from the GPU to the CPU one by one, or in a combined data array. Algorithm 1 for sending K images from the CPU to the GPU one by one, processing them in GPU and returning the results from the GPU to the CPU one by one is given in Table 2. Algorithm 2 for sending K images from the CPU to the GPU in a combined data array, processing them in the GPU and returning the results from the GPU to the CPU in a combined data array is given in Table 3. Algorithm 3 for distributing and processing the images as one pixel per core of the GPU is given in Table 4. Algorithm 4 for distributing and processing the images as P pixels per core of the GPU is given in Table 5.

Table 1 Multiple kernel organization by streams
Table 2 Algorithm 1: Single image transmission
Table 3 Algorithm 2: Multiple image transmission
Table 4 Algorithm 3: Single pixel processing in the GPU

Four techniques are proposed to execute thresholding on the GPU with CUDA: (1) SISP; (2) SIMP; (3) MISP, and (4) MIMP (Table 6).

3.3.1 SISP technique

In this technique, the images are transmitted from the CPU to the GPU one by one and results are returned from the GPU to the CPU one by one using the proposed Algorithm 1 (Table 2). Also, the pixels of the images are distributed and processed one pixel per core of the GPU using the proposed Algorithm 3 (Table 4).

3.3.2 SIMP technique

In this technique, the images are transmitted from the CPU to the GPU one by one and the results are returned from the GPU to the CPU one by one using the proposed Algorithm 1 (Table 2). Also, the pixels of the images are distributed and processed as multi pixels per GPU core using the proposed Algorithm 4 (Table 5). Pixels of images are distributed among cores of GPU as P pixels per core. The number of pixels per core depends on GPU hardware.

3.3.3 MISP technique

In this technique, the images are transmitted from the CPU to the GPU in a combined data array [34] and the results are returned from the GPU to the CPU in a combined data array using the proposed Algorithm 2 (Table 3). After transferring the combined results to CPU, they are separated according to size of the images. Also, the pixels of the images are distributed and processed one pixel per core of the GPU using the proposed Algorithm 3 (Table 4).

3.3.4 MIMP technique

In this technique, the images are transmitted from the CPU to the GPU in a combined data array and the results are returned from the GPU to the CPU in a combined data array using the proposed Algorithm 2 (Table 3). Also, the pixels of the images are distributed and processed as multipixels per GPU core using the proposed Algorithm 4 (Table 5).

Table 5 Algorithm 4: Multi pixel processing in the GPU
Table 6 Proposed techniques

4 Image transmission between CPU and GPU

In real-time applications, data transmission time is also very important factor. In systems with GPU, data transmission time consists of two components. These components are defined as the time spent for transmission of the data from CPU to GPU and from GPU to CPU, accordingly. Before executing a kernel on the GPU, all of the data used by kernel need to be transmitted from the CPU memory to the GPU memory. After execution, the results produced by the kernel most likely need to be transmitted back to the CPU memory. cudaMemcpy function is used to transmit data in both directions.

Transmission time in both directions consists of two components. First component is latency which includes preparation overhead. This overhead may occur due to instruction decoding, memory latency, waiting for bus access and other causes. Second component is the propagation time which depends on bandwidth (the number of bits propagated per second). This property has great impact on the performance of a graphic processor since all data which shall be used in the computation must be copied to the graphics processor.

The Hockney model describes in its simplest form how the bandwidth and latency affect the transmission time (t) which is necessary to transmission a given set of data [35].

$$\begin{aligned} t=L+m/B , \end{aligned}$$
(5)

where L is the latency, B is bandwidth and m is size of transmitted data.

Latency and bandwidth depend on graphic card, memory allocation, memory architecture, memory speed, CPU architecture, CPU speed, chipsets and bus clock frequency. Calculation of the transmission time accepting into account all of above-listed parameters is not so easy task. In practice, measured transmission time is used.

5 Experimental results

Experiments were related to the real-time detection of standard defects such as scratches, dents, wrinkles and crimps on the surface of the military cases [36, 37] (Fig. 2). Eight images covering the entire 360-degree (8 \(\times \) 45 degree) surface of the same moving military cases were used to detect the defects (Fig. 3). A multicore CPU with OpenMP and GPGPU with CUDA was used to perform the parallel segmentation of the military cases and background using the thresholding. Speedup rate (r) was used to evaluate segmentation techniques:

$$\begin{aligned} r=t_\mathrm{{ST}}/t_\mathrm{{PT}}, \end{aligned}$$
(6)

where \(t_\mathrm{{PT}}\) is the processing time of parallel thresholding.

Fig. 2
figure 2

Military cases

Fig. 3
figure 3

Images covering the entire 360-degree (8x45 degree) surface of the same military case

The following platform was used: Intel Core i7-3630QM CPU with 4 cores and hyper threading technologies; 8 GB RAM; Windows 7. The codes were written in C++ using the Visual Studio 2012. Images with different resolutions (320 \(\times \) 240, 640 \(\times \) 480, and 1280 \(\times \) 960) were used.

5.1 Parallel thresholding on a multicore CPU with OpenMP

Static, dynamic and guided scheduling types with different chunk sizes were implemented to speed up the segmentation process (Table 7).

Table 7 Experiment results on a multicore CPU with OpenMP

Table 7 presents the experimental results of the speedup of different scheduling types with different chunk sizes. As seen, the dynamic and guided scheduling types gave the best results. By increasing the chunk size, the speedup is decreased for all scheduling types. In summary, in order to obtain the best results by OpenMP, chunk sizes should be as small as possible and dynamic or guided scheduling types should be used. There is one important point to be underlined. Namely, as shown in Table 7, the values of speed up with dynamic and guided scheduling exceed 4. The reason is that the CPU with four cores has hyperthreading technology.

5.2 Parallel thresholding on a GPU with CUDA

NVIDIA GeForce GT 635M with 96 cores, Tesla K20 with 2496 cores and Tesla K40 with 2880 cores were used. The number of thread size was set to 1024. Four techniques were implemented: (1) SISP; (2) SIMP; (3) MISP; and (4) MIMP.

5.2.1 SISP

In this technique, eight images were sent and executed one by one. The pixels of the images were distributed as one pixel (or 8 bits) per GPU core (Table 8).

Table 8 Data transmit and process performance evaluation for SISP technique
Table 9 Data transmit and process performance evaluation for SIMP technique
Table 10 Data transmit and process performance evaluation for MISP technique
Table 11 Data transmit and process performance evaluation for MIMP technique
Table 12 Comparison results of the proposed techniques without transmit time
Table 13 Comparison results of the proposed techniques with transmit time
Fig. 4
figure 4

Comparison results of the proposed techniques using: a GeForce GT 635M without transmission time; b GeForce GT 635M with transmission time; c Tesla K20 without transmission time; d Tesla K20 with transmission time; e Tesla K40 without transmission time; f Tesla K40 with transmission time

Fig. 5
figure 5

a The original image; b Segmentation result by parallel thresholding

As seen, Tesla K40 gave the best result 35 times of improvement without count of transmission time and 12 times of improvement with count of transmission time in comparison with serial computing. Another finding is that, in general, by increasing the image resolution the speedup rate decreases for all kinds of GPU for both cases (without and with transmission time).

5.2.2 SIMP

In this technique, eight images were sent and executed one by one. The pixels of the images were distributed as four pixels (or 32 bits) per GPU core (Table 9).

As seen, Tesla K40 gave the best result 36 times of improvement without count of transmission time and 13 times of improvement with count of transmission time in comparison with serial computing. Another finding is that by increasing the image resolution the speedup rate decreases for all kinds of GPU for both cases (without and with count of transmission time).

5.2.3 MISP

In this technique, eight images were combined in a data array. This data array was sent and executed in a kernel. The pixels of the images were distributed as one pixel (or 8 bits) per GPU core (Table 10).

As seen, Tesla K40 gave the best result 54 times of improvement without count of transmission time and 16 times of improvement with count of transmission time in comparison with serial computing. Another finding is that by increasing the image resolution, the speedup rate decreases for Geforce and Tesla K40 and increases for Tesla K20 without transmission time. Also by increasing the image resolution the speedup rate decreases for Geforce, Tesla K20 and Tesla K40 with transmission time.

5.2.4 MIMP

In this technique, eight images were combined in a data array. These data were sent and executed in a kernel. The pixels of the images were distributed as four pixels (or 32 bits) per GPU core (Table 11).

As seen, Tesla K40 gave the best result 71 times of improvement without count of transmission time and 17 times of improvement with count of transmission time in comparison with serial computing. Another finding is that by increasing the image resolution the speedup rate decreases for Geforce and increases for Tesla K20 and Tesla K40 without transmission time. Also, in general, by increasing the image resolution the speedup rate decreases for Geforce, Tesla K20 and Tesla K40 with transmission time.

The comparison results of the proposed techniques with CUDA in terms of speedup are given in Tables 12, 13 and Fig. 4.

Different computers were used to implement GeForce, Tesla K20 and Tesla K40. Due to the differences of CPUs of these computers, the different serial times to process the image with same resolution were measured. For example, the serial time to process the image with resolution of (320 \(\times \) 240) by different CPUs was measured as 4.88, 8.25 and 5.79 ms (see the column of serial time in Table 8). Speedup rate for each GPU was affected by the capacity of CPUs.

In general, GeForce gave less improvement than Tesla K20 and K40. This is due to the fewer number cores (96) in comparison with Tesla K20 and K40, which have 2496 and 2880 cores. As shown in Tables 12 and 13, the best results of speedup rate without and with transmission times were obtained by using Tesla K40 for all techniques and image resolutions. Among all techniques, MIMP gave the maximum speedup 71 times without transmission time and 17 times with transmission time. From Tables 8, 9, 10, 11, 12, 13, it can be summarized that Tesla K40 GPU and MIMP technique should be used to get the maximum performance. As seen, there is a big difference between speedup rates without and with transmission time. The reason is the transmission time between CPU and GPU.

As seen, Tesla K40 gave the best results for all techniques. With Tesla K40, the speedup rates for MISP and MIMP techniques were higher than those of the SISP and SIMP ones. Another point with Tesla was that, by increasing the image resolution, the speedup rate increased. In summary, in order to obtain the best results with CUDA, MISP and MIMP techniques should be used.

An example for segmentation results with parallel thresholding is given in Fig. 5.

6 Conclusion

This paper has presented the image processing applications using multicore and multiprocessing technologies to satisfy real-time conditions. To this end, the algorithms and techniques for the parallel image segmentation through thresholding on K images covering the entire surface of the same metallic and cylindrical moving objects were proposed. A multicore CPU with OpenMP and GPGPU with CUDA was used to implement the thresholding of military cases using eight real images covering their entire surface. Obtained implementation results were compared with the results of serial computing in terms of speedup metric. Experiment results have showed that a GPU with CUDA has a huge capacity to increase the performance of real-time applications.

The best results of speedup rate without and with transmission times were obtained by using Tesla K40 for all techniques and image resolutions. Four techniques have been proposed to process the real-time thresholding such as SISP, SIMP, MISP and MIMP. Among all proposed techniques, MIMP gave the maximum speedup 71 times without transmission time and 17 times with transmission time in comparison with serial computing. As seen, there is a big difference between speedup rates without and with transmission time. The reason is the transmission time between CPU and GPU. As summary, Tesla K40 GPU and MIMP technique should be used to get the maximum performance.

As future work, the time to transmit images from the CPU to the GPU and results from the GPU to the CPU will be analyzed and optimized. More studies can be made on the chained-cubic tree and optical chained-cubic tree topologies. It would be interesting to apply our implementation on these topological properties [38, 39].