1 Introduction

Medical image technologies are being used to help doctors and experts in the medical field. There exist many technologies such as magnetic resonance imaging (MRI), computed tomography (CT), and digital mammography. These technologies produce a huge number of data that can be used to analyze and diagnose tissues abnormalities. Researchers in information technology domain facilitate the analysis process by developing image processing techniques that are accurate, efficient, and fast [22, 31].

Image denoising, image recognition, image rotation and image segmentation are some research areas that apply information technology to medical images [8, 9, 11, 12]. Image segmentation is a method that is used to extract objects from image. It is also used to extract or determine an important region from medical images which is called the region of interest (ROI) [13, 33]. Different techniques are used to segment an ROI such as threshold-based methods, compression-based methods, and histogram-based methods.

Also, there are other methods for determining the ROI which are based on data mining and machine learning. These include clustering methods and region-growing methods [23, 40, 43, 44]. ROI helps medical experts to diagnose the disease and monitor the progress level of cancerous diseases. Doctors can automatically determine the type of a tumor using a classification technique where image segmentation is a key operation [18, 47]. Because of image segmentation’s importance in the field, for instance, computer vision [2729], many different algorithms were proposed in the last decade to enhance image segmentation techniques. Clustering algorithms are among the most popular in image segmentation. The proposed algorithms differ in their accuracy and computational efficiency. This paper studies the most famous and new clustering algorithms and provides an analysis on their feasibility for parallel implementation. We have studied four algorithms which are: fuzzy C-mean, type-2 fuzzy C-mean, interval type-2 fuzzy C-mean, and modified interval type-2 fuzzy C-mean. We have implemented them in a sequential (CPU only) and a parallel hybrid CPU–GPU version.

In the past, parallel programming has been used widely in clusters and distributed systems. Heavy tasks are divided into smaller tasks that are executed by different machines that are connected through a network [15, 16]. Optimization solutions were proposed in the literature to enhance the performance based on the network topology [10, 17]. On the other hand, parallel programming using graphic processing units (GPUs) has emerged which improves the computational efficiency of the algorithms without loosing accuracy. In GPU and CUDA programming, the application can run simultaneously more than 512 threads, while modern CPUs can run up to 32 threads at the same time [6, 7].

In this work, speedup gains of 6\(\times \) to 16\(\times \) were achieved in the parallel implementation over the sequential implementation. Experiments were conducted using medical MRI and digital mammography images [36]. A hybrid CPU–GPU parallel programming was used to achieve high speedup [22]. We detail in this paper our discoveries on the portions of the algorithms that are highly parallel so as to help the image processing community, especially if these algorithms are to be used in real-time processing where efficient computation is critical.

This paper is organized as follows. Section 2 provides a background and related work in the area of image segmentation and parallel programming. Section 3 details the four fuzzy C-mean (FCM)-based algorithms that we considered in this paper. In Sect. 3.2, our parallel implementation of the four algorithms is described. Section 4 shows the results and presents a discussion. Finally, we conclude our work in Sect. 5.

2 Background and related work

This section covers two topics. First is the segmentation in medical images and the different methods in extracting ROI from images. The second topic is the parallel programming using GPUs and and its use to enhance the computation of the segmentation process [21, 41].

2.1 Image segmentation techniques

Many algorithms were proposed in the literature to improve image segmentation operation. Tan and Isa [42] proposed an algorithm to mix histogram-based with fuzzy C-mean technique (HTFCM). The authors compared their algorithm with the ant colony algorithm and showed better results. Also, Tang proposed a segmentation algorithm that uses region-growing methods [43]. The method is based on seed region growing algorithm which was proposed earlier by Adams and Bischof [3]. It selects a seed pixel using the watershed segmentation method that was proposed by Shih and Cheng [40].

Other researchers used clustering methods algorithms due to their high efficiency such as in [42]. Ji et al. proposed in [30] a new segmentation algorithm using Gaussian mixture model (GMM), exception maximization (EM) algorithm and FCM. A local minimum is the main drawback of the GMM which was solved using FCM. The dimension of an image that was used in the authors’ experimental results was 176 \(\times \) 218. They used a 3 % noise and 60 % intensity. The results produced a 39 % accuracy.

Icer [26] proposed another GMM and FCM method to increase the accuracy of image segmentation. Corpus callosum (CC) was used as the dataset in this research. MR brain and midsagittal section of CC were the focus of the effort. Semra Icer used these two methods in two main steps. The first step extracts the GMM to segment the image by a probability density function. The second step applies the FCM. The accuracy of this process was 97 %.

Wang et al. proposed a new modified algorithm for fuzzy C-mean (MFCM) using a filtering method [46]. The accuracy of the classification process is improved due to the filtering method to reduce the effect of the noise. The authors used a diffusion filter with multi-scale fuzzy C-means technique. MRI images for brain were used as the dataset. This effort produces an accuracy of 87 % for images of 12 % noise and 84 % for images of 15 % noise. The FCM and MFCM without filtering produced 77 % accuracy for images of 12 % noise and 70 % accuracy for images of 15 % noise. The multi-scale technique and filtering method improved the classification and robustness.

Ahmed et al. presented a new modified fuzzy C-mean algorithm named bias-corrected fuzzy C-mean (BCFCM) [4]. The inhomogeneity of neighborhood pixels is the main attribute that was used in this paper. They modified the objective function to include the measurement of inhomogeneity of neighborhood pixels. They made a comparison between BCFCM, FCM, and expectation maximization algorithms (EM) for MR brain images. [4, 5] used the FCM version that was used in [46]. BCFCM achieved about 94 % accuracy. The images used in the experiment have a Gaussian noise of \(\sigma \) = 6. FCM achieved 78 % and EM achieved 85 % accuracy for the same MR images of the same Gaussian noise.

Earlier research in clustering technique used the classical FCM algorithm with some modifications or new added steps to improve its accuracy. Some other research suggests to improve the segmentation process by adding new mathematical methods such as the work done by Rhee et al., where the authors developed the FCM using a new equation to measure the membership function of data points with centroids of clusters [23]. After that, Hwang et al. developed an algorithm using hypothesizing upper and lower memberships for each data point [25]. Later, Qiu et al. modified on the algorithm in [25] resulting in a new algorithm that reduces the noise effect of older versions [36].

2.2 Segmentation using parallel programming

After the big improvement of accuracy in image segmentation algorithms, researchers used a new hardware technology to improve the computation time as well. One of the hardware technologies is the graphic processing unit (GPU). GPU is used in high-performance computing (HPC) to reduce the execution time through parallel implementation of algorithms [22, 38].

Rowiska et al. implemented the FCM clustering algorithm in a parallel version [37]. CUDA was used as the programming language. The authors compared the sequential code of the FCM with the parallel version. They tested the two algorithm versions using different colored images with different size. The membership and centroids functions were executed on the GPU side, while the objective function and the termination condition were executed on the CPU side. The sequential implementation was implemented in C++ and MATLAB. The GPU parallel implementation was faster than the CPU implementation by 7\(\times \) (7 times).

Walters et al. proposed a parallel technique for two segmentation algorithms which are: the Markov random fields (MRF) and HMMERs Viterbi [45]. The used dataset was medical images for liver. The hardware that was used in this paper is NVIDIA GPU 8800 GTX. In this research, the speedup of the parallel version was enhanced by 130\(\times \) for MRF algorithm and 38.6\(\times \) for the HMMER algorithm over the sequential CPU implementation.

Pan et al. in [34] parallelized the region-growing method (RGM) and the multi-level watershed method (MLWM) algorithms using GPU hardware. They used abdomen images and brain images as datasets. The hardware that was used in their work was the Geforce 8500 GT. The speedup was enhanced by 8\(\times \) for the region-growing method and 2.3\(\times \) for the multi-level watershed method.

Many researchers such as in [22, 34, 37, 45] utilize the GPU capabilities to increase the computational performance of image processing algorithms. GPUs uses single instruction multiple data (SIMD) technique as the parallel paradigm. While both CPUs and GPUs can run and manage thousands of threads simultaneously via time-slicing, modern CPUs can run 4–12 threads simultaneously whereas GPUs can run a thousand of threads [20, 22].

3 FCM-based algorithms for image segmentation

3.1 Serial version of four FCM algorithms versions

In this section, we present the sequential CPU implementations of the FCM and the type-2 fuzzy C-mean (T2FCM) algorithms. The programming language that was used in this study is C-sharp programming language [24].

3.1.1 Fuzzy C-mean algorithm

FCM algorithm is one of the most famous clustering algorithms that is used to segment data into N clusters [14]. The segmentation process is performed in three main steps. The first step calculates the centroid for each cluster (initially, these centroids are generated randomly). This operation uses Eq. 1.

$$\begin{aligned} V_{j}= \frac{\sum _{i=1}^{n}u^{m}_{ij}\cdot x_{i}}{\sum _{i=1}^{n}u^{m}_{ij}} \end{aligned}$$
(1)

where m is the fuzziness factor, n is the number of points, \(v_j\) is the center of cluster j

The second step calculates the membership of each data point to all clusters’ centroids. This step is carried out using Eq. 2

$$\begin{aligned} u_{ij}= \frac{1}{\sum _{k=1}^{c}\left( \frac{\left\| x_{i}-c_{j} \right\| }{\left\| x_{i}-c_{k} \right\| } \right) ^{\frac{2}{m-1}}} \end{aligned}$$
(2)

where C is the number of clusters, \(x_i\) is the object point.

The third step calculates the distance between data points and clusters’ centers. The three steps are repeated until the difference of the total distance between points and centers is less than or equal to some error threshold [37]. This step is called the objective function and is calculated using Eq. 3. Algorithm 1 shows the sequential FCM algorithm.

$$\begin{aligned} Jm = \sum _{i=1}^{n}\sum _{j=1}^{c}u^{m}_{ij}\left\| x_{i}- c_{j} \right\| ^{2}, \quad 1\le m< \infty \end{aligned}$$
(3)

where m is the fuzziness factor, n is the number of points, c is the number of clusters.

figure a

3.1.2 Type-2 fuzzy C-mean algorithm

Rhee et al. in [23] presented the type-2 fuzzy C-mean algorithm (T2FCM) for data classification. It uses the same FCM steps and equations except for the membership function. T2FCM uses Eq. 4 which increases the accuracy of the membership value. Algorithm 2 shows the sequential T2FCM algorithm.

$$\begin{aligned} a_{ij} = u_{ij} - \frac{1 - u_{ij}}{2} \end{aligned}$$
(4)
figure b
figure c

3.1.3 Interval type 2 fuzzy C-mean

Hwang et al. in [25] proposed the algorithm interval type-2 fuzzy C-mean (IT2FCM). This algorithm is based on the traditional FCM but with improvements that produced more accurate results. The algorithm uses two values of fuzziness to calculate the membership of data points. Two membership values are computed which are: the upper and lower membership for each cluster’s center. These two membership values are calculated using Eq. 5 after sorting data points in an ascending order. The centroid values of clusters are then updated using these two memberships. The authors used two values for each center (\(V_\mathrm{left}\) and \(V_\mathrm{Right}\)). The average of the left and right centroid values is the value of the new cluster’s center. The values (\(V_\mathrm{left}\) and \(V_\mathrm{Right}\)) are calculated as follows. First, find the index value of the old cluster’s center K after sorting all data points in an ascending order. Then calculate the center using the same equation of FCM, which is Eq. 1. If index of a point K, then \(u_{ij}=\overline{u_{ij}}\) , otherwise \(u_{ij}=\underline{u_{ij}}\). Equation 6 is used in the algorithm as shown in Algorithm 3.

$$\begin{aligned} \overline{u_{ij}}= & {} \left\{ \begin{array}{l l} \frac{1}{\sum _{k=1}^{c}\left( \frac{\left\| X_i - C_j \right\| }{\left\| X_{i}- C_k \right\| } \right) ^{\frac{2}{m_1-1}}} &{} ,\quad {if \sum _{k=1}^{c}\left( \frac{\left\| X_i - C_j \right\| }{\left\| X_i - C_k \right\| } \right) \le P}\\ \frac{1}{\sum _{k=1}^{c}\left( \frac{\left\| X_i - C_j \right\| }{\left\| X_i - C_k \right\| } \right) ^{\frac{2}{m_2-1}}} &{} \quad \text {Otherwise} \end{array} \right. \end{aligned}$$
(5)
$$\begin{aligned} \underline{u_{ij}}= & {} \left\{ \begin{array}{l l} \frac{1}{\sum _{k=1}^{c}\left( \frac{\left\| X_i - C_j \right\| }{\left\| X_i - C_k \right\| } \right) ^{\frac{2}{m_1-1}}} &{} ,\quad {if \sum _{k=1}^{c}\left( \frac{\left\| X_i - C_j \right\| }{\left\| X_i - C_k \right\| } \right) \ge P}\\ \frac{1}{\sum _{k=1}^{c}\left( \frac{\left\| X_i - C_j \right\| }{\left\| X_i - C_k \right\| } \right) ^{\frac{2}{m_2-1}}} &{} \quad \text {Otherwise} \end{array} \right. \end{aligned}$$
(6)

where \(P = \frac{j}{255}\), j is the index of a cluster, and i is the index of point

The objective function is calculated using Eq. 7. Clusters’ centers are updated using Eqs. 8 and 9.

$$\begin{aligned} J_\mathrm{m}= & {} \sum _{k=1}^{C}\sum _{i=1}^{N}\frac{x_i}{\overline{u_{ik}}}+\frac{x_i}{\underline{u_{ik}}} \end{aligned}$$
(7)
$$\begin{aligned} V_\mathrm{Left}= & {} \frac{\sum _{i=1}^{N}u_{ij}^m \times x_i}{\sum _{i=1}^{N}u_{ij}^m} \end{aligned}$$
(8)

where \(u_{ij}\) = \(\overline{u_{ij}}\) if the value of \(i \le K\), and K is the index of center value after sorting the data points. Otherwise, \(u_{ij} = \underline{u_{ij}}\).

$$\begin{aligned} V_\mathrm{Right} = \frac{\sum _{i=1}^{N}u_{ij}^m \times x_i}{\sum _{i=1}^{N}u_{ij}^m}, \end{aligned}$$
(9)

where \(u_{ij} = \underline{u_{ij}}\) if the value of \(i \le K\), and K is the index of center value after sorting the data points. Otherwise, \(u_{ij}\) = \(\overline{u_{ij}}\). Finally, the new center values are calculated using Eq. 10:

$$\begin{aligned} C(k) = \frac{V_\mathrm{Left}+V_\mathrm{Right}}{2} \end{aligned}$$
(10)
$$\begin{aligned} u_{ij} = \frac{\overline{u_{ij}}+\underline{u_{ij}}}{2} \end{aligned}$$
(11)

3.1.4 Modified interval type-2 fuzzy C-mean

Pixels feathers in FCM and IT2FCM algorithms are supposed to be independent of each other. However, in medical images, pixels feathers are dependent, especially the neighboring pixels. Qiu et al. in [36] modified the IT2FCM algorithm in [25] by adding a feature of dependency for neighboring pixels, which produced better accuracy. The authors used a local spatial interaction between adjacent pixels. They used 3 \(\times \) 3 local window to simplify the computation of dependency and named their algorithm modified interval type-2 fuzzy C-mean (MIT2FCM).

In MIT2FCM, the upper and lower memberships were used similar to [25]. The new step that was introduced is a local spatial of adjacent pixels. It is calculated using Eq. 12, where \(u_{ij}=\frac{\overline{u_{ij}}+\underline{u_{ij}}}{2}\). Equations 13 and 14 are used in MIT2FCM as shown in Algorithm 4.

$$\begin{aligned} u_{i \mathrm{spatial}}= & {} \left( \sum _{l=1}^{N_k}\frac{1}{d^2(d_l,d_k)^\frac{1}{m-1}}\right) ^{-1}\times \left( \sum _{a=1}^{N}\frac{u_{ij}}{d^2(x_a,x_k)^\frac{1}{m-1}}\right) \end{aligned}$$
(12)
$$\begin{aligned} \overline{u_{ij}}= & {} \left\{ \begin{array}{ll} \frac{1}{\sum _{k=1}^{c}\left( \frac{\left\| d^*_{ij} \right\| }{\left\| d^*_{ik} \right\| } \right) ^{\frac{2}{m_1-1}}} &{} ,\quad {\mathrm{if} \sum _{k=1}^{c}\left( \frac{\left\| d^*_{ij} \right\| }{\left\| d^*_{ik} \right\| } \right) \le C}\\ \frac{1}{\sum _{k=1}^{c}\left( \frac{\left\| d^*_{ij} \right\| }{\left\| d^*_{ik} \right\| } \right) ^{\frac{2}{m_2-1}}} &{} \quad \text {Otherwise} \end{array} \right. \end{aligned}$$
(13)
$$\begin{aligned} \underline{u_{ij}}= & {} \left\{ \begin{array}{l l} \frac{1}{\sum _{k=1}^{c}\left( \frac{\left\| d^*_{ij} \right\| }{\left\| d^*_{ik} \right\| } \right) ^{\frac{2}{m_1-1}}} &{} ,\quad {\mathrm{if} \sum _{k=1}^{c}\left( \frac{\left\| d^*_{ij} \right\| }{\left\| d^*_{ik} \right\| } \right) \ge C}\\ \frac{1}{\sum _{k=1}^{c}\left( \frac{\left\| d^*_{ij} \right\| }{\left\| d^*_{ik} \right\| } \right) ^{\frac{2}{m_2-1}}} &{} \quad \text {Otherwise} \end{array} \right. \end{aligned}$$
(14)

where \(d^*_{ik} = \frac{d^2_{ik}}{u_{i \mathrm{Spatial}}}\).

figure d
figure e

3.2 Parallel version of the FCM algorithms

This section presents the hybrid CPU–GPU implementation of the four segmentation algorithms: FCM, T2FCM, IT2FCM, and MIT2FCM. The hybrid strategy of using the CPU and GPU together provides a powerful tool for programmers to achieve efficient computation [35]. In the four cases, our parallel implementations improve the execution time over the sequential implementations.

The membership function is implemented in a parallel fashion to execute on the GPU card. Also, the Do Segmentation Function was parallelized to run on the GPU side as shown in Algorithm 6. This function is used to update and segment the image pixels based on the strength of the membership value compared to the center of a cluster.

figure f
figure g

The calculations of the centroids and the objective functions are executed on the CPU side, because they need to calculate the summation as shown in Eqs. 1 and 3. The summation operation needs to run a number of iterations equal to the number of points multiplied by the number of clusters. If we parallelize this using threads, synchronization of all threads is needed. However, thread synchronization produces high delay on the GPU. Hence, performing this operation on the CPU is faster. Also, transferring data back and forth between the GPU and the CPU memories is avoided. In our previous work in [39], a parallel version without this CPU optimization achieved a speedup of 6\(\times \). However, in this work, we improved it by 9\(\times \) after using optimization techniques as follows. In this work, we performed some mathematical operation in a different way of implementation. For example, the square function can be represented as a multiplication operation. Another optimization is to store the objective value using another variable. When the the operation needs the new value, we used the XOR operation to swap between the two values of the objective function. XOR operation increased the improvement by 1\(\times \). Multiplication operation increased the performance by 2\(\times \). Hence, the new version of parallel FCM is 9\(\times \) faster than the sequential version. Those two optimization techniques were used for the four parallel versions presented in this paper.

As shown in Eq. 2, the membership function has a summation operation that iterates same as the number of clusters. Also, each pixel needs to calculate the Euclidean distance between itself and clusters’ centroids. This can be done separately. Hence, we improved the T2FCM algorithm performance by calculating the membership U in Eq. 2, the new membership A in Eq. 4, and do segmentation function on the GPU side. The new parallel version of T2FCM is shown in Algorithm 8.

figure h

For the hybrid CPU–GPU implementation of the IT2FCM algorithm, all functions that calculate the membership values were converted to a parallel implementation. This algorithm needs to perform data sorting at the beginning of the code. We used a build-in sorting algorithm on the CPU side for two reasons: first, it runs only once at the beginning of the code. Second, the sorting is performed on a data structure type that contains many attributes such as: X-axis, Y-axis, Red, Green, Blue, and alpha values for each pixel point. Hence, if we are to create memory allocation on the GPU memory, it would have required to transfer more than one memory allocation which produced a high delay in time. For the aforementioned two reasons, the sorting algorithm is performed on the CPU side.

Also, at each iteration of the IT2FCM and MIT2FCM algorithms, a search for the index value of a cluster’s center is needed. Thus, we used binary search algorithm which is fast for sorted data.

figure i

The MIT2FCM algorithm is similar to the IT2FCM algorithm except for the filtration operation. Hence, we used the same hybrid CPU–GPU implementation of IT2FCM with a parallel modified membership function of IT2FCM. This step is a filtering operation and the GPU can execute it faster than the CPU because it is parallelizable. The mean filter is used which updates the value of a pixel using the average of neighborhood pixels. Table 1 shows the list of functions of MIT2FCM and their locations whether on the GPU side or the CPU side.

Table 1 Hybrid CPU–GPU functions

In calculating the objective value for FCM, T2FCM, IT2FCM, and MIT2FCM, we compute the exponential function on the CPU side by performing multiplication if the power is two. Also, for IT2FCM and MIT2FCM, we calculate the value \(\sum _{k=1}^{C}\frac{d_{ij}}{d_{ik}}\) in Eq. 5 only one time and then saved it in a memory array for later use when calculating the upper and lower values in: \(\sum _{k=1}^{C}{\bigg (\frac{d_{ij}}{d_{ik}}}\bigg )^{\frac{2}{m_1-1}}\). This optimization saved a large amount of time on the CPU side.

Memory optimization is another challenge that we faced in our parallel implementation. As mentioned in [20], the usage of memory is critical in GPU programming and can result in slow execution time. Hence, we calculate the memory size to be used using Eq. 15. This equation optimizes memory size usage and speeds up GPU computations. Also, large number of threads can slow down GPU computations when many threads try to read and/or write to the memory [19]. Also, since many models of NVIDIA GPU hardwares cannot support 1024 threads, we used a fixed number of threads of 256. We have used different numbers of threads as shown in Table 2. 256 threads produced the best GPU utilization as shown in the table.

(15)
Table 2 Percentage of GPU utilization

4 Results and discussion

This section presents the execution time comparison between the hybrid CPU–GPU implementation and the sequential implementation. We calculate the speedup using Eq. 16.

(16)

1. Dataset

We have used one MR image and one mammography image for both FCM and T2FCM algorithms as in [32] and [1]. The MR is a brain image with size of 512 \(\times \) 512. The Mammography images is also of size 512 \(\times \) 512 and both images are gray-scale type.

2. Results

The GPU card that was used in this experiment is NVIDIA GT 740M with 2GB memory. The CPU is Intel core I7 with 6GB RAM. The platform is Windows 8.1 with C# and CUDA libraries installed. An integration library with Visual Studio 2013 was used to run C# on GPU [2]. Our implementations did not reduce the image segmentation accuracy in all cases. However, the speedup is improved. The number of clusters that was used is five.

(a) Results of MR and mammography images dataset:

Table 3 shows the experiments of sequential FCM and our hybrid FCM. We ran our experiments for ten times and the average CPU time is 167.4988 s (sequential version). However, the average execution time for our hybrid CPU–GPU is 16.8915 s for MR images. The results of the two FCM versions are the same with respect to the segmented image and accuracy. Table 4 shows the results for T2FCM algorithm versions using the same image that was used in the FCM experiment. The average CPU time is 134.5872 s and the average for hybrid CPU–GPU time is 21.993 s.

Table 3 10 runs of FCM versions for the MR image
Table 4 10 runs of T2FCM versions for the MR image
Table 5 10 runs of IT2FCM versions for the MR image

As shown in Tables 5 and 6, IT2FCM algorithm, using MR images, produced an average CPU time of 428.1371 s for the sequential version and 51.4126 s in the case of the hybrid CPU–GPU. The MIT2FCM results show 316.8554 s for the CPU average execution time and 24.0313 s for the hybrid CPU–GPU with a speedup of more than 13\(\times \). This is because of the filtering operation in MIT2FCM which improves the classification operation [46]. Figure 1 summarizes these results.

For the mammography images case, Tables 7, 8, 9, and 10 show the results for the four algorithms. The average FCM CPU time and average hybrid CPU–GPU time are 172.8874 and 20.3478 s, respectively. The average T2FCM CPU time and average hybrid CPU–GPU time are 166.3168 and 24.5534 s, respectively. The average IT2FCM CPU time and average hybrid CPU–GPU time are 461.6416 and 49.6946 s, respectively. The average MIT2FCM CPU time and average hybrid CPU–GPU time are 2427.1017 and 117.778 s, respectively with a speedup of more than 20\(\times \). Figure 2 summarizes these results.

Table 6 10 runs of MIT2FCM versions for the MR image
Fig. 1
figure 1

Execution time of CPU and hybrid CPU–GPU for MR

Table 7 10 runs of FCM for the mammography image
Table 8 10 runs of T2FCM for the mammography image
Table 9 10 runs of IT2FCM for the mammography image
Table 10 10 runs of MIT2FCM for the mammography image

(b) Speedup discussion

From the execution time shown previously, we can see that the speedup improvement for the parallel FCM is almost 10\(\times \) for the MR image and more than 8\(\times \) for the mammogram image compared to the sequential algorithm. We have used optimization techniques to speed up the execution time by reducing the data transfer rate from and to the GPU memory. For example, we transfer cluster index values to the GPU one way, and we copy it when the code terminates to visualize the results in our application GUI. Also, T2FCM speedup is more than 6\(\times \) by moving special functions to the GPU and keeping others on the CPU side where they execute faster. The IT2FCM algorithm achieved a speedup of more than 8\(\times \) for the MR image and more than 9\(\times \) in the case of mammogram image. Finally, MIT2FCM achieved more than 13\(\times \) for the MR image and more than 20\(\times \) for the mammogram image. Figure 3 summarizes these results.

The filter step increases the image segmentation accuracy. However, it incurs a high execution-time overhead of about 53.5 s. Since the filtering operation can be parallelized perfectly, we can move it to the GPU side which results in an overhead of only 0.38 s for the hybrid CPU–GPU versions.

Fig. 2
figure 2

Execution time of CPU and hybrid CPU–GPU for mamogram

Fig. 3
figure 3

Speedup gain of FCM algorithms

Also, IT2FCM algorithm achieved higher speedup than the T2FCM because it uses two membership matrices (upper and lower memberships). Using one membership matrix in T2FCM results in more dependency for the data points. Consequently, calculating the memberships on the GPU side is better than the CPU side because of the low dependency between the elements of the matrices. The GPU can run the upper and lower matrices in parallel, so that each matrix is calculated separately.

In the last segmentation algorithm which is the MIT2FCM, the authors of the algorithm improved the accuracy of the segmentation process by adding a filtering step. However, this step increases the execution time. With the parallel implementation of the filtering operation on the GPU, we solved its problem of having high execution-time overhead. Our aforementioned discoveries of optimization techniques help the image processing community, especially if these algorithms are to be used in real-time processing where efficient computation is critical.

Table 11 Examples of the accuracy for each FCM algorithm
Table 12 5 clusters of MR images for FCM algorithms

Table 11 shows sample examples of the images that were used in our experiments and the accuracy of each segmentation algorithm. Also, Tables 12 and 13 show sample clusters that were used in the experiments.

Table 13 5 clusters of mammography images for FCM algorithms

5 Conclusion

Image segmentation process is used to extract objects from images. Because of its importance, many methods were proposed in the literature to improve the segmentation process. In this paper, we studied four important clustering algorithms which are: FCM, T2FCM, IT2FCM, and MIT2FCM. Parallel implementations of the algorithms were developed and investigated to improve their execution time without penalizing the accuracy. GPU hardware was used to execute the parallel implementations and compared with sequential CPU-only implementations. We used MR and mammography images in conducting our experiments. Results show that speedup of 6\(\times \) to 20\(\times \) can be achieved with parallel hybrid CPU–GPU implementations. Also, we have discussed optimization steps in the algorithms to enhance their execution time. This is critical for real-time processing where efficient calculations are needed.