1 Introduction

The segmentation of images is one of the most important stages in the digital image analysis. Segmentation is a process in which the image is divided into regions that in general have irregular shape and that are separated by the contours that bound those regions. Conceptually, the objective of the segmentation process is to separate the different shapes, structures or objects that are located inside the image [1]. Some applications of segmentation are face recognition, traffic control systems, fingerprint recognition and medical image analysis.

The applications of medical image segmentation include the detection of tumors, the measurement of tumor volume and its response to therapy over time and surgery simulations, to name a few. In some cases, segmentation dictates the outcome of the entire analysis, since measurements and other processing steps are based on the segmented regions. Most of the segmentation methods are based on the intensity of image pixels, though neural networks and model-based algorithms are used as segmentation tools too [1, 2].

Deformable models, also known as active contours, are the most commonly model-based technique [35] used in medical image processing. Their spread use stems from their ability to incorporate a priori information about the regions of interest within the image. Furthermore, deformable models support interaction mechanisms that enable medical personnel to modify their behavior when necessary [6]. These characteristics allow the use of deformable models for different image processing tasks such as segmentation, tracking and matching [7, 8]. Deformable models can be applied to different medical images sources, e.g. magnetic resonance (MRI), computed tomography (CT) or radiography.

Although the computational load of their algorithm is commonly considered as a drawback to the use of deformable models in some applications, with the advent and widespread use of high-performance computing (HPC) platforms like graphics processing units (GPUs), parallel real-time applications for deformable models are nowadays attainable.

There are many works that involve the implementation of deformable models on GPU cards. In [9], one of the first OpenGL attempts to parallelize on a GPU the gradient vector flow (GVF) field computing is presented. Zheng and Zhang [10] made a GVF deformable model implementation on a GPU that uses texture arrays for the computation of the GVF field and for the iterative solution of the evolution equations. Perrot et al. [11] presented a large image segmentation algorithm based on a statistical deformable model implemented on a GPU. In Li et al. [12], implemented a variation of the Geodesic deformable model on a GPU. Smistad et al. [13] made a performance analysis for the GPU memory spaces applied to the GVF field computation. The segmentation on large images using a tile approach with the GVF active contour is presented in [14]. Other GPU-based image processing works are Češnovar et al. [15] where the authors used semantic classification on large datasets of aerial-images and Valero et al. [16] where the authors presented a Markov Random Fields (MRF) classification of MRI images. Despite its wide use, some research groups and vendors have made attempts to define APIs and languages that simplify the GPU programming in an effort to make it accessible to a large audience, as can be seen in [17]. Besides these GPU-based implementations, some other HPC architectures had been used in the implementation of deformable models, e.g. Lenkiewicz et al. [18] presented a deformable model segmentation algorithm designed for computer clusters or multi-core architectures.

In general, an author presents tables with execution times as proof of the success in the GPU implementations. Nevertheless, although the timing values are the most important issue to consider in parallel implementation, a set of other computational metrics will provide a better evaluation of a given implementation. In [19] the authors defined four important criteria for the evaluation of GPU implementations: performance, programming comfort, accessibility, and cost-effectiveness. Pallipuram et al. [20] made a comparative study of GPU architectures and programming models to determine which platform is the most suitable for a given application. In order to improve the efficiency of GPU applications that involve a large quantity of data transfers between the CPU and GPU, in [21] a convolution for audio applications with the overlapping of data transfer and computational work is presented.

However, some of the cited papers only focus on the GVF field computing and not in a segmentation process. Other works use image processing techniques outside the deformable models theory and other researches only focus on the GPU implementation evaluation. Although a segmentation technique comparison is beyond the scope of this paper, in this investigation we focus on the efficient computation of the GVF field and also in the segmentation results applying deformable models. To address the performance analysis we make a comparison between the parallelization of the GVF field computing with OpenMP and CUDA; to improve the efficiency of the segmentation process of the GVF active contour we proposed a snaxel reallocation approach based on a dynamic mesh adaptation.

Therefore, the main contributions of this work are (1) the objective analysis of the GVF field computing on CPU and GPU and (2) the improvement of the segmentation process with the proposed dynamic snaxel reallocation. Besides the OpenMP and CUDA comparison, we also present an analysis of the GVF field computing using textures and global memory. It is important to clarify that though the GVF snake could be formulated as an interactive segmentation tool, we did not consider it that way because we focus on the acceleration of the algorithm, so we designed it as an autonomous segmentation tool.

In Sect. 2 an introduction to the theory of deformable models is presented. The original active contour of Kass et al. [24] as well as the GVF active contour [22] is described. The proposed snaxel reallocation approach is described in this section too. Section 3 presents a brief introduction to the CUDA programming model and the GPU general architecture. Section 4 details the parallel implementations that we proposed and the implementation of our snaxel reallocation method. Section 5 presents the experimental results of the proposed reallocation technique, the medical images segmentation and the metrics that we defined to evaluate our implementations; finally, these results are discussed in Sect. 6.

2 Deformable models

Deformable models are differential equations that determine the shape and movement of curves (in the case of 2-dimensional signals) or surfaces (in the case of 3-dimensional signals) built of an abstract elastic material. The physical interpretation of a deformable model is that of an elastic body that responds to the forces applied on it [6, 23]. In a segmentation process, the interpretation of a deformable model is that of an elastic curve that is introduced into an image plane. The curve deforms from its initial configuration due to external and internal forces applied to it, until its shape resembles the boundary of a region of interest within the image. In [24], due to its behavior, the most common deformable model is called snake.

2.1 Traditional active contour

The traditional active contour or snake is defined in [6] as a contour that is embedded in the image plane \(I_M(x,y) \in \mathbf {R}^{2}\). The position of the active contour is \(\mathbf {v}(s)=(x(s),y(s))^\mathrm{T}\), where \(x\) and \(y\) are coordinate functions and \(s \in [0,1]\) is the parametric domain. Snakes can be formulated as open or closed contours. The shape of the active contour within the image \(I_M(x,y)\) is determined by the energy functional [6, 24]

$$\begin{aligned} \mathcal {\varepsilon }(\mathbf {v})=\mathcal {S}(\mathbf {v})+\mathcal {P}(\mathbf {v}), \end{aligned}$$
(1)

where \(\mathcal {S}(\mathbf {v})\) is the internal deformation energy and \(\mathcal {P}(\mathbf {v}(s))\) represents the external deformation energy. The internal deformation energy of the active contour is defined in [6] as

$$\begin{aligned} \mathcal {S}(\mathbf {v})=\int _{0}^{1} ( \alpha (s) \vert \mathbf {v}_\mathrm{s} \vert ^{2}+\beta (s) \vert \mathbf {v}_\mathrm{ss} \vert ^{2} )\, \mathrm{d}s. \end{aligned}$$
(2)

The term \(\mathbf {v}_\mathrm{s}\) denotes the first derivative of \(\mathbf {v}\) with respect to \(s\) and represents the elasticity of the active contour. The term \(\mathbf {v}_\mathrm{ss}\) represents its rigidity and denotes the second derivative of \(\mathbf {v}\) with respect to \(s\). Two non-negative functions define the behavior of the physical energies that are simulated on the active contour: \(\alpha (s)\) controls the elastic tension of the contour and \(\beta (s)\) controls its rigidity [6]. The tension energy models the behavior of a rubber band and the rigidity energy models the behavior of a flexible bar.

Minimizing \(\alpha (s) \vert \mathbf {v}_\mathrm{s} \vert ^{2}\) causes the contraction of the active contour down to a point; the minimization of \(\beta (s) \vert \mathbf {v}_\mathrm{ss} \vert ^{2}\) causes the active contour to take a circular shape in the case of a closed contour or a straight line in the case of an open contour [25]. This implies that, given an initial size of the active contour and assuming that the external force is null, the snake will never grow; it will only tend to form a circle, in the case of a closed contour, while shrink towards its center.

The second term of Eq. (1) is the external deformation energy of the active contour that couples the snake to the image. It is defined as

$$\begin{aligned} \mathcal {P}(\mathbf {v})=\int _{0}^{1}P(\mathbf {v}(s))\,\mathrm{d}s, \end{aligned}$$
(3)

where \(P(x,y)\) is a scalar function defined on the image plane [6]. \(P(x,y)\) is called external energy because is obtained from sources outside the contour and is commonly calculated as

$$\begin{aligned} P(x,y)=-|\nabla [G_{\sigma }(x,y)*I_M(x,y)]|, \end{aligned}$$
(4)

where \(I_M(x,y)\) is the image and \(G_{\sigma }(x,y)\) is the Gaussian filter

$$\begin{aligned} G_{\sigma }(x,y)=\frac{1}{2 \pi \sigma ^2}\exp {\left( -\frac{|x|^2+|y|^2}{2\sigma ^2}\right) }. \end{aligned}$$
(5)

The active contour \(\mathbf {v}(s)\) that minimizes the energy functional of Eq. (1) satisfies the Euler–Lagrange equation

$$\begin{aligned} -\alpha \mathbf {v}_\mathrm{ss}(s)+\beta \mathbf {v}_\mathrm{ssss}(s)+\nabla P(v(s,t))=0. \end{aligned}$$
(6)

Equation (6) expresses the balance of the internal and external forces of the snake in equilibrium state, i.e. when the active contour is steady. Each term of Eq. (6) corresponds to a force produced by the respective energy. The first two terms represent the internal stretching and bending forces respectively, while the third term represent the external force that couples the snake to the image [6]. To completely specify the mathematical model of Eq. (6), it is assumed that the boundary conditions are known (see e.g. the conditions in [3, 26, 27])

$$\begin{aligned} \mathbf {v}(0), \mathbf {v}(1), \mathbf {v}'(0), \mathbf {v}'(1). \end{aligned}$$
(7)

In the discrete domain, the contour \(\mathbf {v}\) is represented by a set of points called snaxels. The discretization of Eq. (6) produces the iterative equations that calculate the Cartesian coordinates of the snaxels. These discrete iterative equations are

$$\begin{aligned} x^{t+1}&= \left( \frac{I}{\Delta t}+A\right) ^{-1}\left( \kappa f_{{x}^{t}}+\frac{x^{t}}{\Delta t}\right) \!, \end{aligned}$$
(8)
$$\begin{aligned} y^{t+1}&= \left( \frac{I}{\Delta t}+A\right) ^{-1}\left( \kappa f_{{y}^{t}}+\frac{y^{t}}{\Delta t}\right) \!, \end{aligned}$$
(9)

where \(I\) is the identity matrix, \(A\) is the sparse differentiation matrix that represents \(\mathbf {v}_\mathrm{ss}(s)\) and \(\mathbf {v}_\mathrm{ssss}(s), \Delta t\) is the time step of the iterations and \(\kappa \) is a constant used to control the external force influence [22]. The addition of the matrix \(\frac{I}{\Delta t}\) to the matrix \(A\) makes it non-singular so it can be inverted. The matrices \(I\) and \(A\) are of size \(N^2\) with \(N\) the number of points of the initial active contour. The terms \(f_{x}=-\frac{\partial E_I}{\partial x_{i}}\) and \(f_{y}=-\frac{\partial E_I}{\partial y_{i}}\) of Eqs. (8) and (9) respectively depend on the calculation of the external energy of Eq. (4) [6].

The use of a gradient-based external energy function like Eq. (4) produces large energy values over and in the close neighbourhood of the contours of the image. This characteristic, together with the behavior of the internal forces, gives the traditional snake two main drawbacks: (1) the need for an initialization process that places the snake close to the region of interest, and in the case of a closed contour outside the region of interest so that eventually the snake will find it and (2) the incapacity of the snake to progress into boundary concavities [5]. The reason behind these limitations is the small capture range produced by Eq. (4). The capture range is defined as the area of influence of the contour in which the snake is able to find a local minimum; its extension is closely related to the external energy.

To address the problems of small capture range and convergence in boundary concavities, the GVF snake proposed in [22] is one of the most used methods [57]. This is due to its effective improvement in capture range extension and its ability to converge inside boundary concavities. In fact, some investigations about deformable models are related to modifications and improvements to the original GVF snake proposal [6, 28].

2.2 Gradient vector flow active contour

The GVF active contour represents an improvement to the snake defined by Eq. (6). The improvement is a vector field \(\mathbf {w}(x,y)=[u(x,y),v(x,y)]\) that represents the external force. Therefore, the equation that models the GVF active contour is [22]

$$\begin{aligned} -\alpha \mathbf {v}_\mathrm{ss}(s)+\beta \mathbf {v}_\mathrm{ssss}(s)+\mathbf {w}=0. \end{aligned}$$
(10)

Equation (10) is solved numerically in the same way as Eq. (6). In the case of the GVF model, \(\mathbf {w}\) is a vector field, moreover, it is the gradient vector flow field; this is the main difference with respect to the traditional active contour model of Eq. (6). The GVF field solves the problems of small capture range and convergence in boundary concavities associated to the active contour of Kass. It is a dense vector field derived from an image by the minimization of an energy functional. The vector nature of \(\mathbf {w}\) is the improvement of the GVF model over the original active contour model that makes the former more efficient [22].

Unlike the traditional snake, a GVF snake could be initialized inside, across or outside the region of interest. This is because the GVF field produces a wide capture range which is the product of an isotropic diffusion process that does not blur the edges within the image, as a Gaussian filter would do [5, 22].

The external force used in the formulation of the traditional snake is an irrotational field and that is why the snake is unable to converge inside boundary concavities. The GVF field incorporates an irrotational component and also a curl component so it outperforms the traditional snake [22].

The GVF field formulation focuses on keeping the gradient properties nearby the edges within the image and on extending the scope of the normal vectors of the edges beyond their nearby regions through a diffusion process. This process produces an external force with a significative magnitude over the whole image plane, not only in the edges neighborhood [3]. Due to the competition among the vectors of edges that involves a diffusion process on an image, some of the vectors of the GVF field, according to the geometry, point inward the concavities. This particular feature solves the problems of the capture range and the lack of convergence inside the concavities of the snake of Kass [22].

The first step to calculate the vector field \(\mathbf {w}\) is to obtain the contour map \(f(x,y)\) derived from the image \(I_M(x,y)\). The contour map is a potential function defined over the image whose value is larger near the image edges. In this work, the Canny edge detector [29] is used to generate the contour map.

The GVF field is defined as the vector field \(\mathbf {w}(x,y)=[u(x,y),v(x,y)]\) that minimizes the energy functional

$$\begin{aligned} \varepsilon =\int \int \mu (u_{x}^{2}+u_{y}^{2}+v_{x}^{2}+v_{y}^{2})+|\nabla f|^{2}|v-\nabla f|^{2}\,\mathrm{d}x\mathrm{d}y \end{aligned}$$
(11)

where \(f(x,y)\) is the contour map of the image; \(\nabla \) is the gradient operator; \(\mu \) is a positive parameter of regularization; \(u(x,y)\) and \(v(x,y)\) are functions that represent the GVF field components; and \(u_{x}\), \(u_{y}\), \(v_{x}\) and \(v_{y}\) represent the partial derivatives of \(u(x,y)\) and \(v(x,y)\) with respect to \(x\) and \(y\), respectively [5, 22]. The energy functional of Eq. (11) keeps \(\mathbf {w} \approx \nabla f\) when \(\nabla f\) is large; otherwise, it produces a slowly varying field in the homogeneous regions [22].

Applying the calculus of variations to Eq. (11), the equation that solves \(u(x,y)\) (Eq. 12) and the equation that solves \(v(x,y)\) (Eq. 13) are obtained [1, 22]

$$\begin{aligned} \mu \nabla ^{2}u-(u-f_{x})(f_{x}^{2}+fy^{2})&= 0, \end{aligned}$$
(12)
$$\begin{aligned} \mu \nabla ^{2}v-(v-f_{y})(f_{x}^{2}+f_{y}^{2})&= 0, \end{aligned}$$
(13)

where \(\nabla ^{2}\) is the Laplacian operator. The solution of Eqs. (12) and (13) generate the vector field \(\mathbf {w}\). To implement the GVF snake model, the vector field \(\mathbf {w}\) is computed first. Next, Eqs. (8) and (9) are used to minimize Eq. (10).

From Eqs. (12) and (13), the components \(u(x,y)\) and \(v(x,y)\) of the GVF field are

$$\begin{aligned} u_{t}(x,y,t)&= \mu \nabla ^{2}u(x,y,t)-[u(x,y,t)-f_{x}(x,y)][f_{x}(x,y)^{2}+f_{y}(x,y)^{2}],\nonumber \\ \end{aligned}$$
(14)
$$\begin{aligned} v_{t}(x,y,t)&= \mu \nabla ^{2}v(x,y,t)-[v(x,y,t)-f_{y}(x,y)][f_{x}(x,y)^{2}+f_{y}(x,y)^{2}], \nonumber \\ \end{aligned}$$
(15)

where the Laplacian is approximated in the discrete domain using centered finite-difference equations

$$\begin{aligned} \nabla ^{2}u&= u_{i-1,j}+u_{i+1,j}+u_{i,j-1}+u_{i,j+1}-4u_{i,j}, \end{aligned}$$
(16)
$$\begin{aligned} \nabla ^{2}v&= v_{i-1,j}+v_{i+1,j}+v_{i,j-1}+v_{i,j+1}-4v_{i,j}. \end{aligned}$$
(17)

The iterative condition of Eqs. (14) and (15) is denoted by \(t\). The capture range of the GVF field is determined by these iterations, i.e. the produced field is proportional to the number of iterations. This way of controlling the capture range enables the GVF method to eliminate the location dependency for the initial active contour. If a certain shape of interest is not included or is partially included in the initial snake, the GVF field will allow deformations of the initial active contour to include the shape. If the shape of interest is located completely out of the initial active contour, it is even possible that the complete snake moves towards the shape to capture it, provided there are not additional shapes in the neighbourhood.

Because the mathematical formulation of the GVF snake is similar to the formulation of the traditional snake, the GVF snake inherits some limitations, e.g. the sensitivity of the internal energy parameters [6, 30]. However, the main drawback of GVF snake is the large number of arithmetic operations involved, which directly affects its computation time [28].

As the evolution of Eqs. (14) and (15) depends only on current information, the numerical solution of these equations produce explicit discrete iterative equations. At each iteration, there is no data dependency between the elements of the component \(u(x,y)\); this is also true for the \(v(x,y)\) elements. For this reason, the GVF field can be computed in parallel on a high-performance computing platform and this way reduce the computation time of GVF snake algorithm.

2.3 Dynamic snaxel reallocation

One problem associated with the snake formulation is the inadequate grouping of snaxels of a closed contour. At certain sections of a boundary, the snaxels tend to group together into clusters whereas in some other regions they tend to separate from each other. This inadequate distribution is illustrated in Fig. 1.

Fig. 1
figure 1

Typical distribution of snaxels over a border

As shown in Fig. 1, the white snaxels are relatively apart from each other whereas the black snaxels are clustered. This effect is not contemplated in the energy functional of the snake although it can derive into unwanted behavior of the active contour, leading to an incorrect final segmentation result. Even when the tension term \(\alpha \) of Eq. (6) can be viewed as the internal force that somehow controls the spacing between snaxels, the overall effect of the total forces that affect each snaxel leads to the spacing problems described.

Consider the contour of Fig. 1 and the case where the tension force is set to a relatively higher value than the other forces. Under this condition, large spaces between the snaxels are not expected but on the other hand, the snake will not be able to converge inside the concave region. This is because the internal tensional force between the snaxels is greater than the external force of the boundary of interest and therefore, the snaxels placed over the contour in the proximity of the concave region will not allow the snaxels placed near the concave region enter it. On the contrary, if \(\alpha \) is set to a relatively lower value than the other forces, the snake will be able to converge inside concave regions, but the spacing problems shown in Fig. 1 will be present. In order to solve this problem, a mesh adaptation approach for the snaxels is necessary.

In general, a mesh adaptation can be static or dynamic. In the static method, also known as local mesh refinement, some nodes are added wherein the solution has high gradient values and are removed from regions where the solution is almost constant. In the dynamic approach, the number of nodes is fixed and they are only re-allocated based on the characteristics of the solution.

Fig. 2
figure 2

a Parametric closed contour and b normalized arc length of a contour

Consider the parametric contour \(\mathbf {v}\) of Fig. 2a. As we are considering only closed contours, \(\mathbf {v}(s_1)=\mathbf {v}(s_9)\). In Fig. 2a it is clear that the points over the plane are not equally spaced and this condition is reflected over the \(s\) domain. It is also clear that each point of \(\mathbf {v}\) is defined as \(\mathbf {v}(s_i)=\mathbf {v}_i(x_i,y_i)\) and thus, any movement of the \(s_i\) points will be reflected in the configuration of the contour \(\mathbf {v}\). Considering these conditions, a mesh adaptation technique can be applied over the \(s\) domain to equidistribute the \(s_i\) points and hence, the points that form the contour \(\mathbf {v}\). However, the problem with this approach is the transformation of the \(s_i\) values into the corresponding \((x_i,y_i)\) coordinates, i.e. a \(\mathbb {R} \rightarrow \mathbb {R}^{2}\) transformation.

To solve the mesh adaptation issue the normalized arc length value between the points of \(\mathbf {v}\) contour is used; the slope value between the ordered pairs \((x,y)\) of consecutive \(\mathbf {v}\) points is used to perform the \(\mathbb {R} \rightarrow \mathbb {R}^{2}\) mapping. According to the \(s\) domain sequence, the arc length and the slopes values of the \(\mathbf {v}\) points are computed in a counterclockwise sense on the plane.

The arc length is calculated as \(l_{i}=\sqrt{\Delta x_{i}^{2}+\Delta y_{i}^{2}}\) where \(\Delta x_{i}=x_{i+1}-x_{i}\) and \(\Delta y_{i}=y_{i+1}-y_{i}\). The normalized arc length is defined as \(l_{i}^{\prime }=\frac{l_{i}}{L_{T}}\) and is used to determine the position of each \(s_i\) point over the \(s\) domain. \(L_\mathrm{T}\) is defined as the total length of the contour. Figure 2b shows the normalized arc length for the contour \(\mathbf {v}\) illustrated in Fig. 2a.

To perform the mesh refinement over the \(s\) domain a nodal reallocation approach is used. The arc length is utilized as the mesh adaptation function that drives the mesh adaption process, with larger values denoting a region for refinement and smaller values denoting a region for mesh coarsening. The arc length function is passed through the equation

$$\begin{aligned} l_{i}=\frac{1}{4}(l_{i-1}+2l_{i}+l_{i+1}) \end{aligned}$$
(18)

to promote smoothness of the mesh adaptation. Once the adaptation function has been smoothed, it is used to define a spring constant \(k_{i+\frac{1}{2}}=0.5(l_{i}+l_{i+1})\) that connects the \(s_i\) points. The relation of the \(k\) values that affects each \(s_i\) point is \(k_{i-\frac{1}{2}}(s_{i}-s_{i-1})=k_{i+\frac{1}{2}}(s_{i+1}+s_{i})\). Points with higher arc length values will have higher spring constants and thus promote refinement in that region. The new positions are then solved for according to

$$\begin{aligned} s_{i}^{t+1}=\frac{k_{i-\frac{1}{2}}s_{i-1}^{t}+k_{i+\frac{1}{2}}s_{i+1}^{t}}{ k_{i-\frac{1}{2}}+k_{i+\frac{1}{2}}}. \end{aligned}$$
(19)

It has to be noted that the proposed nodal reallocation allows using the same differentiation matrix of the snake throughout the process of deformation because the number of snaxels remains constant. Also, this dynamic mesh adaptation approach makes relatively unnecessary the use of a large number of snaxels due to the better distribution of them over the boundary of a region of interest. Therefore, it is not necessary to use a large number of snaxels and as a consequence, the linear systems to be solved are relatively small and do not represent a high computational load.

Once the mesh adaptation is completed, the \(\mathbb {R} \rightarrow \mathbb {R}^{2}\) mapping is computed. The slope values between each point of the contour \(\mathbf {v}\) and the recently adapted mesh are used to perform this transformation. Refer to Fig. 3 where the mesh adaptation process is illustrated. The superindices \(t\) and \(t+1\) are used to differentiate the old and new positions of the \(s_i\) points. Note that the \(s_{i}^{t}\) points correspond to the contour \(\mathbf {v}\) of Fig. 2a.

Fig. 3
figure 3

\(s_{i}^{t}\) and \(s_{i}^{t+1}\) after mesh adaptation

The first step is to determine the interval where each \(s_{i}^{t+1}\) point is located, i.e. to find out between which \(s_{i}^{t}\) points is placed. After that, the corresponding \((x, y)\) ordered pairs of these adjacent \(s_{i}^{t}\) points are determined. The abscissa and ordinate values can be represented as functions of \(s_i\) with their respective slope \(m_i\). These slopes are used to compute the \((x, y)\) values corresponding to the \(s_{i}^{t+1}\) points. Refer to Fig. 4 where the approximate value of the ordered pair \((x, y)\) for each of the \(s_{i}^{t}\) points of Fig. 2a is presented as function of \(s\). The slopes formed by the \((x,y)\) values and the \(s_{i}^{t+1}\) points are shown too.

Fig. 4
figure 4

Abscissa and ordinate values as functions of \(s\)

The abscissa and ordinate values corresponding to each \(s_{i}^{t+1}\) point are found by

$$\begin{aligned} x-x_{i} =m_\mathrm{abs}(s^{t+1}-s_{i}^{t}), \quad y-y_{i} = m_{\mathrm{ord}}(s^{t+1}-s_{i}^{t}) \end{aligned}$$
(20)

where

$$\begin{aligned} m_\mathrm{abs} =\frac{x(s_{i+1}^{t})-x(s_{i}^{t})}{s_{i+1}^{t}-s_{i}^{t}}, \quad m_{\mathrm{ord}} = \frac{y(s_{i+1}^{t})-y(s_{i}^{t})}{s_{i+1}^{t}-s_{i}^{t}} \end{aligned}$$
(21)

and the sub-indexes \(i\) and \(i+1\) correspond to the left and right adjacent \(s_{i}^{t}\) points respectively.

The result of the \(\mathbb {R} \rightarrow \mathbb {R}^{2}\) mapping described is shown in Fig. 5. It can be seen that the new positions of the snaxels are better distributed over the image than the originals (Fig. 2a) and therefore, represent a better approximation of the border of the region of interest within the image.

Fig. 5
figure 5

Result for snaxel reallocation

3 Graphic processing units

High-performance computing refers to the use of parallel processing platforms in the solution of computational intensive problems efficiently, reliably and quickly. Parallel platforms include clusters, multicore architecture and hybrid systems [31].

Recently, semiconductor industry has settled on two main design lines for microprocessors: the multicore processors and the manycore processors. By their definition, a multicore processor can have up to tens of cores, whereas a manycore processor has hundreds and even thousands of cores [32]. A CPU is an example of a multicore processor; GPUs are an example of a manycore processor.

The ratio between GPUs and multicore processors for peak floating-point operations (FLOPS) is about 10–1 [33]. The reason for this gap in their performance lies in their architecture: a CPU is a processor designed for sequential code execution whereas the GPU design is conceived to increase the execution throughput of parallel applications. The higher memory bandwidth of the GPU is also an important feature that contributes to the performance gap [32].

3.1 CUDA

CUDA is a parallel computing architecture for general purpose as well as a parallel programming model that offers high-level access to the GPU. CUDA allows the use of a GPU to solve computationally intensive problems in a more efficient way than with a CPU [34].

The CUDA programming model allows the transparent scalability of applications through GPUs with different number of cores. The base to get this transparent scalability are three key abstractions: a hierarchy of threads groups, shared memory and barrier synchronization. These abstractions allow the programmer to partition the problem into thread blocks that can be solved independently. The thread blocks can be considered as sub-problems and are executed in the available GPU cores, in any order and in parallel or sequential manner. This independence of the thread block execution is the characteristic that allows the scalability of the CUDA applications [34, 35].

The CUDA programming model enables the use of a GPU as a co-processor of the CPU. In this context, the GPU is called device and the CPU is called host, as it is shown in Fig. 6. A CUDA program is composed of sequential code sections for the host and parallel code sections for the device. In the parallel code sections, thousands of threads are executed concurrently to reduce the computation time [32].

Fig. 6
figure 6

CUDA heterogenous programming model

In a CUDA program the main thread is executed in the host, as it is shown in Fig. 6. When a kernel is called, the execution is moved from the host to the device where a massive number of threads are executed concurrently to perform the parallel operations. A kernel is a subroutine that is executed \(K\) times by \(K\) threads within the device [32]. The threads generated by a kernel are grouped in thread blocks, and the total of thread blocks within a kernel is called a grid. The kernel calls are asynchronous, which implies that after the invocation of a kernel, the host can execute the rest of the sequential code or just wait for the termination of the kernel execution [32].

3.2 Optimization strategies

The optimization of a CUDA program is based on three main aspects: maximize parallel execution, optimize memory usage to achieve maximum memory bandwidth, and optimize instruction usage to achieve maximum instruction throughput [34, 36].

The most important rule to optimize the memory space usage of the GPU is to minimize the data transfer operations between the CPU and the GPU, because these memory operations have a lower memory bandwidth than the internal transfers within the GPU. It is also important to minimize the kernel access to the global memory and maximize the use of the shared memory. For a detailed description of many other optimization strategies see [36].

4 Parallel implementation of the GVF snake

The proposed scheme for the parallel implementation of the GVF snake is shown in Fig. 7. We implemented this scheme twice with the parallel frameworks OpenMP and CUDA to do a performance comparison between them. Our approach focuses on the parallelization of the GVF field computation, which is denoted as the A-box of Fig. 7. Due to the dynamic reallocation feature proposed, the solution of Eqs. (8) and (9) is performed sequentially as only relatively small linear systems are used to compute the snake deformation. The other less demanding operations, e.g. image reading and writing, are executed sequentially too. The GVF field is computed via the Eqs. (14) and (15). The OpenMP and CUDA codes for this parallel section are detailed below.

Fig. 7
figure 7

General scheme of GVF parallel computation

4.1 OpenMP computation of the GVF field

For the case of the OpenMP implementation, the A-box of Fig. 7 represents a parallel for. Algorithm 1 contains the pseudo-code used.

figure a

We tested many configurations for the OpenMP implementation following the optimizations recommended in [37] and the structure presented in Algorithm 1 produced the best results in terms of overhead, CPU usage, thread concurrency and execution time. Table 3 comprises the timing value results for each of the images used.

The omp parallel directive is placed outside the main loop to reduce the overhead produced by the creation of parallel regions. The computation of the boundary values of the \(u\) and \(v\) components and also the update process of these components is performed sequentially by the master thread.

The omp for directive is used to make the computing of the \(u\) and \(v\) components in parallel. Each thread computes chunks of 16 loop iterations and the scheduling of these chunks is dynamic, i.e. the thread executes the chunk of iterations then requests another chunk until there are no more chunks to work on [37].

An OpenMP implementation represents a coarse-grain parallel approach, in which each thread computes the GVF field of hundreds of pixels sequentially, i.e. only few pixels are computed concurrently.

4.2 CUDA computation of the GVF field

For the CUDA implementation, the A-box in Fig. 7 represents a loop that contains kernel calls. The loop is controlled by the host whereas the kernels called within it are executed on the device. Algorithm 2 contains the pseudo-code used.

figure b

The computing of the GVF field with a GPU involves the solution of Eqs. (8) and (9) on each pixel of the image matrix \(I_M(x, y)\) using a scheme of one pixel per thread. This approach of data mapping facilitates the pixel distribution into thread blocks, which is the base of the CUDA programming model. This way, unlike the sequential implementation of the GVF snake in which the execution thread must calculate the GVF field of one pixel at a time or the OpenMP implementation described in Sect. 4.1, the GPU implementation allows the parallel computation of the GVF field on thousands of pixels concurrently. Figure 8 shows the model of one pixel per thread for an image of \(10 \times 10\) pixels. It is shown how the image is divided into blocks and the way each block is assigned a set of threads.

Fig. 8
figure 8

One pixel per thread scheme

Because the equations for the components \(u\) and \(v\) of the GVF field involve access to data in a regular pattern (a central pixel and 4 of its neighbors), the necessary data are copied to the GPU memory as textures. Textures facilitate access to two-dimensional data, automatically handle the non-valid spatial locations and in some cases improve the computing time as a result of their cache. The kernels that are called to compute the GVF field (A-box of Fig. 7) are detailed in Algorithm 2. After the memory transfer from the host to device, the host starts a loop and the kernel GVF \(\mathbf {<<<\cdots >>>}\) is called once per iteration. This kernel computes the Eqs. (14) and (15). After completing the specified number of iterations, the kernel normalization \(\mathbf {<<<\cdots >>>}\) is called to normalize the \(u\) and \(v\) components to the range \([-1, 1]\).

The Algorithm 3 presents the pseudo-code for the global memory version. To properly compute the Laplacian of the boundaries of the image, we used temporal arrays \(u_\mathrm{temp}\) and \(v_{temp}\) with an increased size by two pixels on each axis, i.e. if the original image is \(256 \times 256\) pixels in size, we set the temporal images as \(258 \times 258\) pixels in size. The extra columns and rows are filled with zeros. To identify each thread correctly a padded thread ID is used. By utilizing this approach, the use of if statements is not required to avoid non-valid spatial conditions and this prevents thread divergence in the kernels. In Algorithm 3, the index id points to the original image locations. The index ida points to the temporal arrays of increased size.

figure c

4.3 Computation of the snake deformation and dynamic snaxel reallocation

After computing the GVF field, the initial contour for \(\mathbf {v}\) and the differentiation matrix \(A\) are created.

The snake deformation is driven by the iterative solution of Eqs. (8) and (9) which can be represented as the linear systems \(x^{t+1}=Bb\) and \(y^{t+1}=Bc\) respectively where

$$\begin{aligned} B=\left( \frac{I}{\Delta t}+A\right) ^{-1}, \quad b=\left( \kappa f_{{x}^{t}}+\frac{x^{t}}{\Delta t}\right) , \quad c=\left( \kappa f_{{y}^{t}}+\frac{y^{t}}{\Delta t}\right) , \end{aligned}$$
(22)

\(I\) is the identity matrix and \(\Delta t\) is the time step.

The root mean square error (RMSE) is proposed as the stopping criterion for the deformation of the snake. The equation used is

$$\begin{aligned} \mathrm{RMSE}=\frac{\sqrt{\sum _{i=1}^{k}( \mathbf {v}_{i}^{t+1}-\mathbf {v}_{i}^{t})^{2}}}{k} \end{aligned}$$
(23)

where \(k\) is the number of snaxels.

To ensure that the snake has found the minimum energy region, three results of consecutive values of the RMSE are stored. The snake deformation loop is stopped when

$$\begin{aligned} \mathrm{RMSE}_\mathrm{thresh}>\mathrm{RMSE}_{1}>\mathrm{RMSE}_{2}>\mathrm{RMSE}_{3}, \end{aligned}$$
(24)

where \(\mathrm{RMSE}_\mathrm{thresh}\) is a predetermined value. The RMSE is computed before the snaxels reallocation process.

While the stopping criterion is not met, the snaxels reallocation process is performed. After each solution of Eq. (22), (18) and (19) are used to perform the mesh adaptation over the \(s\) domain and then, the reallocation of the snaxels is computed via the Eqs. (20) and (21). The new vectors \(x\) and \(y\) are then used in Eq. (22) to compute the next solution of the snake deformation. Although in all the experiments presented in this work the mesh adaptation process is computed just once after each solution of Eq. (22), it can be implemented iteratively to enhance the snaxel distribution.

5 Results

In this section, we present the results of our investigation. First, we discuss the performance of our snaxel reallocation approach by means of a synthetic test image. Then, we present the results of our GVF snake parallel implementations.

5.1 Snaxel reallocation implementation

Figure  9a shows the binary synthetic image that we used to test our snaxels reallocation method. The image presents a rounded star shape defined by the contrast difference. The star features five concave regions and is symmetric about the \(y\)-axis. The star image is 8 bits per pixel and \(256 \times 256\) pixels in size.

Fig. 9
figure 9

a Initial contour, b plain segmentation, c segmentation with static refinement and d segmentation with the proposed snaxel reallocation method

In this part of the investigation we implemented the algorithm described in Fig. 7 sequentially because the computation performance is not the subject under analysis. The initial contour is a circle with center at \([128,135]\) and radius \(r=118\). This initial snake configuration is the same for the three implementations presented in this section.

Figure  9b shows the result of plain segmentation with the active contour. The snake parameters used are \(\alpha =0.2\), \(\beta =0\), \(\kappa =1\), 3,000 iterations for the GVF computation, \(\mu =0.15\) and \(N \mathrm{(number~of~snaxels)}=1{,}300\). The active contour was indeed able to converge inside the concave regions after 2,000 iterations of the deformation Eq. (22) but the snaxels distribution over the edge of interest presents spacing problems as there are regions with clustering and some other regions where the snaxels are separated from each other. This behavior is the reason why a large number of snaxels is needed in order to let the snake converge in the concave regions. The large number of iterations required for the GVF field is another important issue for this particular case.

The final active contour shape for the case of static refinement of snaxels is shown in Fig. 9c. We used \(\alpha =0.01\), \(\beta =0\) and \(\kappa =0.6\), 70 iterations for the GVF computation, \(\mu =0.2\), \(N_\mathrm{ini}=50\) and \(N_\mathrm{fin}=605\). In fact, with this approach the active contour deformation process starts with a given \(N\) value \((N_\mathrm{ini})\) but ends with a different number of snaxels \((N_\mathrm{fin})\). The deformation process took 2,000 iterations of Eq. (22). The arc length between snaxels is the criterion for the refinement process, i.e. if the snaxels tend to cluster some of them are removed, if the snaxels tend to separate from each other a given number of snaxels is added. It can be seen that the snake was able to converge inside the concave regions but with a high computational cost, because the deformation matrix \(A\) needs to be calculated every time the number of snaxels is changed. This is an important drawback of this refinement approach as it directly affects the computational load of the algorithm.

Figure  9d shows the final snake configuration for our proposed snaxels reallocation approach. The snaxels are placed almost symmetrically about the \(y\)-axis and the active contour was able to effectively converge inside the concave regions. The settings used are \(N=70\), \(\alpha =0.01\), \(\beta =0\), \(\kappa =0.6\), 70 iterations for the GVF field and \(\mu =0.2\).

Considering the segmentation results, the number of snaxels used and the iterations for the GVF field, we can state that our reallocation approach is more efficient than the other two methods. The advantage is clear not only in the segmentation results but also in a performance computational point of view because with the proposed mesh adaptation method we obtain better results with fewer snaxels or GVF iterations and these conditions directly impact the execution time.

It has to be mentioned that, if needed, our approach can be parallelized as there are no data dependencies in the computing of the mesh adaptation. Since only a relatively small number of snaxels is utilized in the experiments presented in this work, we estimate that the parallel implementation was not necessary.

5.2 GVF snake parallel implementation

We used MRIs, PET/MRIs and CTs medical images only for illustrative purposes to show the performance of active contours. In order to analyze the efficiency of our algorithms, we have used six different sizes of images within the range of \(64\times 64\) to \(2{,}048\times 2{,}048\) pixels in size. The implementation time for the sequential version of each experiment is presented for reference. All the timing values presented are result of optimized versions of our algorithms and an average of six runs.

The experiments have been conducted under Ubuntu 13.04 using C language on a computer with an intel i7-930 processor at 2.8 GHz and 6 GBytes of RAM memory. We disabled the Hyperthreading and Turboboost features to reduce variations between runs of the programs. The CUDA implementation is made with a Kepler GeForce GTX 670 GPU that has 2 GBytes of global memory. All the results presented are single-precision floating point. The OpenCV library is used to perform the image reading, writing and Canny edge detection operations. The same initial contour configuration is used for the three implementations of each experiment. Depending on the region of interest, we used ellipses or circumferences as the initial contour. All the medical images used in the experiments are 8 bits per pixel. The contour map for all the experiments is created with the Canny edge detector [29]. The final segmentation is equal for the sequential and parallel implementations, so only one image is shown.

We used the Intel Inspector and Intel Vtune Amplifier software to evaluate our OpenMP algorithms. Intel Inspector software can be used to locate and fix memory-related problems, e.g. data races and deadlocks presented in shared memory platforms; Intel Vtune Amplifier is used to determine the thread concurrency level, CPU usage, FLOPS and many other parameters. Therefore, when used together, the Amplifier and Inspector software provide a complete profile of an OpenMP application. To profile and debug the CUDA algorithms the Nvidia Visual Profiler and nvprof are used.

To compute the GVF field on the GPU we used blocks of \(16\times 16\) or \(32 \times 32\) threads depending on the image size and these block sizes produce respectively 256 and 1,024 CUDA threads. For the OpenMP implementation we used four threads, i.e. one thread per physical core.

Figure 10a, b shows Image1 which is a binary synthetic image that is \(64\times 64\) pixels in size. Image2 is a \(128\times 128\) pixel-size positron emission tomography/magnetic resonance image that is showed in Fig. 10c, d [38]. The image highlights a brain lesion. Moving one step forward, Image3 is a magnetic resonance image which is \(256\times 256\) pixels in size and it is shown in Fig. 10e, f [39]. Image3 shows a section of the axial plane of the brain. Our fourth experiment is conducted with Image4 which is a \(512\times 512\) pixels in size computed tomography shown in Fig. 11a, b [39]. The image presents a section of the sagittal plane of a knee. Figure 11c, d shows Image5 which is a computed tomography that is \(1{,}024\times 1{,}024\) pixels in size [40]. This CT shows a section of the axial plane of a colon. Figure 11e, f shows a computed tomography (Image6) that we have used to perform our last experiment. The image is \(2,048\times 2,048\) pixels in size and shows a section of the axial plane of the brain [40]. For each experiment, the initial active contour and its final shape are shown. The GVF field parameter \(\mu \), the settings for the initial contour and the parameters of the snake for all the experiments presented are summarized in Table 1.

Fig. 10
figure 10

a Initial and b final contour for synthetic image Image1; c initial and d final contour for PET/MRI Image2; e initial and f final contour for MRI Image3

Fig. 11
figure 11

a Initial and b final contour for CT Image4; c initial and d final contour for CT Image5; e initial and f final contour for CT Image6

Table 1 GVF \(\mu \) value, settings for initial contours and snake parameters

First, we present in Table 2 the comparison of timing values between the texture and global memory implementations of our algorithm. From these results, it is clear that the texture implementation is faster than the one with global memory, but the speedup is not as high as it could be with older GPU architectures. This is due of the L1 and L2 caches that feature the GPU but in addition, because the texture cache in the Kepler architecture is now available for general load operations as a read-only memory without using texture units. These results agree with the conclusions presented in [13]. Based on this timing results, we use the texture implementation in the performance comparison against the openMP version of the GVF computing.

Table 2 Timing values for global memory and texture computing of GVF field

The number of iterations for the GVF field, the best computation times for the sequential, OpenMP and CUDA implementations and the ratio between the OpenMP and CUDA timing values of all the experiments are shown in Table 3.

Table 3 Execution time comparison of GVF field computing between the sequential, OpenMP and CUDA implementations

In order to make a more objective comparison and better evaluation between the implementations, we have taken into account three key aspects: (1) execution time, (2) occupancy and (3) FLOPS. Table 4 presents these results for the OpenMP and CUDA implementations. We used the occupancy as a metric because the GVF computation is memory-bound and for this type of algorithm, we required to increase the occupancy to hide latencies. The FLOPS are used to roughly estimate the computational peak of the algorithms.

Table 4 Metrics of GVF field computing for the OpenMP and CUDA implementations

For the GPU, occupancy is defined as the ratio between active warps and the maximum number of warps. For the OpenMP implementation, we defined occupancy as the ratio between the thread concurrency (number of active threads) and the number of physical cores, which in our particular case is four.

Besides the metrics presented in Table 4, there are many other parameters that can be considered in the evaluation of an application but not all of them represent a good aspect to take into account because all the problems have specific issues. However, the execution time is the most representative metric to evaluate a parallel application and, in addition with other selected metrics, gives a complete performance profile of an algorithm.

6 Conclusions

The proposed snaxel reallocation method improves the efficiency of the segmentation process with the GVF snake. From a HPC point of view, it improves the performance of the snake deformation computing by significantly reducing the number of snaxels required to perform the segmentation of a given region of interest. This condition has a direct impact on the computation time because the deformation of the snake can be driven with the use of small linear systems. Furthermore, the proposed reallocation technique in some cases reduces the necessity of a large number of iterations for the GVF field, decreasing the overall computation time of the segmentation.

Our implementation provides better results if it is used on images that have been processed with noise reduction and enhancement techniques, and present high contrast. If it is used in images without these characteristics, there is a probability that the active contour will not converge to the region of interest due to the many local energy minima.

We present an objective comparison between the parallel implementation of our algorithm on a CPU and on a GPU. OpenMP is used to implement the CPU parallel version. Although the elapsed time of an application is not the best way to evaluate a CUDA algorithm, it is indeed the most important aspect. That is why it is the most common metric for assessing a program. However, execution time alone is not the best choice because is not objective, fair nor qualitative. Many factors are involved in a speedup such as the capacity of the processors or available resources, among others. That is why we think that an execution time comparison plus an objective measurement of the GPU and CPU resources utilization is the best way to evaluate a GPU application. For this reason, we used three metrics for the evaluation of our CUDA application: execution time, occupancy and FLOPS.

The results encountered in this work show three important aspects. First, the OpenMP version of the GVF field computing is in general just slightly faster than the sequential version of the algorithm. This somehow unexpected result can be explained by the auto-vectorization feature of the icpc compiler which reduces the computational load of loops using SSE packed (vector) instructions rather than scalar instructions. This process can be seen as a type of loop parallelization. Due the characteristics of the GVF field computing loop, the auto-vectorization feature greatly reduces the sequential computation time. Second, in our analysis of textures and global memory performance for the Kepler architecture, we did find out that the GVF computing using textures is faster than the global memory version. However, the difference is not as significant as it could be in older GPU architectures. Kepler texture objects were not used in this work because they are designed for applications where a large number of textures is required and the binding/unbinding overhead of texture references represents a time execution problem. Third, the difference in the execution time of the OpenMP and CUDA implementations coincides with the theoretical gap between the computational capacity of a CPU and a GPU. This result indirectly indicates that our implementations are correct considering that speedups of orders of magnitude are not consistent with the peak capacity of the processors.