1 Introduction

Edge detection is an essential operation in different fields, such as image processing, computer vision and pattern recognition. Over the years, many edge detection algorithms have been proposed, including classical approaches, such as Roberts [1], Sobel [2], Prewitt [3] and Canny [4] methods, as well as more recent methods based on soft computing techniques, such as fuzzy logic [5], Artificial Neural Networks [6], genetic algorithms [7], particle swarm optimization [8], ant colony optimization [9] and adaptive neuro fuzzy inference system [10].

The Canny algorithm [4], also known as optimal detection method, is still one of the most widely used edge detection techniques due to its superior performance. It consists of the following four stages: (1) noise reduction, (2) gradient computation, (3) non-maximum suppression, and (4) hysteresis thresholding. First, the image noise is reduced by a Gaussian convolution. Next, first derivatives are calculated in both horizontal (\(d_x\)) and vertical dimensions (\(d_y\)). From these two images, the gradient magnitude (G) and direction (\(\theta\)) are computed for each pixel by the formulas \(G = \sqrt{d_x^2 + d_y^2}\) and \(\theta = tan^{-1}(\frac{d_y}{d_x})\). In the third stage, possible edges are obtained by suppressing all pixels which are not local maximums in the gradient direction. In the last stage, hysteresis thresholding determines which of possible edges are really edges using two thresholds values, low and high. First, the set of pixels with \(G \ge high\) and the set of pixels with \(G \le low\) are directly classified as edges and non-edges, respectively. Then, the remaining possible edges (i.e., those with \(low< G < high\)) are classified as edges if and only if they are connected (directly or via other possible edges) to pixels with \(G \ge high\). In the rest of the paper, the set of pixels with \(low< G < high\) will be referred to as instability zone [11], and their classification process as linking process [11]. Additionally, we define the instability map as a binary image of the same dimensions as G, in which the value of pixel (ij) is 1 if the pixel (ij) of G belongs to the instability zone, or 0 otherwise.

The main drawback of the Canny edge detector is that it is time consuming, due to its high computational complexity. To overcome this limitation, many implementations of the algorithm have been presented on different accelerating platforms, such as ASICs [12,13,14], FPGAs [15,16,17,18,19,20,21,22,23] and GPUs [25,26,27,28,29,30,31,32].

There are several ASICs implementations of Deriche filters, which have been derived from Canny’s criteria. Deriche [12] presented a network with four transputers that took 6 s to detect edges in a \(256\times 256\) image, which is far from real-time requirements. Torres et al. [13] proposed a faster solution that processed 25 frames/s at 33 MHz, but the area overhead was increased by the use of Last-In First-Out (LIFO) stacks in off-chip SRAM memories. Lorca et al. [14] presented a new design that improved that of [13] by reducing the memory size and the computation cost by a factor of two. Nevertheless, the number of clock cycles per pixel varies with the image size, and the processing time increases with the size of the image.

Some efforts have been made to accelerate Canny edge detection using FPGAs [15,16,17,18,19,20,21,22,23]. The proposals in [15] and [16] translated the software designs directly into hardware description languages (Handel-C and VHDL, respectively), which resulted in timing performance degradation. Gentsos et al. [17] presented a parallel architecture of simultaneous 4-pixel calculation that reduced the latency of the implementations of [15] and [16]. He et al. [18] proposed a self-adapt threshold Canny algorithm to overcome the drawback of setting the hysteresis thresholds manually in existing hardware implementations. In their method, hysteresis thresholds are calculated from the histogram of gradient magnitude. Their algorithm required about 2.5 ms to detect the edges of a 360\(\times\)280 image on a FPGA chip EP1C60240C8 (Altera Cyclone) based platform. Li et al. [19] presented other solution for self-adapt threshold Canny algorithm, which adopted a Shifting-LUT-based direction calculation algorithm to improve the processing speed. The processing time was 5.24 ms for a 512\(\times\)512 image on a Xilinx’s Virtex-5 FPGA. Peng et al. [20] proposed an improved high-speed Canny edge detection algorithm based on FPGA, in which the gradient is calculated by the second harmonic of the variable parameters (SHOVP) to simplify complex arithmetic into logic operation. The feasibility and effectiveness of the algorithm was tested on Altera DE2 platform. Abdelgawad et al. [21] proposed an implementation of Canny algorithm on Zynq platform using Vivado High Level Synthesis (HLS). The achieved results showed that the collaboration of CPU and FPGAs enabled up to a 100x performance improvement. The CPU utilization dropped down and the frame rate was up to 60 fps for 1280\(\times\)1024 resolution. Xu at al. [22] presented a distributed Canny edge detection algorithm that adaptively computes the edge detection thresholds based on the block type and the local distribution of the gradients in the image block. In addition, their method uses a non-uniform gradient magnitude histogram to compute block-based hysteresis thresholds. The implementation of the algorithm on a Xilinx Virtex-5 FPGA platform takes only 0.721 ms (including the SRAM read/write time and the computation time) to detect edges of 512\(\times\)512 images in the USC SIPI database when clocked at 100 MHz. Sangeetha et al. [23] proposed a cost-effective robust Canny edge detection algorithm, whose keys contributions are the following: (1) computation of gradient magnitude and orientation using approximate method, (2) block classification techniques, and (3) adaptive threshold calculation of each block. Results on Xilinx Virtex-5 FPGA showed that the algorithm requires only 0.672 ms to detect the edges of 512\(\times\)512 image when clocked at 100 MHz.

In the area of General Purpose Graphic Processing Unit (GPGPU), several efficient implementations of the Canny algorithm have been proposed [25,26,27,28,29,30,31,32]. Luo and Duraiswami [25] presented the first implementation of the Canny algorithm on the popular NVIDIA CUDA framework [33]. They mapped the entire algorithm to the GPU, and improved previous similar implementations on NVIDIA Cg [34] and Khronos Group GLSlang [24] that not included the hysteresis stage. The convolution steps (Gauss and Sobel filtering) are efficiently implemented using a separable filter algorithm, similar to the one supplied with the CUDA toolkit [35]. The gradient magnitude and direction are easily obtained by calculating the L2 norm and the arctangent, respectively, of the first derivatives on a simple pixel to thread mapping. The gradient direction of each pixel is quantized to one of the eight directions corresponding to the neighboring pixels (\({\pi /8 + k\pi /4}\)). Non-maximum suppression is performed on a straightforward way by setting to 0 the gradient magnitudes that are not local maximums in the gradient direction. Hysteresis is performed by a kernel of 16\(\times\)16 thread-blocks, each of which processes a separate 16\(\times\)16 pixel-block of the gradient along with a one pixel wide apron around the 16\(\times\)16 pixel-block, resulting in a 17\(\times\)17 pixel-block. Each thread-block loads its assigned 17\(\times\)17 pixel-block to shared memory, and executes a breadth first search (BFS) algorithm on it to classify the pixels of the internal 16\(\times\)16 pixel-block as edges or non-edges. This classification is carried out by assigning -2 to the gradient magnitude, if the pixel is an edge, or 0, otherwise. Once a thread-block finishes the BFS process, it writes the edge states of all non-apron pixels in shared memory back into the gradient magnitude space in global memory. Subsequent calls to the hysteresis kernel will allow the linking among pixels that belong to different 16\(\times\)16 pixel-blocks, thanks to the reloading of the updated edge states of apron pixels into shared memory. Due to this multi-pass approach, the implementation speed is dominated by the hysteresis process. Experimental evaluation showed that it occupies more than 70% of the total runtime. For testing purposes, the hysteresis kernel was called four times per iteration, as no significant improvement was observed with higher values for the test images. Experiments showed a significant speedup against straightforward CPU functions, but a moderate improvement against multi-core multi-threaded CPU functions taking advantage of special instructions. The measured execution time for a 512\(\times\)512 image was 3.40 ms. Ogawa et al. [26] presented a solution based on the work of Luo and Duraiswami [25], in which they described an issue in the traversing of all weak edge pixels, and proposed a stack-based mechanism to fix it. In the hysteresis thresholding stage, if the pixel assigned to a thread is a strong edge pixel, the thread uses a stack to traverse the adjacent weak edge pixels, which are labeled as final edge pixels. Experimental evaluation showed a runtime of 364.389 ms for a 10240\(\times\)10240 image. The logarithmic image processing (LIP) model is a robust mathematical framework that is compatible with what is known about the human visual process [36]. In [27], Palomar et al. presented the implementation of two LIP-Canny methods, one operating images in LIP space with traditional operators, and the other operating images in natural space with modified operators. The work of Palomar et al. [27] was based on those of Palomares et al. [37] and Luo and Duraiswami [25]. As in [25], the number of iterations of the hysteresis kernel was fixed to 4. Experimental evaluation showed that CUDA implementations are 10–16 times faster than the corresponding C++ implementations. Moreover, LIP-Canny using modified operators is slightly faster than the alternative approach based on classical operators. The average runtimes for 512\(\times\)512 images were 26.448 ms and 28.848 ms for the first and second method, respectively. Lourenço et al. [28] developed a CUDA implementation of the Canny algorithm for the Insight Segmentation and Registration Toolkit (ITK) using second-order derivatives (instead of Sobel filtering [25]) and a hybrid CPU-GPU approach for the hysteresis stage that closely followed the method proposed in [25]. Experimental evaluation showed that the CUDA implementation on three generations of NVIDIA GPGPUs was between 3.6 and 50 times more faster than the standard ITK Canny implementation on two CPU models. The main novelties of the CUDA implementation proposed by B. M. L. P. Vigil [29] are the application of Otsu method for automatic calculation of hysteresis thresholds, and the use of interpolation in the non-maximum suppression step to improve the quality of edge detection. The hysteresis thresholding is performed by the same hybrid CPU-GPU technique used in previous works, and, hence, it occupies a considerable percentage in the total execution time (more than 50%). The execution times of the CUDA Canny detector for 512\(\times\)512 Lena, Mandrill and Peppers images were 8.49 ms, 9.84 ms and 10.90 ms, respectively. Huang et al. [30] presented a CUDA implementation on the embedded CPU and GPU heterogeneous computing platform Jetson TK1 of NVIDIA. Noise reduction, gradient computation and non-maximum suppression are efficiently implemented in a similar way to that of [25]. However, the linking process is replaced by a simpler schema, which classifies a pixel of the instability zone as an edge pixel if at least one of its eight neighboring pixels is an edge pixel. Additionally, the hysteresis thresholds are obtained from the histogram of gradient magnitude. Experimental evaluation showed that the runtimes for 512\(\times\)512 Lena and Peppers images were approximately 3 ms. In [32], Emrani et al. presented a CUDA implementation of Canny algorithm in which the main novelty was the replacement of the Luo and Duraiswami’s BFS algorithm [25] with a more efficient method. The kernel corresponding to this method checks whether a pixel belongs to the instability zone or not. If so, it will check its neighboring pixels. If a strong edge is found, the current pixel is classified as an edge pixel. A flag in global memory is used to indicate whether any pixel of the instability zone has been classified as an edge pixel. The kernel is launched as long as the flag is set. The execution time of the CUDA Canny detector for a 512\(\times\)512 image was 37.35 ms on a GeForce GTX 550 Ti GPU.

As we have just seen, the main bottleneck of GPU-based implementations of Canny algorithm is the hysteresis step, due to the need of calling the hysteresis kernel an indeterminate number of times (at least 4) executed on host side. On the other hand, in all implementations, except B. M. L. P. Vigil’s [29] and Huang et al.’s [30], the hysteresis thresholds are adjusted manually. In this work, we propose a novel GPU-based implementation of the Canny algorithm on CUDA that overcomes these limitations. As in [22] and [23], the image is partitioned into sub-images, and the following steps are performed on each sub-image in parallel: (1) calculation of the optimal hysteresis thresholds, and (2) hysteresis process using the parameters obtained in the previous step. As each sub-image is processed independently, it is not necessary the costly hybrid CPU-GPU approach of previous implementations for hysteresis stage. The calculation of hysteresis thresholds is carried out with Medina-Carnicer’s method [11], which, at present, is relevant for unsupervised edge detection because, since its introduction, it has been used to find automatically the hysteresis thresholds in many works [38,39,40,41,42,43,44,45,46,47,48,49,50,51]. Medina-Carnicer’s method [11] outperforms those used in previous implementations of Canny algorithm [18, 19, 22, 23, 29], because the first searches the optimal values of both hysteresis thresholds low and high, while the latter do not, since they assume a constant ratio low/high. Experimental evaluation showed that our GPU-based unsupervised and distributed Canny edge detector, which we have named GUD-Canny, requires only between 0.33 and 0.48 milliseconds to detect edges on 512\(\times\)512 images, which fully satisfies real-time requirements and outperforms reported runtimes of existing FPGA and GPU solutions.

The rest of the paper is organized as follows. Section 2 gives a brief overview of Medina-Carnicer’s method. Section 3 presents GUD-Canny. Section 4 shows the experimental evaluation of our solution, and, finally, the main conclusions are stated in Sect. 5.

2 Medina-Carnicer’s method for unsupervised determination of hysteresis thresholds

2.1 Background

In [11], Medina-Carnicer et al. presented a novel method to look for the hysteresis thresholds in an unsupervised way. Given a set of candidate thresholds pairs, the key idea is to combine the gradient information with that obtained from applying the linking process for all the candidate thresholds pairs. Experimental evaluation showed that the performance of Medina-Carnicer’s algorithm is better than those of previous methods [52, 53]. The computational complexity of Medina-Carnicer’s algorithm [11] is smaller than that of the solution presented in [53], but bigger than that of the proposal in [52]. Nevertheless, the approach in [52] only finds an approximate edge map and it is not able to find the hysteresis thresholds. The results obtained by Medina-Carnicer’s method [11] have been validated only for the Canny edge detector, but there are no restrictions to apply it to any other edge detector whose strategy is based on the hysteresis mechanism.

The main innovations presented in [11] are the following:

  1. 1.

    In contrast to previous works [53,54,55,56], which are aimed at directly searching for hysteresis thresholds, it follows an indirect way, which consists of looking for the instability zone and then determining the hysteresis thresholds from it.

  2. 2.

    Unlike previous proposals [53, 55, 56], which only use gradient information, it combines the latter with that of the linking process.

2.2 Steps summary of Medina-Carnicer’s method

Let I be an image, G its gradient magnitude after non-maximum suppression normalized in the interval [0,1], and C a set of candidate thresholds pairs \(\{(low,high), low,high \in (0,1)\}\).

Given a hysteresis thresholds pair (lowhigh), we define the following edge maps:

  • Hysteresis map (\(G_{low,high}\)), which is obtained by performing the hysteresis process on G with (lowhigh).

  • High map (\(G_{high}\)), which is the result of thresholding G with high.

  • Linking map (\(\Delta G_{low,high}\)), which is composed exclusively of the edges added by the linking process using (lowhigh). Note that \(\Delta G_{low,high} = G_{low,high} - G_{high}\).

The steps of Medina-Carnicer’s method are the following:

  1. 1.

    Calculate a set H of linking maps corresponding to the candidate thresholds pairs of C.

    $$\begin{aligned} H = \{\Delta G_{low,high}, (low,high) \in C\} \end{aligned}$$
    (1)
  2. 2.

    Compute the sum \(SM_H\) of the linking maps.

    $$\begin{aligned} SM_H = \sum H \end{aligned}$$
    (2)

    In this matrix, the value of each element is the number of times that the corresponding pixel of G is classified as edge by the linking process for all the candidate thresholds pairs.

  3. 3.

    Calculate the division of \(SM_H\) by the cardinality of C, which will be denoted as \(Prob(SM_H)\).

    $$\begin{aligned} Prob(SM_H) = SM_H / |C| \end{aligned}$$
    (3)

    Each element of \(Prob(SM_H)\) represents the probability that the corresponding pixel of G is classified as edge by the linking process.

  4. 4.

    Compute the distribution \(P(F(x)), \forall x \in (0,1)\), defined as follows:

    $$\begin{aligned} P(F(x)) = \left\{ \begin{array}{ll} \frac{|F(x)|}{|Prob_x(SM_H))|} &{} |Prob_x(SM_H)| > 0 \\ 0 &{} |Prob_x(SM_H)| = 0 \end{array} \right. \end{aligned}$$
    (4)

where

  • \(Prob_x(SM_H)\) is the binary edge map obtained by thresholding \(Prob(SM_H)\) with \(x \in (0,1)\). Its elements with value 1 correspond to the pixels of G that have a probability equal or greater than x of being classified as edges by the linking process.

  • \(|Prob_x(SM_H)|\) is the number of elements with value 1 in \(Prob_x(SM_H)\).

  • \(F(x) = G \circ Prob_x(SM_H)\), where \(\circ\) is the Hadamard product.

  • |F(x)| is the number of elements with value x in F(x).

The distribution P(F(x)) represents the probability that a pixel has gradient level x if it is a pixel with probability equal or greater than x of being added by the linking process. It is the combined information used by Medina-Carnicer’s method.

5. Compute the histogram of \(Prob(SM_H)\) for the set \(D = \{x \in (0,1) | P(F(x)) \ne 0\}\), which represents the instability zone. The hysteresis thresholds are the values of D corresponding to the first and last local maximums of the histogram.

The set C is obtained by sampling an interval \([0.01, MAX\_HIGH]\), where \(0.01 < MAX\_HIGH \le 1.0\). In [11], Medina-Carnicer et al. showed that two selections of C that ensure a good performance of their method are those obtained by sampling the interval [0.01, 0.25] with steps 0.01 and 0.03. Furthermore, the results presented in [53] indicate that their approach, in general, depends less on the initial set than the method of Yitzhaky and Peli [56] does.

3 GPU-based unsupervised and distributed Canny edge detector (GUD-Canny)

In this section, we describe GUD-Canny, our GPU-based unsupervised and distributed Canny edge detector, which has been developed using the popular NVIDIA CUDA framework [33]. In the presented algorithms, the following notation is employed:

  • Prefixes d_, s_ and c_ in the names of the variables indicate that they are allocated in global, shared and constant memory spaces, respectively.

  • Symbols&, |, \(\sim\), \(<<\) and \(>>\) are the bitwise operators AND, OR, NOT, left shift and right shift, respectively.

Algorithm 1 provides a high-level description of GUD-Canny. As it can be seen, the inputs of our method are the following. First, a \(W \times H\) image, which is provided in a vector of P 8-bit unsigned integers (\(d\_image\)), where P is the number of pixels. Second, the standard deviation \(\sigma\). Third, a set of NCTP candidate thresholds pairs, which is supplied in a vector of float pairs (\(c\_C\)). On the other hand, the output of GUD-Canny are the edges of the input image, which are written in a vector of P 8-bit unsigned integers (\(d\_edges\)).

Steps 1–3 correspond to the classic first stages of Canny edge detection. To apply Medina-Carnicers’s method (steps 4 to 7), the non-maximum suppression returns the gradient magnitude normalized in the interval [0, 1] (\(d\_G\)). The gradient magnitude is partitioned horizontally into \(NS = W / 32\) sub-images of dimension \(32 \times H\), and Medina-Carnicer’s method [11] is used to calculate an optimal pair of hysteresis thresholds for each sub-image. Finally, in step 8, the hysteresis map is computed for each subimage using its assigned hysteresis thresholds pair, and written in the output vector \(d\_edges\).

Since the original width of the input image may not be a multiple of 32, the CUDA function cudaMemcpy2D [57] is used to copy the input image from host to device memory adding the necessary padding to each row, and the same function is called to copy the output edges from device to host memory.

In the following subsections, each step of GUD-Canny is described in detail.

figure a

3.1 Gaussian filtering

To reduce the impact of noise, the input image is smoothed by convolving it with two one-dimensional Gaussian filters in the horizontal and vertical dimensions.

Each Gaussian filtering is performed by a different CUDA kernel, in which each output pixel is computed by a different thread. Kernels implementations are similar to those presented in [35], but with the difference that the shared memory is not used for caching data. Since the hardware cache system ensures a good performance [57], all read/write operations are performed directly to global memory.

Each thread initializes each element of the input image vector used to perform the convolution dot product as follows. If it corresponds to an existing pixel, i.e., the position of the pixel is not outside the borders of the image, it is read from the input image. Otherwise, it is assigned the value zero.

As in [27], the length of Gaussian filters is variable and depends on the standard deviation \(\sigma\). Each kernel obtains the Gaussian filter from a table in constant memory, which stores the Gaussian filters corresponding to \(\sigma\) values between 0.1 and 2.0. The first table entry corresponds to \(\sigma\) = 0.1, the second one to \(\sigma\) = 0.2, and so on up to 2.0.

3.2 Gradient computation

After Gaussian filtering, each gradient tuple \((d_x, d_y)\) is calculated using the first difference operator \((-1,0,1)\), and the associated gradient magnitude by the formula \(\sqrt{d_x^2 + d_y^2}\). The results are written in the output vectors \(d\_grad\_x\), \(d\_grad\_y\) and \(d\_grad\_mag\), respectively. As in Gaussian filtering step, all read/write operations are made directly to global memory, and the border conditions are carefully checked.

Additionally, to compute the maximum gradient magnitude, each thread performs an atomic maximum operation (using the CUDA function atomicMax [57]) between the calculated gradient magnitude and a global memory variable (\(d\_max\_grad\_mag[0]\)), which has been initialized to zero.

3.3 Non-maximum suppression

In this step, one kernel computes a new version of gradient magnitude (\(d\_G\)) by performing non-maximum suppression and normalization on the gradient magnitude obtained in the previous stage (\(d\_grad\_mag\)). Given a pixel of value p in \(d\_grad\_mag\), the value of the corresponding pixel in \(d\_G\) is \(p / d\_max\_grad\_mag[0]\) if the pixel is a maximum in the gradient direction, or zero otherwise. Each pixel of \(d\_G\) is computed by a different thread of the kernel.

The method used for maximum suppression is the one employed in [29], which quantizes gradient direction to one of the eight directions \(\{\pi /8 + k\pi /4\}\), and uses linear interpolation to calculate the values of the two neighboring pixels in the gradient direction.

Global memory operations and border conditions management are executed as in previous steps.

3.4 Hysteresis thresholds computation

As we said previously, the gradient magnitude is partitioned horizontally into \(NS = W / 32\) sub-images, and the method of Medina-Carnicer is applied on each one in parallel.

Images are processed by dividing them into groups of 32 consecutive pixels in the horizontal dimension, which will be referred to as regions. The numbers of regions of an image and of a sub-image will be denoted by NRI and NRS, respectively. For simplicity, the regions of instability/hysteresis/high/linking maps will be referred to as instability/hysteresis/high/linking regions, respectively.

Each of the steps 4–7 of Algorithm 1 is performed by a different kernel, whose actions are specified in the following subsections.

3.4.1 Calculation of the matrices \(SM_H\)

Algorithm 2 presents the pseudo code of the kernel calc_SM_H, which calculates the matrix \(SM_H\) for each sub-image of G. The inputs are G, which is provided in a vector of P 32-bit floats (\(d\_G\)), and C, which is supplied in a vector of NCTP 32-bit float pairs, initialized statically in constant memory (\(c\_C\)). The output are the NS matrices \(SM_H\) corresponding to the NS sub-images of G, which are written in a vector of P 32-bit unsigned integers (\(d\_SM\_H\)), initialized to 0. Maps regions are represented by 32-bit unsigned integers, where the i-th bit stores the binary value of the i-th pixel of the region. Although the gradient regions reads in step 1 are not coalesced, as CUDA literature [58] [57] recommends, they satisfy the principle of spatial locality because each thread reads 32 consecutive elements of \(d\_G\), which are properly aligned. Therefore, the transparent cache hierarchy of modern GPU architectures ensures a good performance while reading the gradient regions. On the other hand, the writes in step 5 are carried out atomically using the CUDA function atomicAdd [57].

figure b

In step 3, of Algorithm 2 each thread gets its hysteresis region (\(hyst\_reg\)) by calling the function \(calc\_hyst\_map\), which receives as inputs the high and instability regions of the calling thread (\(high\_reg\) and \(inst\_reg\), respectively). The actions performed by this function are presented in Algorithm 3. As it can be seen, each thread-block computes its hysteresis map in a shared memory 32-bit unsigned int vector (\(s\_hyst\_map\)) of size NRS. Each hysteresis region i is managed by the thread i, and held in the element \(s\_hyst\_map[i]\).

figure c

An alternative way to divide the gradient magnitude into sub-images is by partitioning it vertically into \(NS = H / 32\) sub-images of dimension \(W\times 32\). In this case, the spatial locality of accesses to global memory is improved, because consecutive threads access consecutive regions. On the other hand, the advantage of the horizontal partition is that the number of operations in the linking process is reduced (step 5 of the function calc_hyst_map). The reason is that it is only necessary to examine the top and bottom regions; in the case of a vertical partition, the six remaining neighbor regions (left, top left, bottom left, right, top right and bottom right) have also to be taken into account. As will be shown in Sect. 4, GUD-Canny is slightly faster for sub-images of dimension \(32 \times H\).

3.4.2 Calculation of the matrices \(Prob(SM_H)\)

The matrix \(Prob(SM_H)\) for each sub-image of G is obtained by dividing each element of the corresponding matrix \(SM_H\) by NCTP. The matrices \(Prob(SM_H)\) are written in a vector of P 32-bit floats (\(d\_Prob\_SM\_H\)).

The number of threads of the grid equals to P divided by 4, and each thread i performs the following actions:

  1. 1.

    Reads the group i of four consecutive elements from \(d\_SM\_H\) through one vectorized load.

  2. 2.

    Calculates the division of each element by NCTP.

  3. 3.

    Writes the four computed float values to the 4-elements group i of \(d\_Prob\_SM\_H\) through one vectorized store.

Vectorized accesses are an important GPU optimization, because they increase bandwidth and reduce both instruction count and latency [59].

3.4.3 Calculation of the distributions P(F(x)) and the histograms of the matrices \(Prob(SM_H)\)

For each sub-image of G, the distribution P(F(x)) and the histogram of \(Prob(SM_H)\) are computed by one kernel for \(x \in \{0.01, 0.02, ..., MAX\_HIGH\}\). The number of x values, which is \(MAX\_HIGH / 0.01\), will be denoted by NX.

The number of thread-blocks of the grid is \(NS \times NX\). Each thread-block calculates P(F(x)) and the histogram of \(Prob(SM_H)\) for one sub-image of G and one x value. The size of thread-blocks is NRS.

The actions performed by the kernel are shown in Algorithm 4, where div and mod are the quotient and remainder operators, respectively. The three parallel reductions are efficiently executed using the CUDA function \(\_\_shfl\_down\_sync\) [57] and fast device memory atomic operations, as described in [60].

figure d

3.4.4 Searching of hysteresis thresholds

The hysteresis thresholds searching for each sub-image of G is performed by the kernel described in Algorithm 5. The number of warps of the grid is NS, and each warp i searches for the hysteresis thresholds pair of sub-image i. It is assumed that NX < 32.

The warp votes are performed by calling the CUDA function __balloc_sync [57], which, given a predicate, evaluates it for all threads in the current warp, and returns a 32-bit binary mask, in which each bit j is set if the predicate evaluates to non-zero for the lane j.

The searches of bits within the masks are performed efficiently using the CUDA integer intrinsic functions __ffs and __brev [61]. The first one finds the position of the least significant bit set to 1 in a 32-bit integer, and the second one reverses the bit order of a 32-bit unsigned integer.

figure e

3.5 Hysteresis thresholding

The hysteresis thresholding is carried out by the kernel presented in Algorithm 6, which is very similar to Algorithm 2. The number of thread-blocks of the grid is NS, and the size of each thread-block is NRS. The i-th thread-block calculates the hysteresis map corresponding to the i-th sub-image of G following the same steps of Algorithm 2. Then, the j-th thread of the thread-block writes the pixels values specified in its hysteresis mask (\(hyst\_reg\)) to the j-th region of the corresponding output edges sub-image.

To write the hysteresis region, each thread accesses the output edges image through a pointer to a structure of 32 8-bit unsigned int members. As in the case of gradient regions reading, although the accesses to global memory are not coalesced, they satisfy the principle of spatial locality, and are properly aligned.

figure f
Table 1 MSE values for frame-level Canny edge detection and distributed Canny edge detection using Medina-Carnicer’s method for unsupervised determination of hysteresis thresholds
Table 2 Edge detection times (ms) for candidate thresholds sets \(C_{0.01}\) and \(C_{0.03}\)
Table 3 Statistics of GUD-Canny speedup for \(C_{0.03}\) with respect to \(C_{0.01}\)
Table 4 Statistics of kernels execution time proportions for set \(C_{0.01}\)
Table 5 Statistics of kernels execution time proportions for set \(C_{0.03}\)

4 Experimental evaluation

To evaluate the performance of GUD-Canny edge detection, we used the ground truth images of Heath’s dataset [62], that can be downloaded from ftp://figment.csee.usf.edu/pub/Edge_Comparison/images/results/. The 28 gray reference images of this dataset were selected by humans from a limited set of edge maps, which were obtained using the Canny edge detector with different values for its parameters.

We utilized the same two candidate thresholds sets selected in [11], which were those obtained by sampling the interval [0.01, 0.25] with steps 0.01 and 0.03, and that will be denoted by \(C_{0.01}\) and \(C_{0.03}\), respectively.

Our test machine had a 3.50Ghz Intel Core i7-7800X CPU and 32 GB of RAM. The GPU that we used was a GeForce RTX 2080 (Turing architecture with compute capability 7.5), and no optimization flags were utilized in our implementation.

4.1 Quality evaluation

In the first experiment, we compared the quality obtained by applying Medina-Carnicer’s method to the entire W\(\times\)H image (classical frame-level approach) with the quality resulting from executing the same method on each 32\(\times\)H sub-image (distributed approach, which is the focusing of GUD-Canny). Table 1 shows the mean-square errors (MSE) obtained for sets \(C_{0.01}\) and \(C_{0.03}\). In each row, for each candidate thresholds set, the minimum MSE is highlighted in bold. As it can be seen, the good performance of Medina-Carnicer’s method not only remains in the distributed approach, but it even slightly outperforms that of frame-level approach. For the set \(C_{0.01}\), the average MSEs for classical and distributed approaches were 0.0534 and 0.0498, respectively. In the case of the set \(C_{0.03}\), the values were 0.0534 and 0.0502, respectively.

On the other hand, it can be observed that there is no big difference between the quality obtained using \(C_{0.01}\) with respect to that resulting from utilizing \(C_{0.03}\), as the average MSEs are 0.0498 and 0.0502, respectively.

4.2 Temporal efficiency evaluation

Table 2 presents the GUD-Canny edge detection times for sets \(C_{0.01}\) and \(C_{0.03}\). At the end of each column, statistics (average, minimum and maximum) are presented for all images, and for those of size 512\(\times\)512. Additionally, Table 3 shows the statistics of GUD-Canny speedup for \(C_{0.03}\) with respect to \(C_{0.01}\). From the presented results, we can see the following points:

  1. 1.

    GUD-Canny fully satisfies real time requirements, as its execution times are on average 1.2736 ms and 0.3637 ms for sets \(C_{0.01}\) and \(C_{0.03}\), respectively.

  2. 2.

    For the set \(C_{0.03}\), the edge detection times are between 0.2814 and 0.3932 milliseconds for 512\(\times\)512 images. Hence, GUD-Canny outperforms the temporal efficiency of existing GPU and FPGA implementations, like the solution of Sangeetha et al. [23], whose edge detection time is 0.672 ms for 512\(\times\)512 images.

  3. 3.

    The speedup obtained using \(C_{0.03}\) instead of \(C_{0.01}\) is significant, as its values are between 2.99x and 3.90x. The reason is that the number of linking maps that have to be calculated for \(C_{0.03}\) (\(36 \times NS\)) is much less than that for \(C_{0.01}\) (\(300 \times NS\)). This contrasts with the small difference between the quality of edge maps obtained with these candidate thresholds sets.

4.3 Distribution of execution times

Tables 4 and 5 show the statistics (average, minimum and maximum) of kernels execution time proportions (expressed as percentages) for sets \(C_{0.01}\) and \(C_{0.03}\).

Unlike the case of existing GPU-based Canny edge detectors, the hysteresis stage is executed efficiently, as its average time proportions are 2.02% and 7.70% for sets \(C_{0.01}\) and \(C_{0.03}\), respectively.

As expected, due to their higher computational complexity, the most time-consuming operations are the calculation of matrices \(SM_H\) (whose average time proportions are 82.38% and 39.39% for sets \(C_{0.01}\) and \(C_{0.03}\), respectively) followed by the computation of distributions \(\{P(F(x))\}\) and histograms of matrices \(Prob(SM_H)\)(whose average time proportions are 7.21% and 24.87% for sets \(C_{0.01}\) and \(C_{0.03}\), respectively).

Figure 1 presents the statistics (average, minimum and maximum) of the total GPU time proportions corresponding to memory transferences. As it can be seen, the penalty is moderate because the percentages are less than 12% and 30% for sets \(C_{0.01}\) and \(C_{0.03}\), respectively.

Fig. 1
figure 1

Statistics of memory transferences time proportions for sets \(C_{0.01}\) and \(C_{0.03}\)

Table 6 Edge detection times (ms) for sub-images sizes \(32 \times H\) (horizontal partition) and \(W \times 32\) (vertical partition)

4.4 Horizontal partitioning vs. vertical partitioning

Table 6 shows the edge detection times (ms) on 512\(\times\)512 images using the candidate thresholds \(C_{0.03}\) for sub-images sizes \(32 \times H\) (horizontal partition) and \(W \times 32\) (vertical partition). For each image, the minimum execution time is highlighted in bold. In all cases, the number of sub-images is 16 and the number of regions per sub-image is 512.

From the presented results, we can see that the execution times are slightly lower using the horizontal partitioning. The average speedup is 1.14x. As explained in Sect. 3.4.1, although the spatial locality of accesses to global memory is improved using vertical partitioning, the number of operations in the linking process is reduced if the sub-image size is \(32 \times H\). Experimental evaluation has shown that the performance improvement due to the second factor is greater than that of the first.

5 Conclusions

This work has presented GUD-Canny, a novel GPU-based unsupervised and distributed implementation of Canny edge detector. Our solution overcomes the two main limitations of current Canny algorithm implementations, which are the bottleneck caused by the hysteresis process, and the use of fixed hysteresis thresholds.

Given a W\(\times\)H image, GUD-Canny computes the normalized gradient magnitude, partitions it into 32\(\times\)H sub-images, and calculates the optimal pair of hysteresis thresholds for each sub-image using Medina-Carnicer’s method [11]. Once the hysteresis thresholds are obtained, instead of running one costly multipass CPU-GPU hysteresis process on the entire image, hysteresis thresholdings (one per sub-image, using its specific hysteresis thresholds) are executed entirely on GPU, independently and in parallel. Each thread-block performs the hysteresis process on one sub-image in shared memory, and represents each pixel of the hysteresis map with only one bit to optimize the use of the limited space of shared memory.

Experimental evaluation showed that GUD-Canny only requires 0.35 ms on average to detect edges on 512\(\times\)512 images. Hence, it fully satisfies real time constraints, and is faster than existing GPU and FPGA implementations.