1 Introduction

In medicine, diagnosis is as important as treatment. However, providing the fastest and most accurate diagnosis is an important parameter as well. Advances in technology allow us to provide faster and more accurate solutions to current problems. For example, different types of technological devices are used to diagnose a variety of diseases. Generally, x-rays are used for bone and hard tissues, tomography for soft tissues, and fundus cameras for retinal imaging. They are also complementary devices that help doctors to diagnose diseases.

The retina is a light-sensitive tissue that consists of photosensitive cells (photoreceptors) and covers the inner surface of the eye. Photoreceptors convert light into nerve signals that travel along the optic nerves to the brain. Fundus camera systems (retinal microscope) are generally used to obtain images (fundus images) to document the status of the retina. Fundus images contain basic diagnostic information that can be used to determine whether the retina is healthy or damaged. Fundus images are widely used in the diagnosis of vascular and non-vascular pathologies [38] and also provide information on retinal vascular changes that are frequently observed in diabetes, glaucoma, hypertension, cardiovascular diseases, and strokes [10, 11], which generally alter the arrangement of blood vessels [49]. For example, hypertension alters the branching angle or curvature of the vessels [44] while diabetic retinopathy can cause neovascularization, that is, the formation of new blood vessels. If left untreated, such conditions can cause visual impairment and even blindness [42]. Early detection of those symptoms enables us to take measures for the prevention of vision loss and blindness [33]. Retinal blood vessels also play an important role in the early stage diagnosis of numerous diseases. However, manual retinal blood vessel segmentation is a time-consuming and repetitive process that requires training and skill. Therefore, automatic segmentation of retinal vessels is the first step in developing a computer-aided diagnostic system for ophthalmic disorders [12].

For rapid and timely retinal image analysis, ophthalmologists use automated methods, which, therefore, attract the attention of members of the medical image analysis community. Diagnostic analysis of retinal images of blood vessels is of particular interest to eye specialists. Therefore, most studies on computerized methods for retinal image analysis focus on segmentation of retinal blood vessels [39].

Retinal blood vessels are analyzed in many areas - including ophthalmology - for diagnosis, treatment planning and administration, and assessment of clinical outcomes [32]. Methods based on blood vessel segmentation include morphological (structural) filters and multi-scale line detectors. However, those methods fail to solve the main problems of retinal imaging analysis [40]. Therefore, new approaches based on machine learning have been developed in recent years.

Segmentation of retinal blood vessels is of interest in medical image analysis. The studies in this field can be examined under two groups; those using rule-based methods and those using machine learning based methods. Rule-based methods are vessel tracking approaches, adaptation filter, multi-scale approaches, and morphology-based methods. Machine learning based methods are traditional machine learning algorithms where training is carried out with manual segmentation of vessels and deep learning methods thanks to recent developments. Rule-based methods use the natural features of a pixel in a retinal image to determine whether that pixel belongs to the vessel structure. Manual segmentation of vessels has an indirect effect on the success of the algorithm. Previous studies have focused on rule-based methods. For example, Salem et al. proposed a Radius-Based Clustering algorithm, which uses a distance-based principle, to map the distributions of image pixels before segmentation. They made use of three main features of the image: (1) green image channels, (2) maximum gradient magnitude, and (3) local maximum of the largest eigenvalue calculated from the Hessian matrix [36]. Zhang et al. developed a matched filter method with first-order derivative of Gaussian for retinal vessel extraction and reported promising results [50]. Delibasis et al. (2010) proposed a model-based tracing algorithm for vessel segmentation and diameter estimation by using the main geometrical features of blood vessels. They computed the starting points using the Frangi filter and then determined the geometric properties of the vessel such as width, size and orientation. They used a measure of match to quantify the similarity between the image and the developed parametric model [6]. Mendonca et al. (2006) used the DoOG (Difference of Offset Gaussian) filter to detect vessel centerlines and used them as starting points. They used multi-scale morphologically modified top-hat filtering to smooth out the images. The algorithm is complemented with morphological restructuring and vessel filling [31]. Fraz et al. took a step further and combined the morphological bit plane slicing method with the vessel centerlines detection approach. They applied bit plane slicing to the improved gray level images of retinal blood vessels and combined it with vessel centerlines [12]. Ghoshal et al. proposed an improved vessel extraction scheme from retinal fundus images and performed a mathematical morphological process on the image. They converted the original image and the image removed from the vessels into negative gray scale images, which were extracted to balance the image and then were enhanced to obtain thin vessels by transforming the developed image into a binary image. Lastly, they merged the thin vessel image the binary image to obtain the vessel extracted image. They reported that they achieved satisfactory performance results [14].

Studies also recommend machine learning-based methods for retinal blood vessel segmentation. Controlled methods use information on whether a pixel belongs to a vessel or the retinal background. The algorithm learns how to decide whether a pixel belongs to a vessel. Training is carried out using training data along with manually segmented versions. Soares et al. (2006) applied a two-dimensional Gabor wavelet transform and supervised learning and used the two-dimensional Gabor wavelet transform responses at different scales of each pixel as features. They used a Bayesian classifier and classified a complex model quickly [38]. Ricci and Perfetti (2007) proposed a method based on line operator that can be used to calculate the features required for retinal blood vessel segmentation. Their model is faster and requires fewer features than other methods thanks to the line detector that it applies to the green-channel of retinal images [35]. Marin et al. proposed an artificial neural network technique and used a seven-dimensional feature vector to train a multi-layered forward-oriented artificial neural network. They used sigmoid activation function in each neuron of the network with three hidden layers. They reported that the trained network works successfully in other databases as well [29]. Due to recent developments, there is an increase in the number of studies using deep learning for blood vessel segmentation in retinal images. Convolutional neural networks (CNNs), which are robust methods used for image classification and segmentation, also yield successful results in vascular segmentation. CNNs detect and learn features from raw images instead of using manually designed features. Studies on different network structures have reported significant results regarding blood vessel segmentation in retinal images. Liskowski et al. used preprocessing steps to input images and then used a convolutional neural network for training. They used data augmentation and successfully performed vascular segmentation [27]. Melinscak et al. developed a method with an CNN architecture for determining whether each pixel is a vessel or background [30]. Wang et al. proposed a new retinal vessel segmentation framework based on Dense U-net and patch-based learning strategy and reported that the proposed method was promising in terms of common performance metrics [43]. Yao et al. developed a CNN-based blood vessel segmentation algorithm and used a two-stage binarization and a morphological operation to refine the preliminary segmentation results of fundus images. They reported that the proposed algorithm significantly improved the segmentation of blood vessels performance [46]. Guo et al. proposed a CNN-based two class classifier with two convolution layers, two pooling layers, one dropout layer and one loss layer for retinal vessel segmentation. They concluded that the proposed method had high accuracy and training speed without pre-processing [17]. Leopold et al. presented PixelBNN, which is an efficient deep learning method for automatic segmentation of fundus morphologies and reported that it had shorter test time and relatively good performance, considering the reduction of information due to the resizing of images during preprocessing [26].

Image quality and image enhancement play a significant role in object recognition and detection and is an integral part of image and video processing applications [2,3,4]. Advances in technology have provided high resolution images with high pixel density, clear details and rich information. Therefore, high image quality can meet the practical application requirements of image analysis and image understanding [48]. CNN yields successful outcomes in image classification and object detection, which however depends on some factors, such as the network architecture, selection of an activation function, and input image quality. Research shows that the performance of a CNN is negatively affected by poor quality input images, even if not visible [7, 48].

This study proposed a hybrid method based on CNN architectures and image processing techniques for segmentation of blood vessels in retinal images. Different from previous methods, the proposed method utilizes four-step image processing techniques to significantly extract blood vessels from the background and to standardize retinal images for the training of the network. The preprocessed images were randomly selected and cropped to certain sizes, which improved training performance through data augmentation. The post-training network was used to segment blood vessels in the retinal images.

The success of the proposed CNN method depends on several factors. Retinal blood vessels are generally detected by segmenting full-size images. However, the proposed CNN involves detecting regional vascular sections rather than segmenting full-size images. It divides an image into small pieces to detect the retinal blood vessels and then assembles the image back, resulting in an exponential increase in the amount of training data and thus improving the training performance of the CNN. CNNs can learn attributes from raw images and perform well in making sense of images. Moreover, the performance of the proposed CNN was greatly improved by applying preprocessing to the input images. The comparative results also showed that the proposed CNN is an efficient and successful method that can be used for extracting retinal blood vessels.

2 Materials and method

In this study, a fully CNN model was developed, and retinal blood vessels were extracted using semantic segmentation technique. The greatest challenge to detecting retinal blood vessels is differentiating them from surrounding tissues and lesions. There are several points critical for the success of the proposed method, which is successful for several reasons. First, it has transformed the problem of detection of retinal blood vessels from full segmentation to regional segmentation. It divides images into small pieces and detects retinal blood vessels and then brings them together. This leads to an increase in the amount of training data to thousands or even hundreds of thousands, resulting in a significant improvement in the training performance of the CNN. In addition, the method preprocesses input images, thereby, greatly improving the performance of the CNN.

The proposed method consists of three parts. Before processing, the data were divided into groups: training and test. Prior to training and testing, a series of preprocesses were applied to the images, and then, data augmentation was performed, which greatly affected the performance of the full CNN. This both diversified the training data and evolved the problem from full segmentation into regional segmentation of retinal blood vessels in retinal images. The image slices after data augmentation were used as training data. Test data were used to test the performance of the network after training. In the final step, the image slices were combined. The Fig. 1 shows the operational steps and the details of the methods used in Fig. 1 are given in the following sections.

Fig. 1
figure 1

Flow Diagram

2.1 Data set

The retinal images were obtained from the publicly accessible DRIVE database [22], which is the result of a screening program for diabetic retinopathy in 400 diabetic patients 25 to 90 years of age in the Netherlands. The database consists of 40 fundus images, 33 of which exhibit no symptoms of diabetic retinopathy while the rest present symptoms of mild early diabetic retinopathy. The images were captured using a Canon CR5 non-mydriatic 3CCD camera with a 45-degree field of view (FOV). Each image was captured using 24 bits per color plane at 768 by 584 pixels. The FOV of each image is circular with a diameter of approximately 540 pixels. The images are cropped around the FOV. Each image is provided with a mask image that delineates the FOV [34, 41].

The DRIVE database consists of 20 training and 20 test images along with manually segmented versions. Figure 2 shows an example of a training image. The database also has its own gold standard image, which is used both to compare computer-generated vessels with that of an independent observer and as a training outcome. Retinal images were also obtained from a second database, the STARE, consisting of 20 retinal fundus images (10 from healthy individuals and 10 from individuals with pathology). Each image was carefully hand-labeled to produce a ground truth vessels segmentation [20, 21].

Fig. 2
figure 2

Types of images in the DRIVE data set: (left to right), color fundus image, gold standard image and mask image [22]

2.2 Gray level transformation

Gray level transformation is often used in image processing. Several different methods are used for gray level transformation, which is achieved by adding a constant value to or multiplying a constant value by the color channels of a given image. Its generalized version is as follows:

$$ GL={k}_{\begin{array}{c}r\\ {}\ \end{array}}R+{k}_{\begin{array}{c}g\\ {}\ \end{array}}G+{k}_{\begin{array}{c}b\\ {}\ \end{array}}B $$
(1)

where GL is the gray level of a given pixel. R, G and B are the red, green and blue channel values, respectively, and \( {k}_{\begin{array}{c}r\\ {}\ \end{array}} \), \( {k}_{\begin{array}{c}g\\ {}\ \end{array}} \) and \( {k}_{\begin{array}{c}b\\ {}\ \end{array}} \) are the factors of three different channels. These coefficients must satisfy the condition in Eq. 2 [47].

$$ {\mathrm{k}}_{\begin{array}{c}r\\ {}\ \end{array}}+{k}_g+{k}_b=1 $$
(2)

2.3 Gray level normalization

Uneven illumination, local brightness and contrast differences are common problems observed in obtaining fundus images [23]. To overcome those problems, data standardization is applied to image sets that are used as training data. In the first stage of data standardization, the mean value (Eq. 3) and standard deviation (Eq. 4) of the entire image data set are calculated.

$$ \upmu =\frac{\sum_{k=0}^l\ {\sum}_{i=0}^n\ {\sum}_{j=0}^m{I}_{i,j}^k}{N} $$
(3)
$$ \sigma =\frac{\sqrt{\sum_{k=0}^l\ {\sum}_{i=0}^n\ {\sum}_{j=0}^m{\left({I}_{i,j}^k-\upmu \right)}^2}}{\mathrm{N}-1} $$
(4)

where l is the total number of images, n is the image height, m is the image width, and N is the total number of pixels.

Standardization is then calculated using the standard deviation and mean values (Eq. 5).

$$ {I}_{i,j}^k=\frac{I_{i,j}^k-\upmu}{\boldsymbol{\sigma}} $$
(5)

In the next stage, the standardized input image data set is normalized (Eq. 6).

$$ {I}_{i,j}^k=\frac{I_{i,j}^k-\min \left({I}^k\right)}{\mathit{\max}\left({I}^k\right)-\min \left({I}^k\right)}\ast 255 $$
(6)

where min(Ik) and max(Ik) are, respectively, the minimum and maximum pixel gray level values of the first image [37].

2.4 Contrast limited adaptive histogram equalization (CLAHE)

Vessels that are characterized as small and extreme should be accurately segmented in fundus images because they generally have low contrast with the background. The histogram equalization can be used to enhance image contrast.

For a given gray level image, let ni be the number of gray levels at i occurrences. The probability that a pixel of level i occurs in the image is calculated as follows:

$$ p(i)=\frac{n_i}{n},\kern0.75em for\kern0.5em 0\le i<256 $$
(7)

where i is the total number of gray levels (usually 256), n is the total number of pixels in the image, and p(i) is the histogram value of level i in the image. Hence, cumulative distribution function is calculated as follows:

$$ cdf(i)=\sum \limits_{j=0}^ip(j) $$
(8)

A transform is calculated to obtain an image with a flat histogram. The resulting image has a linear increasing cumulative distribution function.

The histogram equalization calculates and normalizes the density distribution of a given image. The use of all pixels in normalization, however, results in noisy data. Therefore, contrast limited adaptive histogram equalization (CLAHE) can be used alternately. CLAHE divides the image into sections and then calculates and equalizes the histogram of each section. Applying histogram equalization to sections may result in noisy data. A contrast limit is set to overcome this.

Rayleigh Distribution is generally used to clip the histogram in CLAHE, which is also used to prevent excessive contrast magnification [23]. Rayleigh Distribution is expressed as follows:

$$ g={g}_{min}+{\left[2\left({\propto}^2\right)\ln \left(\frac{1}{1-P(f)}\right)\right]}^{0.5} $$
(9)

where P(f) is the cumulative probability distribution and gmin is the minimum pixel value.

In the last stage, subregions are joined using bi-linear interpolation method, resulting in an improved image [19, 24]. Fig. 3 shows the pixel density distribution, i.e., histogram, of a randomly selected image before and after CLAHE.

Fig. 3
figure 3

Effect of CLAHE on vessels; (a) Fundus image and histogram before CLAHE, (b) Fundus image and histogram after CLAHE

2.5 Gamma correction

Gamma correction applies a linear or nonlinear mapping between input and output gray values depending on the value of gamma. If gamma is 1, the mapping is linear. If gamma is less than 1, the mapping is weighted towards brighter values. If gamma is greater than 1, the mapping is weighted towards darker values. It is generally expressed as in Eq. 10.

$$ {I}_{out}=A\ast {I}_{in}^{\gamma } $$
(10)

Where A is generally 1 and γ is the gamma value [5].

2.6 Deep learning

Deep learning is a branch of machine learning that has provided significant solutions to many problems in recent years. It is based on artificial neural networks (ANNs), which, however, fell out of favor in the 2000s due to small data sets, low processing capacity of computers, insufficient hardware despite increasing number of layers and relatively ineffective non-linear activation functions [45]. ANNs have become popular again due to the emergence of big data, improvements in hardware technology, new non-linear activation functions and new methods of initialization [15, 18] and new methods of regularization.

Multilayer ANN models as shown in Fig. 4, aim to calculate the weight and bias parameters with which the model will yield the best output score. Multilayer sensors have many deep-layered variants, some of which are CNNs [25]. CNNs are advanced neural networks inspired by the visual cortex of living creatures and specialized for processing data with grid-like topology. Images are regarded as 2-dimensional grid-type structures.

Fig. 4
figure 4

Multilayer ANN Scheme

Convolution (Eq. 11) is the response of an artificial nerve cell to excitations [25]. CNNs employ a special linear mathematical operation referred to as convolution as well, hence the name “convolutional neural network.” In CNN terminology, the first argument is the input, the second argument is the kernel and the output is the feature map [16]. Figure 4 shows the schematic of a given ANN and CNN.

$$ \mathrm{s}\left(\mathrm{t}\ \right)=\left(\ \mathrm{x}\ast \mathrm{w}\right)\left(\mathrm{t}\right) $$
(11)

Deep learning is based on learning multiple levels of data resulting from the interaction of multiple abstract layers. The output of a layer is the input to the next layer. In the first layers, low quality features are derived, from which high-quality features are derived, resulting in a hierarchical representation. Deep learning provides low- and high-quality feature extraction from a given raw image [1].

In a CNN, the convolution layer is the basic layer where the perception of the basic features of an image is carried out. In Eq. 12, the output attribute map is expressed after the convolution in its most general form. Given a convolution layer l, the input consists of \( {kb}_1^{\left(l-1\right)} \) attribute maps of the previous convolution layer. When l = 1, this value is the channel size \( {kb}_{\mathbf{2}}^{\left(l-1\right)} \), \( {kb}_3^{\left(l-1\right)} \) image dimensions of the image input.

$$ {Y}_i^{(l)}={B}_i^{(l)}+{\sum}_{j=1}^{kb_1^{\left(l-1\right)}}{K}_{i,j}^{(l)}\ast {Y}_j^{\left(l-1\right)} $$
(12)

where \( {Y}_j^{\left(l-1\right)}\ \mathrm{is}\ \mathrm{the}\ \mathrm{input} \), \( {Y}_i^{(l)} \) is the ith output attribute matrix, \( {B}_{\boldsymbol{i}}^{\left(\boldsymbol{l}\right)} \) is the bias matrix, and \( {K}_{i,j}^{(l)} \) is the filter used in convolution.

The sharing layer is another important layer of a CNN. Pooling refers to the process in which the value of pixels in an output matrix is calculated using the statistical summary of the pixels around the data at a certain point in the input matrix, thereby creating all the values of the output matrix [16]. Pooling is used to reduce size. The dimensions of the pooled matrix is calculated using Eq. 13:

$$ {G}_2x\ {Y}_2x\ {D}_2=\left(\frac{G_1-F}{S}+1\right)x\ \left(\frac{Y_1-F}{S}+1\right)x\ {D}_1 $$
(13)

where G is the width, Y is the height, D is the depth, S is the number of shifting steps, and F is the filter size.

Multi-layer artificial neural networks can use activation functions to learn nonlinear real-world information. ReLU, which is the most common activation function in deep layered artificial neural networks, is expressed as follows:

$$ f(x)=\kern0.5em \max \left(0,x\right) $$
(14)

2.7 Data augmentation

Data augmentation, such as rotation, scaling, cropping, and mirroring, is used to expand the training of a data set, and thus, to improve training performance in deep learning. In this study, data augmentation was performed by dividing an input image into small regions.

A training set is prepared using random patch generation procedure, where the points (x, y) that satisfy the conditions in Eqs. 15 and 16 are randomly selected. Cropped patches with a dimension of 28 × 28 remain in the center of the image slice.

$$ 14<x<W-14 $$
(15)
$$ 14<y<H-14 $$
(16)

where H and W are the image height and width, respectively. The points are randomly selected, and therefore, the image slices overlap. Moreover, the image slices are completely or partially out of sight, which allows the proposed CNN to distinguish between visual field boundaries and retinal blood vessels [30].

The test set is prepared using ordered patch generation with sliding procedure, where multiple consecutive overlapping patches with a striding step of 5 pixels and a dimension of 28 × 28 are obtained in each testing image.

3 Experiments and results

This study proposed a method based on image processing techniques and deep CNN architectures for segmentation of blood vessels in retinal images. The method primarily applied image processing techniques to extract blood vessels from the background and standardized the retinal images for the training of the network. The preprocessed images were randomly selected and cropped to certain sizes, which improved training performance through data augmentation. After preprocessing, the network was trained. The resulting network was used to perform segmentation of blood vessels in retinal images.

The proposed method was performed in MATLAB environment on a PC Intel Core i7 @ 2.6 GHz with 16 GB RAM and Nvidia GTX 960 M running Windows 10. The fully CNN was trained on a GPU (graphic processing unit). Training lasts an average of 8 h on a single GPU whereas preprocessing steps are completed much faster. The DRIVE database consists of 20 training and 20 test images. It takes an average of 4.5 s to crop 40,000 random packaged images from the training set and to generate and preprocess 8000 consecutive test packages from the test set.

The proposed method used color fundus images as input. Preprocessing began with the process of transforming the color images into gray level images. Gray level transformation was carried out using Eq. 1. Experiments showed that \( {k}_{\begin{array}{c}r\\ {}\ \end{array}}=0.299 \), \( {k}_{\begin{array}{c}g\\ {}\ \end{array}}=0.587 \) and \( {k}_{\begin{array}{c}b\\ {}\ \end{array}}=0.114 \) were the ideal coefficient values for gray transformation. Gray level normalization was then performed. Afterwards, CLAHE was applied to the normalized images. In this process, 8 × 8 pixel dimensions were selected for image sub-regions. Normal distribution was selected for the histogram equalization intensity distribution. Lastly, gamma correction was performed, and the optimal gamma was found to be 0.833. In the last step, the gray level of the image ranged from 0 to 1.

After preprocessing, data augmentation was performed by dividing an input image into small regions in order to increase the amount of data, and thereby, to improve training performance. The data set consisted of 40 retinal images along with manually segmented versions. 2000 images (28 × 28) were extracted from each training image. Figure 5 shows the image slices. A total of 40,000 image slices were prepared for training.

Fig. 5
figure 5

Randomly acquired 28 × 28 image batches and vessel tags

Striding step 5 was selected, and the test images were sequentially cropped from the full image to obtain image slices (28 × 28). A total of 246,340 image slices (12,317 per image) were obtained.

Data augmentation was followed by the generation of fully CNN for segmentation of retinal blood vessels. Figure 6 shows the overall structure of the network. CL is the convolution layers, BNL is the batch normalization layers, MP is the maximum pooling, and TCL is the transpose convolution layer (upsampling).

$$ Input\to \left[ CL1\to BNL\to ReLU\ \right]\to \left[ CL2\to BNL\to ReLU\right]\to \left[ MP\to CL3\to BNL\to ReLU\ \right]\to \left[ CL4\to BNL\to ReLU\ \right]\to \left[ TCL\to CL5\to BNL\to ReLU\ \right]\to \Big[ CL6\to BNL\to ReLU\to 1x1\ CL\to Softmax\to Pixel\ Classification $$
Fig. 6
figure 6

CNN Network Architecture

The first layer consisted of preprocessed image slices (1 × 28 × 28). The inputs were convoluted in the first convolution layer (C1). Layer C1 consisted of 32 filters of size 1 × 28 × 28. The pixel addition method was used to avoid a dimensional difference between the convolution output and input. The ReLU was the activation function, and batch normalization layer was added at the end of the output. The second convolution layer (C2) had the same characteristics as C1. Sequential convolution layers were followed by maximum pooling. The layer dimensions were 2 × 2, and the striding step was 2. After maximum pooling, the input dimensions were reduced by half (14 × 14). Then, there were two more convolution layers (C3, C4). C3 and C4 were 3 × 3 as well, however, the number of filters was increased to 64 because the input sizes were reduced. The activation functions were ReLU and supported by the batch normalization layer. Subsequently, outputs C4 were convoluted using the upsampling layer to achieve a dimensional increase. The size of the transpose convolution layer was 2 × 2. It consisted of 64 filters and the striding step was 2, resulting in a two-fold increase in input size (28 × 28). The next layers were C5 and C6, which were the same as C1 and C2. The last part involved the combination of features for the calculation of class scores and contained the convolution layer with a size of 1 × 1. In that layer, filters as many as classes, that is, two filters (retinal blood vessels and background) were applied. The SoftMax function was applied onto the outputs followed by classification involving pixel-based classification rather than a full image-based scalar score calculation. This layer showed to which class each pixel belonged and labeled background pixels as 0 and retinal blood vessels as 1. The output image matrix of 0 and 1 was converted to black and white image in the next stage.

In the training stage, cross-entropy function was used as cost function (Eq. 17).

$$ {J}_{CE}\left(y,{y}^{\prime}\right)=-\sum {y}_i\mathit{\log}{y}_i^{\prime }+\left(1-{y}_i\right)\log \left(1-{y}_i^{\prime }\ \right) $$
(17)

where yi and \( {y}_i^{\prime } \) are the class label and estimation matrices, respectively. The two matrices have the same dimensions and represent the neighborhood of an (x, y) coordinate of the preprocessed images before they are divided into image slices.

Optimization in an ANN is a process of minimizing the difference between the actual values and the ANN output. Gradient-descent is the most common optimization method. In this study, stochastic gradient descent was used as the optimization method.

In CNNs, weights and biases of neurons are updated using the back-propagation method, in which the difference between predicted and actual outputs is calculated, and then, the difference is multiplied by the learning rate parameter, and the resulting value is subtracted from the weight value. A high learning rate causes the network to be affected by the input data and makes it prone to overshoot the minima, whereas a low learning rate results in prolonged learning. The learning rate of the proposed deep network during training was 0.0001.

The batch size is the number of training examples used per iteration. In general, 32, 64 and 128 iterations are used. The training of an ANN with each batch of data is referred to as 1 iteration. The completion of training of all data batches of a data set is referred to as epoch. In this study, the batch size was 32, and the network was trained for 60 epochs, i.e. 75,000 iterations. Figure 7 shows the feature maps obtained from training.

Fig. 7
figure 7

Feature maps (filter) of convolution layers in proposed CNN architecture (CL1, CL2, CL3, CL4, CL5, CL6)

Striding step was set to 5, and the test was performed on 246,340 image slices (12,317 per image). Each image slice was tested, and the results were stored for later assembly. Overlapping pixels were averaged. These mean values correspond to the likelihood that a pixel belongs to a vessel or not. These values ranging from 0 to 1 were set to 0 or 1 using the thresholding method. As a result of experiments, the threshold value was selected as 0.5.

The difference between estimated and actual values is calculated to evaluate the performance of a network. As mentioned earlier, the DRIVE database was used in this study. The DRIVE database consists of 20 training and 20 test images along with manually segmented versions. Network performance was determined comparing the network output on those images. Five most common performance metrics, accuracy (Eq. 18), specificity (Eq. 19), sensitivity (Eq. 20), precision (Eq. 21), and F1 score (Eq. 22) as well as Structure-measure (SM) proposed by Fan et al. [8], WFb proposed by Margolin et al. [28], E-measure proposed by Fan et al. [9] was used in this study.

$$ \mathrm{ACC}=\frac{TP+ TN}{TP+ TN+ FP+ FN}x\ 100 $$
(18)
$$ \mathrm{SPEC}=\frac{TN}{TN+ FP}x\ 100 $$
(19)
$$ \mathrm{SENS}=\frac{TP}{TP+ FN}x\ 100 $$
(20)
$$ \mathrm{PREC}=\frac{TP}{TP+ FP}\ x\ 100 $$
(21)
$$ {F}_1=2\ x\frac{PREC\ x\ SENS}{PREC+ SENS}\ x\ 100 $$
(22)

Accuracy (ACC), which is the ratio of correctly classified pixels (TP and TN) to the total number of pixels, is the most widely used metric for evaluation. Specificity (SPEC) is a measure of how well a model can detect truly negatives. In other words, it is a measure of the ability of a model to separate vessel pixels from the retinal background pixels. Sensitivity (SENS) is a measure of success of the network in identifying vessel pixels that are truly vessel pixels. The higher the score, the higher the success. Precision (PREC) is a measure of how close values corresponding to inputs are to the average value in repeated measures. Table 1 shows the performance metrics of the proposed method in the DRIVE test images. Figure 8 shows the radar graph, which is a different display technique.

Table 1 Performance Metrics for Test Images
Fig. 8
figure 8

Radar graph comparison between performance metrics of test images; (a) Accuracy, (b) Specificity, (c) Precision, (d) Sensitivity

The proposed fully convolutional ANN yielded a similar accuracy rate on all images and calculated a standard deviation of 0.0042, indicating that it is resistant to uneven distribution of light and to low contrast.

The proposed method was also tested using the STARE database. Unlike the DRIVE database, the STARE database does not contain FOV images. Therefore, color images were filtered to obtain FOV masks. The training and test data for the STARE database were randomly determined as 50 %. All steps were performed on the STARE database in the same order as in the DRIVE database. Table 2 shows the performance metrics of the proposed method in both data sets.

Table 2 Performance Metrics of Proposed Method

CNNs are good at extracting features from and segmenting blood vessels in raw retinal images. The preprocessing steps greatly improved the performance of the proposed CNN on the DRIVE data set (Table 3). Table 4 compares the overall performance metrics of the proposed method with those reported by previous studies for the DRIVE and STARE data sets while Figs. 9 and 10 presents the radar graph comparative representation.

Table 3 Performance with Pre-processing and without Pre-processing
Table 4 Comparison of proposed method and previous studies
Fig. 9
figure 9

Radar graph comparison between performance metrics of test images and previous studies in DRIVE dataset; (a) Accuracy, (b) Sensitivity, (c) Specificity

Fig. 10
figure 10

Radar graph comparison between performance metrics of test images and previous studies in STARE dataset; (a) Accuracy, (b) Sensitivity, (c) Specificity

Table 4, Figs. 9 and 10 show that the proposed method yields better results than some studies. In terms of accuracy, the proposed method is one of the most successful ones. The performance metrics of the proposed method and of other deep learning methods indicate that CNNs can be successfully used for retinal vessel segmentation. Rule-based methods have the lowest performance whereas ANNs in general and CNNs in particular are the most successful methods.

It is not enough to use the criterion “accuracy” to perform performance evaluation. Therefore, rank analysis was performed on the proposed method and other studies for general evaluation in DRIVE and STARE datasets. Table 5 shows the results of the analysis.

Table 5 Rank analysis of proposed method and previous studies (DRV: DRIVE dataset, STR: STARE dataset)

Most studies did not use the precision metric in rank analysis. Therefore, it was also excluded from this study to make the analysis more objective. The mean rank analysis of accuracy, sensitivity, and specificity suggests that the proposed method is overall one of the most successful methods. All the metric scores of the proposed method are higher than the total mean scores, and what is more, the other methods that perform better in total rank analysis have some metric scores far below the total mean scores (Table 4). We can, therefore, conclude that the proposed method is one of the most consistent methods in terms of specificity, sensitivity, and accuracy.

Figure 11 shows the segmentation outputs of the proposed method for retinal vessel segmentation.

Fig. 11
figure 11

Blood vessel segmentation in retinal images using the proposed method; (a) Preprocessed test image, (b) Gold standard image, (c) Output image. First 3 images belong to DRIVE database, the rest of them belongs to STARE database

4 Conclusion & Discussion

Computer-aided algorithms are being used more and more widely on medical images to detect eye diseases. Accurate diagnosis and assessment of diseases depends both on image acquisition and interpretation. There has been significant progress in image acquisition thanks to recent advances in technology. Moreover, advances in computer technologies and improvements in processing performance have made the extraction of data from medical images using computer-aided algorithms more and more popular.

Medical images are generally interpreted by doctors. However, expert interpretation is costly and might differ widely. Computer-aided tools, such as machine learning and image processing, help experts diagnose diseases. One of those tools is deep learning, which is widely used thanks to advances in computer hardware and software.

This study proposed a method of blood vessel segmentation in color retinal images. Although CNNs can learn and classify important features from raw images, preprocessing was conducted to improve classification performance. Preprocessing consists of four steps; (1) gray level transformation, (2) gray level normalization to eliminate noise that adversely affects image quality, such as uneven illumination, local brightness etc., (3) contrast-limited adaptive histogram equalization to eliminate contrast differences, and (4) Gamma correction to better regulate gray levels.

Data augmentation was used to improve the learning capacity and performance of the fully convolutional network. Central points were randomly determined to divide the input images into small slices of size 28 × 28. The image slices were introduced to CNN as input. During learning, networks autonomously extract low quality features that do not change with respect to small geometric variations, and then, gradually transform them into higher quality features. This makes raw images more usable and suitable for blood vessel segmentation. A complex function is used to automatically evaluate the progressively learned features, and input image slices corresponding to estimation outputs are obtained.

The striding step was set to 5, and the image was cut into slices with a size of 28 × 28 in order to improve the network forecasting performance. The overlapping pixels were averaged, and the thresholding method was used to decide whether a pixel belonged to a vessel or not, which improved the network forecasting performance.

The fact that CNNs automatically extract features from images makes it easier to overcome problems in the selection of correct features. In the first layers, simpler features are extracted while, in the following layers, more complex feature maps are extracted. Therefore, the process can express the desired features better than manually extracted features.

In the proposed method, the preprocessing steps greatly improved the performance of the neural network, and the CNNs had good forecasting performance in retinal blood vessel segmentation. The superior performance of the supervised learning networks shows that eliciting information on dependencies between class labels for neighboring pixels is of importance for segmentation of continuous anatomical structures, such as retinal blood vessels. We, therefore, believe that the proposed method is an effective and reliable method that can be used by future studies on artificial intelligence-based disease diagnosis systems to detect retinal blood vessels. In recent years, CNNs have been popular in many areas related to image processing and salient object detection [13, 51]. Most CNN-based salient object detection approaches regard the image patches as input and use multi-scale or multi-context information to obtain a final saliency map. In this context, future studies can use salient edge features for segmentation of retinal blood vessels and detect vessels, especially their boundaries, more accurately.