Introduction

Retinal blood vessel segmentation has been extensively used for diagnosis, screening, and clinical study. For example, the morphological characteristics of retinal blood vessels have been associated with cardiovascular and coronary disease in adult life [23]. Vessel segmentation is the first step in the quantitative analysis of retinal fundus images. The segmented vasculature can assist in the implementation of screening programs for diabetic retinopathy, detecting the foveal avascular region, and the diagnosis of hypertension and glaucoma [7]. In [12], a retinal vascular tree was adopted for biometric identification because of its unique structure in each individual.

In clinical practice, the retinal vascular tree is usually delineated by ophthalmologists. However, the manual annotation of retinal blood vessels is highly skilled, tedious, and time-consuming. Therefore, automated retinal blood vessel segmentation methods are urgently required in clinical practice to reduce the annotation time and workload on ophthalmologists. However, vessel segmentation is a challenging problem. The main difficulties stem from the following aspects [6]: (1) the width of vessels ranges from 1 to 20 pixels, depending on both the width of the vessel and the image resolution. (2) The shape, size and local intensity values of vessels can vary hugely, increasing the difficulty of vessel segmentation. (3) Various structures appear in retinal images, including the retinal boundary, optic disk, fovea and pathologies, which make it difficult to segment vessels accurately based on hand-crafted features. (4) Some microvessels with low contrast to the background further complicate the segmentation task.

Hence, the automated segmentation of retinal blood vessels has received significant attention over recent decades. All existing algorithms for vasculature segmentation can be broadly categorized as unsupervised and supervised methods.

Unsupervised methods for vessel segmentation

Unsupervised methods mine the inherent patterns of blood vessels in retinal images to determine whether a pixel is a vessel or not. With regard to retinal vessel segmentation, matched filtering, mathematical morphology, and vessel tracking algorithms are generally applied. The matched filtering methodology convolves a 2-D kernel with the retinal image and the matched filter response indicates the presence of the feature. For example, Hoover et al. [9] presented a threshold probing technique on a matched filter response image and combined local and region-based properties of the vessels for segmentation. Mathematical morphology is employed in combination with curvature evaluation for vessel segmentation [36]. Vessel tracking algorithms segment a vessel between two points using local information and work at the level of a single vessel [17].

Unsupervised methods can perform the vessel segmentation task without any prior labeling knowledge, making it suitable for large amounts of unlabeled data. However, every coin has two sides. Compared with supervised methods, the training data or hand-labeled ground truths do not contribute directly to the design of the algorithms. Therefore, unsupervised methods show some deficiencies in their performance of retinal vessel segmentation. For example, the matched filtering method is effective for healthy images, but tends to produce too many false responses for pathological images. Vessel tracking has the advantage of highly accurate vessel widths, but is unable to detect vessels or vessel segments that have no seed points. The main disadvantage of exclusively relying upon morphological methods is that they do not exploit the known vessel cross-sectional shape [7]. Thus, supervised methods have been widely studied and applied by an increasing number of researchers.

Supervised methods for vessel segmentation

In supervised methods, the general framework for vessel segmentation contains two steps. i.e., feature extraction and classification. Features extracted from the original images can be further divided into hand-crafted features and automatically learned features. In the classification stage, the extracted features feed into a classifier such as K-nearest neighbor (KNN), support vector machine (SVM), or artificial neural network (ANN) to predict whether the pixel is a vessel or not. Niemeijer et al. [24] extracted a feature vector for each pixel that consisted of the responses of a Gaussian matched filter and its first- and second-order derivatives at multiple scales, augmented with the green plane of the RGB image. Afterward, the KNN algorithm was employed to estimate the probability of a pixel belonging to vessel. Marin et al. [20] proposed a neural network-based methodology for retinal vessel segmentation. They designed a 7-D feature vector composed of gray-level and moment invariant-based features as the inputs of a five-layers feedforward neural network and trained the network for classification. Ricci [26] employed line operators as feature vector and SVM for pixel classification. Soares et al. [30] used 2-D Gabor wavelet and supervised classification for retinal vessel segmentation. Each pixel is represented by a feature vector composed of the pixels intensity and two-dimensional Gabor wavelet transform responses taken at multiple scales. A Gaussian mixture model classifier is used to classify each pixel as either a vessel or non-vessel pixel. Staal [31] presented a ridge-based vessel segmentation methodology by which local features are extracted for every pixel. In total, 27 features are selected from convex set regions and individual pixels, and then, a k-NN classifier is used for classification. Fraz et al. [23] presented an ensemble system of bagged and boosted decision trees for retinal vessel segmentation and constructed a feature vector based on the orientation analysis of gradient vector field, morphological transformation, line strength measures, and Gabor filter responses.

In previous methods [23, 24], feature selection is based on artificially designed features. Although these approaches are appropriate in some situations, they cannot generalize well and may miss opportunities to discover novel patterns and features. Moreover, the hand-crafted features are time-consuming and application-specific.

Recently, deep neural networks (DNNs) have outperformed state-of-the-art methods in many image recognition tasks [3, 5, 11]. This success can be partially attributed to the ability of DNNs to automatically extract hierarchical features of data [37]. In terms of retinal vessel segmentation, Li et al. [15] proposed a cross-modality approach for vessel segmentation. The first modality is each image patch of size \(16\times 16\) pixels, and the second one is the label map of all pixels for the given patch. Liskowski et al. [16] used a deep convolutional neural network (CNN) as a pixel-wise classifier. The class label of each pixel is predicted by providing a square window centered on that pixel as an input. They also considered some variants of the method including a NO-POOL architecture and structured prediction.

Previous patch-based methods have achieved excellent performance for vessel segmentation. However, they have two drawbacks. Firstly, the computational efficiency is quite poor, because each patch must be computed separately by the network. This is infeasible for large-scale retinal images. Secondly, the shape, size, and local intensity values of vessels are diverse in retinal images. As the size of square window is fixed in patch-based methods, it may produce unsatisfactory results in some regions where the size of the contextual window is inappropriate.

To address the above challenges, we propose a novel deep supervised fully convolutional network that fuse multi-level features with different receptive fields to obtain precise segmentation results. Within a single forward propagation, the network can produce a probability map with the same size as the original input. Therefore, it is much faster than previous patch-based methods. In addition, we introduce a deep supervision (DS) mechanism to counteract the issue of gradient vanishing. Furthermore, transfer learning is exploited to mitigate the problem of insufficient training data. Meanwhile, this approach can learn hierarchical features automatically, reducing the need for the time-consuming hand-crafted features. The proposed method has been evaluated on three publicly available databases, the DRIVE, STARE, and CHASE_DB1. The performance metrics illustrate the advantages of our method, achieving better or comparable performance to state-of-the-art methods for retinal vessel segmentation.

The remainder of this paper is organized as follows. The proposed method is presented in “Method” section. In “Experimental evaluation” section, performance metrics are defined and the robustness of the algorithm is assessed in a series of experiments. Finally, our conclusions are given in “Conclusion” section.

Methods

Architecture of the proposed network

In this section, we build upon a deep network by fusing multi-level hierarchical features and inserting auxiliary supervision at intermediate layers for vessel segmentation. Inspired by the recent studies of fully convolutional network (FCN) [18], which converts the fully connected layers in the traditional CNN into convolutional kernels and thus can be applied to input image of arbitrary size, we design a deep network architecture that works with very few training data and yields more precise segmentations. As shown in Fig. 1, the architecture of the proposed network consists of two parts, i.e., downsampling and upsampling modules.

The downsampling module is composed of convolutional and max-pooling layers that capture discriminative semantic information and is robust to appearance changes. In a typical convolutional layer, a feature map is produced by convolving the input with convolutional kernels, adding a bias term, and finally applying a nonlinear activation function. By denoting the ith feature map of the lth layer as \(y_i^l\) and the kth feature map of the previous layer as \(y_k^{l-1}\), a convolution layer is formulated as

$$\begin{aligned} \left\{ \begin{aligned}&y_i^l=f\left( S_i^l\right) ,\\&S_i^l=\sum _{k=1}^{n_{l-1}}y_k^{l-1}\bigotimes W_{ki}^l+b_i^l, \end{aligned} \right. \end{aligned}$$
(1)

where \(S_i^l\) is the net input of the ith feature map in the lth layer, \(f(\cdot )\) is the nonlinear activation function, \(\bigotimes \) denotes the 2D spatial convolution, \(W_{ki}^l\) is the convolutional kernel connected to the ith feature map \(y_i^l\) in the lth layer from the kth feature map of the previous layer \(y_k^{l-1}\), \(b_i^l\) is the bias term, and \(n_{l-1}\) is the number of feature maps in the \((l-1)\)th layer. During the downsampling stage, the RGB image is passed through a stack of \(3 \times 3\) convolutional layers. The convolution stride is fixed to 1 pixel. The spatial padding of the convolutional layer input is used to preserve the spatial resolution after convolution, i.e., the padding is 1 pixel for \(3 \times 3\) convolutional layers. Spatial pooling is carried out by three max-pooling layers, which is performed over a \(2 \times 2\) pixel window with stride of 2.

Fig. 1
figure 1

The architecture of the proposed deep supervised fully convolutional network for vessel segmentation. Each box corresponds to a multi-channel feature map. The number and sizes of feature maps are depicted in the format (Number \(\times \) size). \(l_m(W,w_\mathrm{b}^{(m)})\) is the loss function for the mth branch output and \(\mathcal {L}_\mathrm{fuse}(W,W_\mathrm{b},W_\mathrm{f})\) is the loss function at the fusion layer. W and \(w_\mathrm{b}^{(m)}\) are the parameters of the mainstream network and the weights of the auxiliary classifier in the mth branch \((m=1,2,3,4)\), respectively, \(W_\mathrm{f}\) is the fusion weight, \(W_\mathrm{b}=(w_\mathrm{b}^{(1)},\ldots ,w_\mathrm{b}^{(4)})\). The arrows denote the different operations, where the thin blue and red arrows represent computing the loss functions and weighted-fusion outputs, respectively

After several downsampling stages, the feature maps are gradually reduced in size and become much smaller than that of the ground truth. Therefore, upsampling is implemented to ensure that the size of the output at all stages is the same as that of the ground truth. The upsampling module comprises convolutional and deconvolutional layers that obtain a fine spatial resolution for precise localization. Specifically, to integrate the information of multiple feature maps at each downsampling stage, we append a \(1\times 1\) convolution with channel 1 at some convolutional layers, followed by a deconvolutional layer to bridge those coarse feature maps to dense probability predictions. The deconvolution is implemented using backwards strided convolution [18], and its kernel can be learned during training. In the proposed network, there are three deconvolutional layers (kernel size \(2k\times 2k\) with stride of \(k, k=2,4,8\) for three upsampling layers, respectively), which are designed by taking full advantage of the different feature maps in hierarchical layers.

Our motivation is to adopt an end-to-end fashion to train the network, i.e., input an image and directly output an equal-sized probability map within a single forward propagation. In this regard, the proposed method is more computationally efficient and economical with regard to storage than previous patch-based models. In addition, vessel segmentation is a pixel-level classification problem, so the hierarchical features and refined localization are required simultaneously. Global information from higher layers resolves the problem of what (classification ability) while local information from lower layers covers the problem of where (precise localization). Finally, this multi-level hierarchical information from each branch of the upsampling module is fused by a weighted-fusion layer in the network and the fusion weights can be learned during training. The probability map is computed using the sigmoid function on the fusion map.

As we know, it is difficult to train a deep neural network. To alleviate the problem of gradient vanishing, we connect auxiliary classifiers (smaller networks) to a few intermediate layers to provide additional supervision during training [14, 32] that is computing the loss function between each branch output and the label of the input image (i.e., \(l_m\left( W,w_\mathrm{b}^{(m)}\right) , m=1,2,3,4\) in Fig. 1). Specifically, we upsample some lower-level and middle-level feature maps using deconvolutional layers and then use the sigmoid cross-entropy loss layer to compute the companion losses in these auxiliary classifiers. Finally, we optimize a loss function that is a sum of the loss from the final fusion layer and branch losses associated with these auxiliary classifiers. The intuitive idea behind these classifiers is to encourage the feature maps at lower layers to be directly predictive of the final labels, and to help propagate the gradients back through the deep network structure. The detailed learning process and the computation of branch losses from auxiliary classifiers are introduced in “Vessel segmentation by fusing multiple branch outputs” section.

Training a deep CNN requires a large amount of labeled training data, which remains a challenge for medical images because of the expense of expert annotation and scarcity of disease (e.g., lesions). Previous studies have indicated that filters trained on large-scale well-annotated ImageNet could be transferred to different recognition tasks in other domains [28, 34]. Therefore, we optimize our medical task by modifying the off-the-shelf model VGGNet [29], which was trained for large-scale images of ImageNet. The parameters of the convolutional layers in our model are initialized by the pre-trained filters in VGGNet. This process can be considered as the pre-training phase in the neural network with good initialization, which accelerates the convergence of the network.

Vessel segmentation by fusing multiple branch outputs

Vessel segmentation can be formulated as a per-pixel classification problem. Specifically, let \(\textstyle \mathcal {S}=\left\{ (X_n,Y_n), n=1,\ldots ,N \right\} \) denote the training set of the network, where \(\textstyle X_n=\{x_j^{(n)},j=1,\ldots ,|X_n|\}\) denotes the raw input image and the index j runs over the spatial dimensions of image \(X_n\). \(|X_n|\) denotes the number of pixels in the input image \(X_n\). \(\textstyle Y_n=\{y_j^{(n)},j=1,\ldots ,|X_n|\}\) denotes the ground truth for image \(X_n\). As we consider each image holistically and independently, we drop the subscript n from our notation for simplicity.

Deep supervision for multi-scale branch output learning

In the task of retinal vessel segmentation, there is a large variation in the size of vessels and a complex background. Thus, the size of the receptive field plays a crucial role in the pixel-wise classification, and it is equivalent to the intensity values of the corresponding contextual window. As shown in Fig. 2, different regions demand different window sizes. For example, some regions with clustered vessels require a small window size to segment different vessels clearly, whereas other regions containing microvessels with low contrast require more contextual information to distinguish vessels from background, and thus need a large contextual window. The proposed network comprises a single stream deep network with multiple branch outputs. These branch outputs have different receptive fields. Therefore, by leveraging multiple branch outputs, the network can handle the various sizes of receptive field demanded by different regions. Suppose there are M branch output layers. Each branch output layer is connected to an auxiliary classifier. Let \(l_m\left( W,w_\mathrm{b}^{(m)}\right) \) denote the image-level loss function for the mth auxiliary classifier, where W and \(w_\mathrm{b}^{(m)}\) are the parameters of the mainstream network and the weights of the auxiliary classifier in the mth branch, respectively. As the input of the network is a holistic image, the loss function is computed over all pixels in a training image \(X=\{x_j,j=1,\ldots ,|X|\}\) and the corresponding ground truth \(Y=\{y_j,j=1,\ldots ,|X|\}\), \(y_j\in \{0,1\}\).

Fig. 2
figure 2

Different regions demand different contextual window sizes. Left the original retinal image. Right the corresponding ground truth (white and black pixels denote the vessel and non-vessel regions, respectively)

For retinal images, the distribution of vessel/non-vessel pixels is heavily imbalanced: 85% of the ground truth is non-vessel. Therefore, similar to cost-sensitive loss function, we introduce a class-balancing weight \(\beta \) to automatically balance the loss between positive and negative classes [33]. Specifically, \(l_m\) is defined as the following class-balanced cross-entropy loss function:

$$\begin{aligned} l_m\left( W,w_\mathrm{b}^{(m)}\right)= & {} -\beta \sum _{j=1}^{|X|}y_j\log {P\left( y_j=1\Big |X;W,w_\mathrm{b}^{(m)}\right) } \nonumber \\&-\,(1-\beta )\sum _{j=1}^{|X|}(1-y_j) \nonumber \\&\times \, \log {P\left( y_j=0\Big |X;W,w_\mathrm{b}^{(m)}\right) }, \end{aligned}$$
(2)

where \(\textstyle \beta =|X_{\_}|/|X|\) and \(\textstyle 1-\beta =|X_+|/|X|\). \(\textstyle |X|,|X_+|\) and \(\textstyle |X_{\_}|\) denote the number of pixels, vessel pixels and non-vessel pixels in an input image \(\textstyle X\), respectively. The probability \(\textstyle P(y_j=1\Big |X;W,w_\mathrm{b}^{(m)})=\sigma (s_j^{(m)})\in [0,1]\) is computed using the sigmoid function \(\textstyle \sigma (\cdot )\) on the net input \(\textstyle s_j^{(m)}\) value at pixel j in the m-th branch. Let \(\textstyle W_\mathrm{b}=(w_\mathrm{b}^{(1)},\ldots ,w_\mathrm{b}^{(M)})\) denote the parameters of the auxiliary classifiers in all branch layers. The loss function for all the branch outputs can then be obtained as

$$\begin{aligned} \mathcal {L}_{b}(W,W_\mathrm{b})=\sum _{m=1}^M l_m\left( W,w_\mathrm{b}^{(m)}\right) . \end{aligned}$$
(3)

Multi-scale branch output fusion

The final predicted vessel segmentation map can be obtained by fusing each branch output. Here we add a weighted-fusion layer to the network to fuse the multi-scale segmentation map from each branch output. Moreover, the fusion weight of each branch-output can be learned during training. The loss function at the fusion layer \(\mathcal {L}_\mathrm{fuse}\) becomes

$$\begin{aligned} \mathcal {L}_\mathrm{fuse}\left( W,W_\mathrm{b},W_\mathrm{f}\right)= & {} -\beta \sum _{j=1}^{|X|}y_j \nonumber \\&\log {P_\mathrm{fuse}\left( y_j=1\Big |X;W, W_\mathrm{b}, W_\mathrm{f}\right) } \nonumber \\&-(1-\beta )\sum _{j=1}^{|X|}(1-y_j) \nonumber \\&\log {P_\mathrm{fuse}\left( y_j=0\Big | X;W,W_\mathrm{b},W_\mathrm{f}\right) }, \nonumber \\ \end{aligned}$$
(4)

where \(\textstyle P_\mathrm{fuse}\left( y_j=1\Big | X;W,W_\mathrm{b}, W_\mathrm{f}\right) =\sigma \left( \sum _{m=1}^M w_\mathrm{f}^{(m)}s_j^{(m)}\right) \), \(\textstyle s_j^{(m)}\) is the net input values at pixel \(\textstyle j\) in the m-th branch, \(\textstyle W_\mathrm{f}=\left( w_\mathrm{f}^{(1)},\ldots ,w_\mathrm{f}^{(M)}\right) \) is the fusion weight, and \(\scriptstyle P_\mathrm{fuse}\left( y_j=0\Big | X;W, W_\mathrm{b}, W_\mathrm{f}\right) =1-P_\mathrm{fuse}\left( y_j=1\Big | X;W, W_\mathrm{b}, W_\mathrm{f}\right) \). Combining Eqs. (3) and (4), we obtain the following total objective function:

$$\begin{aligned} \begin{aligned} \mathcal {L}(W,W_\mathrm{b},W_\mathrm{f})&= \mathcal {L}_{b}(W,W_\mathrm{b})+\mathcal {L}_{fuse}(W,W_\mathrm{b},W_\mathrm{f}) \\&\quad +\, \frac{\lambda }{2}\left( \Vert W\Vert ^2+\Vert W_\mathrm{b}\Vert ^2+\Vert W_\mathrm{f}\Vert ^2\right) , \\ \end{aligned} \end{aligned}$$
(5)

where the last term is a regularization term that helps to prevent overfitting and the hyperparameter \(\lambda \) controls the relative importance of the data loss and weight decay terms. Finally, the parameters \(\scriptstyle \{W,W_\mathrm{b},W_\mathrm{f}\}\) of the proposed network are jointly optimized in an end-to-end way by minimizing the total loss function \(\scriptstyle \mathcal {L}(W,W_\mathrm{b},W_\mathrm{f})\) via standard stochastic gradient descent. For a test image, we apply the overlap-tile strategy to improve the robustness.

Experimental evaluation

Materials

We evaluated our method on three publicly available retinal image databases: DRIVE, STARE, and CHASE_DB1.

The DRIVE (Digital Retinal Images for Vessel Extraction) database [24] consists of 40 color fundus photographs. The database is divided into the training set and the test set, each containing 20 images. For each image, manual segmentation of retinal vessels is provided. The STARE (Structured Analysis of the Retina) database [9] contains 20 images for blood vessel segmentation. There is no separation of training set and testing set. Two strategies of generating the training set and testing set have been reported in studies. One is the random sampling method, which randomly extracts many patches from all 20 images [4, 23, 26], and it results in some overlap between the training data and testing data. This causes the results to be overly optimistic. The second strategy is leave-one-out validation [15, 16, 31]. In this study, we apply leave-one-out validation to the STARE database. Specifically, each image is tested using the network model trained on the other 19 images. The CHASE_DB1 database is part of the Child Heart and Health Study in England (CHASE) and contains 28 images from both the left and right eyes of 14 children [23, 25]. We use the four-fold cross-validation to evaluate the performance on the CHASE_DB1 database.

Details of training

As there are very few training data available for our tasks, to reduce overfitting in deep networks, we used various transformations to augment the training dataset (by a factor of  10), including rotation, scaling, flipping, and mirroring. In the training process, we randomly cropped multiple different regions with the size of size \(512\times 512\) pixels from the original image as the input of the network. Specifically, we cropped 5 regions from each image of the augmented training dataset on DRIVE and STARE. While for CHASE_DB1 dataset, because its image is much larger compared to DRIVE and STARE, we cropped 10 different regions for each image of the augmented training dataset. This effectively reduces GPU memory usage and takes advantage of efficient batch processing. The proposed model was trained with standard back propagation using stochastic gradient descent (momentum=0.9, weight decay=0.0005, learning rate initialized to \(10^{-8}\) and divided by 10 every 30,000 iterations, each element of the fusion weight initialized to 0.25). The parameters of the convolutional layers in the downsampling path were initialized by the pre-trained filters in VGGNet. The deconvolutional layers were initialized by bilinear interpolation. Other convolutional layers were initialized using the Xavier algorithm [8].

Table 1 Average performance measures with standard deviation for DRIVE, STARE, and CHASE_DB1

Performance measurements

In the retinal vessel segmentation process, any pixel is classified as either vessel or non-vessel. For evaluation purposes, we employ the evaluation measurements sensitivity (Se), specificity (Sp) and accuracy (Acc). These are defined as:

$$\begin{aligned}&\hbox {Se} = \frac{\mathrm{TP}}{\mathrm{TP}+{\mathrm{FN}}},\, \hbox {Sp}=\frac{\mathrm{TN}}{\mathrm{TN}+{\mathrm{FP}}}, \\&\hbox {Acc} = \frac{\mathrm{TP}+{\mathrm{TN}}}{{\mathrm{TP}}+{\mathrm{FN}}+{\mathrm{TN}}+{\mathrm{FP}}}, \end{aligned}$$

where the true positives (TP) are correctly identified vessel pixels, false negatives (FN) are vessel pixels incorrectly classified as non-vessel, true negatives (TN) are correctly identified non-vessel pixels, and false positives (FP) are non-vessel pixels incorrectly classified as vessel pixels. In addition, the performance metrics also include the Kappa value and the area under the receiver operating characteristic (ROC) curve (AUC). The Kappa value is a measure for the agreement between two observers, which is defined as:

$$\begin{aligned} \text {Kappa}=\frac{\mathrm{Acc}-\mathrm{Acc}^{*}}{1-\mathrm{Acc}^{*}}, \end{aligned}$$

where \(\mathrm{Acc}^{*}=\left[ \left( \frac{n_1}{N}\right) \cdot \left( \frac{m_1}{N}\right) \right] + \left[ \left( \frac{n_0}{N}\right) \cdot \left( \frac{m_0}{N}\right) \right] \), \(n_1=\mathrm{TP}+\mathrm{FP}\), \(m_{1}=\mathrm{TP}+\mathrm{FN}\), \(n_0=\mathrm{TN}+\mathrm{FN}\), \(m_0=\mathrm{TN}+\mathrm{FP}\), and \(N=\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}\). AUC is calculated using the implementation provided in the scikit-learn Python library and modeled using a series of trapezoids built on all points from the ROC curve. A perfect classifier will have an AUC of 1.

Method evaluation

Vessel segmentation results

The segmentation result is a vessel probability map, in which each value represents the probability of each pixel belonging to the vessel class. To obtain the binary vessel segmentation, a thresholding scheme is applied to the probability map to determine whether a particular pixel belongs to the vessel class. The optimal threshold is selected as that which achieves the maximum accuracy in the training dataset. The performance metrics are calculated by taking the first human observer as the ground truth. The average results for DRIVE, STARE, and CHASE_DB1 are presented in Table 1. As illustrated in this table, the average accuracy, specificity, and Kappa values produced by the proposed algorithm outperform those of the second human observer for all three databases, which suggests the algorithm has a strong ability to identify vessel and non-vessel with a low FP rate compared to the second human observer. As shown in Fig. 3, the AUC values produced by the proposed method are greater than 0.97 for all three databases, which demonstrates the proposed network has good generalization ability. Figures 4, 5, 6 show the segmented images with the best and the worst case accuracy from the DRIVE, STARE, and CHASE_DB1 databases, respectively.

Fig. 3
figure 3

The ROC curves for DRIVE, STARE, and CHASE_DB1 datasets

Fig. 4
figure 4

Segmentation results with the best case ad and the worst case e, f accuracy on the DRIVE dataset. a e Retinal image. b, f Manual segmentation. c, g Probability map. d, h The final segmentation result. The best case Acc, Se, Sp, AUC and Kappa are 0.9586, 0.9094, 0.9653, 0.9877, and 0.8174, respectively, whereas the worst case measures are 0.9323, 0.5677, 0.9944, 0.9713, and 0.6738, respectively

Fig. 5
figure 5

Segmentation results with the best case ad and the worst case ef accuracy on the STARE dataset. a, e Retinal image. b, f Manual segmentation. c, g Probability map. d, h The final segmentation result. The best case Acc, Se, Sp, AUC and Kappa are 0.9768, 0.8113, 0.9890, 0.9901, and 0.8154, respectively, whereas the worst case measures are 0.9573, 0.7537, 0.9804, 0.9810, and 0.7588, respectively

Fig. 6
figure 6

Segmentation results with the best case ad and the worst case e, f accuracy on the CHASE_DB1 dataset. a, e Retinal image. b, f Manual segmentation. c, g Probability map. d, h The final segmentation result. The best case Acc, Se, Sp, AUC and Kappa are 0.9691, 0.8473, 0.9792, 0.9870, and 0.7912, respectively, whereas the worst case measures are 0.9502, 0.8339, 0.9616, 0.9737, and 0.7222, respectively

Table 2 Performance comparison of vessel segmentation methods on DRIVE database

Comparison with other methods

We compared the performance of our method with that of state-of-the-art algorithms published in the past decade on DRIVE, STARE, and CHASE_DB1 databases. The performance metrics of Se, Sp, Acc and AUC given by the proposed method were compared with the results of published methodologies reported in their respective publications. From Tables 2, 3, 4, we can see that our deep segmentation network achieves better performance than most of the state-of-the-art methods in terms of Se,Sp, and Acc. In addition, the proposed method achieved AUCscores of 0.9782, 0.9885, and 0.9812 for the three databases. These values are higher than in all other methods, with the exception of the method in [16], which scored slightly higher. However, the approach described in [16] uses a patch-based method to predict the class assignment for all pixels in an \(s\times s\) window centered in a given image patch. Although this can learn the dependencies between class labels for neighboring pixels, the sliding window scanning will produce \(s^2\) outputs for each pixel in the testing phase, which imposes a heavy computational burden and is time-consuming. In contrast, with only one forward computation, our method can produce segmentation results that are the same size as the original input. Therefore, it is significantly faster than the method of [16]. Indeed, a comparison of computation times ( see “Computation time” section) demonstrates that our algorithm is roughly 230 times faster than the method of [16]. In general, our algorithm achieves comparable performance to the state-of-the-art methods with a much faster speed, which illustrates the effectiveness and efficiency of the proposed method.

Table 3 Performance comparison of vessel segmentation methods on STARE database
Table 4 Performance comparison of vessel segmentation methods on CHASE_DB1 database

Cross-training

To test the generalization capability of the proposed algorithm, cross-training experiments were carried out. Specifically, the network was tested on one dataset after being trained on each of the other datasets. The performance metrics of cross-training for the three datasets are presented in Table 5. Compared with Table 1, there is a slight decrease in accuracy from 0.9521, 0.9674, and 0.9599 to 0.9460, 0.9570, and 0.9515 for the DRIVE, STARE, and CHASE_DB1 datasets when trained on CHASE_DB1, DRIVE, and STARE, respectively. The average decrease in accuracy is only 0.0083. This demonstrates that our algorithm is robust with respect to the training set. As shown in Table 5, the proposed method achieves better performance than other methods in all cases, except for DRIVE (trained on CHASE_DB1), where the accuracy is slightly lower than the methods of [15]. The reason is that images from DRIVE and CHASE_DB1 are quite different. Compared to DRIVE, CHASE_DB1 images have non-uniform background illumination, poor contrast of blood vessels and wider arteriolars. These differences result in the cross-training performance on DRIVE (trained on CHASE_DB1) being lower than that of other forms of cross-training.

Table 5 Performance measures of cross-training

Computation time

Training the network takes about 10 hours on a single NVIDIA GTX Titan GPU. After the training process is completed, it is much faster to test an image. Generally, it takes about 0.4 seconds to segment an image of size \(584\times 565\) pixels using the trained network. We compared the computation time of our algorithm with the latest deep learning-based methods published by Liskowski [16] and Li [15]. By taking advantage of fully convolutional networks, the computation time of our approach is much less than that of the methods in [15, 16], both of which use the sliding window way to obtain the segmentation results, which causes redundant computations on neighboring pixels. The results in Table 6 demonstrate that our algorithm is roughly 230 times faster than the state-of-the-art method in [16] and 20 times faster than that in [15]. Moreover, our approach achieves comparable performance to [15, 16]. This makes our algorithm suitable for real-world clinical applications.

Table 6 Computation time for processing one image
Table 7 Performance measures with ablation studies

Ablation studies of our method

To investigate the role of each component in the proposed method, we performed extensive ablation studies. Compared with the STARE and CHASE_DB1 databases, the DRIVE database has a specified training set and test set. Therefore, the following ablation analysis was carried out on the DRIVE database. We draw the following conclusions from the results in Table 7. Firstly, the performance obtained by fusing multi-scale contextual information with DS (Fusion-output (with DS)) is significantly better than that given by only single branch output (branch-output 1/2/3/4). In addition, with or without DS on the branch-output, the network initialized by the pre-trained parameters in VGGNet outperforms the network with random initialization by a large margin in all evaluation measurements (Fusion-output (with DS) vs. Random weights (with DS); Fusion-output (w/o DS) vs. Random weights (w/o DS)). This illustrates that transfer learning helps to improve performance. Furthermore, to exclude the impact of transfer learning and analyze the effect of DS on the network independently, we initialized the network randomly and compared the performance without and with DS (Random weights (w/o DS) vs. Random weights (with DS)). Obviously the performance with DS is significantly better, which verifies that DS can guide the propagation of the gradient flow and dramatically boost performance. Finally, by combining DS and transfer learning, the performance of the network can be further enhanced. This suggests that good initialization is very important in deep networks and that the DS mechanism during the learning process can accelerate the optimization convergence rate and improve the prediction accuracy.

Conclusion

In this paper, we presented a deep supervised fully convolutional network for vessel segmentation from retinal images. By leveraging the multi-scale hierarchical features with different receptive fields, the network achieves high classification and precise localization abilities simultaneously, enabling it to handle variations in the receptive field size as demanded by different regions. Deep supervision with auxiliary classifiers helps to guide the back propagation of gradient flow and improves the discriminative capability of features in lower layers. To alleviate the issue of insufficient medical training data, the transferred knowledge learned from other domains is used to reduce overfitting on the limited medical dataset and boost performance. Furthermore, our approach does not rely on hand-crafted features and needs no problem-specific preprocessing or postprocessing, which reduces the impact of subjective factors. The results of cross-training also illustrate its robustness with respect to the training set. Extensive experiments on three benchmark datasets demonstrate the effectiveness and efficiency of our method. Compared with state-of-the-art methods, the proposed approach yields better or comparable performance with a much faster computation speed, which makes it suitable for clinical applications.