Multi-level deep supervised networks for retinal vessel segmentation

Mo, Juan; Zhang, Lei

doi:10.1007/s11548-017-1619-0

Multi-level deep supervised networks for retinal vessel segmentation

Original Article
Published: 02 June 2017

Volume 12, pages 2181–2193, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

Multi-level deep supervised networks for retinal vessel segmentation

Download PDF

Juan Mo^1,2 &
Lei Zhang¹

2206 Accesses
144 Citations
Explore all metrics

Abstract

Purpose

Changes in the appearance of retinal blood vessels are an important indicator for various ophthalmologic and cardiovascular diseases, including diabetes, hypertension, arteriosclerosis, and choroidal neovascularization. Vessel segmentation from retinal images is very challenging because of low blood vessel contrast, intricate vessel topology, and the presence of pathologies such as microaneurysms and hemorrhages. To overcome these challenges, we propose a neural network-based method for vessel segmentation.

Methods

A deep supervised fully convolutional network is developed by leveraging multi-level hierarchical features of the deep networks. To improve the discriminative capability of features in lower layers of the deep network and guide the gradient back propagation to overcome gradient vanishing, deep supervision with auxiliary classifiers is incorporated in some intermediate layers of the network. Moreover, the transferred knowledge learned from other domains is used to alleviate the issue of insufficient medical training data. The proposed approach does not rely on hand-crafted features and needs no problem-specific preprocessing or postprocessing, which reduces the impact of subjective factors.

Results

We evaluate the proposed method on three publicly available databases, the DRIVE, STARE, and CHASE_DB1 databases. Extensive experiments demonstrate that our approach achieves better or comparable performance to state-of-the-art methods with a much faster processing speed, making it suitable for real-world clinical applications. The results of cross-training experiments demonstrate its robustness with respect to the training set.

Conclusions

The proposed approach segments retinal vessels accurately with a much faster processing speed and can be easily applied to other biomedical segmentation tasks.

Retinal Vessel Segmentation via Multiscaled Deep-Guidance

Retinal Blood Vessels Segmentation: Improving State-of-the-Art Deep Methods

Fundus Image Based Retinal Vessel Segmentation Utilizing a Fast and Accurate Fully Convolutional Network

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Retinal blood vessel segmentation has been extensively used for diagnosis, screening, and clinical study. For example, the morphological characteristics of retinal blood vessels have been associated with cardiovascular and coronary disease in adult life [23]. Vessel segmentation is the first step in the quantitative analysis of retinal fundus images. The segmented vasculature can assist in the implementation of screening programs for diabetic retinopathy, detecting the foveal avascular region, and the diagnosis of hypertension and glaucoma [7]. In [12], a retinal vascular tree was adopted for biometric identification because of its unique structure in each individual.

In clinical practice, the retinal vascular tree is usually delineated by ophthalmologists. However, the manual annotation of retinal blood vessels is highly skilled, tedious, and time-consuming. Therefore, automated retinal blood vessel segmentation methods are urgently required in clinical practice to reduce the annotation time and workload on ophthalmologists. However, vessel segmentation is a challenging problem. The main difficulties stem from the following aspects [6]: (1) the width of vessels ranges from 1 to 20 pixels, depending on both the width of the vessel and the image resolution. (2) The shape, size and local intensity values of vessels can vary hugely, increasing the difficulty of vessel segmentation. (3) Various structures appear in retinal images, including the retinal boundary, optic disk, fovea and pathologies, which make it difficult to segment vessels accurately based on hand-crafted features. (4) Some microvessels with low contrast to the background further complicate the segmentation task.

Hence, the automated segmentation of retinal blood vessels has received significant attention over recent decades. All existing algorithms for vasculature segmentation can be broadly categorized as unsupervised and supervised methods.

Unsupervised methods for vessel segmentation

Unsupervised methods mine the inherent patterns of blood vessels in retinal images to determine whether a pixel is a vessel or not. With regard to retinal vessel segmentation, matched filtering, mathematical morphology, and vessel tracking algorithms are generally applied. The matched filtering methodology convolves a 2-D kernel with the retinal image and the matched filter response indicates the presence of the feature. For example, Hoover et al. [9] presented a threshold probing technique on a matched filter response image and combined local and region-based properties of the vessels for segmentation. Mathematical morphology is employed in combination with curvature evaluation for vessel segmentation [36]. Vessel tracking algorithms segment a vessel between two points using local information and work at the level of a single vessel [17].

Unsupervised methods can perform the vessel segmentation task without any prior labeling knowledge, making it suitable for large amounts of unlabeled data. However, every coin has two sides. Compared with supervised methods, the training data or hand-labeled ground truths do not contribute directly to the design of the algorithms. Therefore, unsupervised methods show some deficiencies in their performance of retinal vessel segmentation. For example, the matched filtering method is effective for healthy images, but tends to produce too many false responses for pathological images. Vessel tracking has the advantage of highly accurate vessel widths, but is unable to detect vessels or vessel segments that have no seed points. The main disadvantage of exclusively relying upon morphological methods is that they do not exploit the known vessel cross-sectional shape [7]. Thus, supervised methods have been widely studied and applied by an increasing number of researchers.

Supervised methods for vessel segmentation

In supervised methods, the general framework for vessel segmentation contains two steps. i.e., feature extraction and classification. Features extracted from the original images can be further divided into hand-crafted features and automatically learned features. In the classification stage, the extracted features feed into a classifier such as K-nearest neighbor (KNN), support vector machine (SVM), or artificial neural network (ANN) to predict whether the pixel is a vessel or not. Niemeijer et al. [24] extracted a feature vector for each pixel that consisted of the responses of a Gaussian matched filter and its first- and second-order derivatives at multiple scales, augmented with the green plane of the RGB image. Afterward, the KNN algorithm was employed to estimate the probability of a pixel belonging to vessel. Marin et al. [20] proposed a neural network-based methodology for retinal vessel segmentation. They designed a 7-D feature vector composed of gray-level and moment invariant-based features as the inputs of a five-layers feedforward neural network and trained the network for classification. Ricci [26] employed line operators as feature vector and SVM for pixel classification. Soares et al. [30] used 2-D Gabor wavelet and supervised classification for retinal vessel segmentation. Each pixel is represented by a feature vector composed of the pixels intensity and two-dimensional Gabor wavelet transform responses taken at multiple scales. A Gaussian mixture model classifier is used to classify each pixel as either a vessel or non-vessel pixel. Staal [31] presented a ridge-based vessel segmentation methodology by which local features are extracted for every pixel. In total, 27 features are selected from convex set regions and individual pixels, and then, a k-NN classifier is used for classification. Fraz et al. [23] presented an ensemble system of bagged and boosted decision trees for retinal vessel segmentation and constructed a feature vector based on the orientation analysis of gradient vector field, morphological transformation, line strength measures, and Gabor filter responses.

In previous methods [23, 24], feature selection is based on artificially designed features. Although these approaches are appropriate in some situations, they cannot generalize well and may miss opportunities to discover novel patterns and features. Moreover, the hand-crafted features are time-consuming and application-specific.

Recently, deep neural networks (DNNs) have outperformed state-of-the-art methods in many image recognition tasks [3, 5, 11]. This success can be partially attributed to the ability of DNNs to automatically extract hierarchical features of data [37]. In terms of retinal vessel segmentation, Li et al. [15] proposed a cross-modality approach for vessel segmentation. The first modality is each image patch of size $16\times 16$ pixels, and the second one is the label map of all pixels for the given patch. Liskowski et al. [16] used a deep convolutional neural network (CNN) as a pixel-wise classifier. The class label of each pixel is predicted by providing a square window centered on that pixel as an input. They also considered some variants of the method including a NO-POOL architecture and structured prediction.

Previous patch-based methods have achieved excellent performance for vessel segmentation. However, they have two drawbacks. Firstly, the computational efficiency is quite poor, because each patch must be computed separately by the network. This is infeasible for large-scale retinal images. Secondly, the shape, size, and local intensity values of vessels are diverse in retinal images. As the size of square window is fixed in patch-based methods, it may produce unsatisfactory results in some regions where the size of the contextual window is inappropriate.

To address the above challenges, we propose a novel deep supervised fully convolutional network that fuse multi-level features with different receptive fields to obtain precise segmentation results. Within a single forward propagation, the network can produce a probability map with the same size as the original input. Therefore, it is much faster than previous patch-based methods. In addition, we introduce a deep supervision (DS) mechanism to counteract the issue of gradient vanishing. Furthermore, transfer learning is exploited to mitigate the problem of insufficient training data. Meanwhile, this approach can learn hierarchical features automatically, reducing the need for the time-consuming hand-crafted features. The proposed method has been evaluated on three publicly available databases, the DRIVE, STARE, and CHASE_DB1. The performance metrics illustrate the advantages of our method, achieving better or comparable performance to state-of-the-art methods for retinal vessel segmentation.

The remainder of this paper is organized as follows. The proposed method is presented in “Method” section. In “Experimental evaluation” section, performance metrics are defined and the robustness of the algorithm is assessed in a series of experiments. Finally, our conclusions are given in “Conclusion” section.

Methods

Architecture of the proposed network

In this section, we build upon a deep network by fusing multi-level hierarchical features and inserting auxiliary supervision at intermediate layers for vessel segmentation. Inspired by the recent studies of fully convolutional network (FCN) [18], which converts the fully connected layers in the traditional CNN into convolutional kernels and thus can be applied to input image of arbitrary size, we design a deep network architecture that works with very few training data and yields more precise segmentations. As shown in Fig. 1, the architecture of the proposed network consists of two parts, i.e., downsampling and upsampling modules.

The downsampling module is composed of convolutional and max-pooling layers that capture discriminative semantic information and is robust to appearance changes. In a typical convolutional layer, a feature map is produced by convolving the input with convolutional kernels, adding a bias term, and finally applying a nonlinear activation function. By denoting the ith feature map of the lth layer as $y_i^l$ and the kth feature map of the previous layer as $y_k^{l-1}$, a convolution layer is formulated as

$$\begin{aligned} \left\{ \begin{aligned}&y_i^l=f\left( S_i^l\right) ,\\&S_i^l=\sum _{k=1}^{n_{l-1}}y_k^{l-1}\bigotimes W_{ki}^l+b_i^l, \end{aligned} \right. \end{aligned}$$

(1)

where $S_i^l$ is the net input of the ith feature map in the lth layer, $f(\cdot )$ is the nonlinear activation function, $\bigotimes $ denotes the 2D spatial convolution, $W_{ki}^l$ is the convolutional kernel connected to the ith feature map $y_i^l$ in the lth layer from the kth feature map of the previous layer $y_k^{l-1}$, $b_i^l$ is the bias term, and $n_{l-1}$ is the number of feature maps in the $(l-1)$th layer. During the downsampling stage, the RGB image is passed through a stack of $3 \times 3$ convolutional layers. The convolution stride is fixed to 1 pixel. The spatial padding of the convolutional layer input is used to preserve the spatial resolution after convolution, i.e., the padding is 1 pixel for $3 \times 3$ convolutional layers. Spatial pooling is carried out by three max-pooling layers, which is performed over a $2 \times 2$ pixel window with stride of 2.

After several downsampling stages, the feature maps are gradually reduced in size and become much smaller than that of the ground truth. Therefore, upsampling is implemented to ensure that the size of the output at all stages is the same as that of the ground truth. The upsampling module comprises convolutional and deconvolutional layers that obtain a fine spatial resolution for precise localization. Specifically, to integrate the information of multiple feature maps at each downsampling stage, we append a $1\times 1$ convolution with channel 1 at some convolutional layers, followed by a deconvolutional layer to bridge those coarse feature maps to dense probability predictions. The deconvolution is implemented using backwards strided convolution [18], and its kernel can be learned during training. In the proposed network, there are three deconvolutional layers (kernel size $2k\times 2k$ with stride of $k, k=2,4,8$ for three upsampling layers, respectively), which are designed by taking full advantage of the different feature maps in hierarchical layers.

Our motivation is to adopt an end-to-end fashion to train the network, i.e., input an image and directly output an equal-sized probability map within a single forward propagation. In this regard, the proposed method is more computationally efficient and economical with regard to storage than previous patch-based models. In addition, vessel segmentation is a pixel-level classification problem, so the hierarchical features and refined localization are required simultaneously. Global information from higher layers resolves the problem of what (classification ability) while local information from lower layers covers the problem of where (precise localization). Finally, this multi-level hierarchical information from each branch of the upsampling module is fused by a weighted-fusion layer in the network and the fusion weights can be learned during training. The probability map is computed using the sigmoid function on the fusion map.

As we know, it is difficult to train a deep neural network. To alleviate the problem of gradient vanishing, we connect auxiliary classifiers (smaller networks) to a few intermediate layers to provide additional supervision during training [14, 32] that is computing the loss function between each branch output and the label of the input image (i.e., $l_m\left( W,w_\mathrm{b}^{(m)}\right) , m=1,2,3,4$ in Fig. 1). Specifically, we upsample some lower-level and middle-level feature maps using deconvolutional layers and then use the sigmoid cross-entropy loss layer to compute the companion losses in these auxiliary classifiers. Finally, we optimize a loss function that is a sum of the loss from the final fusion layer and branch losses associated with these auxiliary classifiers. The intuitive idea behind these classifiers is to encourage the feature maps at lower layers to be directly predictive of the final labels, and to help propagate the gradients back through the deep network structure. The detailed learning process and the computation of branch losses from auxiliary classifiers are introduced in “Vessel segmentation by fusing multiple branch outputs” section.

Training a deep CNN requires a large amount of labeled training data, which remains a challenge for medical images because of the expense of expert annotation and scarcity of disease (e.g., lesions). Previous studies have indicated that filters trained on large-scale well-annotated ImageNet could be transferred to different recognition tasks in other domains [28, 34]. Therefore, we optimize our medical task by modifying the off-the-shelf model VGGNet [29], which was trained for large-scale images of ImageNet. The parameters of the convolutional layers in our model are initialized by the pre-trained filters in VGGNet. This process can be considered as the pre-training phase in the neural network with good initialization, which accelerates the convergence of the network.

Vessel segmentation by fusing multiple branch outputs

Vessel segmentation can be formulated as a per-pixel classification problem. Specifically, let $\textstyle \mathcal {S}=\left\{ (X_n,Y_n), n=1,\ldots ,N \right\} $ denote the training set of the network, where $\textstyle X_n=\{x_j^{(n)},j=1,\ldots ,|X_n|\}$ denotes the raw input image and the index j runs over the spatial dimensions of image $X_n$. $|X_n|$ denotes the number of pixels in the input image $X_n$. $\textstyle Y_n=\{y_j^{(n)},j=1,\ldots ,|X_n|\}$ denotes the ground truth for image $X_n$. As we consider each image holistically and independently, we drop the subscript n from our notation for simplicity.

Deep supervision for multi-scale branch output learning

In the task of retinal vessel segmentation, there is a large variation in the size of vessels and a complex background. Thus, the size of the receptive field plays a crucial role in the pixel-wise classification, and it is equivalent to the intensity values of the corresponding contextual window. As shown in Fig. 2, different regions demand different window sizes. For example, some regions with clustered vessels require a small window size to segment different vessels clearly, whereas other regions containing microvessels with low contrast require more contextual information to distinguish vessels from background, and thus need a large contextual window. The proposed network comprises a single stream deep network with multiple branch outputs. These branch outputs have different receptive fields. Therefore, by leveraging multiple branch outputs, the network can handle the various sizes of receptive field demanded by different regions. Suppose there are M branch output layers. Each branch output layer is connected to an auxiliary classifier. Let $l_m\left( W,w_\mathrm{b}^{(m)}\right) $ denote the image-level loss function for the mth auxiliary classifier, where W and $w_\mathrm{b}^{(m)}$ are the parameters of the mainstream network and the weights of the auxiliary classifier in the mth branch, respectively. As the input of the network is a holistic image, the loss function is computed over all pixels in a training image $X=\{x_j,j=1,\ldots ,|X|\}$ and the corresponding ground truth $Y=\{y_j,j=1,\ldots ,|X|\}$, $y_j\in \{0,1\}$.

For retinal images, the distribution of vessel/non-vessel pixels is heavily imbalanced: 85% of the ground truth is non-vessel. Therefore, similar to cost-sensitive loss function, we introduce a class-balancing weight $\beta $ to automatically balance the loss between positive and negative classes [33]. Specifically, $l_m$ is defined as the following class-balanced cross-entropy loss function:

$$\begin{aligned} l_m\left( W,w_\mathrm{b}^{(m)}\right)= & {} -\beta \sum _{j=1}^{|X|}y_j\log {P\left( y_j=1\Big |X;W,w_\mathrm{b}^{(m)}\right) } \nonumber \\&-\,(1-\beta )\sum _{j=1}^{|X|}(1-y_j) \nonumber \\&\times \, \log {P\left( y_j=0\Big |X;W,w_\mathrm{b}^{(m)}\right) }, \end{aligned}$$

(2)

where $\textstyle \beta =|X_{\_}|/|X|$ and $\textstyle 1-\beta =|X_+|/|X|$. $\textstyle |X|,|X_+|$ and $\textstyle |X_{\_}|$ denote the number of pixels, vessel pixels and non-vessel pixels in an input image $\textstyle X$, respectively. The probability $\textstyle P(y_j=1\Big |X;W,w_\mathrm{b}^{(m)})=\sigma (s_j^{(m)})\in [0,1]$ is computed using the sigmoid function $\textstyle \sigma (\cdot )$ on the net input $\textstyle s_j^{(m)}$ value at pixel j in the m-th branch. Let $\textstyle W_\mathrm{b}=(w_\mathrm{b}^{(1)},\ldots ,w_\mathrm{b}^{(M)})$ denote the parameters of the auxiliary classifiers in all branch layers. The loss function for all the branch outputs can then be obtained as

$$\begin{aligned} \mathcal {L}_{b}(W,W_\mathrm{b})=\sum _{m=1}^M l_m\left( W,w_\mathrm{b}^{(m)}\right) . \end{aligned}$$

(3)

Multi-scale branch output fusion

The final predicted vessel segmentation map can be obtained by fusing each branch output. Here we add a weighted-fusion layer to the network to fuse the multi-scale segmentation map from each branch output. Moreover, the fusion weight of each branch-output can be learned during training. The loss function at the fusion layer $\mathcal {L}_\mathrm{fuse}$ becomes

$$\begin{aligned} \mathcal {L}_\mathrm{fuse}\left( W,W_\mathrm{b},W_\mathrm{f}\right)= & {} -\beta \sum _{j=1}^{|X|}y_j \nonumber \\&\log {P_\mathrm{fuse}\left( y_j=1\Big |X;W, W_\mathrm{b}, W_\mathrm{f}\right) } \nonumber \\&-(1-\beta )\sum _{j=1}^{|X|}(1-y_j) \nonumber \\&\log {P_\mathrm{fuse}\left( y_j=0\Big | X;W,W_\mathrm{b},W_\mathrm{f}\right) }, \nonumber \\ \end{aligned}$$

(4)

where $\textstyle P_\mathrm{fuse}\left( y_j=1\Big | X;W,W_\mathrm{b}, W_\mathrm{f}\right) =\sigma \left( \sum _{m=1}^M w_\mathrm{f}^{(m)}s_j^{(m)}\right) $, $\textstyle s_j^{(m)}$ is the net input values at pixel $\textstyle j$ in the m-th branch, $\textstyle W_\mathrm{f}=\left( w_\mathrm{f}^{(1)},\ldots ,w_\mathrm{f}^{(M)}\right) $ is the fusion weight, and $\scriptstyle P_\mathrm{fuse}\left( y_j=0\Big | X;W, W_\mathrm{b}, W_\mathrm{f}\right) =1-P_\mathrm{fuse}\left( y_j=1\Big | X;W, W_\mathrm{b}, W_\mathrm{f}\right) $. Combining Eqs. (3) and (4), we obtain the following total objective function:

$$\begin{aligned} \begin{aligned} \mathcal {L}(W,W_\mathrm{b},W_\mathrm{f})&= \mathcal {L}_{b}(W,W_\mathrm{b})+\mathcal {L}_{fuse}(W,W_\mathrm{b},W_\mathrm{f}) \\&\quad +\, \frac{\lambda }{2}\left( \Vert W\Vert ^2+\Vert W_\mathrm{b}\Vert ^2+\Vert W_\mathrm{f}\Vert ^2\right) , \\ \end{aligned} \end{aligned}$$

(5)

where the last term is a regularization term that helps to prevent overfitting and the hyperparameter $\lambda $ controls the relative importance of the data loss and weight decay terms. Finally, the parameters $\scriptstyle \{W,W_\mathrm{b},W_\mathrm{f}\}$ of the proposed network are jointly optimized in an end-to-end way by minimizing the total loss function $\scriptstyle \mathcal {L}(W,W_\mathrm{b},W_\mathrm{f})$ via standard stochastic gradient descent. For a test image, we apply the overlap-tile strategy to improve the robustness.

Experimental evaluation

Materials

We evaluated our method on three publicly available retinal image databases: DRIVE, STARE, and CHASE_DB1.

The DRIVE (Digital Retinal Images for Vessel Extraction) database [24] consists of 40 color fundus photographs. The database is divided into the training set and the test set, each containing 20 images. For each image, manual segmentation of retinal vessels is provided. The STARE (Structured Analysis of the Retina) database [9] contains 20 images for blood vessel segmentation. There is no separation of training set and testing set. Two strategies of generating the training set and testing set have been reported in studies. One is the random sampling method, which randomly extracts many patches from all 20 images [4, 23, 26], and it results in some overlap between the training data and testing data. This causes the results to be overly optimistic. The second strategy is leave-one-out validation [15, 16, 31]. In this study, we apply leave-one-out validation to the STARE database. Specifically, each image is tested using the network model trained on the other 19 images. The CHASE_DB1 database is part of the Child Heart and Health Study in England (CHASE) and contains 28 images from both the left and right eyes of 14 children [23, 25]. We use the four-fold cross-validation to evaluate the performance on the CHASE_DB1 database.

Details of training

As there are very few training data available for our tasks, to reduce overfitting in deep networks, we used various transformations to augment the training dataset (by a factor of 10), including rotation, scaling, flipping, and mirroring. In the training process, we randomly cropped multiple different regions with the size of size $512\times 512$ pixels from the original image as the input of the network. Specifically, we cropped 5 regions from each image of the augmented training dataset on DRIVE and STARE. While for CHASE_DB1 dataset, because its image is much larger compared to DRIVE and STARE, we cropped 10 different regions for each image of the augmented training dataset. This effectively reduces GPU memory usage and takes advantage of efficient batch processing. The proposed model was trained with standard back propagation using stochastic gradient descent (momentum=0.9, weight decay=0.0005, learning rate initialized to $10^{-8}$ and divided by 10 every 30,000 iterations, each element of the fusion weight initialized to 0.25). The parameters of the convolutional layers in the downsampling path were initialized by the pre-trained filters in VGGNet. The deconvolutional layers were initialized by bilinear interpolation. Other convolutional layers were initialized using the Xavier algorithm [8].

Table 1 Average performance measures with standard deviation for DRIVE, STARE, and CHASE_DB1

Full size table

Performance measurements

In the retinal vessel segmentation process, any pixel is classified as either vessel or non-vessel. For evaluation purposes, we employ the evaluation measurements sensitivity (Se), specificity (Sp) and accuracy (Acc). These are defined as:

$$\begin{aligned}&\hbox {Se} = \frac{\mathrm{TP}}{\mathrm{TP}+{\mathrm{FN}}},\, \hbox {Sp}=\frac{\mathrm{TN}}{\mathrm{TN}+{\mathrm{FP}}}, \\&\hbox {Acc} = \frac{\mathrm{TP}+{\mathrm{TN}}}{{\mathrm{TP}}+{\mathrm{FN}}+{\mathrm{TN}}+{\mathrm{FP}}}, \end{aligned}$$

where the true positives (TP) are correctly identified vessel pixels, false negatives (FN) are vessel pixels incorrectly classified as non-vessel, true negatives (TN) are correctly identified non-vessel pixels, and false positives (FP) are non-vessel pixels incorrectly classified as vessel pixels. In addition, the performance metrics also include the Kappa value and the area under the receiver operating characteristic (ROC) curve (AUC). The Kappa value is a measure for the agreement between two observers, which is defined as:

$$\begin{aligned} \text {Kappa}=\frac{\mathrm{Acc}-\mathrm{Acc}^{*}}{1-\mathrm{Acc}^{*}}, \end{aligned}$$

where $\mathrm{Acc}^{*}=\left[ \left( \frac{n_1}{N}\right) \cdot \left( \frac{m_1}{N}\right) \right] + \left[ \left( \frac{n_0}{N}\right) \cdot \left( \frac{m_0}{N}\right) \right] $, $n_1=\mathrm{TP}+\mathrm{FP}$, $m_{1}=\mathrm{TP}+\mathrm{FN}$, $n_0=\mathrm{TN}+\mathrm{FN}$, $m_0=\mathrm{TN}+\mathrm{FP}$, and $N=\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}$. AUC is calculated using the implementation provided in the scikit-learn Python library and modeled using a series of trapezoids built on all points from the ROC curve. A perfect classifier will have an AUC of 1.

Method evaluation

Vessel segmentation results

The segmentation result is a vessel probability map, in which each value represents the probability of each pixel belonging to the vessel class. To obtain the binary vessel segmentation, a thresholding scheme is applied to the probability map to determine whether a particular pixel belongs to the vessel class. The optimal threshold is selected as that which achieves the maximum accuracy in the training dataset. The performance metrics are calculated by taking the first human observer as the ground truth. The average results for DRIVE, STARE, and CHASE_DB1 are presented in Table 1. As illustrated in this table, the average accuracy, specificity, and Kappa values produced by the proposed algorithm outperform those of the second human observer for all three databases, which suggests the algorithm has a strong ability to identify vessel and non-vessel with a low FP rate compared to the second human observer. As shown in Fig. 3, the AUC values produced by the proposed method are greater than 0.97 for all three databases, which demonstrates the proposed network has good generalization ability. Figures 4, 5, 6 show the segmented images with the best and the worst case accuracy from the DRIVE, STARE, and CHASE_DB1 databases, respectively.

Table 2 Performance comparison of vessel segmentation methods on DRIVE database

Full size table

Comparison with other methods

We compared the performance of our method with that of state-of-the-art algorithms published in the past decade on DRIVE, STARE, and CHASE_DB1 databases. The performance metrics of Se, Sp, Acc and AUC given by the proposed method were compared with the results of published methodologies reported in their respective publications. From Tables 2, 3, 4, we can see that our deep segmentation network achieves better performance than most of the state-of-the-art methods in terms of Se,Sp, and Acc. In addition, the proposed method achieved AUCscores of 0.9782, 0.9885, and 0.9812 for the three databases. These values are higher than in all other methods, with the exception of the method in [16], which scored slightly higher. However, the approach described in [16] uses a patch-based method to predict the class assignment for all pixels in an $s\times s$ window centered in a given image patch. Although this can learn the dependencies between class labels for neighboring pixels, the sliding window scanning will produce $s^2$ outputs for each pixel in the testing phase, which imposes a heavy computational burden and is time-consuming. In contrast, with only one forward computation, our method can produce segmentation results that are the same size as the original input. Therefore, it is significantly faster than the method of [16]. Indeed, a comparison of computation times ( see “Computation time” section) demonstrates that our algorithm is roughly 230 times faster than the method of [16]. In general, our algorithm achieves comparable performance to the state-of-the-art methods with a much faster speed, which illustrates the effectiveness and efficiency of the proposed method.

Table 3 Performance comparison of vessel segmentation methods on STARE database

Full size table

Table 4 Performance comparison of vessel segmentation methods on CHASE_DB1 database

Full size table

Cross-training

To test the generalization capability of the proposed algorithm, cross-training experiments were carried out. Specifically, the network was tested on one dataset after being trained on each of the other datasets. The performance metrics of cross-training for the three datasets are presented in Table 5. Compared with Table 1, there is a slight decrease in accuracy from 0.9521, 0.9674, and 0.9599 to 0.9460, 0.9570, and 0.9515 for the DRIVE, STARE, and CHASE_DB1 datasets when trained on CHASE_DB1, DRIVE, and STARE, respectively. The average decrease in accuracy is only 0.0083. This demonstrates that our algorithm is robust with respect to the training set. As shown in Table 5, the proposed method achieves better performance than other methods in all cases, except for DRIVE (trained on CHASE_DB1), where the accuracy is slightly lower than the methods of [15]. The reason is that images from DRIVE and CHASE_DB1 are quite different. Compared to DRIVE, CHASE_DB1 images have non-uniform background illumination, poor contrast of blood vessels and wider arteriolars. These differences result in the cross-training performance on DRIVE (trained on CHASE_DB1) being lower than that of other forms of cross-training.

Table 5 Performance measures of cross-training

Full size table

Computation time

Training the network takes about 10 hours on a single NVIDIA GTX Titan GPU. After the training process is completed, it is much faster to test an image. Generally, it takes about 0.4 seconds to segment an image of size $584\times 565$ pixels using the trained network. We compared the computation time of our algorithm with the latest deep learning-based methods published by Liskowski [16] and Li [15]. By taking advantage of fully convolutional networks, the computation time of our approach is much less than that of the methods in [15, 16], both of which use the sliding window way to obtain the segmentation results, which causes redundant computations on neighboring pixels. The results in Table 6 demonstrate that our algorithm is roughly 230 times faster than the state-of-the-art method in [16] and 20 times faster than that in [15]. Moreover, our approach achieves comparable performance to [15, 16]. This makes our algorithm suitable for real-world clinical applications.

Table 6 Computation time for processing one image

Full size table

Table 7 Performance measures with ablation studies

Full size table

Ablation studies of our method

To investigate the role of each component in the proposed method, we performed extensive ablation studies. Compared with the STARE and CHASE_DB1 databases, the DRIVE database has a specified training set and test set. Therefore, the following ablation analysis was carried out on the DRIVE database. We draw the following conclusions from the results in Table 7. Firstly, the performance obtained by fusing multi-scale contextual information with DS (Fusion-output (with DS)) is significantly better than that given by only single branch output (branch-output 1/2/3/4). In addition, with or without DS on the branch-output, the network initialized by the pre-trained parameters in VGGNet outperforms the network with random initialization by a large margin in all evaluation measurements (Fusion-output (with DS) vs. Random weights (with DS); Fusion-output (w/o DS) vs. Random weights (w/o DS)). This illustrates that transfer learning helps to improve performance. Furthermore, to exclude the impact of transfer learning and analyze the effect of DS on the network independently, we initialized the network randomly and compared the performance without and with DS (Random weights (w/o DS) vs. Random weights (with DS)). Obviously the performance with DS is significantly better, which verifies that DS can guide the propagation of the gradient flow and dramatically boost performance. Finally, by combining DS and transfer learning, the performance of the network can be further enhanced. This suggests that good initialization is very important in deep networks and that the DS mechanism during the learning process can accelerate the optimization convergence rate and improve the prediction accuracy.

Conclusion

In this paper, we presented a deep supervised fully convolutional network for vessel segmentation from retinal images. By leveraging the multi-scale hierarchical features with different receptive fields, the network achieves high classification and precise localization abilities simultaneously, enabling it to handle variations in the receptive field size as demanded by different regions. Deep supervision with auxiliary classifiers helps to guide the back propagation of gradient flow and improves the discriminative capability of features in lower layers. To alleviate the issue of insufficient medical training data, the transferred knowledge learned from other domains is used to reduce overfitting on the limited medical dataset and boost performance. Furthermore, our approach does not rely on hand-crafted features and needs no problem-specific preprocessing or postprocessing, which reduces the impact of subjective factors. The results of cross-training also illustrate its robustness with respect to the training set. Extensive experiments on three benchmark datasets demonstrate the effectiveness and efficiency of our method. Compared with state-of-the-art methods, the proposed approach yields better or comparable performance with a much faster computation speed, which makes it suitable for clinical applications.

References

Al-Diri B, Hunter A, Steel D (2009) An active contour model for segmenting and measuring retinal vessels. IEEE Trans Med Imaging 28(9):1488–1497
Article PubMed Google Scholar
Azzopardi G, Strisciuglio N, Vento M, Petkov N (2015) Trainable cosfire filters for vessel delineation with application to retinal images. Med Image Anal 19(1):46–57
Article PubMed Google Scholar
Chen H, Qi XJ, Cheng JZ, Heng PA (2016) Deep contextual networks for neuronal structure segmentation. In: Thirtieth AAAI conference on artificial intelligence
Cheng E, Du L, Wu Y, Zhu YJ, Megalooikonomou V, Ling H (2014) Discriminative vessel segmentation in retinal images by fusing context-aware hybrid features. Mach Vis Appl 25(7):1779–1792
Article Google Scholar
Ciresan D, Giusti A, Gambardella LM, Schmidhuber J (2012) Deep neural networks segment neuronal membranes in electron microscopy images. In: Advances in neural information processing systems, pp 2843–2851
Fraz MM, Barman SA, Remagnino P, Hoppe A, Basit A, Uyyanonvara B, Rudnicka AR, Owen CG (2012) An approach to localize the retinal blood vessels using bit planes and centerline detection. Comput Methods Programs Biomed 108(2):600–616
Article CAS PubMed Google Scholar
Fraz MM, Remagnino P, Hoppe A, Uyyanonvara B, Rudnicka AR, Owen CG, Barman SA (2012) Blood vessel segmentation methodologies in retinal images-a survey. Comput Methods Programs Biomed 108(1):407–433
Article CAS PubMed Google Scholar
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: International conference on artificial intelligence and statistics, pp 315–323
Hoover A, Kouznetsova V, Goldbaum M (2000) Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Trans Med Imaging 19(3):203–210
Article CAS PubMed Google Scholar
Jiang X, Mojon D (2003) Adaptive local thresholding by verification-based multithreshold probing with application to vessel detection in retinal images. IEEE Trans Pattern Anal Mach Intell 25(1):131–137
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kse C, kiBa C (2011) A personal identification system using retinal vasculature in retinal fundus images. Expert Syst Appl 38(11):13,670–13,681
Google Scholar
Lam BS, Gao Y, Liew AWC (2010) General retinal vessel segmentation using regularization-based multiconcavity modeling. IEEE Trans Med Imaging 29(7):1369–1381
Article PubMed Google Scholar
Lee CY, Xie S, Gallagher P, Zhang Z, Tu Z (2014) Deeply-supervised nets. arXiv preprint arXiv:1409.5185
Li Q, Feng B, Xie L, Liang P, Zhang H, Wang T (2016) A cross-modality learning approach for vessel segmentation in retinal images. IEEE Trans Med Imaging 35(1):109–118
Article PubMed Google Scholar
Liskowski P, Krawiec K (2016) Segmenting retinal blood vessels with deep neural networks. IEEE Trans Med Imaging 35(11):2369–2380
Liu I, Sun Y (1993) Recursive tracking of vascular networks in angiograms based on the detection-deletion scheme. IEEE Trans Med Imaging 12(2):334–341
Article CAS PubMed Google Scholar
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Lupaşcu CA, Tegolo D, Trucco E (2010) Fabc: retinal vessel segmentation using adaboost. IEEE Trans Inf Technol Biomed 14(5):1267–1274
Article PubMed Google Scholar
Marin D, Aquino A, Gegundez-Arias ME, Bravo JM (2011) A new supervised method for blood vessel segmentation in retinal images by using gray-level and moment invariants-based features. IEEE Trans Med Imaging 30(1):146–158
Article PubMed Google Scholar
Mendonca AM, Campilho A (2006) Segmentation of retinal blood vessels by combining the detection of centerlines and morphological reconstruction. IEEE Trans Med Imaging 25(9):1200–1213
Article PubMed Google Scholar
Miri MS, Mahloojifar A (2011) Retinal image analysis using curvelet transform and multistructure elements morphology by reconstruction. IEEE Trans Biomed Eng 58(5):1183–1192
Article PubMed Google Scholar
Muhammad Moazam F, Paolo R, Andreas H, Bunyarit U, Rudnicka AR, Owen CG, Barman SA (2012) An ensemble classification-based approach applied to retinal blood vessel segmentation. IEEE Trans Biomed Eng 59(9):2538–2548
Article Google Scholar
Niemeijer M, Staal J, Ginneken BV, Loog M, Abramoff MD (2004) Comparative study of retinal vessel segmentation methods on a new publicly available database. Proc SPIE Int Soc Opt Eng 5370:648–656
Google Scholar
Owen CG, Rudnicka AR, Mullen R, Barman SA, Monekosso D, Whincup PH, Ng J, Paterson C (2009) Measuring retinal vessel tortuosity in 10-year-old children: validation of the computer-assisted image analysis of the retina (caiar) program. Invest Ophthalmol Vis Sci 50(5):2004–2010
Article PubMed Google Scholar
Ricci E, Perfetti R (2007) Retinal blood vessel segmentation using line operators and support vector classification. IEEE Trans Med Imaging 26(10):1357–1365
Article PubMed Google Scholar
Roychowdhury S, Koozekanani DD, Parhi KK (2015) Iterative vessel segmentation of fundus images. IEEE Trans Biomed Eng 62(7):1738–1749
Article PubMed Google Scholar
Shin HC, Roth HR, Gao M, Lu L, Xu Z, Nogues I, Yao J, Mollura D, Summers RM (2016) Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE Trans Med Imaging 35(5):1285–1298
Article PubMed Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Soares JV, Leandro JJ, Cesar RM Jr, Jelinek HF, Cree MJ (2006) Retinal vessel segmentation using the 2-d gabor wavelet and supervised classification. IEEE Trans Med Imaging 25(9):1214–1222
Article PubMed Google Scholar
Staal J, Abràmoff MD, Niemeijer M, Viergever MA, Van Ginneken B (2004) Ridge-based vessel segmentation in color images of the retina. IEEE Trans Med Imaging 23(4):501–509
Article PubMed Google Scholar
Wang L, Lee CY, Tu Z, Lazebnik S (2015) Training deeper convolutional networks with deep supervision. arXiv preprint arXiv:1505.02496
Xie S, Tu Z (2015) Holistically-nested edge detection. In: Proceedings of the IEEE international conference on computer vision, pp 1395–1403
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: Advances in neural information processing systems, pp 3320–3328
You X, Peng Q, Yuan Y, Ym Cheung, Lei J (2011) Segmentation of retinal blood vessels using the radial projection and semi-supervised approach. Pattern Recogn 44(10):2314–2324
Article Google Scholar
Zana F, Klein JC (2001) Segmentation of vessel-like patterns using mathematical morphology and curvature evaluation. IEEE Trans Image Process 10(7):1010–1019
Article CAS PubMed Google Scholar
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Computer vision–ECCV 2014, Springer, pp 818–833

Download references

Acknowledgements

This work was supported by Fok Ying Tung Education Foundation (Grant 151068); National Natural Science Foundation of China (Grants 61332002); and Foundation for Youth Science and Technology Innovation Research Team of Sichuan Province (Grants 2016TD0018).

Author information

Authors and Affiliations

College of Computer Science, Sichuan University, Chengdu, 610065, China
Juan Mo & Lei Zhang
School of Science, Inner Mongolia University of Science and Technology, Baotou, 014010, China
Juan Mo

Authors

Juan Mo
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

No animal or human experiments were conducted as part of this research.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mo, J., Zhang, L. Multi-level deep supervised networks for retinal vessel segmentation. Int J CARS 12, 2181–2193 (2017). https://doi.org/10.1007/s11548-017-1619-0

Download citation

Received: 15 December 2016
Accepted: 22 May 2017
Published: 02 June 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s11548-017-1619-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-level deep supervised networks for retinal vessel segmentation

Abstract

Purpose

Methods

Results

Conclusions

Similar content being viewed by others

Retinal Vessel Segmentation via Multiscaled Deep-Guidance

Retinal Blood Vessels Segmentation: Improving State-of-the-Art Deep Methods

Fundus Image Based Retinal Vessel Segmentation Utilizing a Fast and Accurate Fully Convolutional Network

Explore related subjects

Introduction

Unsupervised methods for vessel segmentation

Supervised methods for vessel segmentation

Methods

Architecture of the proposed network

Vessel segmentation by fusing multiple branch outputs

Deep supervision for multi-scale branch output learning

Multi-scale branch output fusion

Experimental evaluation

Materials

Details of training

Performance measurements

Method evaluation

Vessel segmentation results

Comparison with other methods

Cross-training

Computation time

Ablation studies of our method

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Informed consent

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation