Keywords

1 Introduction

Diabetic retinopathy (DR) originates from diabetes and is a relatively common fundus disease that can cause blindness. It is estimated that about 93 million people worldwide suffer from DR [1], and this number continues to rise. Since DR can be detected by the morphological changes of the retina, many researchers are devoted to the research of disease detection based on fundus images to assist ophthalmologists for diagnosis. However, most methods are based on single-view fundus images, which lead to incomplete information of lesions. So some researchers chose ultra-widefield (UWF) images as their studying objects [2,3,4]. In clinical, compared with conventional single-view scanning images, the UWF images have a vision of 180–200° and contain more region information, which is beneficial for the accurate diagnosis. For instance, Nagasato et al. proposed a central retinal vein occlusion detection method based on UWF fundus images [5]. Pellegrini et al. presented an method for artery/vein classification using UWF images [6].

The UWF fundus images of patients with DR and those of normal people are shown in Fig. 1. It shows that there is little contrast between the lesion area and the normal area, with differences in lesion size and inhomogeneous distribution, which brings great challenges to the processing and analysis of the UWF images.

Fig. 1.
figure 1

Display of the UWF images with two categories. The two images on the left are DR images, and the two on the right are normal images.

Deep learning performs well in medical image processing and analysis and many researchers have applied it to automatically diagnose various diseases. For example, Brown et al. designed an automatic detection method for retinopathy of prematurity using deep convolutional network [7]. Li et al. used a deep learning method to detect DR and diabetic macular oedema [8]. Diaz-Pinto et al. presented a method based on deep learning to detect and classify glaucoma [9]. Xie et al. used a cross-attention network for fundus diseases classification [10]. However, compared to the background area, some lesion areas are small and the background area is more obvious than the target area (especially in UWF images). Some researchers tried to use more complex models or design multiple networks to extract the discriminative features. For instance, Hamwood et al. used a fully convolutional network to determine the positions of cone photoreceptors [11]. Ruan et al. achieved kidney tumor segmentation on CT images by using a multi-branch feature sharing network [12]. However, these networks have advantages in extracting global features, but easily overlook the important local information, which is disadvantageous for the expression of discriminative detailed features. To address this situation, some researchers have turned to attention mechanism [13,14,15,16]. The attention mechanism can obtain more detailed information of the target, which is very helpful for the detection and discrimination of lesion.

In this paper, a dual-branch network with ResNet-34 [17] as the backbone is proposed to extract features. We use atrous spatial pyramid pooling module (ASPP) to enlarges the receptive field and obtain rich semantic information. According to the dependence of spatial and channel information between the features, the efficient channel attention (ECA) and spatial attention (SA) modules are utilized to emphasize the useful information after multiple convolutional layers of the network. Finally, we use a global average pooling (GAP) layer to compress the features and concatenate the features from two branches to complete the final classification task.

2 Methodology

The architecture of our proposed method is shown in Fig. 2. We propose a dual-branch network to extract features and enlarge the receptive field by ASPP to obtain richer semantic information. At the same time, the ECA and SA modules are employed to emphasize some important information. After obtaining discriminative features by GAP, the features from the two branches are concatenated and used for the prediction. The detailed information of modules will be described below.

Fig. 2.
figure 2

The architecture of proposed method. The ResNet-34 model is the backbone and is used in both two branches. The features extracted from the ASPP module are fed into the ECA and SA modules to interact the high-dimensional features and compressed by GAP.

2.1 Dual-Branch Network

Inspired by [9], we propose a dual-branch network to extract depth features. Each branch of the model is based on the ResNet-34 model, which solves the gradient disappearance problem in deep networks through residual connection. It is worth noting that the first three layers of the two branch networks share parameters, so the scale of the model parameters is reduced. We use the UWF images as the input of the dual-branch network, and the extracted rich deep semantic features can be used in the subsequent modules.

Fig. 3.
figure 3

Demonstration of the ASPP module. Resnet-34 model extracts features and feeds the features to the ASPP module. After feature extraction with different dilated rates, all the features are concatenated. And the final output is obtained by a 1 × 1 convolution layer.

2.2 Atrous Spatial Pyramid Pooling Module

General neural networks use pooling operation to enlarge the receptive field to capture more spatial information while reducing the resolution, but this process will have some loss of detailed information. To solve this issue, some researchers used atrous convolution (AC) instead of pooling operation. We can obtain the context information of different scales by setting different dilated rates. At the same time, spatial pyramid pooling (SPP) [18] can detect complex objects in multiple fields of view and extract features from different angles and then aggregate them by 1 × 1 convolution, which improves detection accuracy. Combining the advantages of AC and SPP, an atrous spatial pyramid pooling (ASPP) module is added to a branch of the network to extract more context features, which can further improves the classification performance on UWF images without increasing parameters. Specifically, we use three different dilated rates to perform atrous convolution operation and all the extracted features are integrated by a 1 × 1 convolution layer. The structure of ASPP module is demonstrated in Fig. 3 in which we set the dilated rate to 1, 2, and 3 respectively. Because the extraction of information with a large dilated rate may only have an obvious effect on the detection of some large objects, while the focus area of most fundus diseases is relatively small compared with the global background area.

Fig. 4.
figure 4

The structure of ECA and SA modules. After a GAP layer, ECA generates channel attention weights matrix by a one-dimensional convolution of size K, where K is adaptively determined by the number of channels. The channel attention map is obtained by multiplying the channel attention weight matrix with the original feature map. After AP and MP, we use a convolutional layer and an sigmoid function to integrate the concatenated features as the spatial attention weight matrix. Finally, the channel-refined map is multiplied by the spatial attention weights to get the final output.

2.3 Efficient Channel Attention Module and Spatial Attention Module

Inspired by [19,20,21], we use the attention mechanism of efficient channel attention (ECA) and spatial attention (SA) to enhance the feature representation ability. The ECA and SA modules are shown in Fig. 4. We perform a GAP operation on the feature map and capture the cross-channel information from each channel and its K neighbors, which can be achieved by a one-dimensional convolution of size K. Then use a matrix multiplication operation to integrate the output of the previous step with the original feature map. After that, we perform average pooling (AP) operation and max pooling (MP) operation, and the spatial attention coefficients matrix are obtained by a simple convolutional layer and sigmoid operation. Finally, all features are integrated and become the output of this module. Moreover, an adaptive selection method of one-dimensional convolution kernel size in ECA is employed to find the suitable receptive region of kernel. The formula is described as follows:

$$ K = \phi (C) = |\frac{{\log_{2} (C) + b}}{\gamma }|, $$
(1)

where K is the kernel size of one-dimensional convolution, C is the number of channels, b and γ are manually parameters, which are set as 1 and 2 respectively.

3 Experiments

3.1 Dataset and Implementation Details

The experiment performs on ultra-widefield scanning laser ophthalmoscopy images which is called UWF dataset. The UWF is obtained from a local hospital and includes four categories of images. We choose the DR and the Normal images here. Among them, we have 398 images of DR and 948 images of Normal, which are divided into training data and testing data at 3:1. The specific divisions of the training data and testing data are shown in Table 1. In addition, the images in the UWF dataset have a high resolution of 2600 × 2048 which would be difficult to put the original images directly into the network for training. So we resize the images to 448 × 448. Meanwhile, random vertical and horizontal flipping are used for data augmentation in the training set to prevent network overfitting. Accuracy, Precision, Recall, and F1-score are used to evaluate the network classification performance.

Table 1. The specific distribution of the training data and testing data.

We implement our work with PyTorch and use GPUs to accelerate the training process. The pre-trained ResNet-34 model is used for the first three layers of the network, and the second two layers are initialized randomly. During training, we choose the Adam with default values as the optimizer. We set the max epoch to 80, set the size of every training batch to 8 and set the size of every testing batch to 1. In addition, the learning rate is 0.0001 and decays by 10% per 50 epochs.

3.2 Experimental Results

The experimental results of different methods is shown in Table 2. We choose VGG16 [22], ResNet-34 [17], ResNet-50 [17], InceptionV3 [23], and DenseNet121 [24] for comparison. The VGG16, ResNet-34, ResNet-50, and DenseNet121 contain the characteristics of deep networks, while InceptionV3 has the characteristics of multiple branches and great feature extraction ability. So these networks are good baselines for research.

From Table 2, we can see that among the backbone networks, InceptionV3 has the best performance, followed by ResNet-34. So we know the multi-branch has an impact in the stage of feature extraction. From line 2, line 3, and line 5, it shows that the deep networks perform not as good as other networks. Generally, the deeper networks are easily overfitting and difficult to train. The proposed method extracts multi-scale features and uses the attention mechanism to strengthen the degree of attention to some important information. From Table 2, our model performs well compared with the other backbone networks and achieves the highest accuracy of 98.82%.

Table 2. Performance of different models (%).

Ablation experiments are conducted to estimate the modules in our model. Specifically, we select the following networks for comparison: ResNet-34 model (ResNet34), dual-branch model (DB), dual-branch model with ASPP (DB-ASPP), dual-branch model with attention modules (DB-AM), and dual-branch network with all complete modules (Proposed). The experimental results are shown in Table 3. We can know that adding corresponding modules to the backbone can slightly strengthen the classification performance. From the experimental results, the score of dual-branch model with ECA and SA on Recall is lower than backbone, because both ECA and SA modules operate with deep features. Without strengthening the global feature extraction capability, the use of ECA and SA alone will produce biased errors in the classification results. So the result has a slight decrease in Recall.

Table 3. Evaluation of every module of our method (%).

To more intuitively evaluate our method, we plot the receiver operating characteristic (ROC) curves and the area under curve (AUC) is used as the evaluation metric. The ROC curves are shown in Fig. 5 in which we can observe our model has very good performance.

Fig. 5.
figure 5

Demonstration of ROC curves. We can see that the AUC value of our proposed model is 0.9720, which is the highest compared to other methods.

4 Conclusions

We propose a novel dual-branch network for DR classification. Two branches use ResNet-34 as the backbone, which can strengthen the extraction ability. ASPP enlarges the receptive field and integrates more features. ECA and SA emphasize the important information of feature space to obtain more discriminative features. The experimental results show our method performs better than other methods on the UWF dataset.