Keywords

1 Introduction

Cataract is the leading cause of reversible blindness and vision impairment worldwide [5]. Early treatment can address vision impairment and restore vision to improve the cataract patient’s quality of life. According to the location of the opacities, cataracts can be generally classified into three types: nuclear cataract (NC), cortical cataract (CC), and posterior subcapsular cataract (PSC). NC is the most common type of cataract, characterized by the increase of light scattering in the nucleus region of the crystalline lens area. In clinical practice, slit-lamp image is routinely used to diagnose NC based on standard cataract classification systems. Lens opacity classification system III (LOCS III) [4] is a well-accepted slit lamp image-based cataract classification system. With the development of nuclear opacity pathology, nuclear cataract can be divided into three stages [15]. (1) Normal: healthy or without nuclear opacity in the slit-lamp image; (2) mild (grade=1 or 2 in LOCS III): the nuclear opacity is asymptomatic; (3) severe (grade\(\le \)3 in LOCS III): the nuclear opacity is symptomatic. Mild NC can be relieved by clinical intervention, while severe NC needs to prepare for surgery as soon. Figure 1 shows the representative figures of AS-OCT nuclear areas at the three stages.

Fig. 1.
figure 1

The whole AS-OCT image shown in (a), and the center area is the nucleus. (b) Normal nucleus image; (c) Mild NC image where the nuclear opacity is asymptomatic;(d) Severe NC image where the nuclear opacity is symptomatic.

Anterior segment optical coherence tomography (AS-OCT) is a non-contact, high-resolution tomography technique, which can objectively and quickly obtain overall information of the entire lens. AS-OCT images have gradually been used in the diagnosis of various anterior segment ocular diseases such as glaucoma, cataracts, and keratitis [5]. For NC diagnosis, AS-OCT image can capture the nucleus region clearly while other ophthalmology images like fundus images cannot. The clinical study has shown that the average lens density (ALD) has a strong linear relationship with the nucleus region of AS-OCT images based on the LOCS III [21], which provided clinical support for automatic cataract classification on AS-OCT images. Following [21], clinical research [3, 14, 16, 20] further got the similar statistics results. Motivated by the preliminary works, [27] studied NC classification based on AS-OCT image, which uses the convolutional neural network (CNN), but they achieved poor performance.

Fig. 2.
figure 2

The different nuclear opacity stages are reflected in the OCT image; the histogram reflects the sample distribution of three nuclear cataract severity levels (different colors mean different NC stages).

Average nucleus density (AND) is a clinical indicator on AS-OCT image for nuclear cataract diagnosis, which is defined as the average pixel density in the nucleus region [21]. Figure 2 shows the distribution of AND in different stages of nuclear cataract. It can be seen that there are significant differences in the AND distribution among different NC stages, while many images are difficult to classify the severity of cataracts simply by AND (the overlap area as shown in Fig. 2).

In recent years, channel attention mechanism has become one of the most popular attention mechanisms due to its simplicity and effectiveness, which directly learns importance weights of each channel. In channel attention block, global average pooling (GAP) is used for integrating channel-wise information, which calculates the mean value of each channel. GAP collects the global mean value, which enhances the representation ability for global information, especially AND. Inspired by this relationship, we propose a simple yet effective gated channel attention network (GCA-Net) for NC classification automatically. In the GCA-Net, this paper designs a novel gated channel attention block, where a gating operator is used to mask and applies a weakly-interacting operator to model the global channel information.

The main contributions of this paper are as follows: (1) We develop a novel convolutional neural network (CNN) model named GCA-Net to discriminate opacity information for classifying NC levels into three severity levels. (2) This paper designs a simple yet effective channel attention (GCA) block comprised of three stages: gating, squeezing, and interacting, to capture the global information. (3) The results on a clinical AS-OCT image show that our GCA-Net surpasses state-of-the-art attention-based networks.

2 Related Work

2.1 Cataract Classification

In recent years, research scholars have proposed many advanced machine learning and deep learning methods for automatic cataract classification on different ophthalmology image modalities [26]. [12] proposes an automatic NC classification system that contains three stages (region detection, pixel feature extraction, and level prediction) based on the ACHIKO-NC slit-lamp dataset, and achieves an average error of 0.36. Xu et al. also performed NC classification on the ACHIKO-NC dataset, using the group sparse regression (GSR) method and achieved 83.4% accuracy [25]; [24] proposed the semantic similarity method for slit lamp image-based NC classification and obtained better performance than GSR. [1] achieves an accuracy of 95% using support vector machines (SVM) to classify NC on ultrasound images, but the ultrasound image data sets used for their work are from animals. Li et al. achieved accurate cataract screening by improving the Haar wavelet transform algorithm on fundus images [2].

Compared with machine learning methods, deep learning methods are skilled at capturing useful feature representations. Gao et al. proposed a hybrid model of convolutional neural network (CNN) and recurrent neural network (RNN) based on slit-lamp images and achieved 82.5% accuracy for NC classification [6]. A team of Sun Yat-sen University proposed a congenital cataract screening platform based on deep learning [13]. Xu et al. proposed a global-local hybrid CNN network by fusing different parts of pathological information that achieves better performance than previous methods on fundus images [23, 26].

There are relatively few NC classification studies on AS-OCT images. Some clinical studies have verified its reliability on NC classification based on LOCS III [3, 16, 21]. [27] tried preliminary NC classification using deep learning methods on AS-OCT images. We combine clinical and methodological research to propose our own method.

2.2 Attention Mechanism

Attention mechanisms have empowered CNN models and achieved state-of-the-art results on various learning tasks [19]. In general, attention mechanisms can be mainly summarized into two groups, channel attention mechanism and spatial attention mechanism. SENet [10] firstly proposed the channel attention mechanism. It performs the GAP for channel squeeze, then reconstructs inter-dependencies of the channels through fully-connected (fc) layers, finally a Sigmoid layer is applied to generate channel weights for each channel. GENet [9] introduces a learnable layer for better exploiting the context feature, and FcaNet [19] increases the diversification of extracted features by extracting multi-band information. Bottleneck Attention Module (BAM) [17] and Convolutional Block Attention Module (CBAM) [22] combine the two attention mechanisms for getting the fused attention weights. To improve efficiency, ECANet [18] uses one-dimensional convolution layers to replace the original fully-connected layers in SENet.

3 Method

In this section, we first revisit the classical channel attention mechanism. Then we elaborate our GCA block in detail.

Fig. 3.
figure 3

A gated channel attention block.

3.1 Revisiting of Channel Attention

Channel attention is one of the most widely used attention module in CNNs. It uses a learnable block to adjust the importance of each channel and enhance the feature representation ability of the model. Given \(X\in \mathbb {R}^{C\times W\times H}\) is the input feature tensor, where C denotes the number of channels, H and W denote the height and width of the feature map, respectively. The output \(Y\in \mathbb {R}^{C\times W\times H}\) has the same shape of \(X\) with re-weighting of each channel. SENet [10] is the most classic channel attention mechanism consist of squeeze and excitation operation. The formula can be written as:

$$\begin{aligned} Y = \mathbf {F}_{scale}(W_{att}, X), \end{aligned}$$
(1)
$$\begin{aligned} W_{att}=\mathbf {F}_{ex}(\mathbf {F}_{sq}(X)), \end{aligned}$$
(2)

where \(W_{att}\in \mathbb {R}^{C}\) is the channel attention weight, \(\mathbf {F}_{scale}\) refers to channel-wise multiplication, \(\mathbf {F}_{sq}\) represents the squeeze function GAP, and \(\mathbf {F}_{ex}\) is the excite function to transform the squeeze info to attention weights. Generally, the squeeze step compresses channel information, and excitation step calculates the channel weights \(W_{att}\). For the first step, it usually use parameter-free function like global average pooling (GAP) [10] or global max pooling (GMP) [22] to compute channel-statistics information. For the second step, it adopts fc layers for inter-channel dependency reconstruction.

In this paper, we found that the dependency among channel-statistics information is weak, and fc layers do not work well for AS-OCT image-based NC classification. This is because AND is an important indicator for NC diagnosis on AS-OCT images. Hence, we design a simple yet effective channel attention block named gated channel attention (GCA) block and will be introduced in the next section.

3.2 Gated Channel Attention Block

Figure 3 shows the diagram of the structure of a gated channel attention (GCA) block, which comprises three stages: gating, squeezing, and interacting.

Gating: To suppress the redundant features in a feature map, we devise a gated unit to mask the irrelevant features. According to the clinical studies in Sect. 2.1, the higher density region has higher relevance with cataract. To this end, we proposed a high-value gate for masking the low-value influence. It is an adaptive threshold function in which we use the global average value from each feature map as the threshold value. This is because [11] demonstrated that pooling value below average suppressed neuron activations in a CNN model. Formally, the gated tensor \(X'\in \mathbb {R}^{C\times W\times H}\) is generated by masking the low-value of input tensor \(X\in \mathbb {R}^{C\times W\times H}\), such that the \(c\text {-}th\) channel is formulated by:

$$\begin{aligned} (X'_{c})_{ij}=\mathbf {F}_{gating}(X_{c})_{ij}=Max(Mean(X_{c}), (X_c)_{ij}), \end{aligned}$$
(3)

where Mean function calculates the mean value of the feature map, Max function returns the largest item of input.

Squeezing: We use a squeezing operator to follow the gating operator, which is used to compute the channel-statistics feature information from each channel. This paper uses global average pooling (GAP) as squeezing operator, equivalent to the AND indicator for NC diagnosis. It can be written as follows:

$$\begin{aligned} z_{c}=\mathbf {F}_{GAP}(X'_c)=\frac{1}{W\times H}\sum _{i=1}^{W}\sum _{j=1}^{H}(X'_c)_{ij}, \end{aligned}$$
(4)

where \(z_{c}\) denotes the output of GAP in \(c\text {-}th\) channel.

In the experiments, we test the effects of different pooling operators.

Fig. 4.
figure 4

The schema of the SE residual unit (left) and the GCA residual unit (right).

Interacting: In the third stage, we propose a weakly interacting operator to construct weak dependencies of inter-channel and set the relative weights for channels. The fully-connection operator is the first proposed method for channel interacting in channel attention block. However, it brings higher model complexity, and [18] simplifies the interacting stage using local-connection. We further reduce the interacting complexity, and achieve channel interacting base on a Softmax function. This paper uses the following formulation to get attention weights:

$$\begin{aligned} (W_{att})_c=Softmax(z)_{c} = \frac{e^{z_{c}}}{\sum _{i=1}^{C}e^{z_i}}, \end{aligned}$$
(5)

where \(W_{att}\) is the channel attention weight same as formula 2.

As shown in the formula 5, the attention weight \((W_{att})_c\) of each channel can be obtained through the dependencies between a single channel (\(z_c\)) and all channels (z). Thus, Softmax function can be regarded as a weakly-connection among channels. On the contrary, Sigmoid obtain the channel weights independently with a lack of interaction. In the experiments, we will make a comparison between these two interaction methods.

The final output of the GCA block is obtained by rescaling \(X'\) with the channel weights \(W_{att}\):

$$\begin{aligned} Y_c=F_{scale}(X'_c, (W_{att})_c)=(W_{att})_cX'_c, \end{aligned}$$
(6)

where \(Y_c\) is the \(c\text {-} th\) channel of final output, \(\mathbf {F}_{scale}(X'_c, (W_{att}){}_c)\) is a channel-wise multiplication between the weight \(W_{att}{}_c\) and the feature map \(X'_c\).

Discussion: To demonstrate the effectiveness of our GCA block, we use ResNet18 and ResNet34 as the backbone networks. We use them based on two reasons: 1) ResNet is a universal backbone, and ResNet18 and ResNet34 have low computational cost. 2) Most attention mechanism blocks have been verified to be effective on the ResNet backbone. The final GCA-Net is stacked by repeated GCA units shown in Fig. 4(b).

4 Experiments

4.1 Dataset and Evaluation Measures

We use a clinical AS-OCT images dataset, which is collected through the CASIA2 ophthalmology device (Tomey Corporation, Japan). The original AS-OCT image is shown as Fig. 1(a). However, only the nucleus area is associated with NC classification [21], and we extracted the nucleus part of the whole AS-OCT image manually as shown in Fig. 1(b)(c)(d).

The AS-OCT image dataset contains 17200 AS-OCT images from 543 participants with the average age of 61.3±18.7 (range: 14~95) years old, and there are 135 males and 335 females among the participants with gender information. The participants were asked to collect images of one eye or both eyes, and the total number of collected eyes is 860 (440 left eyes and 420 right eyes). Each eye has 20 AS-OCT images, and We discarded 999 images without complete nucleus region due to the occlusion of the eyelids during collection. Finally, we use 16201 AS-OCT images for NC classification.

We divide the dataset based on participants into three disjoint subsets: training dataset, validation dataset, and testing dataset. Table 2 summarizes the distribution of three NC stages on the three datasets.

Table 1. The AS-OCT image distribution of NC stages on different datasets.

We resize the nucleus images to 224*224 and perform the random rotation and random horizontal flipping for data augmentation. All models are implemented on the Pytorch platform and trained on a TITAN-V GPU with 12GB memory. We use the stochastic gradient descent (SGD) optimizer with the batch size of 64. The initial learning rate is set to 0.0015 and decreased by a factor of 10 every 10 epochs after 100 epochs.

We use three commonly-used evaluation metrics: \(Acc\), \(F1\) and \(Kappa\) value to evaluate the performance of the model [7]. The calculation formulas are as follows:

$$\begin{aligned} Acc = \frac{TP+TN}{TP+FP+TN+FN}, \end{aligned}$$
(7)
$$\begin{aligned} Recall = \frac{TP}{TP+FN}, \end{aligned}$$
(8)
$$\begin{aligned} Precision = \frac{TP}{TP+FP}, \end{aligned}$$
(9)
$$\begin{aligned} F1 = 2\times \frac{Recall\times Precision}{Recall+ Precision}, \end{aligned}$$
(10)

where TP, FP, TN, and FN denote the numbers of true positives, false positives, true negatives, and false negatives, respectively.

$$\begin{aligned} Kappa=\frac{p_{0}-p_{e}}{1-p_{0}}, \end{aligned}$$
(11)

where \(p_{0}\) is the relative observed agreement among raters, and \(p_{e}\) is the hypothetical probability of chance agreement. Furthermore, we use \(\#P\) to denote the number of parameters and GFLOPs [10] to measure the computation.

Table 2. Comparison with state-of-the-art attention blocks.

4.2 Comparison with State-of-art Attention Attention Blocks

Table 2 compares the proposed GCA block with state-of-art attention blocks on ResNet18 and ResNet34. Our GCA-Net achieves the best NC classification results among all methods. It obtains the accuracies of 94.24% and 94.31%, respectively, and outperforms state-of-art attention blocks by more than 3% accuracy. Furthermore, It also consistently improves performance over other methods on F1 and Kappa value, demonstrating the effectiveness of the proposed GCA-Net. Moreover, compared with ResNets and comparative attention-based CNN models, the GCA-Net parameters are equal to ResNets and are smaller than SENet and CBAM. Furthermore, our GCA-Net does not add additional GFlops through comparisons to other state-of-the-art attention methods. In general, Our GCA-Net works better between accuracy and complexity.

4.3 Ablation Study

Table 3. Effects of pooling operators in GCA based on ResNet18 ( denotes using gating operator before squeezing and denotes not).

Effects of Different Pooling Operators. Table 3 shows the classification results of three different pooling operators in the GCA block based on ResNet18. Compared with global max pooling and global std pooling, the GAP achieves the best results on three evaluation measures. This is because GAP can be taken as another representation of average nucleus density (AND) from the nucleus region. Furthermore, the results also demonstrate that the gating operator significantly improves the classification results for the GCA block.

Table 4. Classification results of channel interaction operators in GCA block based on ResNet18.

Effect of Different Channel Interaction. Table 4 presents the classification results of four interaction operations: fully-connection, local-connection, non-connection(Sigmoid) and weakly-connection (Softmax). Our weakly-connection interaction operation obtains the best classification results among four interaction operations. Two reasons can explain these: 1) Softmax operation not only sets the relative weights for channels, but also suppresses the unimportant channels. 2) Inter-channel dependencies are weak, and it is difficult to build good dependencies among channels in training.

5 Conclusion

This paper proposes a simple yet effective gated channel attention network named GCA-Net to classify severity levels of nuclear cataract automatically on AS-OCT images. In the GCA-Net, we design a gated channel attention (GCA) block to mask redundant features and use the Softmax layer to set relative weights for all channels, which is motivated by the clinical study of average nucleus density (AND). The results on a clinical AS-OCT image dataset demonstrate that our GCA-Net achieves the best classification performance and outperforms advanced attention-based CNN models. Moreover, the computation complexity of our GCA-Net is equal to previous methods, which indicates that it has the potential to deploy our method on the real machine.

In the future, we will collect more AS-OCT images to verify the overall performance of the GCA-Net and plug the GCA block in other CNN models to test its effectiveness.