1 Introduction

The development of better image acquisition techniques and reduced cost of storage and computational facilities lead to advancements in medical image analysis. Optical coherence tomography (OCT) is being increasingly used to diagnose ocular diseases in the human retina. Textural and morphological variations in retinal layers captured by OCT imaging suggest the presence of abnormalities like diabetic macular edema (DME), age-related macular degeneration (AMD), glaucoma, diabetic retinopathy, and choroidal neovascularization (CNV) in the retina [46]. These diseases are the leading cause of irreversible vision loss.

With the availability of a large amount of data and image augmentation techniques, it is possible to implement convolutional neural network (CNN) based models to develop a better computer-aided diagnosis system for ocular diseases. However, medical image datasets usually suffer from class imbalance problem. Unequal distribution creates a bias towards the majority class and interferes with the classification performance of the CNN models. This work aims to develop an ocular disease identification algorithm for the classification of OCT images using CNN. The novelty of this work lies in developing a lightweight CNN, which needs fewer parameters to be trained and is also capable of dealing with the issue of class imbalance for ocular disease classification. The feature maps learned through CNN have important localization and classification information which can be harnessed using the global average pooling (GAP) layer as the last layer of the architecture. GAP layer helps in computing class activation maps (CAM) and highlights the important features of the image responsible for the particular disease class. Class weight balancing (CWB) is used to address the class imbalance problem in the proposed method. The proposed method is implemented on two publicly available datasets, ZhangLab and Duke. ZhangLab dataset consists of 83,484 OCT images for classification into CNV, DME, and Drusen ocular diseases, and the Duke dataset comprises 3231 OCT images for classification into AMD and DME diseases.

The work is organized into five sections. The state-of-the-art methods for the classification of OCT images are summarized in Section 2. The proposed methodology for ocular disease classification using class weight balancing in CNN is described in Section 3. Results are discussed in Section 4 and compared with other state-of-the-art methods. Finally, the work is concluded in Section 5.

2 Related work

Ocular disease classification using OCT images is a recently explored research problem in medical image processing. OCT imaging is a non-invasive technique that uses light waves to capture cross-sectional images of the retina [11]. OCT imaging technology was first introduced by Huang et al. [20] to capture important retinal structures like choroid, retinal pigment epithelium, posterior vitreous, and retina. The OCT images are mainly captured through time-domain (TD) and spectral-domain (SD) imaging techniques. TD-OCT is the conventional technique for acquiring retinal images, whereas advancements in SD-OCT imaging have been made in recent years. The SD-OCT technique gives three times more depth estimation as compared to TD-OCT technology [8]. Due to this, SD-OCT is a popular imaging technique for diagnosing AMD, CNV, and DME diseases.

In literature, OCT classification methods are divided into two main categories: hand-crafted feature extraction and deep learning-based methods. The techniques proposed for OCT image classification are mainly implemented on two publicly available datasets, namely Duke [44] and ZhangLab [28], for three and four-class classification of ocular diseases. The first method on the Duke dataset was proposed by Srinivasan et al. [44] using histogram of oriented gradient (HOG) features and Support Vector Machine (SVM) classifier. Hussain et al. [21] used retinal layer segmentation based features to extract ocular disease related abnormalities indicated by changes in the retinal layer for the classification task. Canny edge detection algorithm is used for the segmentation of retinal layer structure. The features are extracted from segmented layers and classified using random forest classifier into AMD, DME, and Normal classes. Venhuizen et al. [48] extracted interest points by thresholding the top 3% values of first-order vertical Gaussian gradient of OCT images of the Duke dataset [44]. Principal component analysis (PCA) is used to reduce the dimensionality of the features. Unsupervised learning is used to classify the final feature vector into disease class labels. A local binary pattern (LBP) feature based method was proposed by Lemaitre et al. [31] in combination with the bag of word technique for classification using an SVM classifier. A CAD system for OCT images of the Duke dataset is developed using a multi-scale convolutional mixture of expert (MCME) ensemble model [40]. The success of hand-crafted feature extraction techniques lies in finding the suitable trade-off between classification accuracy and computational cost. Hand-crafted features represent selected characteristics of a particular set of data. Important features relevant to the task may not be included in the final feature vector by employing a particular hand-crafted feature extraction technique [37].

Recently, deep learning-based methods have been a growing trend in data analysis and computer vision [15]. In particular, CNN has proven to be a powerful tool for medical imaging tasks like segmentation and classification. Various methods proposed for the classification of OCT images into ocular diseases include transfer learning from a network trained for a general image dataset, combining hand-crafted features with learned features, and training the network from scratch [7, 15]. Qingee et al. [23] proposed a transfer learning-based approach using the Inception V3 model by extracting middle-level features and training them with new CNN architecture. The performance of the trained model was evaluated on the Duke dataset for ocular disease classification. Another transfer learning-based method using Google Net was proposed by Karri et al. [25], where the network was fine-tuned for OCT images of the Duke dataset. A wavelet-based convolutional neural network (WCNN) was proposed to generate representative OCT CNN-codes in the spatial-frequency domain. The presence of abnormalities was scored over these features [39]. Najeeb et al. [36] extracted the region of interest of OCT images using binary thresholding. A computationally inexpensive single layer CNN was applied for the classification of OCT images from the ZhangLab dataset. Kermany et al. [27] proposed a transfer learning-based method for the classification of OCT images into four ocular disease classes for the ZhangLab dataset. Alqudah [1] proposed a 19 layer CNN architecture to classify SD-OCT images into five disease categories. The images are obtained from ZhangLab and a dataset released by Duke University. Ibrahim et al. [22] implemented a hybrid architecture where the region of interest based and CNN layer based features are concatenated together to classify OCT images of the ZhangLab dataset. Thomas et al. [47] proposed a multi-scale CNN architecture to classify OCT images into AMD and normal classes. The method was implemented on four publicly available datasets. Sunija et al. [45] implemented a lightweight CNN architecture on the earlier version of the ZhangLab dataset. Wu et al. [49] implemented an attention-based CNN model using ResNet50 architecture to extract features related to ocular classes of ZhangLab and a private dataset.

The sample distribution in the image dataset is an important property to be considered for classification problems. If the number of samples in one class is significantly higher than in the other classes, then the dataset is termed imbalanced [3]. Class imbalance is a relatively less explored and important research problem in medical image analysis. Generally, medical datasets comprise a higher number of normal samples (majority class) as compared to diseased samples (minority class) [12]. The same is true for OCT datasets like are Duke [44], ZhangLab [28], OCTID [13], and EUGENDA [48]. The imbalance present in training set results in over-classification of the majority class due to its higher prior bias. It leads to misclassification of minority class and less accurate overall prediction accuracy [24]. Various techniques like sampling based methods [16], cost-sensitive learning [29], and active learning [9] are implemented to reduce the effect of class imbalance in medical image analysis. Sampling based methods include random undersampling of majority class, oversampling of the minority class, and synthetic minority over-sampling technique (SMOTE) [4, 35]. Instead of sampling the data to create a balanced dataset, cost-sensitive learning defines the cost of misclassifying images of different classes [34]. This approach penalizes the misclassification of the minority class by giving more priority to its samples [29]. The process of random oversampling of minority classes is equivalent to class weight balancing [6]. The advantage of using class weight balancing over image augmentation lies in the reduction of training time required. The addition of the class weight balancing technique does not affect the number of training parameters, the number of floating-point operations, the size of the input data generator, and the time per epoch. The usage of data augmentation increases the training dataset size, which increases training time. Active learning is an iterative approach which deals with a small training set. It is an iterative approach that selects data from unlabeled or synthesized samples. It reduces the efforts of experts in labeling the medical data [43].

In general, publicly available OCT datasets have normal class as majority class, class imbalance issue needs to be addressed for OCT image classification. The contribution of this work lies in the embedding of a cost-sensitive learning-based CWB with a lightweight CNN proposed in this work which is trained from scratch. The performance of the proposed model is observed on not only ZhangLab and Duke datasets but also cross-validated on these datasets. The final results are compared with state-of-the-art methods discussed in this section for OCT classification performance analysis.

3 Materials and methods

3.1 Datasets used for experimentation

Creating a well labeled, balanced, and large image dataset is one of the major challenges in the medical domain. Expert annotation processes are expensive, and the number of images with disease is scarce compared to normal images [15]. OCT datasets by Zhang et al. [28] and Duke et al. [44] are used to implement and evaluate the proposed method. Kermany et al. [27] introduced a well labeled and suitably large OCT dataset named as ZhangLab. ZhangLab dataset was created at Shiley Eye Institute of the University of California San Diego, the California Retinal Research Foundation, Medical Center Ophthalmology Associates, the Shanghai First People’s Hospital, and Beijing Tongren Eye Center. The SD-OCT images in the ZhangLab dataset are captured using Spectralis OCT imaging protocol. The images were graded and verified for quality and labels of the diseases. This dataset comprises of total 83,484 images with 37,216 images of CNV, 11,348 images of DME, 8,617 images of Drusen, and 26,316 images of normal categories. Sample images from each class are shown in Fig. 1. The number of images of CNV and normal class are higher than DME and Drusen categories. This class imbalance tends to reduce the performance of the classification models. The dataset contains 1000 separate images for testing purpose with 250 images from each category.

Fig. 1
figure 1

OCT image examples from ZhangLab dataset [28], where red arrows are showing abnormalities: (a) choroidal neovascularization (CNV) (b) diabetic macular edema (DME) (c) drusen, and (d) normal image

Similar to ZhangLab dataset, Duke dataset comprises of SD-OCT volumetric scans of 45 patients captured with Spectralis SD-OCT imaging at Duke, Harvard, and Michigan Universities. The scans are divided into three classes having 15 volumes each of normal, dry AMD, and DME patients. A sample image from each class is shown in Fig. 2. Each volume contains varying B-scans (31 − 97) making total number of OCT images in the dataset as 3231. Duke dataset consists of three classes with 723, 1101, and 1407 images of AMD, DME, and normal categories. Duke dataset is a relatively small dataset and might not be suitable for deep learning. The number of images in each class is relatively balanced in comparison to the ZhangLab dataset.

Fig. 2
figure 2

OCT image examples from Duke dataset [44], where red arrows are showing abnormalities: (a) age-related macular degeneration (AMD) (b) diabetic macular edema (DME), and (c) normal image

3.2 Convolutional Neural Networks (CNN)

CNN is a specialized neural network for processing data with a grid-like structure, e.g., time-series data and images. CNN was first introduced by Le Cun et al. [30] to classify handwritten digit image data. CNN employs mathematical spacial convolution operation in at least one of the layers [14]. The basic architecture of CNN comprises convolutional (Conv), pooling, rectified linear unit (ReLU), fully connected (FC), batch normalization (BN), and global pooling (GP) layers. CNN architectures incorporate techniques like local receptive fields, weight sharing, and sub-sampling [2]. The convolutional filters are randomly initialized and learned through training CNN. These techniques give CNN an advantage over hand-crafted feature based methods to a great extent. Convolution operation combined with neural networks provides high performance in computer vision models for applications like segmentation, feature extraction, denoising, enhancement, and classification [15]. Although improved classification performance is observed with deep learning methods, but interpretability of the models reduces. The classification prediction relies on the millions of parameters, which makes it difficult to interpret as compared to hand-crafted features and basic machine learning classifiers. Interpretability is very crucial in the medical imaging applications to validate the correctness of trained models. Zhou et al. [50] proposed technique of obtaining activation maps through class specific output weights called class activation mapping (CAM). The important parts of medical images are projected back using the weights of the final layer of CNN on the last convolutional feature maps.

3.3 The proposed CWB based CNN architecture

Based on the traditional CNN architectures, a lightweight CNN model is proposed to classify OCT images into three and four classes of ocular diseases. The OCT images are resized to 256 × 256 × 3 dimensions for fast implementation of CNN architecture. Each pixel of the OCT image is normalized to bring the intensities to a common scale. As depicted in Fig. 3, the proposed CNN architecture comprises four convolutional layers with 64, 64, 128, and 256 filters for each subsequent layer. Each convolutional layer has a 3 × 3 kernel with ReLU activation function. The size of the feature maps is selected based on architectures available in the literature, which depicted state-of-the-art performance. The architecture like LeNet, AlexNet, VGG16, and InceptionV3 follow architecture trends with increasing feature maps or filters in subsequent layers. In the retinal image analysis domain, implementing lightweight or shallow CNN capable of good classification accuracy is important. This information is utilized in the proposed method to prepare 4 layers of lightweight CNN architecture for OCT classification. The convolution layers are combined with batch normalization and max-pooling layers.

Fig. 3
figure 3

Proposed class weight balancing based CNN for ophthalmological disease classification using OCT images

The fully connected layer is usually used as the final layer of the CNN architecture for the classification task, which gives predicted probabilities against each class label. But the use of fully connected layers does not give any insight into the image’s features or particular area responsible for the prediction. The global average pooling (GAP) layer is more native to the convolutional structure of the CNN and hence enforces good correspondence between feature maps and classes [33]. It is the main reason the GAP layer contributes to the construction of class-specific activation maps. During the training and validation steps, the GAP layer does not contribute any parameters to optimize and hence helps in reducing the chance of overfitting. Therefore, the global average pooling (GAP) layer is the last layer of the proposed CNN architecture in place of a fully connected layer [18]. Each feature map constructed by the last convolutional layer is averaged to get a single value as a feature to classify OCT images. Therefore, many filters, i.e., 256 with size 28 × 28, are used in the last layer of the proposed method so that the important features (information) loss does not happen during the GAP step.

Further, the proposed architecture uses CWB to address the issue of class imbalance. In this technique, higher weights are assigned to the classes with less number of OCT image samples and vice versa. The class weights for each class are calculated using the (1) given by King et al. [29]:

$$ W_{j} = \frac{N}{k*N_{j}} $$
(1)

where Wj is the weight of class j, N is the total number of OCT images, k is the total number of classes, and Njis the number of OCT images in class j. The class weights are applied during CNN weight learning and penalize classification of minority class into majority class. The final features obtained through the GAP layer are applied to the dense layer for classification using the softmax activation function.

3.4 Experimental setup

The proposed model is implemented using the Keras library on the TensorFlow backend. In the proposed architecture, the final feature vector obtained from the GAP layer is applied to the dense layer with four neurons for each class of the ZhangLab dataset. The training and testing sets are pre-partitioned in the ZhangLab dataset. The training set of the ZhangLab dataset is further divided into training and validation split with 80% and 20% OCT images in each set, respectively. The number of images in each set is mentioned in Table 1.

Table 1 Ocular disease class distribution for ZhangLab dataset

The same CNN architecture is used for the three-class classification problem of the Duke dataset, where the output layer comprises three neurons. The number of images in the Duke dataset is relatively balanced as compared to the ZhangLab dataset. The literature observed that experiments are performed with different partitions of the Duke dataset, as predefined training and testing sets are not given with the Duke dataset. In this work, the images of the 15th volume of each class of the Duke dataset are kept separate for testing, and the remaining volumes are used as the training set. The training set is further divided into four partition combinations (starting from 60% in the increment of 10%) for training and validation. Table 2 shows the number of training, validation, and testing images for each partition.

Table 2 Ocular disease class distribution for Duke dataset

The model is trained by taking batch size as 32 for 100 epochs with early stopping on the best validation accuracy. The model is trained using the categorical cross-entropy loss function. Adam optimizer is used for back-propagation with a learning rate of 0.01. Xavier uniform initializer is used for initializing the weight of various layers involved in the proposed architecture, and the biases are initialized as zero. During the training phase, training and validation image samples are applied along with the class weights. The learning is penalized according to the weights assigned for each class. The best model with the highest validation accuracy is selected for testing without applying CWB. Finally, the GAP layer is used to construct CAMs.

4 Results and discussion

The proposed method is applied on 83,484 OCT images for the four-class classification of the ZhangLab dataset [28] and 3231 OCT images for the three-class classification of the Duke dataset [44] with the distribution as given in Tables 1 and 2 . The performance is evaluated and compared on metrics based on the confusion matrix. The classification results obtained from the proposed method are compared with the existing state-of-the-art methods earlier discussed in Section 2. A multi-class confusion matrix is constructed on the predicted labels of the testing set for each of the datasets. Predicted labels are compared with the ground truth provided by the experts for each OCT image of test datasets. Performance metrics like accuracy (ACC), sensitivity (SENS), specificity (SPEC), and F1 − score are calculated based on the multi-class confusion matrix [38]. The receiver operating characteristics (ROC) curve is plotted, and the area under the curve (AUC) of the plot is calculated [10]. The training is done for 100 epochs with an early stopping criterion (monitoring up to 10 epochs for improvement). Figure 4 shows the training and validation accuracy per epoch for both datasets. Due to early stopping criteria, the best models were obtained at 60th and 40th epochs for ZhangLab and Duke datasets, respectively. The training accuracy increases with the increase in validation accuracy, which indicates that models are not overfitting the OCT image data.

Fig. 4
figure 4

Plots depicting the performance of OCT classification in the training and validation sets: (a) ZhangLab Dataset (early stopping at epoch number 60) and (b) Duke Dataset (early stopping at epoch number 40)

Performance on the ZhangLab dataset:

The CNN models are usually tested on metrics like F1 − score, Accuracy, Sensitivity, and Specificity, but optimized using loss function like categorical cross-entropy. Sanyal et al. [42] proposed optimization of non-decomposable loss functions like F1 − score for better performing deep learning models developed for imbalanced datasets. Therefore, for the comparison purpose, this method is implemented with the proposed architecture where CNN is optimized on F1 − score as loss function. Results for different combinations of CWB and F1 − score loss function optimization experimented on the ZhangLab dataset are summarized in Table 3. The basic model (OCT_GAP + CCE) using categorical-cross entropy (CCE) loss function performs better in comparison to model (OCT_GAP + F1) with F1 − score loss function optimization. The F1 − score loss function combined with the CWB technique (OCT_GAP + CWB + F1) shows clear improvement in the results compared to F1 − score loss function optimization (OCT_GAP + F1). This improvement is an indicator of the benefit of using class weight balancing for the imbalanced ZhangLab dataset. Although F1 − score loss function optimization with class weight balancing (OCT_GAP + CWB + F1) model shows improvement, it underperforms compared to class weight balancing used with the basic model (OCT_GAP + CWB + CCE).

Table 3 Performance of the proposed algorithm with CWB and F1 − score optimization on ZhangLab dataset

For the ZhangLab dataset, the proposed method performs better than the existing OCT image classification methods presented in Table 4. Zhang et al. [28] presented a transfer learning-based method using the InceptionV3 model comprising of a large number of training parameters. Li et al. [32] also proposed a VGG16 transfer learning based method. A layer-guided CNN was implemented by Huang et al. [19] for the classification of OCT images. An EfficientNet-B7 model was proposed by Chetoui et al. [5] for the classification of OCT images. As proposed by Najeeb et al. [36], a CNN model trained from scratch also underperforms compared to the proposed method. The proposed lightweight CNN has 0.41 M training parameters in comparison to the methods given by Zhang et al. [28], Kaymak et al. [26], Li et al. [32], Chetoui et al. [5], and Najeeb et al. [36] with 21.8 M, 91.7 M, 15.2 M, 64.8 M, and 102.7 M training parameters, respectively. These techniques do not consider the class imbalance problem in the ZhangLab dataset. The proposed method performs better even without using class weight balancing (OCT_GAP + CCE in Table 3). However, the results are further improved by addressing the class imbalance issue using class weight balancing. Thus, given consideration to the class imbalance problem, the proposed method has an added advantage over other methods existing in the literature.

Table 4 Performance of the proposed method on ZhangLab dataset

Performance on the Duke dataset:

As the model (OCT_GAP + CWB + CCE) performs well for ZhangLab dataset, the same combination is applied to the Duke dataset. The results are summarized in Table 5. The partition with 70% training data with CWB produced the best results for the Duke dataset, and the same is used to compare the performance with the existing methods on this dataset. The results summarized in Table 6 clearly show that deep learning-based methods give better results in comparison to hand-crafted feature-based techniques given by Venhuizen et al. [48], Srinivasan et al. [44], Hussain et al. [21], and Hani et al. [17]. The proposed method is also compared with recently proposed deep learning-based methods. Qingee et al. [23] proposed a transfer learning (TL) based approach using the InceptionV3 model, whereas Rasti et al. [39] and Rong et al. [41] proposed specialized CNN architectures. The proposed method performed better than all the hand-crafted feature-based methods in terms of ACC, SENS, SPEC, and F1 − score except for [44] in terms of SENS and SPEC, respectively. It also performed better than Rong et al. [41]. The CWB based lightweight architecture proposed in this work performed better than [41] and yielded comparable performance to [23] and [39]. The advantage of the proposed method over these methods lies in the fewer training parameters compared to transfer learning based methods. The proposed method uses 0.41 M training parameters compared to the transfer learning based method by Qingee et al. [23] with 21.8 M training parameters.

Table 5 Performance of the proposed method on Duke dataset for different partitions
Table 6 Performance of the proposed method on Duke dataset

The ROC curve in the classification problem depicts the trade-off between sensitivity and specificity. This trade-off is important for medical image analysis. The ROC curves for ZhangLab and Duke datasets are shown in Fig. 5. ROC curve for each class is plotted separately, and the area under the curve is calculated, which is depicted in the bottom right corner. Different values of True positive and false positive rates are plotted for different classification thresholds to get the ROC curve. ROC curve having AUC value 1 is considered as the best model. The AUC achieved through the proposed method is 0.99 for both ZhangLab and Duke datasets, respectively.

Fig. 5
figure 5

ROC plots: (a) ZhangLab dataset - CNV (class-0), DME (class-1), drusen (class-2), and normal (class-3) and (b) Duke dataset - AMD (class-0), DME (class-1), and normal (class-2)

Cross-validation Performance on Datasets:

To check the generalization capability of the proposed CNN model, an experiment is performed for cross-dataset validation by only keeping common classes, i.e., DME and normal class, from both the datasets for training and testing purposes. The model trained on the images of the ZhangLab dataset showed ACC, SENS, SPEC, F1-score, and AUC as 90.78%, 83.10%, 96.80%, 0.90, and 0.91, respectively. The model trained on Duke dataset and tested on ZhangLab dataset yielded 77.68%, 82.64%, 73.14%, 0.78, and 0.78 values for ACC, SENS, SPEC, F1-score, and AUC, respectively. This analysis shows that the model trained on the ZhangLab dataset generalized better for OCT images of DME and normal classes than the model trained on the Duke dataset. This shows the effectiveness of the larger dataset and its better generalization capability.

Model Interpretation using CAMs:

The output weights learned during the training phase are projected back to the last convolutional layer filters to get the class activation maps of some example images from ZhangLab and Duke dataset as shown in Figs. 6 and 7 respectively. In the case of OCT images, the important features responsible for ocular diseases are found near the retinal layers. These important features are highlighted in red and yellow colors. The dip in the retinal layer in a normal image is visible in Figs. 6(a) and 7(a). The rise in the retinal layers in the case of a CNV and Drusen images can be seen in Fig. 6(b) and (c). DME, Drusen, and AMD class activation maps also show abnormality in the intermediate layers of the retina.

Fig. 6
figure 6

Class activation maps obtained for OCT images of ZhangLab dataset: (a) normal class images, (b) CNV class images, (c) DME class images, and (d) Drusen class images

Fig. 7
figure 7

Class activation maps obtained for OCT images of Duke dataset: (a) normal class images, (b) DME class images, and (c) AMD class images

5 Conclusion

This work presents an automated OCT classification method using deep learning-based CNN with CWB for ocular diseases. Medical imaging datasets suffer from an imbalance in the number of samples for normal and diseased classes. Addressing the class imbalance issue is crucial in CAD systems for the diagnosis of ocular diseases. A global average pooling based model is proposed in this work to calculate feature maps from CNN. The model is trained by applying class weights that penalize the majority class. The trained model is finally tested on publicly available ZhangLab and Duke datasets for OCT classification. The final results are evaluated using confusion matrix based parameters. The proposed method yielded 99.17% and 98.46% classification accuracy and 0.99 and 0.98 F1 − score for ZhangLab and Duke datasets. The results show improvement in the testing performed on the imbalance ZhangLab dataset with the model using the CWB technique compared to the model without CWB. The class activation maps calculated using the final convolutional layer show important features responsible for a particular disease. The overall advantage of the proposed method over other state-of-the-art methods discussed here lies in the light CNN architecture, CWB, and class activation map construction. Such systems can be used to assist Ophthalmological experts. Network compression methods for well-known CNN architectures, different class balancing techniques, and interpretation of CNN for OCT images through various visualization techniques can be explored in the future.