Introduction

Glaucoma is an optic neuropathy characterized by a slow progressive degeneration of retinal ganglion cells (RGCs), which leads to irreversible visual field defects [1]. Structural alteration of the optic nerve head (ONH) and retinal nerve fiber layer (RNFL) is an early sign of glaucoma [2]. Newer ocular imaging methods, such as spectral domain optical coherence tomography (SD-OCT), allow for noninvasive quantitative imaging of these affected structures in order to facilitate the detection and monitoring of glaucoma [3]. As peripapillary RNFL (pRNFL) comprising the axons of all RGCs from the entire retina, pRNFL parameter measurement is currently the most commonly utilized OCT parameter for clinical glaucoma assessments [4]. Several studies have reported good diagnostic ability of pRNFL parameters in glaucoma using SD-OCT [5, 6]. These, as well as most of the previously studies with hand-crafted features (HCFs), have employed an object segmentation including hand-designed program, followed by utilizing domain knowledge to manually or semi-automatically extract features, like pRNFL thickness, and finally make diagnosis using statistical classifiers or shallow neural computational machine-learning classifiers designed specifically for each class of objects [3, 7].

The advantage of HCFs is that they can reflect expert knowledge by selecting and analyzing features which are suitable for a given problem, like glaucoma. However, hand-designed program and feature extraction may be difficult and time-consuming. By contrast to hand-craft features, deep learning (DL) algorithms learn the most predictive features directly from the images. The DL methods have been applied to various medical imaging modalities, such as fundus photography, lung radiography, computer tomography (CT), and MRI [8,9,10,11]. Moreover, some studies have report DL model significantly outperformed other machine learning methods for detection of glaucomatous visual field (VF) change [12]. To the best of our knowledge, there exists no study that compares diagnostic accuracies of DL model and HFCs of pRNFL parameters obtained with SD-OCT to differentiate normal eyes from eyes with glaucoma. Thus, the purpose of this study was to evaluate the efficacy of a DL model for detecting of glaucomatous pRNFL changes and to compare the DL model against HCFs based on SD-OCT pRNFL imaging.

Methods

Chinese subjects above the age of 20 years were prospectively recruited from Joint Shantou International Eye Center of Shantou University and the Chinese University of Hong Kong (JSIEC). All participants engaged in an informed consent process and signed a written consent document before study procedures were carried out. This study was conducted according to the tenets of the Declaration of Helsinki and had the approval of the institutional review board.

Examinations and inclusion criteria

All subjects were asked about their medical and ophthalmic history and underwent a standardized ophthalmic examination including: slit-lamp examination, Snellen visual acuity, intraocular pressure (IOP) measurement by Goldmann applanation tonometry, stereoscopic optic disc examination with a 78-D Volk lens, gonioscopy performed in the dark using a Goldmann 2-mirror lens, as well as VF testing (Humphrey Field Analyzer II-750i, Carl Zeiss Meditec).

Glaucomatous eyes were defined by the glaucomatous appearance of the optic nerve head on color fundus photographs, regardless of the presence or absence of glaucomatous VF defects. These characteristic changes in the optic nerve head included glaucomatous optic disc cupping, peripapillary atrophy, neuroretinal rim notching, wedge-shaped defects of RNFL adjacent to the edge of optic disc, and optic disc hemorrhage. Glaucomatous VF defects were defined as those with a cluster of three points with probabilities of < 5% on the pattern deviation map in at least 1 hemifield, including at least 1 point with a probability of < 1%, or a cluster of two points with a probability of < 1%, and/or a glaucoma hemifield test (GHT) result outside normal limits. Other inclusion criteria included (1) age > 40 years old and (2) no concomitant ocular disease, like retinal disease, diabetic mellitus, or significant senile cataract that could affect the results of SD-OCT examinations.

Inclusion criteria for the normal control group were (1) age > 20 years old; (2) best corrected visual acuity of 20/40 or better with spherical equivalent refractive errors between − 6.0 and 3.0 diopters; (3) intraocular pressure of 21 mmHg or less; (4) no history of glaucoma, retinal disease, diabetic mellitus, or significant senile cataract that could affect the results of SD-OCT examinations; (5) normal VF which was defined as a mean deviation (MD) and pattern standard deviation (PSD) within 95% confidence limits and a GHT result within normal limits.

SD-OCT imaging and hand-crafted features of pRNFL

SD-OCT imaging of pRNFL was carried out with Topcon 3D OCT-2000 (Topcon, Tokyo, Japan, software version: 8.11.003.04) by experienced operators (X.L, B.C, and J.Y). The SD-OCT uses a super luminescent diode laser with a center wavelength of 840 mm and a bandwidth of 50 nm as a light source. The acquisition rate of the SD-OCT is up to 20,000 A-scans per second. The transverse and axial resolutions were 20 and 5 μm, respectively. A 3D scan disc protocol was used for pRNFL imaging in this study. Only images with a quality factor > 45 were used for analyses. RNFL imaging with 1024 points of resolution on a 3.46-mm circle diameter was exported and saved as .jpg format for quantitative analysis.

A customized software (Anterior Segment Analysis Program (ASAP)) was used to quantify pRNFL parameters, including thickness in four different sectors (nasal, temporal, superior and inferior) and on average. ASAP is a plug-in for image processing software (ImageJ version 1.38x; public domain software, http://imagej.nih.gov/ij/) [13]. The detail of this software had been reported by us and other research groups [14, 15]. After automatically delineating the boundary of pRNFL, the software then automatically calculated the pRNFL thickness, which include the overall average thickness in 360°, with 315° to 45° position designated temporal, 225° to 315° position inferior, 135°to 225° position nasal, and 45°to 135° position superior (Fig. 1). The inter-observer reproducibility of ASAP was assessed using a random subset of 20 images as evaluated by two examiners (X.L and C.Z) independently. The reproducibility of the ASAP was good to excellent, with an intra-class correlation coefficient that ranged from 0.792 to 0.979 for all the RNFL parameters. We used receiver operating characteristic (ROC) curves to describe the ability to discriminate glaucomatous from healthy eyes for each pRNFL parameter.

Fig. 1
figure 1

Overview diagram of classic machine learning with hand-crafted features vs. deep learning. a A classic machine learning with hand-crafted features (HCFs) involves an object segmentation using hand-designed program, follows by utilizing domain knowledge to manually or semi-automatically extract features, and finally makes diagnosis using statistical classifiers or machine-learning classifiers to make a prediction. b A deep learning approach learns the most predictive features by a series of hidden layers which extracts increasingly abstract features from the image directly

Datasets

Two datasets were included in this study. The first dataset, consisted of 1501 peripapillary RNFL OCT images obtained from153 glaucoma patients with 690 images and 394 normal subjects with 811 images, was selected from a prospective study at JSIEC between September 2013 and August 2014. We had reported some of the results previously [16]. Briefly, the study was designed to investigate the profile and determinant of hand-crafted features in SD-OCT imaging with a semi-automatic software. This dataset was further randomly divided into two sets: a training set (80% with 552 images of glaucoma and 649 images of normal subjects) for the development of the algorithm and a development set (20% with 138 images of glaucoma and 162 images of normal subjects) for monitoring the performance of the trained model.

The other testing dataset, which included 50 images from 50 glaucoma patients and 52 images from 52 normal subjects, was used for evaluation of the algorithm and comparison of the algorithm with RNFL parameters. Subjects in the testing dataset were enrolled prospectively after the training dataset were established from the same center. The inclusion and exclusion criterion were identical for both the training dataset and testing dataset except that age was above 40 years old in normal subjects of testing dataset.

Development of Deep Learning Model

We used transfer learning with fine-tune technique to build the classifier in this study. To achieve this, a modified convolutional base of convolutional neural network (CNN) model with weights pre-trained on ImageNet was used as DL model [17]. The pre-trained weights of DL model were further fine-tuned on our dataset for classification. The applied CNN was Inception-V3with 159 layers (Google Inc) [18]. The DL model was implemented in Tensorflow framework (Google, version 1.10.0) with Keras API (version 2.2.4). All images were resized to 224 × 224 pixels as required by Keras API. Image pixel values were scaled to values in a range of 0 through 1. Data augmentation was performed to increase the amount and type of variation within the training dataset, including horizontal flipping, rotation of 20°, and sharpening and adjustments to saturation, brightness, contrast, and color balance. Training was then performed by a minibatch gradient descent of size 32 with an Adam optimizer learning rate of 0.001. Training was run for 100 epochs, as the loss of the model decreased and the accuracy of the validation set increased.

To visualize the most predictive features for detecting glaucomatous pRNFL changes, a class activation map (CAM) was used to open the “black box” of DL model [19]. ACAM is a 2D grid of heatmap associated with a specific output class, computed for every location in any input image, indicating how important each location is with respect to the class considered.

Statistical analysis

The Student t test or Mann-Whitney test was used to evaluate demographic and clinical differences between normal subjects and glaucoma patients. The mean and standard deviation of the following parameters were calculated: age, spherical equivalent, axial length, visual field (MD and PSD), image quality score, and pRNFL thickness in overall average and four different quadrants. The diagnostic accuracy of DL model and each pRNFL parameter to differentiate between normal and glaucomatous eyes was determined by computing the area under the curve (AUC) of ROC, sensitivity, specificity, and f1 score with 95% confidence intervals. Sensitivities at fixed specificities of 80% and 90% were determined for DL model and all the pRNFL parameters. Comparison of AUC was carried out using DeLong’s method [20]. All statistical analyses were carried out using commercial statistical software (IBM SPSS Statistics v. 17 for Windows; SPSS Inc. Chicago, IL) and statistical programming language Python (ver. 3.5.1, Python Software Foundation, Beaverton, USA).

Results

Demographics of the training and testing sets are given in Table 1. In testing set, mean age was 53.57 and 50.70 years for normal [range 40–76 years] and glaucoma [range 24–72 years] subjects respectively. In training set, mean age was 41.84 and 49.54 years for normal [range 18–80 years] and glaucoma [range 21–76 years] subjects respectively. Comparison of HCFs, including pRNFL thickness on average and in four different sectors, is shown in Table 2. The average and different sectors of pRNFL thickness were significantly smaller in the glaucoma group compared with the normal group in both training and testing datasets (all with p < 0.001) except for nasal sector.

Table 1 Demographics of the training and testing datasets
Table 2 Comparison of pRNFL parameters between training and testing set

The training curve for the DL model is shown in Fig. 2. We trained each DL model for 100 epochs (iterations through the entire dataset) until the absence of further improvement in both accuracy and loss function. As shown in Fig. 2, the loss function value and accuracy rate of both training and validation dataset changed dramatically at the first five epochs and then stabilized after that, showing that our DL models reach good convergence.

Fig. 2
figure 2

The training curve for the deep learning model. The loss function value and accuracy rate of both training and validation dataset changed dramatically at the first 5 epochs and then stabilized after that, showing that our DL models reach good convergence

Using the pRNFL thickness in four different sectors, the AUC obtained in the independent test set varied between 0.661 and 0.938 (the lowest AUC value of 0.661 with 95% CI 0.549 to 0.772 for pRNFL thickness in temporal sector, the highest AUC value of 0.938 with 95% CI 0.894 to 0.982 for pRNFL thickness in inferior sector respectively), as shown in Table 3 and Fig. 3. The DL model achieved an AUC value of 0.990 [95% CI 0.974 to 1.000], which was significantly larger than the AUC values of all other HCFs (p < 0.001, DeLong’s method with Holm’s method for multiple comparisons). CAMs were generated to visualize the regions on SD-OCT imaging that were most important for the glaucoma prediction. Figure 4 shows two examples of the strongest activations within the DL model, after being assigned the positive for glaucoma. For normal pRNFL imaging, no area is activated. Most of the pRNFL area has been activated by CNN in glaucoma. This shows that the network is focusing on the part of the image where the glaucomatous changes are present.

Table 3 The diagnostic performance of DL model as compared with hand-crafted features of pRNFL imaging
Fig. 3
figure 3

Receiver operating character (ROC) curves in the testing set

Fig. 4
figure 4

Classic activation map (CAM) for glaucomatous and normal peripapillary retinal nerve fiber layer images. a A glaucomatous pRNFL defect with severe visual field defect (MD = − 16.47). Notice the highlight regions in the pRNFL area which is activated by CNN. This shows that the network is focusing on the part of the image where the glaucomatous changes are present. b A normal pRNFL image without any activation area

Discussion

In this study, we had shown that a DL model was effective at distinguishing glaucoma from normal SD-OCT imaging of pRNFL, and its accuracy was higher when compared with pRNFL parameters which were extracted by a hand-crafted software. Our results demonstrated the possibility of applying DL to assist screening and diagnosis of glaucoma in SD-OCT images.

SD-OCT is widely used for glaucoma detection, and peripapillary RNFL analysis represents the scanning protocol most used for routine patient management. Both manual and semi-automatic methods have been reported to diagnose glaucoma based on SD-OCT images. By using build-in software, Harsha et al. reported AUC between 0.792 and 0.884 for pRNFL parameters, which was comparable to the current study [21]. Using RTVue-100 system, Seong et al. report a higher AUC value from 0.921 to 0.976 for pRNFL parameters in patients with different glaucoma stages [22]. Although there are differences in the dataset and reference standards compared with the previous studies, the present study extends this body of work by using DL model with high sensitivity and specificity. Our study had a unique difference when compared with the previous studies. Most previous studies used HCFs, which are labor intensive and need to be specified by experts. The supervised DL technologies adopted in this study allow training entirely end-to-end. This would help to aid in the implementation of large-scale screening programs by providing quick and objective SD-OCT image assessment.

The application of DL is usually limited due to the lack of large training sets, like several tens of thousands of examples to train DL model effectively. Given a finite amount of OCT images can be prepared in the clinical setting, several authors suggested to use transfer learning to improve the performance of DL model. Transfer learning has proven to be a highly effective technique and been widely used in multiple medical domains with limited data [7, 23]. Rather than training a completely blank network, we adapted DL model base pre-trained on the ImageNet dataset which involved more than 14 million images with 1000 classes. After running pre-trained CNN base over our dataset, we then extended the models by adding dense layers on top and running the whole thing end-to-end on the input data. These techniques yield a more accurate model in much less time with relatively small dataset.

Deep learning models have often been functionally so-called black boxes because it is difficult to determine how DL makes its predictions [24]. CAMs are tools that can help aid visualization of a CNN. It is interesting to notice that features outside RNFL region also contribute to DL model predictions. Previous studies had reported several other glaucomatous structure defects, such as the ganglion cell layer (GCL) and peripapillary microvascular changes [25, 26]. It is possible that CAMs will prove to be even more useful to reveal the structure–function relation in glaucoma.

Our study has several limitations. First, our DL models are developed based on Topcon model. It had been reported that pRNFL thickness values obtained by Topcon OCT are significantly higher than those measured by Cirrus OCT in both normal and glaucomatous eyes [27]. Second, the macular data of SD-OCT was not available in the training set in current study. Some systematic review has reported pRNFL parameters are still preferable to macular parameters for diagnosing manifest glaucoma, although the differences are small. Third, previous study had reported the effects of peripapillary atrophy on the diagnostic ability of SD-OCT parameters. Further studies are needed to explore the effect of peripapillary atrophy on DL model developed in this study [28]. Fourth, all SD-OCT images were collected from the same center and external validation test is needed to confirm the diagnostic performance in different centers or OCT devices in the future. Also, most of the glaucoma cases were quite severe and this made classification easier in the current study. Lastly, we used images of Chinese eyes only, so our results may not be applicable to other populations.

In spite of these limitations, DL model developed in this study has achieved higher sensitivity and specificity compared to traditional HCFs. DL has a high diagnostic capability and may help screening and diagnosis of glaucoma in clinic. Further studies with larger dataset are needed to evaluate whether DL model also predict other clinical outcomes, including pRNFL imaging from different eye center, OCT devices, different size or appearance of optic disc, or different glaucoma stages.