Introduction

In Western countries, age-related macular degeneration (AMD) is the leading cause of visual impairment and blindness in elderly patients [1,2,3]. Its prevalence increases with age and affects more than 10% of the > 65-year-old people [1].

In accordance with the fundoscopic findings, the classification of the Beckman Initiative for Macular Research Classification Committee distinguishes between early AMD, intermediate AMD, and late AMD. Late AMD is defined with the appearance of any geographic atrophy (GA) or choroidal neovascularizations (CNV) [4].

Nowadays, many different imaging modalities are used in the diagnosis of AMD. Most of these imaging techniques, e.g., optical coherence tomography (OCT), OCT angiography, and fluorescein angiography preferably, have their advantage in detecting exudative AMD [5,6,7]. By contrast, fundus autofluorescence (FAF) imaging is particularly useful in the diagnosis of GA [8,9,10]. It provides useful information about the function of the retinal pigment epithelium (RPE) by displaying ocular fluorophores, mainly lipofuscin [3, 9, 10]. In areas affected by GA, there is a progressive loss of the retinal pigment epithelium (RPE), the corresponding choriocapillary layer, and the photoreceptor layer [2, 10]. Due to the absence of the RPE cells and its fluorophores, GA areas appear hypoautofluorescent in the FAF [8,9,10]. However, the junction area of the GA can also show different hyperautofluorescent patterns [3, 9, 10]. In accordance with these different fundus autofluorescence patterns, GA can be subdivided in different phenotypes with differing characteristics [3]. Within these classes, the diffuse-trickling pattern (dt-GA) is shown to have an extremely rapid progression [10, 11].

Deep learning is an interesting research field that gains importance in many different medical areas, especially those dealing with imaging issues. Among others, it enables the automated detection of different structures by self-learning algorithms working with a deep convolutional neural network (DCNN) [12]. Ophthalmology with a lot of different imaging modalities is a potential field of application. Recently, there have been some publications dealing with the use of machine learning in AMD [13,14,15,16,17,18,19,20,21,22,23]. In these studies, machine learning was only used to interpret fundus photography or OCT images. To the best of our knowledge, there exists no study that uses a DCNN in the automated evaluation of fundus autofluorescence images.

Therefore, the aim of our study was to create a deep learning-based classifier for the evaluation of fundus autofluorescence to (1) automatically detect GA and (2) identify eyes exhibiting rapidly progressing dt-GA.

Methods

Deep learning process

For this study, 30° FAF images of GA patients, healthy patients, and patients with other retinal diseases (ORDs), all obtained by the same FAF device (Spectralis, Heidelberg Engineering, Heidelberg, Germany), were used to train and test a DCNN classifier (Fig. 1). Only images with an image quality that was adequate for manual diagnosis were used. The selection of the images as well as the following assignment towards the training and testing set was performed randomly. There was a strict separation of the training data from the test data to prevent inter-eye and intra-eye correlations. Images of patients that were used for training the DCNN classifier were not used to test it. Furthermore, only one image of a single patient was used for testing. A DCNN is a self-learning algorithm that can perform deep learning by processing input data (e.g., images) within many different hierarchal layers from simple (e.g., lines) and to more complex forms [14, 24].

Fig. 1
figure 1

Fundus autofluorescence (FAF) images of a healthy retina (a), a retina with diffuse-trickling pattern (dt-GA) in a patient with geographic atrophy (GA) (b), and a retina with drusen and reticular pseudodrusen in a patient with other retinal diseases (ORD) (c)

For this study, the fast working deep learning framework TensorFlow™ (Google Inc., Mountain view, USA) was used to provide deep learning with a multi-layer DCNN. The first layers of this DCNN had been pretrained with millions of already classified everyday life images (e.g., dog, cat, house and car) from the image database ImageNet [25]. In order to obtain a classifier being able to detect GA and GA patterns in FAF images, the last layer of the used DCNN was subsequently trained with above-mentioned FAF images [12, 14, 25,26,27,28].

Detection of GA

In order to differentiate not just between healthy and pathological FAF images, but also to detect GA in FAF images, two classifiers were built with the DCNN: one to differentiate between GA FAF images and healthy FAF images and another to differentiate between GA and ORD (e.g., AMD without GA, adult-onset foveamacular vitelliform dystrophy, central serous chorioretinopathy, and epiretinal membranes). Therefore, the DCNN was trained with 400 FAF images in 500 training steps in both cases (GA: n = 200; healthy or ORD: n = 200).

For both classifiers, the quality of the training process was assessed by determining the training accuracy (performance to correctly classify already trained images), the validation accuracy (performance to correctly classify not so far trained images), and the cross entropy (a function that gives information about the training progress and decreases in a successful training process) [12, 14].

Finally, the two DCNN classifiers were tested with 60 untrained FAF images in each case (GA: n = 30; healthy or ORD: n = 30). For every image probability scores both for GA FAF (GA probability score) and healthy FAF (normal FAF probability score) or ORD FAF (ORD probability score) were automatically calculated by the DCNN.

Detection of diffuse-trickling pattern

In a second step, the DCNN was trained to discriminate GA with dt-GA in FAF from other GA FAF patterns (ndt-GA). For this training process, 72 FAF images with dt-GA and 138 FAF images with ndt-GA were used. As described above, training accuracy, validation accuracy, and cross entropy were determined.

Finally, this classifier was tested with 10 FAF images with dt-GA and 10 images with ndt-GA. By analogy with the GA probability score and the normal FAF probability score described for the GA classifier above, a dt-GA probability score and an ndt-GA probability score were automatically calculated.

Statistics

Statistical analysis was performed with the software SPSS (IBM SPSS Statistics 23.0; IBM, Armonk, NY, USA). The nonparametric Mann-Whitney-U test for independent samples was used to compare the automatically calculated probability scores of the two classifiers. The level of significance was defined p < 0.05. Descriptive statistics were performed with Excel® (Microsoft® Excel® for Mac 2011, 14.6.2; Microsoft®, Redmond, USA).

In order to get information about the precision and repeatability of the two created classifiers, the whole deep learning procedure including the final testing was repeated for a second time. Therefore, the mean absolute probability score difference and the coefficient of variation were calculated. In addition, a Bland-Altman plot was constructed to visualize information about the repeatability of the testing results.

Results

GA classifiers (GA vs. healthy/GA vs. ORD)

Performance of the training process

During the 500 performed training steps, the training accuracy and the validation accuracy of the GA classifiers showed a fast increase to 99%/98 and 96%/91%. The cross entropy showed a rapid decrease to a final value of 0.062/0.100 (Fig. 2a–d; Table 1(a, b)).

Fig. 2
figure 2

The graphs show the trainings accuracy, the validation accuracy and the cross entropy for the GA classifier (GA vs. healthy) (a, b), the GA classifier (GA vs. ORD) (c, d), and the dt-GA classifier (e, f) during 500 training steps

Table 1 Training accuracy, validation accuracy, and cross entropy after 500 training steps for (a) the GA classifier (GA vs. healthy), (b) the GA classifier (GA vs. ORD) and (c) the dt-GA classifier

Performance of the classifier in the final testing

The mean GA probability score of the final testing was 0.981 ± 0.048/ 0.972 ± 0.043 for the GA FAF images and 0.012 ± 0.016/ 0.061 ± 0.072 for the healthy or ORD FAF images (Figs. 3 and 4a, b). According to this, the mean normal FAF probability score/ORD probability score was 0.012 ± 0.017/0.062 ± 0.072 for the GA FAF images and 0.981 ± 0.047/0.972 ± 0.044 for the healthy/ORD FAF images. The GA probability scores were highly significantly different between the two image groups (p < 0.001).

Fig. 3
figure 3

Example of tested GA FAF images with a high (0.999) (a) and a low probability score (0.769) (b)

Fig. 4
figure 4

Box plot showing the GA probability score for the GA group and the healthy comparable group (a), the GA group and the ORD group (b), and the dt-GA probability score for the dt-GA group and the ndt-GA group (c). In all cases, the difference was highly significant (p < 0.001)

In the two cases, all of the 60 tested FAF images were correctly diagnosed by the GA classifier (Table 2(a, b)). The sensitivity, the specificity, and the accuracy of the GA classifier were 100%.

Table 2 Two by two table showing the results of automated discrimination of GA and healthy eyes (a), of GA and ORD (b), and of diffuse-trickling pattern (dt-GA) and other GA FAF patterns of the GA patients (ndt-GA) (c)

Repeatability and precision

The mean absolute GA probability score difference of the final testing between the two independently performed deep learning procedures was 0.0004 ± 0.0005/0.001 ± 0.009%, the mean coefficient of variation 0.17 ± 1.44%/0.44 ± 0.69%. The Bland-Altman scattering profiles confirmed this good repeatability by showing a distribution of the values next to the mean of the difference with only a few outliers (Fig. 5a, b).

Fig. 5
figure 5

Bland-Altman plot showing a good repeatability with an even distribution for the results of the GA classifier (GA vs. healthy) (a), the GA classifier (GA vs. healthy) (b), and the dt-GA classifier (c)

Diffuse-trickling pattern classifier

Performance of the training process

The increase of the training accuracy and the validation accuracy curves of the dt-GA classifier were lower compared to those of the GA classifier. The training accuracy was 94% and the validation accuracy 77% after the 500 training steps. The cross entropy curve also showed a flatter decrease with a value of 0.16 after the completed training process (Fig. 2e, f; Table 1(c)).

Performance of the classifier in the final testing

The mean dt-GA probability score of the final testing was 0.834 ± 0.123 for the dt-GA FAF images and 0.132 ± 0.121 for the FAF images with ndt-GA (Fig. 4c). According to this, the mean ndt-GA probability score was 0.166 ± 0.123 for dt-GA FAF images and 0.868 ± 0.121 for the ndt-GA FAF images. Between the dt-GA probability scores of the two groups, there was no significant difference (p = 0.353). Comparing the dt-GA probability scores of the dt-GA group and the ndt-GA probability scores of the ndt-GA group, the difference was highly significant (p < 0.001). The dt-GA classifier performed a correct diagnosis in all of the 20 tested FAF images (Table 2(c)). This results in a sensitivity, a specificity, and an accuracy of 100%.

Repeatability and precision

For the dt-GA classifier, the mean absolute difference of the dt-GA probability scores between the two independently performed deep learning procedures was 0.003 ± 0.033, the mean coefficient of variation was 2.53 ± 2.23%. As with the GA classifier, the Bland-Altman scattering profile with an even distribution and only one outlier confirms the good intra-classifier repeatability results (Fig. 5c).

Comparison of the GA classifier and the dt-GA classifier

A comparison of the absolute GA probability scores values of the GA classifier and the dt-GA classifier revealed a highly significant difference (p < 0.001).

Discussion

FAF enables the visualization of the RPE and is an established imaging modality in the diagnosis of GA [3, 8,9,10, 29, 30]. Using the GA classification system for FAF introduced by Bindewald et al. in 2005, a prognosis of the progression rate is possible [3, 10, 29]. Thereby, the dt-GA is known to have a significantly higher progression rate than other patterns [10, 29].

In this context, we created two DCNN classifier approaches in our study, one to automatically detect GA in FAF images and another to automatically detect dt-GA in FAF images. The two classifier approaches showed an excellent performance with a sensitivity, a specificity, and an accuracy of 100%. Focusing on the absolute probability scores of the classifier’s decision process, the GA classifiers achieved significantly better than the dt-GA classifier. The main reasons for this are probably the lower number of FAF images used for the training procedure and the more subtle differences in FAF pattern between dt-GA and ndt-GA compared to GA and healthy retina or GA and ORD. This is also in accordance with the different curves of the training accuracy, the validation accuracy, and the cross entropy that show a more effective training process for the GA classifiers.

With a continuous rising in the number of retinal imaging modalities, both a sufficient expertise in the appraisal of these images and time to include the image information in the diagnosis process become the limiting factors in the daily routine [31]. Additionally, a variable interobserver agreement is a relevant task within retinal imaging [30, 32, 33]. Biarnés et al. (2012) described this problem for the classification of different GA patterns in FAF imaging. In their study, a high intraobserver agreement was reached, while interobserver agreement was described as “variable” [30]. A deep learning-based tool to achieve an automated diagnosis and classification might possibly be a future solution for these problems. As an example, the classifiers in our study reached an extremely good repeatability and precision.

Deep Learning algorithms can hierarchically process a huge amount of image data in a way that is comparable to the neuron microstructure of the brain. In analogy to this, the performance of these algorithms increases during the “learning” process [12, 24]. Therefore, a sufficient amount of classified images is necessary. In case of a lack of training images, overfitting is a phenomenon that can occur during the deep learning process. Hereby, the ability of the classifier to correctly classify unknown images is dramatically reduced [12, 14, 34]. Therefore, it is desirable to expand the amount of training images to further improve the performance of the classifiers and to enable finer subdivisions. To reach this aim, multicenter studies should be aspired.

To the best of our knowledge, our study is the first study that uses a deep learning algorithm in the detection and classification of GA in FAF.

Holz et al. (2007) showed that FAF imaging is the only imaging modality that enables a prognostic view in the progression of GA [29, 30]. In their study, they used the classification system by Bindewald et al. (2005) and compared the progression rates of the different patterns after a median follow-up period of nearly 2 years. Thereby, the diffuse patterns, especially dt-GA, was shown to have the highest spread rate [3, 29]. In literature, this is explained with an increased accumulation of lipofuscin in postmiotic RPE cells as an important factor for cell death in the pathogenesis of GA [35, 36]. An automated classification of different FAF patterns in GA is therefore of tremendous interest in ophthalmology. On the one hand, in the context of a more individualized medicine and on the other hand, in the context of better understanding the pathology of GA. For a GA classifier, the ultimate goal is to implement an algorithm that correctly detects GA and classifies its phenotype in a first step, performs an accurate calculations of the lesion size in a second step, and provide information on the disease progression. The first step in this direction was taken with this work.

There are some limitations of our study that have to be considered. One is the relatively low number of included FAF images, although it is in the range of other studies using machine learning in ophthalmology [14, 16, 19, 22]. This may influence the quality of the deep learning process. Nevertheless, the continuous increase of the course of the training and validation accuracy curves suggests a good working learning progress. As a confirmation of this, the classifiers show no sign of overfitting. Overfitting would lead to an increasing gap between the training and validation accuracy curves [12, 14, 34]. Nevertheless, including more FAF images would probably improve the training performance of the dt-GA classifier.

Another limitation is that in order to receive a sufficient number of images to build our DCNN classifier, in some cases, follow-up images or images of the other patient’s eye were used for training. Therefore, a possible effect on the results cannot completely be excluded, as FAF follow-up images or FAF images of the partner eyes are less different with respect to anatomical features compared to FAF images of other patients. Nevertheless, due to the operating mode of the DCNN, we believe that this effect can be ignored for our study since the portion of images coming from one patient is low. Our used DCNN classifier works by detecting patterns in the images that can be recognized in the majority of these images and that differ in the both classified groups. Therefore, when just a few of the FAF images show the same anatomic features, this is ignored by the DCNN. In this context, we believe that the number of images that were obtained from the same patient should be less than 5% of the training dataset of each class dataset and consequently 2.5% of the entire training dataset of a 2-class classifier. Additionally, overfitting is a sensitive marker of the quality of the dataset. If the classifier recognizes image features due to an insufficient training data set (e.g., due to inter-eye or intra-eye correlation) overfitting will occur. As mentioned above, overfitting did not occur in the classifiers used in this study, so that the composition of the image groups can be considered as sufficient.

In the context of its clinical relevance, a limitation is that the differentiation was only performed between healthy FAF images or ORD FAF images and FAF images with GA as well as between dt-GA pattern and ndt-GA pattern. Therefore, in a following study, it is necessary to extend the classifier in order to recognize more different retinal diseases. In this context, a multicenter design is recommended. The feasibility was already shown by Burlina et al. (2017) by extending a 2-class classifier to a 3- and 4-class classifier with still valid results [21].

Conclusions

In conclusion, we created for the first time a deep learning-based classifier for the automated detection and classification of GA in FAF images. Thereby, our approach showed excellent performance results of the classifier and a very good repeatability.

GA is a progressive, sight-threatening disease, with a divergent progression rate depending on the pattern in FAF [10, 29, 30, 36]. Therefore, this approach may be helpful in the prediction of the individual progression risk of the GA, the identification of biomarkers, and the gain of further information for possible future therapeutic approaches. To expand and improve the performance of the classifier multicenter studies is desirable.