Introduction

Due to the recent progress in artificial intelligence (AI), there are new perspectives in medical research regarding automated data analysis [1, 2]. In this context, deep learning (DL), in particular, is of increasing interest. DL describes algorithms that process and analyze data in a hierarchical manner with artificial neural networks inspired from the organization of the brain [3, 4]. These approaches are currently attracting increasing interest in ophthalmologic imaging research [5,6,7,8,9,10].

Optical coherence tomography angiography (OCTA) is a rapidly evolving approach in multimodal retinal imaging and allows for a depth-selective visualization of blood flow in retinal vessels without dye injection [11]. This technology plays a steadily increasing role in clinical research as well as in clinical practice offering new insights into various chorioretinal diseases [12,13,14]. However, OCTA image quality achieved in healthy probands can often not be reproduced in pathologically altered maculae due to motion artifacts or segmentation errors [15,16,17]. Despite various improvements, such as eye tracking or increasing OCT imaging speeds, image artifacts remain a key limitation in OCTA imaging [18,19,20,21,22,23]. Manufacturers, meanwhile, offer different quality indices given after image recording, yet these indices are non-transparent and do not account sufficiently for image artifacts [17, 23]. At the moment, there is no other software approach that provides automated and reliable OCTA image quality assessment. It appears to be a timely issue whether OCTA devices or users of OCTA devices can be assisted by intelligent software solutions to handle the increasing amount of image data more effectively by avoiding useless image data of insufficient quality. The already existing DL approaches in ophthalmologic imaging rather focus on the machine-based interpretation of disease patterns in different imaging modalities than on the assessment of image quality [24,25,26,27,28,29,30,31]. Interestingly, the first and recently FDA-approved AI imaging device based on color fundus photography features an incorporated AI-based image quality assessment [32,33,34]. Thus, the aim of our study was to pursue a similar approach and create an automated image quality assessment for OCTA imaging based on a DL concept. Therefore, a multi-layer deep convolutional neural network (DCNN) was created and subsequently evaluated in the automated assessment of en-face OCTA image quality by creating a suitable DL classifier.

Methods

Demographics

Two hundred OCTA measurements were randomly selected from the OCTA database of the medical retina clinic of the Department of Ophthalmology at the University of Muenster Medical Center. As numerous patients in the database had received multiple imaging sessions, the selected 200 images had been recorded in merely 82 eyes of 82 subjects. The inclusion of multiple images per subject was accepted since not the individual artifact formation but the detection of artifacts was to be evaluated. Consequently, the diagnosis was registered but not a selection criterion. All subjects had an unambiguous diagnosis of their pathology based on clinical examination, multimodal imaging, perimetry, and, in some patients, electrophysiological examination. Signal strength index (SSI) and implemented OCTA image quality score (IQS) of all included OCTA examinations were recorded. All images included were recorded between 01.01.2016 and 01.04.2018. All procedures adhered to the tenets of the Declaration of Helsinki. Informed consent was obtained from all individual participants included in the study.

Imaging and manual image analysis

All OCTA images were obtained with the same commercial spectral domain OCT system (AngioVue, RTVue XR Avanti SD-OCT, Optovue, Fremont, CA, USA). The device was set to the same calibration for all images. Technical detailed information of the device is described elsewhere [35]. Images were included irrespective of the SSI, IQS, and expression of image artifacts due to motion, projection, or segmentation errors. In contrast to the other quality parameters used here, the SSI is implemented in the proprietary software of the Optovue device. The superficial vascular layer was selected based on the automated segmentation of the device. For image grading, the OCTA en-face image of the superficial plexus and corresponding OCT b-scans were considered. Pathological changes, e.g., drusen or epiretinal membrane, were visible to the graders. Yet, the application of the scores is independent of the presence of pathology. Nevertheless, a certain grader bias remains, which is unavoidable. All images were assigned to an image quality group (Table 1) with regard to the expression of motion artifacts following a motion artifact score (MAS), to the expression of segmentation errors following a segmentation accuracy score (SAS), to the visibility of small capillaries, and to the centering of the fovea. Both scores of MAS and SAS were recently described in detail elsewhere [17, 23]. OCTA image data were separately evaluated by two independent readers for MAS and SAS levels. In case of any discrepancy, a third grader was asked to arbitrate. Group 1 includes all images with sufficient image quality (Fig. 1). If an image did not meet all four criteria for sufficient image quality, it was assigned to group 2 (insufficient image quality) (Fig. 2).

Table 1 Definition of the two main groups of image quality assessment of the superficial vascular plexus
Fig. 1
figure 1

Multiple optical coherence tomography angiography en-face images of the superficial plexus showing sufficient image quality following criteria given in Table 1 with motion artifact score (MAS) 1 or 2, segmentation accuracy score (SAS) 1, centered fovea, and a clear presentation of the small capillaries (ao)

Fig. 2
figure 2

Multiple optical coherence tomography angiography en-face images of the superficial plexus showing insufficient image quality due to motion artifacts (e.g., a, g, j), segmentation errors (only visible in corresponding b-scans), decentration of the fovea (e.g., b), and/ or an insufficient visibility of the small capillaries (e.g., m, d), following criteria given in Table 1

Automated image analysis using deep learning

To automatically assess the quality of OCTA images, a multi-layer DCNN classifier was built, using the open-source DL framework TensorFlow™ (Google Inc., Mountain View, USA) [36, 37]. The method was previously described by our research group for the detection of exudative age-related macular degeneration (AMD) [7, 38]. For the present study, the last layer of the neural network was trained and validated with 160 OCTA en-face images (sufficient image quality: n = 80; insufficient image quality: n = 80) in 500 training steps. For each training step, the performance of the training procedure was evaluated by determining the training accuracy (the performance [%] of the DLA in correctly classifying a pre-trained image set), the validation accuracy (performance [%] of the DLA in correctly classifying a not pre-trained image set), and the cross entropy (function giving information about the training progress and aiming in minimizing value) [39].

In a next step, to evaluate the applicability of the classifier in determining the image quality of untrained images, another 40 randomly selected OCTA en-face scans (sufficient image quality: n = 20; insufficient image quality: n = 20) were to test the performance of the created DCNN classifier. Therefore, for every tested image, a sufficient image quality probability score (SPS) and an insufficient image quality probability score (IPS) were automatically computed.

The scores express the probability of correct pattern recognition. The scores were calculated for each single image. Following, a mean value of all images was given. The value of the score ranges between 0 and 1. A score of 1 describes a probability of 100%, defining the best possible association between an image and an image quality group. The allocation to the diagnosis “sufficient image quality” or “insufficient image quality” was made according to the higher numerical value of the two probability scores in each case.

The entire DL process, including training/validation and testing, was repeated a second time to provide valid information about the repeatability.

Statistical methods

We used SPSS Statistics version 23 (IBM Corporation, Somers, NY, USA) and Microsoft Excel 2010 (Microsoft Excel 14.0.7, Microsoft, Redmond, USA) for data management and statistical calculations. Averaged values are given as mean ± standard deviation. Groups were compared using the non-parametric Mann–Whitney U test for independent samples. Statistical significance was set at p < 0.05.

Results

The 200 randomly selected images (group 1: n = 100; group 2: n = 100) could be attributed to 23 healthy eyes of 23 subjects (61 images) and to 59 eyes of 59 patients with retinal pathologies (139 images). Images from eyes diagnosed with early/intermediate AMD with drusen and/or pigmentary changes (n = 54), late stages of AMD with choroidal neovascularization (n = 35) or geographic atrophy (n = 3), central serous chorioretinopathy (n = 4), retinal vein occlusion (n = 21), retinitis pigmentosa (n = 4), and epiretinal membrane (n = 17) were included. Mean SSI was 68.8 ± 8.63 in group 1 and 55.87 ± 9.27 in group 2 (p < 0.001, range 47–84 (group 1) and 28–86 (group 2)). Mean IQS was 7.79 ± 0.92 in group 1 and 5.55 ± 1.92 in group 2 (p < 0.001, range 5–9 (group 1) and 1–9 (group 2)).

Image quality assessment using deep learning

Training accuracy, validation accuracy, and cross entropy of the DCNN were calculated during the training and validation procedure. The graphical curves of the training and validation accuracy display a steep rise in the beginning of the procedure, followed by a continuous increase up to a final training accuracy of 97% and a final validation accuracy of 100%. The curve of the cross entropy proved to be consistent in a reciprocal manner with a final value of 0.12. Figure 3 displays the curves of training accuracy (a), validation accuracy (a), and cross entropy (b). The mean IPS scores were 0.84 ± 0.19 in group 1 and 0.12 ± 0.21 in group 2. The mean SPS scores were 0.16 ± 0.19 in group 1 and 0.88 ± 0.21 in group 2. The differences between the SPS and the IPS scores for both groups were highly significant (p < 0.001) (Fig. 4). Eighteen of all images with sufficient image quality were correctly detected. Classification was wrong in two images. In case of insufficient quality, there were also two images with a classification differing from the human reader (Fig. 5). The mean difference of the initial testing procedure and the repeated testing procedure was 0.01 ± 0.01 for the values of the IPS and 0.01 ± 0.01 for the SPS. The coefficients of variation were 0.96 ± 1.9% for the insufficient images and 1.14 ± 1.6% for the sufficient images. Sensitivity, specificity, and accuracy of the classifier were 90.0%.

Fig. 3
figure 3

The curves display training accuracy (a), validation accuracy (a), and the cross entropy (b) during the 500 training steps

Fig. 4
figure 4

Boxplots display mean values of insufficient image quality probability scores (IPS) (a) and sufficient image quality probability scores (SPS) (b) for the sufficient and insufficient image quality groups (groups 1 and 2)

Fig. 5
figure 5

Four different optical coherence tomography angiography en-face images of four different subjects that were classified incorrectly by the algorithm. The first two images were classified as sufficient image quality but were assigned to insufficient quality by the algorithm (a + b). The last two images were classified as insufficient image quality but were assigned to sufficient quality by the algorithm (c + d)

Discussion

For the first time, the Food and Drug Administration (FDA) approved of a medical device based on AI in April 2018. In clinical studies based on color fundus photography, it was able to correctly identify the presence of more than mild diabetic retinopathy in 87.4% and it was able to correctly identify those patients who did not have more than mild diabetic retinopathy in 89.5% [32,33,34]. Interestingly, this AI device has two core algorithms—an image quality algorithm and the actual diagnostic algorithm. The image quality algorithm is based on criteria such as retinal area, focus, color balance and exposure to detect sufficient color fundus photography image quality that allows the diagnostic algorithm, in a second step, to rule out diabetic retinopathy. This very first example of an FDA-approved AI imaging device illustrates the indispensable need to incorporate an automated image quality assessment into any intelligent image analysis device.

Therefore, this study aimed at training and subsequently evaluating a DCNN in the automated quality assessment of en-face OCTA images. We found that the trained neural network classifier reached a high prediction accuracy in discriminating sufficient and insufficient OCTA image quality in the validation dataset.

Since the first description of structural OCT imaging in the early nineties, the technology has made remarkable advances with regard to image quality and availability [8, 14, 17, 40,41,42]. Consequently, OCT-based criteria for disease classification and therapeutic indications, such as disease activity criteria in neovascular AMD, are well-established. In contrast, OCTA represents the most recent imaging modality still being in the early stages of development [43]. As there are aspects of OCTA that are not yet fully understood and still technical issues to address, such as projection artifacts, motion, and segmentation artifacts, it appears currently too early to apply DL to OCTA for making diagnoses or predicting therapeutic decisions. However, in the field of OCTA, image quality DL can very well be an interesting tool.

Al-Sheik and co-workers studied the effect of different image quality levels on a quantitative analysis of OCTA images in healthy eyes and found that the frequency of artifacts was higher, and the repeatability of vessel density measurements was lower in images of reduced quality [18]. These phenomena become even more evident in patients with pathologically altered maculae. Thus, warranting high image quality is a key to further establish OCTA [17, 19, 23].

The manufacturers of OCTA devices already offer an easily accessible image quality assessment by displaying parameters like the SSI or a proprietary image quality score. The sensitivity and specificity of the SSI judging OCTA image quality have been questioned in the literature, particularly with regard to image artifacts. It seems that a minimum amount of signal strength is required as a prerequisite for high-quality imaging, although signal strength alone does not guarantee an artifact-free image [17, 20]. Also, the validity of the new IQS of the Optovue device is not clear yet. Both SSI and IQS scores have been analyzed in the present work, and the results show a significant difference between the sufficient and insufficient image quality groups for both scores. The recent introduction of the IQS in the latest update of the Optovue software represents the consequential next step in refining their image quality score. Besides, it proves that manufacturers have recognized the importance of OCTA image quality long ago. However, a mean SSI of about 56/100 and a mean IQS of nearly 5.6/10 in the insufficient image quality group can certainly be misleading as such mean values do not clearly indicate poor quality. Additionally, in some cases, high SSI and/or IQS values were observed along with insufficient image quality, which is also expressed by the large range of values in both groups. For the ordinary user, it remains unclear which quality-defining measures are incorporated in such indices. OCTA manufacturers will certainly further refine the current image quality scores to improve their validity. At the moment, however, an assessment of both segmentation accuracy and presence of artifacts must be performed in other ways, particularly in the presence of retinal pathologies [17, 23].

In this scenario, DL seems predestined to play a supporting role in image quality assessment. The automatic recognition of retinal diseases and the assessment of disease stages or disease progression are undoubtedly exciting. However, a reliable and consistent image quality is necessary in the first step to compare retinal pathologies between patients [18, 44].

The use of deep learning in OCTA image quality assessment aims not at replacing the medical assessment but rather at warranting a minimum image quality standard to enable a reliable medical image evaluation. Obviously, it is desirable to present only images of consistently high quality to the image grader. Currently, the user must rely on their acquired expertise and experience in the evaluation of OCTA images.

In this study, the DLA was able to differentiate between sufficient and insufficient image quality following given criteria (Table 1) with high specificity, sensitivity, and accuracy. A wrong categorization during automated image quality assessment was found in four cases (Fig. 5). Different reasons for that can be discussed. The performance of the algorithm depends on the images provided during training and validation. Errors in image selection and categorization by the human user can misdirect the DLA. Furthermore, criteria of the image evaluation process of the algorithm do not necessarily match those of a human reader. Thus, algorithms can be misdirected by information that may not be noticed by the user during manual image analysis, known as the black-box phenomenon in deep learning [1, 8, 45].

It is possible that the results could be improved with a further increase in the number of patients and images. This would presumably allow even more precise distinctions between the quality categories. Furthermore, the use of DLA seems to save a huge amount of resources, considering the high expenditure of time when performing a manual analysis of all relevant quality parameters. In addition to supporting and accelerating the clinical evaluation of OCTA imaging, there might also be perspectives in the field of training of OCTA users, such as photographers. DLA as a training tool could be useful for sensitizing users to sources of error during the imaging process while learning how to use OCTA devices.

Several aspects limited our study. Our image quality assessment is based on en-face images of the superficial vascular plexus. Due to the well-recognizable vascular structures in the area of the superficial plexus, a categorization of the images with regard to the image quality is in many cases much easier for humans than, for example, at the level of the choriocapillaris. The distinction between artifacts and actual anatomical structures is considerably more demanding here, since the image consists only of a contrast pattern. The prerequisite for using an automated process is a high quality in manual analysis. In this study, this was made possible using established image quality scores like MAS, which is designed for a quality assessment based on the superficial plexus [17]. Furthermore, the data was exclusively obtained with one OCTA device and the results must not be transferred to other devices or other software algorithms. Yet, the OCTA device used in this study is FDA-approved and it has been in widespread clinical use for years. The retrospective study design represents another limitation. Certainly, this also made it possible to evaluate numerous images, and the collective size increases the reliability of the DLA. Concurrently, the inclusion of multiple images of a single subject potentially reduces the diversity within the data set of this study. This would be especially critical if an individual DL-based analysis of disease-specific characteristics is performed. However, in the present study, only the automatic detection of image artifacts was tested. Image artifacts can also vary from one scan to another when performing multiple imaging sessions of a single individual. Furthermore, one must consider that mistakes can be made by the human reader by categorizing the images into sufficient and insufficient that in turn can disturb the DLA. It is known that a DLA can be misdirected even by characteristics that are barely recognizable for a human reader, as discussed previously [1, 8, 45]. Consequently, it would be desirable to test the DLA with an even larger pool of images in follow-up studies. A critical evaluation of the data remains important even when using DL. Humans will continue to play an important role here, both in the development and in the control of the automatic analysis methods used.

OCTA imaging and DL represent two up-to-date and highly dynamic innovations, one in the field of retinal imaging and one in the field of image processing and analysis. The combination of both opens the door for countless new possibilities in retinal imaging and image analysis. A DL algorithm with focus on image quality also paves the way for a reliable application of other AI solutions with a more clinical focus. A timely implementation of DL software solutions for optimized automated quality assessment appears both sensible and realistic. In this study, we primarily aimed at presenting a proof-of-concept for the beneficial use of DL in optimizing and standardizing image quality in OCTA imaging. In any case, the insights gained here encourage further studies focusing on the use of DL in OCTA. In conclusion, DL appears as a suitable approach to improve OCTA image quality assessment as well as to accelerate and facilitate OCTA work flow. DL may contribute to a reliable adherence to quality and analysis standards in this evolving and complex imaging modality.