1 Introduction

Congenital nystagmus (CN) is an ocular-motor disease that makes visual acuity (VA) decrease in the first months of babies’ life [1]. It consists of involuntary, conjugated and rhythmical horizontal to and fro movements and patients affected by it have a disrupted fixation due to quick movements of the image to watch on the retina [2]. There are different ways to perform eye-tracking [3] but one of the most employed is electrooculography (EOG) that exploits the measurements of skin potentials. It works by placing electrodes and sensing the corneo-retinal potential (the resting potential between the cornea and the retina of the eye) which is proportional to the eye-movement. Finally, since the obtained signal has low voltage, it is amplified, filtered and processed to remove involuntary blinks, noises and other artefacts [4,5,6]. Differently, Infrared-oculography (IROG) is a non-invasive method to validate the time of foveation, an indirect measure of VA: an infrared light illuminates the eye and the sclera reflects it; the difference between the input and the output of infrared light from the eye describes the eye positioning. Both EOG and IROG are still considered good methodology for measuring eye-movement and for eye-tracking, as testified by Singh and Singh in their review [7].

Relationships among VA, baseline oscillations (BLO) of both amplitude and frequency, variability of eye positioning (SDp), nystagmus foveation periods (Tf), nystagmus amplitude and frequency were studied in the past literature. Bifulco et al. examined the association between the amplitude of the BLO and the SDp while Cesarelli et al. suggested an exponential model between SDp and VA, both focused on foveation periods [8, 9]. Differently, other authors had their main focus on the automatic detection of nystagmus and studied its relationships with other parameters [10,11,12]: Sheth et al. proposed many associations such as between VA and nystagmus features or standard deviations of eye velocities and eye-positioning [13]. Regarding the investigation of relationships among features and physiological values, Dunn et al. studied the extent to which use of the null zone (as opposed to other gaze angles) affects VA in adults with infantile nystagmus [14]. In a review, Dunn provided researchers with a clinical viewpoint on the recent advances in the field [15].

Differently, Kelly et al. investigated and found a direct relation between the VA with visual evoked potential and optic disc diameter in awake children [16]. Moreover, Kelly investigated how much eye velocity in CN deprives the developing visual system of physiological VA [16].

Due to the big amount of data, new techniques have been used in literature in order to discover hidden patterns in datasets: machine learning [17,18,19]. The growth of information technology has brought massively engineers in the health facilities to help clinicians during the tasks of diagnosis and prognosis of patients, which is often a hard one [20,21,22]. Machine learning has been employed for a wide range of biomedical applications in literature: Ricciardi et al. employed it in neurology to distinguish Parkinsonisms, in cardiology to help with coronary artery disease diagnosis but it has been employed also with biomedical signals such as Cardiotocography [23,24,25]. Of course, some difficulties and challenges have to be faced regarding the management, processing and understanding of big data [26].

The applications of machine learning in ophthalmology have the tendency to tackle the thematic of brain computer interfaces (BCI), particularly through the combination of EOG and electroencephalography (EEG). Witkowski et al. introduced and tested a novel hybrid brain-neural computer interaction system fusing EEG and EOG to enhance reliability and safety of continuous hand exoskeleton-driven grasping motion and, similarly, Punsawad et al. used a system fusing EOG and EEG but for different purposes. Fatourechi et al., instead, revealed the weaknesses in BCI studies due to a wrong management of electromyography and EOG artefacts [27,28,29]. Nevertheless, some efforts were made to make classifications based on eye-movements recording; Zemblys et al. employed machine learning techniques to detect events in eye-tracking data and tested different state-of-art algorithms [30, 31]. Other past applications were focused on detecting fixations and saccades through velocity-based and dispersion-based algorithms, giving no information about events though [32, 33].

Therefore, the aim of this paper is to study the relationships between physiological values of CN affected people and features extracted from their EOG through several machine learning algorithms (Random Forests (RF), Logistic Regression Tree (LRT), Gradient boosted tree (GBT), K nearest neighbour (KNN), Multilayer Perceptron (MLP) and Support Vector Machine (SVM)) and compute some evaluation metrics to compare the new results with the past ones without this approach. The dataset used for this paper was the same of Cesarelli et al. and Bifulco et al. in 2000 and 2002, respectively [8, 9].

2 Materials and methods

The EOG of 20 patients, 10 males and 10 females affected by different forms CN with an age between 6 and 34 years, was recorded for both right and left eyes, obtaining 40 signals. Each patient underwent binocular and horizontal eye-movements at different gaze position. Patients, laying on a head support and a chin rest to reduce head motion, watched a light stimulus at a fixation distance of 1 m. An arched, horizontal LED-bar was used and adjusted according to the height of the subject. The sequence of light stimuli had the following angles: 0°, 5°, 10 °, 20°, 30°, 0°, −5°, −10°, −20°, −30° and 0°. It lasted 2 min, holding all the positions for 10 s. Both EOG and IROG were utilized for the recording of eye-movement with a sampling frequency of 200 Hz. IROG was employed to deal with non-collaborative patients such as children. Before acquiring signals, all patients were allowed to familiarize with the device to obtain a better result.

The classic Landolt Cs technique was performed to measure VA. The Bioengineering Unit (Department of Electronic Engineering, University of Naples ‘Federico II’) designed a specific software to process EOG signals and to extract nystagmus features.

The characteristics of patients according to these features were summarized in Table 1. VA ranges from 0 to 1 with steps of 0.1 and the normal value is 1, value lower than 0.6 are considered pathological. The SDp for normal people is equal to 0 because it isn’t a physiological phenomena.

Table 1 Descriptive statistics ad unit of measurement of each feature

2.1 Pre-processing of signals and extraction of features

A low pass filter at a frequency of 70 Hz and with a cut-off of 3 dB was applied to all signals as well as a notch filter to reduce power line noise. Another pre-processing phase was required to eliminate the DC and a possible linear component to avoid electrode polarisation signal components. A low pass differentiation algorithm for biological systems was used to calculate eye velocities from eye positions [34].

Signal tracts, corresponding to the different gaze positions, were extracted from an entire recording. A specific algorithm was employed to automatically recognise nystagmus cycles and extract nystagmus features such as amplitude, frequency, intensity and waveform shape [35]. The foveation window was localized in the signal by considering the time interval for which the eye position was contained within 0.5° from the local maximum of the nystagmus cycle, and the eye velocity was lower than 4°/s. The time length of the foveation window was proposed as a measurement of the Tf. The foveation window was different from the one proposed by Dell’Osso et al. [36]. The SDp was estimated by computing the standard deviation of all the samples in all the foveation windows contained in a single signal tract. Examining eye movement recordings in our dataset, it was often observed a cycle-to-cycle variability of the eye position and velocity in the foveation periods during foveation windows. This cycle-to-cycle variability looks like the result of a superimposition of a sinusoidal-like oscillation of the baseline. The hypothesis of considering the BLO a pure sinusoid has been held. In order to characterize these sinusoidal oscillations, a common least mean square (LMS) fitting technique was used starting from an estimation of the BLO frequency estimated directly on the FFT of the signal tract. For each signal tract the highest peak of the power spectrum of the eye movement signal in the range 0.1–1.5 Hz was considered as an estimator of the BLO frequency. BLO amplitudes and phase were computed using LMS fitting approach. Reasonably, this slow baseline wandering may cause an increase of the SDp during foveation, which in turn may hamper VA.

Figure 1 represents how some features were extracted.

Fig. 1
figure 1

Qualitative picture of an eye movement recording including five nystagmus cycles. It exemplifies the computation of the foveation windows, the foveation time Tf and the standard deviation of eye position during foveation SDp

2.2 Tool, algorithm and evaluation metrics

Knime analytic platform is a “business intelligence and predictive analytics” tool and was chosen to implement the algorithms. This platform allows users to create workflows by combining different nodes and to install the plugins of the most popular programming languages and software such as R, Weka, Matlab, and Python. It has been used in literature for many biomedical applications: in cardiology [37, 38], neurology [39], radiology [40, 41]. RF, LRT, GBT, KNN, SVM and MLP were used in this study: LRT is one of the most common algorithm for regression analysis, RF and GBT are an empowerment of the decision tree that is the easiest and most intuitive algorithm in literature while KNN has a different functional principle since it is an instance-based algorithm. Since this is an investigative analysis, several algorithms exploiting different principles were employed. RF, LRT and GBT are based on the decision tree, whose basic idea is to divide a composite problem into many easier ones. It is made up of leaves and nodes, which stand for a predicted value and an attribute, respectively. RF consist of an ensemble of decision trees where each tree employs a random and partial subgroup of attributes in each node, utilizing only a random part of the training data. The LRT consists of a decision tree whose leaves contain linear regression. The last tree-based algorithm is GBT, which uses all the principles of ensemble learning to empower the decision tree: randomization and bagging just like the RF and boosting. In summary, RF and GBT are based on decision tree but they exploit two different combinations of ensemble learning principles (bagging and randomization for RF, GB adds boosting). While the decision tree algorithm is easy to use and understand, the KNN was employed because it should have good results when dealing only with numeric attributes. It is an instance-based algorithm, which assigns the class to the test data based on their distance from similar training data. SVM is capable to face problems dealing with over fitting, small dataset, not linear and/or high dimensional data; it can be used for both classification and regression. It aims at finding the best hyperplane that divides the dataset into two classes and employs a non-linear mapping technique that converts the starting data into a higher dimensional space when they are not linearly distributed; it aims at maximizing the margin separating the classes to predict while minimizing the classification errors [42]. The last algorithm, MLP, has another different principle since it is a form of neural networks with an input layer, one or more hidden layers and an output layer. The training is usually performed through the algorithm backpropagation of errors (BP) or some of its variants. The MLP can characterize complex mappings and to address large nonlinear problems in an effective and relatively simple way [43].

A leave one out cross-validation was also employed since the number of records was quite small 40, 20 patients per two registrations (right and left eyes). The models are learned on all the patients minus one that “is left out” for testing allowing evaluations that are more honest. This procedure is repeated for a number of times equal to the number of records. The performance was evaluated with the following evaluation metrics, recognized in literature for comparing and assessing classifiers [44]: coefficient of determination (R2), mean absolute error, mean squared error, root mean squared deviation, mean signed difference. The features considered for predicting VA were the same of Cesarelli et al. [8] while the features included in the algorithms to predict SDp were the same of Bifulco et al. (with Tf in place of VA) [9]; they are all shown in Table 2.

Table 2 Features included in the algorithms for both predictions

3 Results

The features shown in Table 2 were used to train and test six algorithms: RF, LGT, GBT, KNN, MLP and SVM. The targets for the regression analyses were VA and SDp and the performance was evaluated by employing a leave one out cross-validation. The results for VA are shown in Table 3; those for SDp are shown in Table 4.

Table 3 Performance in the prediction of VA
Table 4 Performance in the prediction of SDp

On the one hand, RF was the best algorithm for predicting VA obtaining the highest coefficient of determination (R2 = 0.85) and the lowest errors. Then, there were MLP and GBT for predicting VA with R2 respectively equal to 0.83 and 0.82 and errors that were comparable with those of RF. KNN and SVM.

On the other hand, GBT, RF and LRT obtained low R2, respectively, 0.68, 0.65 and 0.62. Some good results were achieved by KNN and MLP (respectively, R2 equal to 0.74 and 0.72). The best algorithm for predicting SDp was SVM with a R2 equal to 0.79 and the lowest errors compared to the other algorithms.

Figures 2 and 3 represent, respectively, the features importance for the regression analysis of VA and SDp according to the results obtained by the RF. The importance was computed based on how often a variable was utilized for making the splits at the first, second or third level.

Fig. 2
figure 2

Feature importance in the prediction of VA

Fig. 3
figure 3

Feature importance in the prediction of SDp

SDp and Tf were the most important features for the prediction of VA confirming the strong relationship between VA and SDp while BLO amplitude, Tf and amplitude of nystagmus were the most important features for predicting SDp.

4 Discussion and conclusion

Summarizing the study, the first phase consisted in acquiring the EOG of 20 people affected by CN for both eyes. A pre-processing was performed and, then, there was the extraction of some features (frequency, amplitude, intensity, nystagmus foveation periods and BLO both amplitude and frequency) through a custom-made software, developed by the Bioengineering Unit of the University of Naples “Federico II”. In a second phase, four algorithms (RF, LRT, GBT, KNN) were implemented through Knime analytics platform, trained and test with a leave one out cross-validation their performance underwent an evaluation through some evaluation metrics (R2, mean absolute error, mean squared error, root mean squared deviation, mean signed difference).

The first model confirmed the results obtained by Cesarelli et al. and Bifulco et al. with a strong dependence of VA from SDp [8, 9]. The Tf was the second most important feature when predicting VA and it has a physiological explanation: when Tf has an acceptable duration, people affected by CN can have a good vision with a VA of 0.8/0.9 that is near a normal value; when the Tf goes below a critical value, instead, the vision becomes blurred. This model is true if the eye-position goes back to the foveation position after each cycle.

The second model kept into consideration and confirmed the hypothesis that was made from the authors [8, 9]. The nystagmus is a periodic function; thus, it should have always the same maximum position. It seemed to be false in the set of patients that were analysed. Therefore, the hypothesis was that there was a BLO amplitude at low frequencies that overlapped on top of the amplitude of nystagmus that was at higher frequencies. This situation implied that the foveation position was always different with a consequent worsening of the vision. The feature importance computed with machine learning detected this concept, namely that increasing the Tf or the amplitude of nystagmus makes the SDp increase.

On the one hand, Cesarelli et al. created an exponential model called Nystagmus Acuity Estimator Function (NAEF), function only of SDp and Tf, obtaining a coefficient of determination of 0.85 as a measure of linearity between NAEF with VA that is comparable to the R2 obtained through machine learning, although they just applied linear regression and did not model the VA with all the features [8]. On the other hand, Bifulco et al. introduced also the BLO amplitude and frequency and obtained a coefficient of determination of 0.69, which is lower than three machine learning algorithms that we employed, as a measure of linearity between SDp and the BLO amplitude [9].

When comparing these results to those obtained with the more modern machine learning algorithms, it is clear the good feasibility of these techniques in the context of EOG. Regarding the regression on SDp, it was computed with a maximum R2 of 0.79 through SVM but also KNN and MLP obtained a R2 greater than 0.70. Concerning VA, it was computed with a R2 always greater than 0.67 and with three maximums of 0.85, 0.83 and 0.82 through, respectively, RF, SVM and GBT. RF have shown greater potential in predicting nystagmus features. As regards the interpretation of the algorithms, all the tree-based algorithms achieved good results for the prediction of VA with R2 greater than 0.70 while the instance-based ones (KNN and SVM) didn’t overcome 0.70. For predicting SDp the tree-based algorithms were not able to obtain a high coefficient of determination (R2 < 0.70) while the instance-based ones were able to do it. Finally, the MLP obtained good results in both cases (R2 > 0.70).

Dunn et al. found a strong relationship between characteristics of nystagmus and VA [14]. He also showed that a letter chart is typically employed for the measurement of VA in the health facilities and, despite the struggles of clinicians to give patients plenty of time to read the chart, there is the necessity to go on with the next text [15]. Thus, he pushed the research into the investigation of restricted duration optotypes.

Zemblys et al. showed that RF Forests exhibited the best eye-movement event classification performance [31, 32]. Most of the past algorithms had a positive functioning within assumptions on data such as an input of high-quality, or a requirement of high sampling frequencies while training a classifier on an extensive selection of input data allows machine learning to generalize better than hand-crafted algorithms [45]. Indeed, these classifiers can be applied to many kinds of data, just needing a proper training to accomplish their tasks.

Thus, considering the feasibility of machine learning algorithms used on features extracted from EOG, they could be used also to accomplish diagnosis and prognosis tasks in ophthalmology like it has been done in many medical fields [24, 25, 38, 39]. Since CN is a complex pathology, in the UK a Nystagmus Care Pathway of 7 phases was developed and our procedures may be included in some stages of the clinical pathway, particularly in the identification of the CN and in the phases of finding and managing underlying causes and associations [46]. This study agrees with the conclusion of Zemblys et al. and augments its strength since, in this case, a regression analysis was performed. Moreover, the findings of Kelly et al. and Dunn et al. were confirmed, a relationship, particularly, between VA and characteristics of nystagmus, was found [15, 16]. There is a clear limitation that has also been much discussed literature [47, 48] regarding the black box nature of machine learning models; the algorithm doesn’t provide researchers with many details, but it just gives insight when discovering the possibility to find relationships or useful classifications (i.e. for a diagnosis or a prognosis). Of course, there is the possibility of future developments: only 20 signals were analysed, and they were recorded and processed through old instrumentation, meaning that the quality of signals and features could be better with instrumentation that is more modern. Finally, as shown by some researchers [14, 15], VA isn’t the best way to measure CN; so, the use of machine learning algorithms could be the best way to find new insights on how to replace this measure.