Introduction

Postural imbalance is an ambiguous and key symptom of various neurological disorders, including acquired loss of sensory function (visual, vestibular or somatosensory), cerebellar or extrapyramidal dysfunction due to degenerative brain disorders, or somatoform functional disorders [1,2,3]. Some of these conditions can be easily and reliably diagnosed by associated neurological symptoms. Others are difficult to differentiate clinically, and require further quantitative assessment of stance and gait. To this end, static posturography [4,5,6] records the stance and sway behavior of patients under increasingly difficult conditions, while standing on a force measurement platform. In static posturography, body sway (i.e., center-of-pressure displacements) is measured and scored based on so-called features, i.e., characteristic parameters quantifying sway path amplitude and variance, or tremors in various frequency bands using Fourier analysis. As shown in our own previous work [7], as well as the work of others [8,9,10], diagnostic classification can be computerized using multi-parametric statistical analysis [11] or supervised machine-learning algorithms like artificial neural networks (ANNs) [7].

Here, we attempt the application of modern machine-learning and data-mining techniques to 8 subject classes on a larger clinical dataset with 293 individuals, recruited from our neurology department, as well as our tertiary interdisciplinary outpatient dizziness unit, the Germen Center for Vertigo and Balance Disorders. The cohort includes healthy controls, and the pathological classes acute unilateral vestibulopathy (former: acute vestibular syndrome, or vestibular neuritis) [12], cerebellar disorders (anterior lobe cerebellar atrophy [13], downbeat nystagmus syndrome [14]), somatosensory deficits (sensory polyneuropathy) [8], postural tremor (orthostatic tremor) [15, 16], extrapyramidal dysfunction (Parkinson’s disease) [17] and functional disorders (Persistent Postural-Perceptual Dizziness, in its sub-form phobic postural vertigo) [18].

Our analysis of posturographic sway has three objectives: (1) classification of static posturography signal patterns using modern supervised machine-learning techniques, towards computerized diagnosis for above-mentioned eight classes. Compared to many works classifying only one or two types of disorders [8, 9, 19, 20], often against healthy controls, we aim at a more general classifier distinguishing the above-mentioned eight classes. (2) We calculate the discriminative power and perform a ranking of the 10 posturographic stance conditions and their derived individual sway features to better assess the reasonability of the experimental paradigm. (3) We create a mapping of posturographic sway patterns of subjects by non-linear projection of high-dimensional sway patterns into a 2D map. This allows us to collectively and visually analyze the distribution of sway patterns from all subjects in our cohort. We give explanations and pointers towards improvement of the computerized diagnostic scheme in the discussion for routine clinical use.

Methods

Patient cohorts

For this study, we included a cohort of 293 adults with patients (n = 236) and healthy controls (n = 57), who performed posturographic examination under the same examination protocol. Group statistics are detailed in Table 1. Presumptive clinical diagnoses were determined by expert physicians from a neurological clinic and policlinic (Dept. of Neurology, University of Munich, Germany) and a tertiary interdisciplinary outpatient dizziness unit (German Center for Vertigo and Balance Disorders, Munich, Germany). Patients in the primary orthostatic tremor (OT) group were diagnosed according to criteria defined in [21, 22], with a focus on a high-frequency body sway (11–19 Hz) present already while standing with eyes open on firm ground. Downbeat nystagmus (DN) was diagnosed with the aid of quantitative video-oculography according to criteria defined in [23]. Acute unilateral vestibulopathy (AVS) was diagnosed following guidelines in [24]. Parkinson’s disease (PD) patients were clinically diagnosed [25] and selected to only include patients with a tremor-dominant type (4–8 Hz). Anterior lobe cerebellar atrophy (CA) was based on cerebellar symptoms during neurological examination and confirmation with MR imaging (atrophy in the vermal and anterior lobe region of the cerebellum). Polyneuropathy (PNP) patients presented a distal, symmetric sensorimotor loss, according to guidelines in [26]. Patients belonging to the postural phobic vertigo (PPV) group were recruited earlier than 2017, i.e., before the official consensus criteria for Persistent Postural-Perceptual Dizziness (PPPV) were defined [3], hence the diagnostic process followed guidelines in [27]. For this study, neither disease severity nor comorbidities could be considered, instead we represented patients with respect to their primary diagnosis. All patients participated under written consent, the study was authorized by the ethics committee of the medical faculty, Ludwig-Maximilians-University, Munich, and conducted in conformity with the Declaration of Helsinki.

Table 1 Details of patient cohorts

Instrumentation

Posturographic examination

We followed the posturographic examination protocol of [7]. Subjects were examined while standing on a stabilometer platform (Type 9261 A; Kistler, Winterthur, Switzerland). The protocol consisted of ten trials during which subjects had to maintain balance in upright stance, with arms hanging, and under conditions with increasing difficulty. Posturographic stance conditions were: (1) eyes open; (2) eye closed; (3) eyes open, head in neck; (4) eyes closed, head in neck; (5) eyes open, standing on foam-rubber block; (6) eyes closed, foam-rubber block; (7) eyes open, head in neck, foam-rubber block; (8) eyes closed, head in neck, foam-rubber block; (9) eyes open, foam-rubber block, tandem stance; (10) eyes closed, foam-rubber block, tandem stance.

Signal processing and feature extraction

The posturography sensor hardware records raw data of fore-aft (y) and lateral (x) body sway and body weight (z), at a sampling frequency of 40 Hz. Each trial was recorded for a duration of 30 s. Machine-learning classification was performed on a set of discriminative sway features. For each trial, recorded as a 3-dimensional time series of x/y/z raw data, 18 features were computed [7], which are summarized in Table 2. A full examination with 10 trials thus yields a 180-dimensional feature vector (10 trials with 18 features each) for each subject. The features represent accumulated sway path in x, y and z directions [28], root-mean square values in x, y and z directions [29], and spectral energy magnitudes [7] in different frequency bands, which were computed after application of a Hamming window, followed by discrete Fourier analysis (MATLAB, MathWorks Inc., USA). Feature abbreviations, descriptions, and formulae are summarized in Table 2.

Table 2 List of spatial and spectral features extracted from posturographic raw data

Classification using supervised machine learning

Our procedure for supervised machine learning consists of data pre-processing, cross-validation and classification. Pre-processing is performed by normalization of the data range for each sway feature. Normalization applied a transformation to zero-mean (μ = 0) and unit-variance (σ2 = 1) distribution for each sway feature. For assessment of classification robustness, we utilized a stratified k-fold cross-validation, which provides training- and test-set splits that compensate for the class imbalance of our dataset. To obtain robust cross-validation statistics, we utilized k = 50 randomized and stratified splits at 90% training vs. 10% test data for each class.

Classification with supervised machine learning and ensemble models

As classifiers, we used different machine-learning algorithms which are often applied in high-dimensional data analysis [30]. Additionally, we compare the selected algorithms’ performance on our new dataset to the previously proposed artificial neural network (ANN) configuration from [7]. We compare logistic regression (LogRegr) [31], k-nearest-neighbors (kNN) [32], a single-hidden-layer perceptron ANN (ANNsingle [7]), a deeper multi-layer perceptron ANN with three hidden layers (ANNmulti) [33], support vector machines (SVM) [34], random forests (RandForest [35]), and extra-randomized trees (ExtraForest [36]). For our experiments, we used implementations provided in the scikit-learn framework for pre-processing, cross-validation and classification [37]. In addition to regular classifiers, we employed an ensemble-learning technique called Stacking Classifier (SC [38]), which combines probabilistic classifications of all above-mentioned classifiers into a stronger and more robust meta-classifier, with logistic regression as the meta-learning algorithm.

Evaluation scores

Cross-validation results are evaluated according to accuracy, which summarizes the multi-class classification in a single scalar according to the following formula:

$$\text{Acc} (c_{{{\text{true}}}} ,c_{{{\text{pred}}}} )\; = \;\frac{1}{{n_{{{\text{samples}}}} }}\sum\nolimits_{i = 1}^{{n_{{{\text{samples}}}} }} {1(c_{{i,{\text{true}}}} = c_{{i,{\text{pred}}}} )} ,$$

where \({c}_{true}\) is the true class of a sample, \({c}_{pred}\) is the predicted class, and \(1\left(x\right)\) is the indicator function. Furthermore, we report confusion matrices with detailed (mis-)classification rates, as well as sensitivities and specificities in results.

Ranking of stance conditions and sway feature importance

Analyses on feature importance and embedding were performed using the ExtraForest algorithm [36]. Feature importance was calculated according to Mean Decrease in Impurity (MDI [39]). In MDI, each feature importance is computed as the number of splits that involve the feature, summed over all decision trees in the ensemble, and weighted by the number of samples that were split by the feature. We rank the discriminative power of the ten stance conditions and of the extracted sway features by computing relative MDI coefficients, scaled by the most important feature in each group.

Mapping disorders of stance using posturography

We calculate a 2D map of the sway pattern distribution in our cohort using t-Stochastic Neighborhood Embedding (t-SNE [40]) a dimensionality reduction technique designed for visualization of high-dimensional datasets. Unlike, e.g., Principal Component Analysis (PCA), t-SNE performs a non-linear projection of the 180-dimensional feature vectors of each subject into the 2D plane for visualization. The projection preserves the stochastic distribution of data points in high-dimensional space into the low-dimensional (2D) space, by aligning their respective distributions. Alignment is achieved by maximization of Kullback–Leibler divergence [40]. The 2D map allows for simultaneous assessment of the relative sway similarity of all subjects in our dataset, along with the distribution of their diagnostic classes. Two main parameters, dimensions and perplexity, affect the layout of the resulting 2D map. As discussed in the original paper, t-SNE is relatively robust to these parameters, we hence use a standard parametrization (see “Results”).

Statistical analysis

Data were analyzed using the Python programming language and open-source modules for scientific computing, in particular scipy-stats [41] for statistical testing and scikit-learn [37] for machine-learning experiments. Comparison between the best- and second-best performing classifiers (SC vs. ExtraForest) was performed using Wilcoxon signed-rank test, the level of significance was set at p < 0.05. Sensitivities and specificities of diagnostic classification were calculated based on 50-fold randomized, stratified cross-validation, the best-performing algorithm’s accuracy is reported with 95% confidence intervals.

Results

Supervised machine learning for classification and feature ranking

Classification results

Classification accuracy is evaluated with 50 randomized folds of stratified cross-validation (90% training vs. 10% test data). Mean classification accuracies for single classifiers range from 64.5% (kNN) to 80.7% (ExtraForest). Across the 50 cross-validation runs per algorithm, the maximum accuracy ranges from 76.7% (kNN) to 93.3% (ANNMulti, RandForest, ExtraForest).

The application of a Stacked Classifier (SC) increases the confidence and robustness of the classification further, raising the mean classification accuracy from 80.7% (ExtraForest) to 82.7% (95% CI: [80.9%, 84.5%]). The difference of paired accuracies between the SC and the second-best performing method (ExtraForest) is statistically significant (p = 0.015, Wilcoxon signed-rank test). A box-plot of classification accuracies is depicted in Fig. 1.

Fig. 1
figure 1

Cross-validated results from different supervised classification algorithms, evaluated on the entire dataset. Classification accuracies (denoted in %/100) are sorted by mean value in ascending order, with box plots indicating confidence intervals over 50 stratified random-split cross-validation with 10% hold-out test data. The most accurate classifier (Stacked Classifier, SC) shows significantly better performance than the second-best algorithm RandForest (p = 0.015, Wilcoxon signed-rank test). Detailed classification results of SC on individual diagnostic classes are depicted in Fig. 2

Confusion matrices

To better analyze class separation, we compute normalized confusion matrices, highlighting true labels vs. predicted labels and the respective percentages. While the mean accuracy of the Stacked Classifier (SC) in the 8-class setup yields an overall accuracy of 82.7% (95% CI: [80.9%, 84.5%]), confusion matrices reveal three classes with true positive rates below 70%, namely DN, PD and PNP (cf. Fig. 1, left panel).

We additionally compare the Stacking Classifier, our best-performing method, to the classification scenario in [7], using five classes CA/AVS/NC/PPV/OT only. Figure 2 depicts the confusion matrix. Given five classes only, SC classification yields a comparable overall specificity of 97.4%, compared to 98.4% in [7]. Sensitivity, however, fairs slightly lower at 89.5%, compared to 93.4% in [7], mainly due to lower sensitivity for OT (80.0%, compared to 100%).

Fig. 2
figure 2

Confusion matrices (true label vs. predicted label) of the Stacking Classifier algorithm for the 8-class, (left panel) and 5-class (right panel) Stacking Classifier (see Table 1 for abbreviations of disorders). Classes CA/VN/NC/PPV/OT show consistently high classification accuracy (see main diagonal of confusion matrices) with sensitivities above 80%, while classes DN/PNP/PD are more challenging to classify (left panel: classification accuracies below 70%)

Ranking of stance conditions and feature importance

Altogether, 18 features from 10 trials (180 features total) are ranked by their discriminative power derived from the MDI criterion in the ExtraForest classifier. We marginalize their importance by stance condition, and by feature type, and visualize their relative feature importance in Fig. 3.

Fig. 3
figure 3

Relative accumulated importance of stance conditions 1–10 (top panel) and of posturographic feature types (bottom panel)

As seen in Fig. 3, left panel, trial 6 (eyes closed, frontal stance on foam-rubber block) is evaluated with highest accumulated importance. The next two most important experiments are trials 5 (eyes open, foam-rubber block) and trial 7 (eyes open, backward head extension, foam-rubber block). Furthermore, features from two of the presumably most difficult trials 8 (eyes closed, backward head extension, foam-rubber block) and 10 (eyes closed, foam-rubber block, tandem stance) are ranked comparatively low (< 50% of importance of trial 6).

As seen in Fig. 3, right panel, the highest ranked features are accumulated swaypath lengths in x/y/z direction. The next important feature is rms_z, i.e., vertical (weight) variation. Among spectral features, horizontal x/y energies in the bands 0.1–2.4 Hz and 2.4–3.5 Hz are most influential, along with vertical tremor energy in the band 3.5–8 Hz. In the high-frequency band 11–19 Hz, lateral (x) and vertical (z) tremor energies are more central to classification than fore-aft sway tremor energy.

Mapping sway path features using t-SNE

Computation of t-SNE dimensionality reduction and 2D mapping [40] is performed with a configuration initial dimension i = 20 and perplexity p = 25. The resulting 2D map of posturographic sway behavior can be seen in Fig. 4 in form of one comprehensive plot with color coded class distributions, as well as one-vs-all colorings for each diagnostic class.

Fig. 4
figure 4

Distribution of high-dimensional sway features for all study subjects, projected into 2D space via t-SNE (x-/y-axes represent relative similarity between subjects by proximity, and are unit-less). Left panel: color coding of eight diseases reveal clear clustering of disorders CA, AVS, NC, PPV. Right subpanels: one-vs-all color coding for all eight classes highlight weak clustering in disorders DN, PD and PNP

As seen in the colored visualization, and highlighted in the one-vs-all colored subpanels, the well-classified patient groups of CA, AVS, NC and PPV form well-defined clusters in the 2D embedding space. Compared to these, classes DN, PNP and PD are not grouped together and sparsely scattered across the 2D space. Implications of these observation are discussed below.

Discussion

Classification results of various stance disorders using supervised machine learning

In the separation of five diagnostic classes, supervised classification with Stacked Classifier meta-learning yielded consistently good classification of CA, AVS, NC and PPV and OT. A mean sensitivity and specificity of 88.4% and 97.1% for all five conditions was obtained, which is comparable with previously published classification accuracies [7] and on the order of inter-observer variability across expert raters (0.86) [20]. Given that these results stem from a three times larger dataset than before and a statistically more robust 50-fold stratified random cross-validation [7], we see this study as a solid confirmation that supervised learning techniques can be used to “differentiate postural sway patterns typical of several distinct clinical balance disorders with sufficiently high sensitivity and specificity for clinical use” [7].

In comparison, regarding the classification into eight diagnostic groups, the three classes DN, PNP and PD have significantly lower true-positive classification rates at 44%, 50% and 67%, respectively (cf. Fig. 2 , left). We conclude that the choices of classifiers and their parametrization are not the decisive factor for low classification accuracy of classes DN, PD or PNP. Instead, we argue that while the extracted sway features are sufficient to discriminate the main classes, they still fail to provide enough saliency to distinguish classes such as DN, PD or PNP.

Ranking of stance conditions and feature importance

In our study, each patient is represented as a 180-dimensional feature vector. While all these are derived from extensive expert knowledge [7, 8, 10], they do not carry the same discriminative power. Figure 3 showed relative feature importance of the ten stance conditions and 18 feature categories (see Table 2). Concerning features, swaypath_x/y/z and spectral energies in lower bands, in particular 2.4–3.5 Hz are most relevant. Concerning stance conditions, the five most important conditions are 6, 5, 7, 3 and 9. Four of these are acquired during trials with the condition “eyes open”, or while standing on foam-rubber block. Despite their lower importance, trials 8 and 10 still contribute to classification performance, and should only be left out of the diagnostic protocol if examination time needs to be strictly reduced.

Mapping of sway patterns using t-SNE for visualization of disease specific data clustering

The non-linear projection of 180-dimensional features into a 2D map via t-SNE allows us for the first time to assess the distribution of posturographic sway patterns for close to 300 subjects simultaneously in one plot. The resulting 2D map reveals naturally occurring clusters of stance disorders in our cohort. It should be noted that t-SNE achieves this mapping in an entirely unsupervised fashion, i.e., without a-priori knowledge of specific symptoms or disorders of the tested individuals.

Class distributions in t-SNE space, visualized in Fig. 4, reveal clusters for the well-detectable disorders NC, CA, PPV, AVS and a subset of OT. Disease classes get grouped together solely based on the stochastic similarity of sway patterns, which confirms their high discriminative power. In contrast, classes DN, PNP and PD are sparsely scattered across the 2D map, indicating a low similarity of patients with respect to their sway path feature representation. In high-dimensional feature space, patients from these classes are similarly scattered, which is the reason why supervised classification algorithms fail to find sufficiently separating boundaries between diagnoses, regardless of the concrete classifier type.

An interesting special case is class OT, where classification sensitivity dropped from 100 to 80% compared to [7]. In Fig. 4, half of subjects in class OT (n = 13) are well clustered in the lower left of the data plane, while remaining OT subjects (n = 12) are more scattered. This coincides with findings in [16], where up to 50% of OT subjects were identified to lack characteristic discharges at OT-typical peak frequencies of 13 Hz [15]. Furthermore, most of the off-cluster OT subjects reside close to subjects from class CA. Interestingly, a recent study [42] found evidence for cerebellar origins of orthostatic tremor, which is further discussed and explained in [16]. Multimodal analysis based on PET imaging, electromyography and posturography revealed pathological ponto-cerebello-thalamo-cortical activations in primary orthostatic tremor during lying and stance. Similarity of posturographic sway patterns between subjects in classes OT and CA may, therefore, be due to comorbidity, which was not considered in our diagnostic categorization.

Limitations and clinical implications

One limitation of this study is the large age difference between the HC group and the seven disease groups. The high classification accuracy of HC is thus partly explainable by age difference, as aging effects can cause a decline in biomechanical or sensory function, and as such have been shown to considerably affect static postural function [43, 44]. Therefore, a control group with a higher age average would probably lead to a slight decrease in classification accuracy. From a clinical perspective, however, the main classification goal is the differentiation of disease groups, and a relatively high separation of HC to the rest has little consequence on the inter-class separability of the remaining groups. As such, it is safe to assume that the main conclusions of this study are not affected by the age difference. In future work, demographic differences such as age, gender, as well as basic clinical scores could be incorporated into the classification to analyze their effect on differential diagnosis across disease groups.

Apart from HC, our results indicate that not all disorders of stance and gait can be classified with an equally high sensitivity and specificity. We see several potential reasons for this.

First, Fig. 4 showed that while certain disorders are well-clustered (CA/AVS/NC/PPV), the less distinguishable diagnoses (DN, PNP, PD) are distributed sparsely and interleaved with other classes. We hypothesize that there is room for classification improvement, if a more specific feature representation or additional stance conditions (e.g., leaning forward or backward; balancing on one foot) can be found for these diagnoses. Simply put, the number and type of features and conditions might not yet be sufficient to optimally distinguish certain disorders. Spatial sway features (swaypath, RMS) and spectral coefficients (FFT energy integrals) considered in this and previous work are global scalars computed over the entire 30 s examination window for each trial, while the temporal dynamics of motion might be of relevance, such as characteristic sway dynamics under proprioceptive blocks during the foam-rubber block condition. Further features might be required to model external factors, such as the number of holds of a patient by the examiner during examination, to avoid falls, which are particularly frequent in elderly patients and patients with PD [9, 17]. The number of holds, if not well documented by examination staff, could be extracted from the z-amplitude, since a fall-avoiding hold causes a dip in body weight.

Second, as mentioned in the patient cohort description, we neither considered disease severity nor comorbidity in this study. In particular, for those disease groups with low classification accuracy (i.e., DN, PD, and OT), clinically known heterogeneity might explain why they do not cluster as well as the other classes. For example, OT might be caused by a cerebellar pathology and thus linked to CA, as evidenced by multi-modal data with nuclear imaging [42] and longitudinal posturographic examination [45]. Further, there is evidence for a deficit in lower limb somatosensation in OT [46, 47]. Dichotomous classification, as applied in this study, might thus be too simple an assumption for future studies and for application in the clinic. Instead, a guideline for incorporation of machine learning into clinical routine would be to rely on probabilistic outputs of classifiers rather than hard labels and investigation of multi-class, probabilistic prediction for modeling of cases with comorbidity.

Third, a straightforward limitation is that supervised classification in this study is solely determined by a single modality, i.e., features extracted purely from static posturography. As discussed above, balance maintenance in humans is dependent on an interplay of highly multi-modal body sensors and brain functions, which should be matched by a multi-modal examination paradigm. Dynamic posturography [19] for example considers force platform stabilometry, under simultaneous monitor-based visual stimuli and external foot support actuation for balance manipulation. Further modalities can be thought of, such as demographics, questionnaires, video-oculography [48] and medical imaging [49]. Combination of static and dynamic posturography with ANN-based gait analysis [50] is most promising, since patients with postural imbalance also suffer from impairment of gait. As demonstrated in this work, as soon as discriminative feature extraction schemes are in place for each modality, modern supervised machine-learning algorithms such as randomized forests and meta-learners are capable of identifying discriminative patterns in high-dimensional clinical sensor data, for breakthrough performance in computerized diagnosis.

Conclusion

In this study, we performed supervised classification experiments on posturographic sway features from a large clinical cohort of patients with various vertigo, balance and movement disorders. For a subset of classes considered in this study, namely cerebellar ataxia, functional phobic postural vertigo, acute unilateral vestibulopathy, orthostatic tremor and healthy controls, we consider quantitative static posturography as a useful tool for computer-aided diagnosis in clinical routine. Other stance disorders remain a challenge, which may be addressed by extraction of more meaningful features, incorporation of further stance conditions, and inclusion of, e.g., video oculography, gait analysis, questionnaire data, demographics or medical imaging into a multi-modal examination paradigm. Importantly, modern methods from the fields of machine learning and data mining have reached a high level of maturity. A mapping of high-dimensional clinical data into a low-dimensional 2D-space can provide an informative visualization approach for the analyzed data. In particular, it allows to clinically interpret shortcomings of the automated classification routines with respect to, e.g., heterogeneous disease groups. Overall, machine learning and data science can help in visualizing, understanding, and utilizing high-dimensional clinical data, towards a computerized and more “objective” evaluation of balance disorders.