Keywords

1 Introduction

The World Health Organization recognizes that dementia is significantly underdiagnosed globally and emphasizes that even when a diagnosis is made, it often occurs at a relatively advanced stage. Delaying the onset of Alzheimer’s disease (AD) can have substantial benefits, including reduced care costs and increased lifespan for individuals. Therefore, early diagnosis of AD is crucial to enhance awareness and provide timely interventions. Mild cognitive impairment (MCI) is an early stage in the progression towards Alzheimer’s disease (AD); its accurate diagnosis is essential for initiating timely treatment to delay the onset of AD. Individuals experiencing MCI may start noticing alterations in their cognitive abilities, yet they are still capable of performing their daily tasks. However, severe levels of impairment can significantly impact the comprehension of events and the significance of information conveyed through speech and writing, ultimately leading to the loss of independent living. Currently, AD diagnosis relies on various methods such as imaging, blood tests, and lumbar punctures (spinal sampling), among others. Recent research has demonstrated that individuals with AD exhibit disrupted spatial organization and impaired motor control. Therefore, the assessment of motor activities, including the analysis of handwriting, which encompasses a complex interplay of cognitive, kinesthetic, and perceptual motor skills, holds significant potential in diagnosing AD. An illustrative example is the occurrence of dysgraphia in both the early and progressive stages of AD [8, 24]. Within this context, numerous studies in medicine and psychology have investigated the relationship between the disease and various handwriting features, utilizing conventional statistical methods [16, 19, 21]. However, these studies often neglect the intricate interactions that can arise among multiple features, failing to capture the complexity inherent in the analysis. In many instances, individual features that exhibit weak correlations with the target class could significantly enhance classification accuracy when combined with complementary features. Conversely, features that are individually relevant may become redundant when utilized alongside other features. In a previous study [4], the authors assessed the efficacy of the extracted features and their relationship with the diseases they potentially contribute to prediction. The techniques employed a search strategy to identify optimal solutions, i.e., the best feature subsets, based on a predefined evaluation function. The approaches used to define the best features are typically categorized into three main classes: filter, wrapper, and embedded methods. Filter methods primarily rely on statistical properties of the feature subset space. Wrapper methods, on the other hand, assess the performance of a specific classifier when utilizing a particular feature subset. Embedded methods incorporate feature selection as an integral part of the training process. In this previous work, the analysis of features extracted from the handwriting of individuals with neurodegenerative diseases and cognitive impairment is done using wrapper methods. In this paper, we present a novel approach based on Bayesian Networks to further investigate the complex interactions that may emerge among multiple features. A Bayesian Network (BN) is a probabilistic graphical model that encodes the joint probability of a set of variables that, in our case, can be the features and the disease identification. The feature selection problem has seen the extensive application of Bayesian networks [2, 18]. Once the BN has been learned from instances in a dataset, it allows the identification of a reduced set of features conditionally dependent on the disease identification. Such a reduced set is also known as Markov Blanket (MB). Existing approaches involve learning a Bayesian network from the given dataset and subsequently utilizing the Markov Blanket of the target feature as the criterion for selecting relevant features. Thus, in this paper, we want to primarily explore the relationship between these diseases and individual features and then study the complex interactions that may emerge among multiple features. In this way, we can fill the gap in the literature regarding the comprehensive understanding of the interplay among various handwriting features in relation to AD and cognitive impairments (CI) [6]. The remainder of the paper is organized as follows. In Sect. 2 we describe the protocol we used to acquire the handwriting data, and the used features. Section 3 introduces the BNs and describes how they can be applied for selecting features. Finally, Sect. 4 reports the experimental results.

2 Acquisition Protocol and Features

We have developed a comprehensive protocol to collect data on handwriting movements from patients with CI and a control group of healthy individuals. The protocol, described in [4], consists of twenty-five tasks categorized as follows: graphic tasks, copy and reverse copy tasks, memory tasks, and dictation tasks. Graphic tasks assess the ability to write basic strokes, connect dots, and draw geometric shapes of varying complexity. Copy and reverse copy tasks evaluate the proficiency in reproducing complex gestures with semantic meaning, such as letters, words, and numbers. Memory tasks examine changes in the writing process for previously memorized words or words associated with depicted objects. Dictation tasks aim to investigate how handwriting performance is influenced by working memory usage.

It is important to note that each task was designed to assess either functional or parametric aspects. For example, in task number 17, participants were asked to write six different words that were analyzed in two different ways: the former, by averaging feature values across the entire word set, and the latter, by averaging feature values for each individual word. This led to the subdivision of the \(17^{th}\) task into six additional sub-tasks from \(26^{th}\) to \(31^{th}\). A similar approach was applied to task number 14, which involved memorizing and rewriting the Italian words “telefono”, “cane”, and “negozio” added the sub-tasks 32, 33, and 34. The reasoning for this approach is to measure the impact of tiredness, i.e., if writing performance deteriorates more rapidly in participants suffering from neurodegenerative disorders when they are required to write multiple words consecutively. To summarize, we have a collection of thirty-four tasks that characterize each patient. The recruitment for the study uses standard clinical tests, including the Mini-Mental State Examination (MMSE) [11], Frontal Assessment Battery (FAB) [14], and Montreal Cognitive Assessment (MoCA) [17]. These tests assessed cognitive abilities across various domains, such as orientation, recall, and registration. To ensure unbiased results, the control group was carefully matched with the patient group regarding age, education level, gender, and type of work (manual or intellectual), as shown in Table 1. Participants using psychotropic medication or any other drugs that could influence cognitive abilities were excluded. Moreover, we excluded patients with severely compromised cognitive abilities.

Table 1. Average demographic data of participants. Standard deviations are shown in parentheses.

For data acquisition, we utilized a Wacom Bamboo Folio smart pad paired with a pen that allowed participants to write naturally on A4 white paper sheets placed on it. The smart pad recorded the x-y coordinates of pen movements (at a frequency of 200 Hz) on the paper’s surface. We also captured the pressure applied when the pen tip touched the sheet and the pen’s movements when lifted in the air within a maximum distance of 3 cm. The smart pad was positioned approximately seventy centimeters away from the participants during the data collection process. Importantly, all participants underwent the acquisition under identical conditions.

The features extracted from the raw data available, i.e., (x, y) coordinates, pressure, and timestamps, were calculated on the strokes making up the handwritten traits and then averaged over the entire task. Our goal is to describe, for each task, the behavior of a subject, taking into account a fixed number of features that are described in Table 2.

Considering the significant variation in the number of strokes across different subjects and tasks, we have adopted an averaging approach to consolidate the values extracted from individual strokes. Specifically, the feature denoted as \(f_{22}\) represents the total number of strokes. For each of the features from \(f_1\) to \(f_{21}\), we have calculated the average and standard deviation, symbolized by f and \(\hat{f}\), respectively. Consequently, the first 21 features are duplicated for each patient, encompassing static and dynamic handwriting characteristics. Additionally, features ranging from \(f_{23}\) to \(f_{26}\) consider variations associated with factors such as sex, age, work, and education.

Table 2. Feature list description.

As many studies in the literature show significant differences in patients’ motor performance between in-air and on-paper traits, each feature was calculated separately for the in-air or on-paper traits. In particular, we extracted four groups of features:

  • On-paper: The features extracted from the written traits (i.e., during pen-down and the successive pen-up). Note that in this case, forty-seven features represented each sample.

  • In-air: The features extracted from the in-air traits. These movements characterize the planning activity for positioning the pen tip between two successive written traits. Note that in this case, we extracted forty-five features because pressure (feature \(f_{21}\)) is always zero.

  • All: In this scenario, each feature vector includes in-air (\(af_i\)) and on-paper (\(pf_i\)) attributes (where the subscript i indicates the feature number), thus reporting the values of both On-paper and In-air feature vectors. The aim was twofold: facilitating a direct in-air versus on-paper feature comparison and delving into the intricate interplay between the two. It is worth noting that eighty-eight distinct features were considered, excluding repeated personal features and pressure variables regarding the in-air part.

  • In-air-On-paper: The computation of these features disregards the differentiation between in-air and on-paper characteristics. In practice, the value of each feature is obtained by averaging the values derived from both in-air and on-paper traits. The only exception is for the pressure, whose values are obviously obtained considering on-paper traits. This approach represents an alternative method of supplying global information to the classification system, considering as equivalent the motor planning for both handwritten and in-air strokes.

Summarizing, we considered four categories of features: we have forty-five features for In-air category, forty-seven features for both On-paper and In-air-On-paper categories, and eighty-eight for the category All.

3 Bayesian Network for Feature Evaluation

The problem of feature evaluation can be handled by estimating the joint probability of each feature and the class label. A Bayesian Network (BN) may effectively solve this problem. A BN is a probabilistic graphical model that allows the representation of a joint probability distribution of a set of random variables through a Direct Acyclic Graph (DAG) [20]. The graph nodes represent the variables, while the arcs encode the statistical dependencies among the variables. An arrow from a node \(f_i\) to a node \(f_j\) encodes the conditional dependence between the node \(f_j\) and node \(f_i\), and we can define \(f_i\) as a parent of \(f_j\). In a BN, the i–th node \(f_i\) is associated with a conditional probability function \(p(f_i|pa_{f_i})\), where \(pa_{f_i}\) indicates the set of nodes which are parents of \(f_i\). Such a function quantifies the effect that the parents have on that node. The process of learning a BN entails acquiring knowledge from a training set of examples. This learning phase involves capturing both the network structure, which defines the statistical dependencies among variables, and the parameters of the probability distributions associated with those variables. Among structural learning methods, there are constraint-based methods like PC [22], IAMB [23] that exploit conditional independence relationships in the data to uncover the network structure; there are also score-based methods that evaluate different network structures based on a scoring metric to find the structure with the highest score or the lowest complexity. Among score-based methods, there are K2 [5], TAN [12] etc. The third category of structural learning methods, the hybrid ones combine the strengths of constraint-based and score-based approaches. These methods balance computational efficiency and the ability to handle complex network structures. Among them, there are methods also based on evolutionary algorithm [7]. On the other hand, parameter learning generally uses the Maximum Likelihood Estimation that estimates the parameters of a BN by maximizing the likelihood of the observed data. Once the statistical dependencies among variables have been learned, the DAG structure encodes them, and the joint probability of the set of variables \(F= \{f_1, \dots ,f_L\}\) can be described as:

$$\begin{aligned} p\,(f_1 \dots , f_L)= \prod _{f_i \in F} p(f_i|pa_{f_i}) \end{aligned}$$
(1)

In the feature evaluation framework, this property can be used to infer the true class c of an unknown sample only by a subset of features. In fact, suppose to have L features, then the class label c and the L features can be modeled as a set of \((L+1)\) variables \(\{c, f_1,\dots ,f_L\}\), and the Eq. (1) allows the description of their joint probability as:

$$\begin{aligned} p\,(c, f_1, \dots , f_L) = p\,(\,c\,|\,pa_c\,)\prod _{f_i \in F} \,p\,(\,f_i\,|\,pa_{f_i}\,) \end{aligned}$$
(2)

The node c may be the parent of one or more of the nodes of the DAG. For example, if we consider the BN depicted in Fig. 1, we have that c is the parent of nodes \(f_6\) and \(f_5\). While the nodes \(f_2\) and \(f_3\) are the parents of c. Therefore, it may be useful to divide the set of DAG nodes that are not parents of c into two groups: the first, denoted \(F_c\), contains the nodes having the node c among their parents, and the second, denoted \(F_{\overline{c}}\), the remaining ones. Note that among \(F_c\) nodes there also are nodes, like, \(f_7\) that are not parents of c, but are modeled in the conditional probability that also contains the node c. With this assumption, Eq. (2) can be rewritten as:

Fig. 1.
figure 1

An example of Bayesian Network and the Markov Blanket in case of 9 features available.

$$\begin{aligned} p(c, f_1, \dots , f_L) = p(c|pa_c)\prod _{f_i \in F_c} p(f_i|pa_{f_i}) \prod _{f_i \in F_{\overline{c}}}p(f_i|pa_{f_i}) \end{aligned}$$
(3)

This property allows a BN to recognize a given sample only considering the responses provided by the feature represented by the nodes that are directly linked to the class node. The group \(F_c\) and \(pa_c\) is also known as Markov Blanket (MB) of node c and in \(F_{\overline{c}}\) group are all nodes d-separated from c, that is, it contains features conditionally independent from class label c. Therefore, the MB of node c consists of its parents, children, and spouses, and the c is independent of all other nodes given its MB.

This behavior encoded by the BN is particularly useful in the testing phase when, from the feature vector of all features, the class label is inferred. In fact, given the set of responses concerning a sample, we can use the conditional probability \(p(c|f_1, \dots ,f_L)\) estimated by the BN for assigning the most probable class \(\widehat{c}\) to the unknown sample, as follows:

$$\begin{aligned} \widehat{c}=\arg \max _{c \, \in \, C} p\,(c|f_1,..., f_L) \end{aligned}$$
(4)

where C is the set of classes. Considering the definition of conditional probability, and omitting the terms not depending on the variable c, the above equation can be rewritten as follows:

$$\begin{aligned} \widehat{c}=\arg \max _{c \, \in \, C} \frac{p\,(c,f_1,..., f_L)}{p\,(f_1,..., f_L)}= \arg \max _{c \, \in \, C} p\,(c, f_1,..., f_L) \end{aligned}$$
(5)

which involves only the joint probabilities \(p\,(c,f_1,..., f_L)\). According to Eq. (3), and discarding the term conditionally independent on c, Eq. (5) assumes the form:

$$\begin{aligned} \widehat{c} \,= \,\arg {\displaystyle \max _{c\in C} \,p\,(\,c\,|\,pa_c\,)} \prod _{f_i \in F_c} \,p\,(\,f_i\,|\,pa_{f_i}\,) \end{aligned}$$
(6)

An example of the application of this rule is shown in Fig. 1, where only 9 features have been considered for the sake of clearness. In this case, the above equation becomes \(\widehat{c} = \arg {\displaystyle \max _{c \, \in \, C} \; p(f_6|c,f_7)p(f_5|c)p(c|f_3, f_2)}\). Then in the case of the learned BN of Fig. 1 we have to know only features \(f_2\), \(f_3\), \(f_5\), \(f_6\) and \(f_7\) to infer c. In fact, during the learning procedure, the set of experts \(F_{\overline{c}}\), which does not add information to the choice of \(\widehat{c}\), is individuated and it is discarded in the testing phase. In our example, in fact, the contribution of features \(f_1\), \(f_4\), \(f_8\) are considered not necessary, and they can be discarded in the testing phase. Thus, the BN-based feature selection approach uses only the features in the MB of node c.

4 Experimental Findings

In this section, we will describe the experimental setup and procedures implemented to assess the performance of our system. The data were acquired according to the protocol described in Sect. 2 and refer to 18 subjects, including 90 patients and 90 healthy people. The purpose of the experimentation was twofold: to show that our approach allows us to accurately classify patients and healthy subject using the features extracted fro their handwriting, and to underline which features are more relevant to provide the diagnosis. In order to achieve our aim, we learned the BN DAG structure and the conditional probabilities among features and the class label, using the available data. Given the DAG structure, we extracted the MB, which highlights the features conditionally dependent on the class node. We performed the aforementioned procedure by running the K2 algorithm 30 times, setting to 3 the maximum number of parents for each node. The value selected for the maximum number of parents represent a compromise between algorithm efficiency and computational cost, while the number of runs of the algorithm k2 was chosen to average the effects of the initial ordering of the input variables. Its efficiency, in fact, is strongly dependent on this ordering and therefore a different initial order of the variables were considered in each run. We used the WEKA software for feature selection and classification, whereas we used the KNIME software for data pre-processing (managing missing values, filtering null columns, and encoding categorical variables). Although real-world data often includes a combination of discrete and continuous variables, BN structure learning algorithms generally assume that all random variables are discrete. As a consequence, continuous variables are typically discretized to comply with this assumption. In our implementation, we applied the sample quantile technique to discretize continuous variables into five binning intervals. These pre-processing steps were crucial in preparing the data for the subsequent classification analysis.

In order to investigate the importance of the used features, we plotted the histogram of the features selected more than 10% of the times in the MB among the 30 runs (see Fig. 2) for each category, namely In-air, On-paper, In-air-On-paper and All. Even if the results relative to the feature category All indicate that the features derived from on-paper traits are selected more frequently than those related to on air traits, the importance of on air traits is confirmed by our results reported in Table 3 as well as by the results reported in [9]. From the figure, we can see that the age, feature \(f_{24}\) (see Table 2), is the most selected in the four categories, with a minimum value of 0.47 (On-Paper). This confirms that age significantly affects the handwriting process of people with Alzheimer’s disease, due to the changes in brain structure, such as the decrease in the size of the brain’s memory center (hippocampus). Even if these changes typically worsen with age [10], they are more relevant in subjects with mild cognitive impairment and even more dramatic in people with Alzheimer’s disease. Another feature selected with high frequency in 3 of the 4 categories is the education, feature \(f_{26}\), with a minimum value of 0.43. Comparing these results with the ones obtained in [4], where the education was selected few times, we can say that BNs seem more effective in estimating the correlation among variables, and thus the joint contribution of groups of features in distinguishing patients from healthy people. Other two significant features selected with high frequency are \(f_2\) and \(f_3\), measuring the start vertical position and the vertical size, respectively. The joint selection of these features confirms the importance of evaluating the spatial coordination of subjects, and indicates that their variability differs between healthy people and AD patients. As regards the evaluation of the dynamic parameters of the handwriting process, separate evaluations can be made for data On-Paper and In-Air data. The peak vertical acceleration \(f_5\) appears with its mean value and its mean and standard deviation in On-Paper group, underlining the importance of the variability of this feature. In In-Air group, the presence of \(f_5\) together with the standard deviation of the number of peak acceleration points \(\hat{f}_{20}\) and the standard deviation of the normalized y jerk \(\hat{f}_{17}\), suggests that the variability of in-air movements may highlight anomalies in the handwriting of AD patients. On the other hand, also the movements performed with the pen tip touching the paper are characterized by different dynamics in AD patients and healthy people. In fact, in the On-Paper category, the mean value of jerk \(f_{18}\), the absolute y jerk \(f_{16}\), the average absolute velocity \(f_{14}\) and the standard deviation of peak vertical acceleration \(\hat{f}_4\), are the most selected features, underlining the importance of hesitations in the handwriting process. In the On-Paper category, features like \(f_3\), \(f_{13}\), \(f_1\), \(f_6\), and \(f_9\) that measure the space occupation, are also very important. In case of In-Air-On-Paper group, the features are obtained by averaging the values from both in-air and on-paper attributes for each task, thus assuming that the generation of the handwritten strokes is obtained by concatenating in air and on paper movements. Apart from the already mentioned features, the most important features are the absolute jerk \(f_{18}\), peak vertical velocity \(f_4\), the road length \(f_{15}\), but also the duration \(f_1\), the average absolute velocity \(f_{14}\), the start horizontal \(f_6\) and vertical position, \(f_{2}\), with their mean and standard deviation. These results confirm that globally handwriting dynamic and spatial coordination are very important. When we apply the BNs to the category All, the effect is to produce a ranking of the features computed on both In-Air and On-Paper traits. In particular, we have the predominance of On-Paper features, where the most selected features are pen pressure \(f_{21}\), the pen absolute velocity \(f_{14}\), the absolute y jerk \(f_{16}\), the start vertical position \(f_{2}\), loop surface \(f_{10}\), the road length \(f_{15}\), duration \(f_1\). The only feature computed on in-air traits present in the histogram is the feature the vertical size \(f_3\). It is interesting to note that the features \(f_{10}\) emerge only in this group, meaning that they probably assume importance only in correlation with other features of both In-Air and On-Paper category. Finally, note that the feature road length \(f_{15}\) is present in all the histograms except for the one relating to the in-air category.

Fig. 2.
figure 2

Feature percentage greater than 10% selected by the proposed approach. Each bar of the histogram shows the fraction of time the corresponding feature(s) was selected among the 30 runs.

We used the Recursive Feature Elimination with Cross-Validation (RFE) [13] to select the most relevant features for comparison purposes. This technique recursively eliminates features and evaluates their impact on the performance of basic classifiers. We used a 10-fold cross-validation setup and evaluated the performance using the XGBoost [3], Decision Tree [15], and Random Forest [1] classifiers. We performed 30 classification runs, each time using the relevant features selected by each method. Table 3 shows the results obtained in all the aforementioned experiments; the first column shows the algorithm used for the feature selection, whereas the second column shows the algorithm used for the classification step. For each method, we reported the accuracy mean and the standard deviation along the 30 runs, with the average mean number of features selected (NF). From the table, we can observe that the best results are obtained every time with BN selection and classification. It is also worth noticing that when we apply another classification method to the feature selected by BNs it never achieved the result obtained by using BN as a classifier. This behavior proves that both structural and parametric learning of the BN is very effective in selecting features and classifying healthy subjects and AD patients. Moreover, among the four categories, the best performing is All and then, in the order, In-air-On-paper, On-paper and In-air, and This result shows that BN optimally exploits the information coming from the two groups when this information is not averaged. Furthermore, as also shown in Fig. 2(d), the best features are those derived by the On-Paper category, where the In-Air ones are exploited as complementary information to obtain the best recognition performance.

Table 3. Classification results using BN, RF, XGBoost, and DT in term of average accuracy (Acc) and its standard deviation (in brackets), and average number of selected features (NF).

5 Conclusions

In this study, we presented a novel approach based on Bayesian networks to evaluate the statistical dependencies among different features extracted from handwriting samples, in order to maximize the performance of a system for early AD diagnosis. The data were obtained by administering handwriting tests according to a protocol including 35 tasks, to a group of 180 subjects including 90 healthy controls and 90 AD patients. From these data, four datasets were obtained including feature relative to on paper and on air traits: this choice allowed us to estimate the distinctive power of the different considered feature categories and to study the complex interactions among groups of features.

The results seem very encouraging and demonstrate the effectiveness of the proposed approach: in particular, the Bayesian network allowed the selection of about half of the whole set of available features, significantly improving the performance with respect to other state of the art feature selection methods. As future works, we plan to increase the number of parents in the BN structural learning algorithm and to evaluate the sensitivity of the proposed system to variations of this parameter. The number of parents, in fact, has a very strong impact on the computational cost of the BN learning algorithms. We also plan to apply hybrid structural learning techniques to Bayesian networks [7].