1 Introduction

Alzheimer’s disease (AD) is an irreversible and progressive neurodegenerative disease. Close monitoring and an early diagnosis of AD are essential to prevent a rapid progression of the disease. Therefore, prediction of a person’s cognitive performance of participants is an important research topic in the study of Alzheimer’s disease.

Usually, in the machine learning field, EEG signals or MRI images are used for detecting cognitive impairment. However, in recent years, it has become increasingly evident that other signals can be useful predictors of cognitive status. For example, it is generally agreed that early signs of Alzheimer’s disease produce alterations in handwriting [1,2,3,4], which is based on an ensemble of kinesthetic and motor-perceptive skills. Several steps forward have been made in this field, starting from the definition of some handwriting protocols, which specify the writing or drawing tests to be performed. However, it should be noted that there is no general agreement on the number and type of task to be adopted and that there are few standard databases available collecting this type of data, which generally refer to a very limited number of participants.

This aspect represents a further difficulty in the context of machine learning techniques, which typically require a huge amount of information. Furthermore, there is no general agreement on the types of features on which researchers should concentrate [5, 6]. Indeed, the problem of detecting effective features that allow the system to distinguish the natural handwriting alterations due to age from those caused by neurodegenerative disorders is still an open issue, which strongly influences the obtainable results and the practical applicability of early diagnosis support techniques. In the large majority of cases, the sets of features are typically selected by hand, generally considering the dynamics of the handwriting process in order to detect motor disorders closely related to AD. Features directly derived from handwriting generation models have also been used for AD diagnosis.

It should also be remarked that, to the best of our knowledge, the feature sets considered in the studies published in the literature do not include shape information of handwritten traces, which may be very helpful in many cases. (The only exception is the evaluation of micrographiaFootnote 1 that is normally used in Parkinson’s disease detection.) The presence, for instance, of irregular or fragmented handwriting, often associated with changes in the thickness of the strokes, can indicate difficulties in fine motor control and possibly the onset of neurodegenerative disorders. Furthermore, these studies do not investigate the correlation between changes in the shape of handwritten strokes and those in the dynamics of the writing process that produced those strokes.

Moving on from these considerations, in a preliminary study [7] we tried to verify if the combined use of both shape features and dynamic features allows a decision support system to improve performance for AD diagnosis. Starting from a database of online handwriting samples, in which the sequence of points acquired at a given frequency is recorded in terms of x-y coordinates and pressure value of each point, we generated a synthetic offline color image (hereafter denoted as RGB) for each of them.The color of each elementary trait encodes the dynamic information associated with that trait in the three RGB channels. According to our procedure, a synthetic offline color image is generated by drawing an elementary trait for each pair of consecutive points in the corresponding online handwriting sample: The end points of each elementary trait have the same x-y coordinates of the corresponding pair of points in the online handwriting sample, while the color of the trait is obtained by using as RGB values, velocity, jerk and pressure relative to these points. Thus, in the obtained images, the shape of each elementary trait is correlated with the information about the dynamics with which that trait has been produced and the pressure exerted when it was written.

Moreover, we exploited the capability of deep neural networks (DNNs) to automatically extract features from offline color images. More specifically, we used convolutional neural networks (CNNs), because they are particularly suitable for processing raw images [8]. In this way, the presence of significant differences among patients and healthy controls, regarding the shape of the traits or the way in which these traits were produced (speed, jerk, pressure), can be automatically derived through the learning process of CNN, which produces, for each handwriting sample, a feature vector representation. The results obtained employing two tasks of the whole protocol presented in [9] and different CNNs pre-trained on the public database ImageNet [10] and then fine-tuned using the synthetic images generated according to the above procedure, have been very promising. For comparison purposes, in the above study, we also considered the results obtained by extracting standard dynamic features from the same data.

Although the results obtained were very encouraging, reporting an increase in performance compared to those obtained by using standard dynamic features, the analysis of the experimental data also showed that in some cases the features obtained with the CNNs approach were not able to distinguish healthy people (HC) from people with AD (PT), especially in the initial phases of the disease: This is probably due to the simplicity of the considered drawing tasks, which do not allow the system to adequately capture the alterations in the writing performance.

Moving from these considerations, we extended the set of experiments presented in [7], by selecting other graphic tasks belonging to our protocol with higher level of difficulty. In particular, we considered tasks requiring a higher level of fine motor control, as well as tasks that involve a higher cognitive load and a greater complexity in spatial organization. Also in this study, we have chosen to only consider graphic tasks that require participants to produce handwritten graphic forms that are not as familiar to them as characters and words in their native language. Our rationale is that if people suffering from neurodegenerative disorders write regularly this could make less evident alterations in their handwriting, making it more similar to that of healthy people who do not write regularly. In other words, we have selected writing tasks that the participants are not familiar with, and therefore not very automated from the neuromotor control perspective. In this way, the differences between the writing characteristics of healthy participants and those affected by neurodegenerative disorders should emerge more clearly [5, 9].

Furthermore, in order to verify the relevance of the dynamic information associated with each handwritten trait, we generalized the procedure for generating the offline synthetic images, adding a further channel, to the three previously considered ones. According to this new procedure, a multi-channel TIFF image (hereafter denoted as MC) is generated for each handwriting sample, where the values of first three channels encode the dynamic information used in our previous study, namely velocity, jerk and pressure, while the value of the fourth channel encodes the acceleration. Finally, we also exploited in this case the ability of CNN to automatically extract features on TIFF images.

In summary, for each task we generated three different datasets, namely the one based on standard dynamic features [11,12,13,14], the one based on the features provided by CNN applied to synthetic color images [7] and the one provided by CNN applied to the MC images. Moreover, for each task and for each dataset, the performance was evaluated using the same classification schemes, namely random forest, K-nearest neighbor, multilayer perceptron and support vector machines. This choice allowed us to easily compare the experimental results relative to the different feature vector representations and, therefore, the role played by the shape and by the combined use of both shape and dynamic information. Finally, a further comparison was made by considering the classification results directly provided by the fully connected layer of CNN. The main contributions of the paper can be summarized as follows:

  • Assessing the contribution, in terms of performance of an AD diagnosis system, of dynamic information encoded as color values in the RGB channels of specifically generated images. We also compared the results achieved by using these images with those achieved by using multi-channel images, generated by encoding a further dynamic feature in the fourth channel;

  • Assessing CNNs ability as an automatic feature extractor tool comparing their performance with that achieved with widely used handcrafted features;

  • Evaluating the effectiveness of the method presented in [7] on more tasks; these new tasks allowed us to test participant long-term motor planning ability;

  • Comparing two different classification approaches. In the first one, we classified the participants considering handcrafted features and applying well-known machine learning algorithms. In the second we classified the participants considering the features automatically extracted by CNNs, using both RGB and multi-channel images. For comparison purposes, we also considered the classification results provided by the fully connected layers of CNNs.

The remainder of the paper is organized as follows. Section 2 discusses the related work, whereas Sect. 3 presents the architecture of our system. In particular, this section details the data acquisition process (Sect. 3.1), the handcrafted feature extraction and the deep feature extraction (Sects. 3.2 and 3.3, respectively), and the classification step (Sect. 3.4). The experimental results are shown in Sect. 4, while discussion and future works are eventually left to Sect. 5.

2 Related work

As anticipated in the Introduction, to the best of our knowledge, this is the first attempt to exploit information about shape changes in handwritten traces, as well as the correlation between changes in the shape and changes in the dynamics of the handwriting process, by means of a deep transfer learning approach. In other words, the studies published in the literature do not address the specific problem of verifying if shape features extracted from handwriting through deep learning techniques are more suitable for characterizing cognitive impairment then handcrafted features. Indeed, deep learning techniques in the assessment of the AD are usually employed starting from MRI images as input signals [15,16,17] or from EEG signals [18, 19].

Fig. 1
figure 1

Chain of the whole system

Although a lot of handwriting studies are still conducted in the field of psychology, where standard statistical techniques are used, increasing attention from the machine learning field in the analysis of data from handwriting is visible, especially for the PD diagnosis [20,21,22,23].

Regarding the AD diagnosis from handwriting, in [24] the authors performed semi- or unsupervised learning to uncover homogeneous clusters of participants, and analyzed how much information these clusters carry on the cognitive profiles. Furthermore, they introduced a new temporal representation learning from handwriting trajectories that uncovered a rich set of features simultaneously, like full velocity profile, size and slant, fluidity and shakiness, revealing how these features jointly characterize the cognitive profiles.

In [6], the authors performed kinematic measures of the handwriting process to assess the importance of features for differentiating groups and for assessing the characteristics of the handwriting process across five different and functional tasks of copying. The results showed that the kinematic measures together with the MMSE score were able to distinguish effectively between the patients belonging to the different groups considered. As for the feature analysis, pressure and time-in-air obtained the best performances. Also in [25] the authors analyzed the stability of the offline handwritten word “mamma” (mum in Italian) to distinguish AD patients from healthy controls. The stability of the word was computed by splitting its image into elementary parts and measuring the similarity of the adjacent parts. As a classification algorithm the authors adopted the Yoshimura approach, based on the comparison of the stability features among the sample to be recognized and those of the training samples.

In [26], the authors presented a novel approach in which handwritten signatures were analyzed for the early diagnosis of AD. Patients’ signatures were represented by using the Plamondon’s Sigma-Normal model, by means of twelve features.

Finally, the goal of the work reported in [27] was to distinguish participants belonging to three different groups (AD, MCI and control group) by comparing their handwriting kinematics. The authors used discriminant analysis as a classification algorithm and adopted a protocol consisting of seven tasks, which included copying and drawing tasks. In these experiments, the authors, for the same task, investigated which were the most discriminating features and the best distinguished groups. They found that: (i) discriminating features depended on the type of group to be discriminated; (ii) some tasks, e.g., the clock drawing test, allowed some groups, e.g., AD vs. MCI, to be well discriminated (100% of specificity and sensitivity).

Starting from the results of this research, which highlights the lack of a unique handwriting protocol and the limited number of participants involved, we started a research activity in collaboration with relevant hospitals to define an experimental protocol capable of capturing the most relevant aspects of the onset of neurodegenerative diseases, involving a large number of participants on the basis of rigorous recruitment criteria. A first result of this activity is the definition of an experimental handwriting protocol according to which we collected the handwriting samples of one hundred eighty participants, including both AD patients and healthy controls. In particular, in [9] the experimental handwriting protocol consisting of 25 tasks to record the dynamics of handwriting, when different motor skills are employed, is presented. Using a subset of the above tasks, in a first set of experiments, whose results are reported in [11,12,13], we tested 130 participants (both patients and healthy controls) employing two classification algorithms. The tasks considered in the above-mentioned studies were selected in order to evaluate the alterations on kinematic and pressure properties in repeating complex graphic gestures, which have a semantic meaning, such as letters and words of different lengths and with different spatial organizations. To improve the performance of these systems in [14], a genetic algorithm has been used which selects the best subset of tasks among those belonging to the above protocol.

3 The architecture of the system

The architecture of the whole system is reported in Fig. 1: The figure shows that the acquired data are processed for generating both the set of standard dynamic features (denoted as handcrafted features), and the two groups of RGB and MC images: Each group of images is then forwarded to the corresponding CNN to extract a new set of features; thus, at the end of this step, two further sets of features are generated, namely those obtained by the RGB images (denoted as RGB-deep features), and those obtained by the MC images (denoted as MC-deep features). These three sets of features are individually used by each of the considered classification schemes to implement the classification stage. Note that this stage also includes the classification results directly provided by the fully connected layer of each CNN.

A detailed description of each stage of the proposed system will be provided in the following sections.

3.1 Data acquisition

The first step of our system, as shown in Fig. 1, is devoted to the acquisition and recording of participants’ handwriting, produced by participants according to a given protocol, in terms of x-y-z coordinates of each point, acquired at a constant sampling rate equal to 200Hz. The first two coordinates are the point position in the two-dimensional space representing the surface on which the writing is produced, while the third is a measure of the pressure exerted by the person at that point. This measure assumes a positive value when the pen is resting on the sheet and a null value when the pen is detached, up to a maximum distance of 3 cm from the sheet, beyond which the system is not able to receive information. The application developed using the C programming language drives the graphic tablet and acquires the coordinates of the movements of the participants while they are writing on a A4 sheet fixed to the tablet surface. Furthermore, since writing skills can be influenced by age, education level and type of work, this information is also stored.

For the recruitment of participants involved in the study, with the support of the geriatric ward, Alzheimer unit, of the “Federico II” hospital in Naples, we used standard clinical tests, such as the Mini-Mental State Examination (MMSE), the Frontal Assessment Battery (FAB), the Montreal Cognitive Assessment (MoCA) to distribute the participants into two groups: healthy people (control group) and patients. All participants were right-handed and comfortably positioned approximately 70 cm from the sheet of paper.

Our database includes 180 participants (90 patients and 90 healthy control), each performing the 25 tasks defined in our protocol. As anticipated in the Introduction, in this study we only considered the handwriting samples relative to six graphic tasks produced by all the participants.

Starting from the x-y-z acquired coordinates, three different feature extraction approaches were implemented:

  1. (i)

    Typical online features are extracted and used for implementing the classification schemes described in Sect. 3.2.

  2. (ii)

    The dynamic information relative to each stroke is considered for generating the images used for deep learning classification experiments, as further detailed in Sect. 3.3.1.

  3. (iii)

    Similarly, in the third approach, MC images are generated and used for deep learning classification experiments (see Sect. 3.3.1).

We analyzed participants’ handwriting while drawing lines to predict their cognitive status. In particular, we asked the participants to perform six tasks, as detailed in the following.

The first two tasks consisted of joining two points 5cm apart with a straight continuous horizontal (task 1) or vertical (task 2) line, continuously for four times. This kind of tasks investigates elementary motor functions [28]. Horizontal movements require movements of the arm, keeping fingers in a fixed position. Vertical movements require small finger and wrist movements. In addition, drawing a single continuous line four times requires the execution of long-term motor planning, which is a typically compromised function in individuals with cognitive impairments.

The third and fourth tasks consisted of retracing a 3cm (task 3) or 6cm (task 4) wide circle four times. These tasks highlight the continuity of the line by retracing, a circular shape of various dimensions. The continuity and distancing from the background figure to be traced are indicative of cognitive deterioration. These tasks make it possible to check the automaticity of the movements and the regularity and coordination of the sequence of movements [29].

The fifth task consisted of retracing a complex form specifically devised to test the participant’s motor control skills. This task investigates the alteration of the handwritten traits independently of any letter, word or the related semantic usage. The handwriting movements needed to retrace the form require a constant motor re-modulation. The shape of the form consists of a continuous line that presents radii of different curvatures with the aim of testing both fine control and long-term motor motion planning [30, 31].

Finally, the sixth task was the well-known clock drawing test: The participant is asked to draw a clock face, including the numbers, and then to draw the hands at five past eleven. The clock-drawing test (CDT) is used for the screening of cognitive impairments and dementia. It is also used to assess participants’ spatial dysfunction and lack of attention. It was originally used to evaluate visuo-constructive abilities but it has been shown that abnormal clock drawing occurs in other cognitive impairments. The test requires verbal understanding, memory and spatially coded knowledge in addition to constructive skills [32]. Moreover, in [33] the authors found that CDT shows a high sensitivity for the diagnosis of mild Alzheimer’s disease.

Examples of tasks are shown in Fig. 2.

Fig. 2
figure 2

Examples of tasks performed by a participant involved in the experiments

3.2 Handcrafted feature extraction

From the acquisition phase, the handwriting trajectories, in terms of x, y coordinates, are available. For each point acquired a third piece of information representing the pressure (z coordinate), is also provided. From these coordinates, we have calculated the handcrafted features used for the classification step as detailed below.

Each feature is computed referring to a stroke, defined as the sequence of traits produced between a pen down point and the following pen up or a change of direction on the y-axis (see Fig. 3). As shown in this figure, from the starting point (\(x_1\), \(y_1\)) to the point (\(x_2\), \(y_2\)) we detected a stroke since in (\(x_2\), \(y_2\)) there is a change of direction along the y-axis. Similarly, from (\(x_2\), \(y_2\)) to (\(x_3\), \(y_3\)) we detected another stroke because in (\(x_3\), \(y_3\)) a pen up occurs. We denoted such strokes as on-paper, since they are acquired by the system when the pen tip is touching the sheet. Moreover, from (\(x_3\), \(y_3\)) to (\(x_1\), \(y_1\)) the pen tip is lifted from the sheet but within the maximum distance that allows the system to receive information: Thus, we can detect a further stroke, traced between a pen up and the following pen down, but keeping the pen tip close to sheet. We denoted such strokes as in-air.

Table 1 Feature list
Fig. 3
figure 3

Example of generated strokes

Many studies in the literature have shown that the analysis of in air traits can provide significant information for identifying neurodegenerative disorders: In-air movements, indeed, characterize the motor planning activities related to the positioning of the pen tip between two successive written traits. Moving from these considerations, in a previous study [7] we decided to extract the features from both in-air and on-paper strokes. On the other hand, in this study, we have decided to eliminate the features calculated from the in-air traits so that the comparison between the system with handcrafted features and the CNN system is more faithful. As we will see in Sect. 3.3, the CNN system uses input images that do not include in-air strokes.

For each stroke, we extracted twenty-two features, which can be grouped into two categories, namely static and dynamic, as detailed in Table 1. A feature vector is obtained for each task performed by each person by averaging the values for all the strokes relative to that task.

3.3 Deep feature extraction

In this step, the online handwriting samples, each represented as sequence of points acquired at a given frequency, in terms of x-y coordinates and pressure value of each point, are processed for generating two groups of images, namely RGB and MC images. The images of each group are forwarded to the corresponding CNN, which operate as feature extractor. To this aim, CNN are pre-trained on the public database ImageNet [10] and then fine-tuned using such images. The result, is the production of two feature sets, each representing the whole database of handwriting samples, but including the features extracted from the corresponding group of images. This process is detailed in the following sections.

3.3.1 Image generation

Starting from the same raw data used for handcrafted feature extraction, stored in terms of x-y coordinates and pressure of the points acquired for each online handwriting sample, we have generated two type of images to be submitted to the CNN networks.

The traits of both types of synthetic images are obtained by considering the points \((x_i, y_i)\) as vertices of the polygonal that approximates the original curve. As regards the first type, we also used kinematic information encoded in the RGB channels. In particular, these synthetic images are obtained by considering: i) the triplet of values \((z_i, v_i, j_i)\) assumed as RGB color components for the ith trait, delimited by the couple of points \((x_i, y_i)\) and \((x_{i+1}, y_{i+1})\).

ii) the triplet of values are obtained as follows:

  • \(z_i\) is the pressure value at point \((x_i,y_i)\) and it is assumed constant along the ith trait;

  • \(v_i\) is the velocity of the ith trait, computed as the ratio between the length of the ith trait and interval time of 5ms corresponding to the period of acquisition of the tablet;

  • \(j_i\) is the jerk of the ith trait, defined as the second derivative of \(v_i\).

Fig. 4
figure 4

Example of encoding for the trait generation in a RGB image

The values of the triplets \((z_i, v_i, j_i)\) have been normalized into the range [0, 255] in order to match the standard 0-255 color scale, by considering the minimum and the maximum value on the entire training set for these three quantities. An example of a trait generated from these images is reported in Fig. 4, where the color of the first trait corresponds to the triplet (z=127, v=127, j=0), while that of the second one to the triplet (z=127, v=127, j=127).

As previously mentioned, in order to improve the dynamic information encoded, we also created multi-channel TIFF images, storing four representations (frames) of the same handwriting sample into a single image file. Each frame is a grayscale representation of the traits obtained according to a procedure similar to that previously described for RGB images. More specifically, considering the points \((x_i, y_i)\) as vertices of the polygonal that approximates the original curve, pixel values in each frame are assigned according to the following criteria (see Fig. 5):

  • The first frame implements the acceleration feature: The acceleration of the ith trait is defined as the derivative of \(v_i\);

  • The second frame implements the jerk feature: The jerk of the ith trait is defined as the second derivative of \(v_i\);

  • The third frame implements the velocity feature: The velocity of the ith trait is computed as the ratio between the length of the ith trait and interval time of 5ms corresponding to the period of acquisition of the tablet;

  • The fourth frame implements the pressure feature that it is assumed constant along the ith trait.

As stated in Sect. 1 and better detailed in Sect. 3.3.2, we adopted four CNN models that accept input images that are automatically resized to 256x256 for VGG19, to 224x224 for ResNet50, to 299x299 for InceptionV3 and InceptionResNetV2, respectively. Taking into account these constraints for both type of images, the original x, y coordinates have been resized into the range [0, 299] for each image, in order to provide ex ante images of suitable size and minimize the loss of information related to possible zoom in/out.

Fig. 5
figure 5

Example of encoding for the trait generation in a MC image

3.3.2 Deep transfer learning for feature extraction

Deep transfer learning is gaining much popularity nowadays to solve image classification problems as it is possible to employ different CNNs trained on public datasets like ImageNet [10] reaching the highest classification performance in many application fields. In this paper, we adopted four different CNNs models: VGG19 [34], ResNet50 [35], InceptionV3 [36], InceptionResNetV2 [37]. These models differ one from another in several details, such as the introduction of new structural elements (inception, residual, dropout) or the number of layers and, consequently, the number of trainable parameters. The VGG19, in fact, is a model of tens of layers and twenty-five millions of parameters, while the deeper InceptionResNetV2 is made of hundreds of layers and consequently the number of parameters increases to sixty-two millions (see Table 2).

The adopted CNN are composed of a convolutional and a classification part. The first part is conceived for feature extraction (FE) from the images used to feed the network, whereas the second part is for the classification (C) (see Fig. 6).

Table 2 Number of parameters and input/output size of the CNN used in the experiments
Fig. 6
figure 6

The general structure of the networks used

The transfer learning (TL) step is followed by a retraining of the network using the fine tuning (FT) approach, which requires the retraining of both parts (FE and C) of the network. In order to apply FT, the parameters of the feature extraction are initialized with the weights obtained on ImageNet, whereas the classification part is initialized with the weights obtained during the previous TL step.

The original classification part of the network has been replaced with a unique classifier for all the models, as described in the next section.

After the training phase, the CNN networks were used for both deep feature extraction, and classification with the final fully connected layers (the classifier section of the deep network). The output of the FE part of the network, for each input, consists in a vector of features, denoted also as bottleneck (i.e., the last activation map before the fully connected layers in the original model). This is a flattened vector of extracted features and its size depends on the architecture of the considered CNN (the number of features for each model is shown in Table 2).

Once the CNN architectures are chosen, it is necessary to assess them through an experimental phase with the aim of maximizing the mean accuracy of every model by selecting the following hyper-parameters and settings:

  • Stochastic gradient descent (SGD) with learning rate 0.001, momentum 0.9: optimization method used to minimize the loss function.

  • Categorical cross-entropy: is the adopted loss function.

  • Batch size 16 and 20, respectively, for RGB and MC images: number of training set images considered in each iteration.

  • Max epochs equal to 2, 000: One epoch is one pass on the entire training set and contains a number of iterations equal to (trainingsetsize)/batchsize.

  • Patience 200: If the validation accuracy does not improve for a number of 200 epochs, the training is stopped.

  • Accuracy as a measure of performance.

Table 3 Training times, expressed in seconds

During the training phase, a fivefold cross-validation strategy was adopted, using a validation set to reduce or avoid the undesired over-fitting phenomenon. Following the standard cross-validation procedure, the data set was randomly partitioned into five equally sized folds. At each iteration, all the samples belonging to a single fold were used as test set, while the samples belonging to the other four folds were further divided into two subsets: a validation set obtained by randomly selecting 10% of these samples, and a training set consisting of the remaining ones. In practice, at each iteration, the training procedure exploits the validation set to stop the learning if the performance on such data begins to deteriorate, thus avoiding the over-fitting phenomenon. The cross-validation process is repeated five times, with each of the five folds used exactly once as a test set.

3.4 The classification step

The classification step was carried out considering nine different features sets, as already mentioned in the previous sections. Specifically, the first set consists of the twenty-two handcrafted features (see Sect. 3.2), while the remaining ones are extracted from both for RGB and MC images, through the FE part of each of the four CNNs described in Sect. 3.3.1. Summarizing, the first feature set includes standard dynamic features obtained from the online handwriting samples, four feature sets are provided by CNN applied to RGB images, and the remaining four feature sets are provided by CNN applied to MC images.

The classification was conducted following two different approaches. The first one involved the using of a standard classifier, which takes as input the features provided by the CNN feature extractor. The second approach, on the other hand, consists in using an unique classifier composed of fully connected layers for each CNN (see Fig. 6), properly modified for our purposes, i.e., classifying two classes (healthy control or patient) instead of the thousands as is the case of the ImageNet dataset.

Regarding the first approach, we decided to consider four well-known classification schemes among the most used: random forest (RF) [38], multilayer perceptron (MLP), support vector machines (SVM) [39], and the K-nearest neighbors (K-NN). Those classifiers have different characteristics and each one represents a different kind of model, more precisely RF is an ensemble of decision tree, MLP is a connectionist network, K-NN is an instance based nonparametric regression algorithm and SVM is kernel-based.

The second approach, instead, relies on the use of a classifier, where the input layer comprehends a number of neurons equal to the number of features reported in Table 2, while the output layer has two neurons, each one corresponding to the desired class (healthy controls and patients). There are two hidden layers between the input and the output one, with two thousand forty-eight neurons each and an intermediate dropout layer.

4 Experimental results

The experiments reported in this section have been executed on the following system architecture:

  • CPU and RAM: Intel Core i7-7700 CPU @3.60GHz equipped with 32GB of RAM;

  • Graphics card: GPU Titan Xp;

  • Software: Keras 2.2.2 and TensorFlow 1.10.0.

For the four classification schemes mentioned in Sect. 3.4 (RF, MLP, K-NN, and SVM) we performed thirty runs and used the fivefold cross-validation strategy to evaluate the classification accuracy. The results reported in the following have been computed averaging the accuracy achieved on the thirty runs performed. However, since the use of the fully connected layer of CNN as classifier needs to retrain the whole network (see Fig. 6), thus involving a large amount of time, the FC results were computed averaging the accuracy achieved on the five test folders, as detailed in Sect. 3.3.2. Table 3 shows the time needed to extract the features, while the values of the parameters used in the experiments are shown in Table 4.

The feature extraction procedure detailed above was applied to the data of the six drawing tasks described in Sect. 3.1. Since the extraction of the deep features requires a training phase of the CNN, to avoid any bias with the fivefold cross-validation strategy, we selected for each sample the feature vector provided by the CNN, when that sample was in the test fold, i.e., it was non included in the folds used for training.

To assess the effectiveness of our system, we performed three sets of experiments. In the first one, we evaluated and compared the results achieved by the CNNs used on the six tasks previously mentioned. In the second one, we compared the classification performance achieved by RGB and MC images. Finally, in the third set we compared the classification performance of our approach with those achieved by using the handcrafted features described in Sect. 3.2. The results of these experiments are detailed in the following subsections.

Table 4 Values of the classifier parameters used in the experiments

4.1 Tasks comparison

In this set of experiments, we tried to answer to the following questions: Which is, if any, the best performing task among those considered? Among the four CNNs used with RGB and MC images, there is one that performs better than the others? Do the CNNs exhibit similarities or differences on the tasks?

With the aim of reporting a detailed description of the results achieved, Tables 5 and 6 show the results for each task provided by the five considered classifiers, when using both RGB-deep features and MC-deep features, extracted with different CNNs. From both tables, we can observe that, for each task, the performance of each classifier varies widely as the deep features used vary. Similarly, for each task, the performance obtained by using the features extracted with each CNN, varies widely changing the classifiers. Furthermore, for each classifier, the performance obtained by using the features extracted with each CNN, significantly varies changing the tasks.

Table 5 Classification results achieved using RGB features
Table 6 Classification results achieved using MC features

To summarize the results shown in Tables 5 and 6, we plotted two vertical bar graphs for each feature type. The first graph displays the accuracy achieved by each classifier, whereas the second one that achieved by each CNN. The plots of Fig. 7 refer to RGB-deep features, while those of Fig. 8 to MC-deep features. In both figures, the left plot shows for each task the mean accuracy of each classifier, averaged on the results achieved with the features provided by the four CNNs, while the right plot shows for each task the mean accuracy of each CNN averaged on the results of the five classifiers. The aim is to see “at a glance” if there is a CNN or a classifier that performs better than the others. From the figures, we can observe that task 2 achieved the worst performance, both for RGB and MC features. This result is explainable considering that task 1 requires a greater motor load than task 2. Indeed, when joining the points vertically, it is easier to carry out the task without moving the arm with small movements of both fingers and wrist.

Fig. 7
figure 7

Average accuracy achieved by the classifiers (a) and the CNNs (b) using RGB images

Fig. 8
figure 8

Average accuracy achieved by the classifiers (a) and the CNNs (b) using MC images

Fig. 9
figure 9

Accuracy for each task averaged over the results of five classifiers

From the figures showing the classifier performances (left plots), we can observe that RF and SVM outperform the others in most cases. These results on one hand confirm the effectiveness of the ensemble-based strategy of RF, as well as that of the SVM kernel-based approach, specifically devised for two class problems. On the other hand, they confirm that the simple K-NN algorithm was not able to effectively estimate the probability distributions underlying our data. These results also highlight an important point: The effectiveness of RGB features extracted by the CNNs is independent of the classification algorithm used to implement the classification layer. Furthermore, RF and SVM performance is better than that of the FC classifier trained during the process for feature extraction. The same does not occur for MC features.

From Fig. 7b, showing the CNN performances using RGB features, we can observe that InceptionResNetV2 (60M parameters) achieved the best results on the first three tasks, whereas for the remaining tasks different cases occur. In particular, for tasks 4 and 6 the best performance was achieved by InceptionV3 (30M) and VGG19 (25M), respectively. Most probably, this is due to the higher complexity of these tasks when compared to the first three. Therefore, simpler CNNs allowed a more effective training on the available data, thus providing better results. On the other hand, on task 5 the CNNs achieve similar performances, confirming that in this case the number of parameters did not affect the training process.

Table 7 Classification results achieved by the FC classifier, using RGB and MC features
Table 8 Results of classification with the handcrafted features

Looking at Fig. 8b (CNN performance on MC features), we can see that, except for tasks 2 and 5, InceptionResNetV2 did not achieve the best performance. This result seems to suggest that the use of CNN with higher complexity does not allow the system to achieve better performance.

4.2 Comparison of MC and RGB features

In the second set of experiments, we compared the classification performance achieved by using RGB and MC features. The aim was to evaluate, in terms of performance, the contribution of the fourth channel in MC images.

Figure 9 shows the comparison between the classification performance achieved using MC and RGB features. To this aim, we averaged the accuracy achieved by the five classifiers used and plotted a vertical bar per task. From the graphs, we can observe that in most cases the performance achieved using RGB features is slightly better than (or comparable with) that achieved using MC features. This result confirms that the information added by the fourth channel (see Sect. 3.3) did not allow our system to significantly improve its performance.

Moreover, as shown in Fig. 8a, in the case of MC features, the FC classifier achieved slightly better or comparable performance when compared with the other classifiers considered. This result confirms that during the training step, to deal with the higher complexity of the MC images, it was necessary to exploit the interaction between the feature extractor and the classification layer (see Sect. 3.3). Note that the results provided by the FC classifier show higher values for the standard deviation than those provided by the other classifiers (see Tables 5 and 6). This is probably due to the fact that, as previously mentioned, the FC results were computed by averaging the accuracy achieved over the five test folders, while the results of the other classifiers were averaged over 30 runs.

To highlight these aspects, we compared the performance of RGB and MC features achieved by using the FC classifier only (see Table 7).

4.3 Comparing deep and handcrafted features

In the last set of experiments we compared the classification performance achieved using RGB features with that achieved using the handcrafted features.

Table 8 shows the accuracy achieved using the handcrafted features. The last row of the table shows, for each task, the average accuracy computed over the four classifiers. From the table, we can observe that using these features the best performance is achieved by RF and K-NN. Thus, while confirming the effectiveness of the RF ensemble-based strategy, K-NN, in contrast to the deep features case, obtained satisfactory results. In this case, indeed, the K-NN algorithm was able to effectively estimate the probability distributions underlying our data represented through the handcrafted features. From the table, we can also observe that, in this case, task 2 allowed us to achieve good performance. These results suggest that for this task, in contrast to the deep features case, the information added by some of the handcrafted features allowed us to effectively distinguish between the handwriting of patients from that of the control group. On the contrary, tasks 3 and 6 achieved poor performance with these features. These results suggest that handcrafted features do not represent the shape and dynamics of handwriting in such a way to effectively distinguish handwriting samples of cognitively impaired people from those of the control group.

To summarize the comparison between deep and handcrafted features, we plotted a vertical bar graph showing the best overall accuracy achieved on each task (see Fig. 10). For each task, we plotted the best overall classification performance achieved by using deep-RGB features (bold values in Table 5) and handcrafted features (bold values in Table 8). From the plot, we can observe that our deep-based approach outperforms that based on the handcrafted features, except for task 2. These results confirm the effectiveness of our approach for combining shape and dynamic information. The slight performance difference in task 2 is probably due to the fact that the low complexity of this task does not allow the selection of discriminant features, as also confirmed by the poor classification results generally obtained by using both deep and handcrafted features.

5 Discussion and conclusions

In this paper, we presented a deep transfer learning approach for feature selection applied to Alzheimer’s disease diagnosis through handwriting analysis.

Fig. 10
figure 10

Comparison results between deep-RGB and handcrafted features. The accuracy values shown are those highlighted in bold in Table 5

The rationale of our work was that of combining information derived from the shape of online handwritten traits with those related to the dynamics of the writing process used to produce such traits. To this aim, we generated synthetic offline multi-channel images, where each elementary trait in the handwritten trace is represented in each channel with a gray level encoding a single dynamic piece of information associated with that trait. Moreover, we exploited the capability of convolutional neural networks (CNNs) to automatically extract features from offline images.

In this study, we compared the results obtained by generating both three-channel (RGB) images and four-channel (TIFF) images. In the first case, the three channel encode for each elementary trait, velocity, jerk and pressure applied for producing that trait. In the second case, a fourth channel has been added to the previous ones to encode the acceleration. The experimental results obtained by exploiting the features extracted from these images were also compared with those obtained by using standard dynamic features directly derived from the original online handwriting samples. For the sake of comparison, the performance was evaluated using the same classifiers: This choice allowed us to easily analyze the experimental results relative to the different feature representations and, therefore, the role played by the shape and by the combined use of both shape and dynamic information. Finally, a further comparison was made by considering the classification results directly provided by the fully connected layer of CNN.

As a first consideration, we can observe that the deep features seem more promising than the handcrafted ones, reaching the best performance in terms of accuracy. Indeed, for each task and for each classification scheme, there is always a CNN model whose features allow us to obtain better results than those obtainable with the handcrafted ones. The only exception is task 2, where the best performance obtained by using handcrafted features was slightly better than that obtained with deep features.

Regarding the comparison between RGB and MC deep feature, the analysis of the results showed that the addition of a further channel in the generation of multi-channel images does not seem to allow better feature extraction: In fact, the classification results obtained with the RGB deep features are almost always better than those obtained with the MC deep ones. The only exception is the case of task 5, where the FC classifier, obtained by training Inc.ResNetV2 with MC deep features, produced slightly better results. However, it should be noted that these results were obtained using only handwriting samples related to graphic tasks: Thus, we intend to carry out, as a future work, a wider comparison using the data of all the writing tasks included in our protocol. Considering the whole set of tasks would also allow us to improve the overall performance of the diagnostic system, by combining for each participant the responses provided by the classifiers for each single task [40, 41].

Finally, as a future development of our system, we would also like to include information related to in-air features: As mentioned above, we did not exploit this information because our aim was to evaluate the combined use of dynamic and morphological features. However, since in-air points are acquired from the tablet during the execution of the writing tasks, we could add these in-air traits when generating the synthetic images and evaluate their effects on the feature extraction process.