1 Introduction

Parkinson’s disease (PD) is a neurodegenerative disorder affecting an area called substantia nigra in the basal ganglia. PD produces a progressive loss of dopaminergic neurons that causes various motor symptoms such as tremor, rigidity, bradykinesia, postural instability, shuffling gait, and non-motor symptoms such as depression, sleep problems, loss of cognitive function, nerve pain, and intestinal constipation [1].

PD affects about 1% of the over 60-year-old world population. Moreover, according to the World Health Organization (WHO), by 2050, almost two billion people worldwide are expected to be over 60 years old. Thus, 20 million people may be suffering from PD in the future [2].

PD, unfortunately, remains cureless and its diagnosis is not simple. The patient must be assessed by means of standardized clinical exams, diagnostic imaging exams, and the dopaminergic therapy response. However, the cardinal signs of the disease, such as tremors, rigidity, bradykinesia, and postural instability, are features that can mark the presence of the disorder [1, 3, 4].

An adequate understanding of these cardinal signs can help to diagnose the condition. Thus, the evaluation of the hand tremor is a crucial stage of the clinical assessment of the patient with PD, and it can be done by analyzing drawings that can capture tremors. This evaluation may be conducted by a specialist or computer-based method [1]. Computational methods have the main advantage that they rely on techniques that guarantee consistency and reproducibility.

Currently, several studies demonstrate the use of techniques to extract information from handwritten drawings in real-time. Smits et al. [5] used a tablet and digitizer pen to draw shapes such as circles, stars, spirals, and characters (e.g., “elel”). The authors quantify the tremor, bradykinesia, and micrographia using the writing dynamics (i.e., speed, time, writing size, and frequency) to detect people with PD from a control group (CG).

Features extracted from tablet-based spiral drawings were correlated with part III of the Unified Parkinson Disease Rating Scale (UPDRS) score [6]. The study introduced by Almeida et al. [7] used a digitizing tablet to collect spiral drawings to analyze the correlation between physiological tremor and aging. Westin et al. [8] proposed a new web software for viewing and comparing spiral drawing assessments. Several research groups [9,10,11] have used Archimedes’ spiral drawings to assess hand tremor in PD patients via tablet-based data collection.

Recently, deep learning methods, which have a higher computational cost, have been used to analyze spiral-handwritten drawings. Khatamino et al. [12] reported a convolutional neural network (CNN) with the highest accuracy of 88%. Similarly, Moetesum et al. [13] reported an accuracy of 83% while discriminating PD from a CG group.

Pereira et al. [14] used a smartpen with several attached sensors to obtain data from spiral and meander drawings. The smartpen signals were transformed into pictures, and the images were inputted in CNN to build knowledge. CNN reached an accuracy of 83.7% in the classification of PD and CG.

Tolonen et al. [15] used a tablet and a pen with an attached gyroscope to collect data from drawings such as spirals, circles, and zigzag triangles. They classified PD from other movement disorders and achieved the following accuracy: 82% (essential tremor), 69.8% (functional tremor), and 72.2% (physiological tremor). Similarly, the work presented by Matsumoto et al. [16] used a tablet and a digital pen with a 3-axis accelerometer to evaluate spiral drawings of patients with PD and essential tremor (ET).

Although many studies have been employing inertial sensors and digitizing tablets, there is a lack of research that analyzes digitized drawings made with paper and pen. This type of analysis is relevant because of the simplicity of data collection and its broad availability in the clinical environment.

Kraus and Hoffmann [17] used paper and pencil to collect spiral drawings and then used these images to analyze tremor amplitude. The regression analysis revealed a significant association (88.9%) between the Bain et al. [18] rating scale and tremor amplitude.

Bajaj et al. [19] collected spiral drawings using paper and pen to distinguish cases of tremulous PD from those clinical-SWEEDs (parkinsonian phenotype with normal presynaptic dopaminergic imaging). The authors assessed the clinical tremor severity (TS), spiral diameter 3-turns (3TD), and spiral density (SD). The sensitivity and specificity in predicting the correct classification were, respectively, 62.5% and 65.0% for TS, 75% and 56.7% for 3TD, and 30.4% and 82.5% for SD.

Pereira et al. [20] used paper and pen to collect spiral drawings from CG and patients with PD. They reached 78.9 ± 3.5% using the Naïve Bayes classifier. The work presented by Pereira et al. [21] used CNN to extract features from spiral and meanders of handwritten to identify subjects with PD and showed average overall accuracy for meanders of 79.62% and spirals of 89.55%. Passos et al. [22] reached 96% of accuracy to identify people with PD using a complex structure of a deep neural network (DNN) (ResNet-50) to learn the patterns and extract features from the image spiral drawings, and then fed the Optimum-Path Forest (OPT).

In addition, Gupta and Chanda [23] proposed a Fourier transform-based distance, specific to spiral drawings, to extract features from offline images. The authors discriminated patients with PD from a healthy group with an accuracy of 81.66% by using support vector machine.

Archimedes’ spiral drawing has been widely used to assess tremor. However, the characteristics and methods used in the analysis of Archimedes’ spiral cannot be applied to other types of drawings. Besides, there is a lack of studies comparing the results from Archimedes’ spiral with other drawings for tremor quantification. As presented by Daroff et al. [24], handwriting drawings, e.g., sinusoidal patterns, should be employed for the evaluation of tremor. In contrast to a spiral drawing, which captures the hand movement in the same position, the sinusoidal drawing requires the person to slide the hand from one point to another. This difference in the motion pattern may bring essential characteristics to be analyzed.

Folador et al. [25] introduced the analysis of sinusoidal handwritten drawings for tremor evaluation. The authors showed that it was possible to discriminate drawings made from people with PD from a CG by employing computer-vision techniques to extract features from images. A random forest classifier was applied to these characteristics, and overall mean accuracy of 70% was reached.

The present research extends the investigation of Folador et al. [25] by collecting and analyzing a data set with 960 handwritten drawings of spiral and sinusoidal patterns. Twenty people with PD and 20 healthy subjects participated in the experimental trials. The presented results consider the clinical evaluation of tremor by three distinct examiners and the comparative performance of four classifiers in distinguishing the tremor symptom in people with PD from the control group.

The histograms of oriented gradient (HOG) descriptor used in computer vision has been used primarily in human body detection [26] but extended to tsunami victim detection [27], texture classification [28], traffic sign detection [29], and mammographic image classification [30]. In this research, HOG was introduced for tremor detection in people with PD from the spiral and sinusoidal handwritten drawings. The HOG features were fed into three classical machine-learning techniques, which have been employed in several related studies, i.e., random forest [31,32,33,34,35], k-nearest-neighbor [33, 36, 37], and support vector machine [20, 23, 27, 33,34,35, 38]. Additionally, the HOG features were fed in a deep learning classifier (convolutional neural network) [12,13,14, 21].

The main purpose of the proposed method was to employ solely paper and pencil to quantify tremor through spiral and sinusoidal handwriting images by HOG feature extraction. Practical advantages of this method are the following: (i) it can be widely employed as it depends only on the availability of paper and pencil; (ii) it does not need an experimental environment with supervision of drawing parameters such as speed, time, and pressure; (iii) it can be extended to the analysis of distinct types of drawing patterns as it does not depend on specific metrics of a particular kind of drawing pattern.

2 Materials and methods

This transversal research was approved by the Research Ethics Committee of the Federal University of Uberlândia (CAAE 07075413.6.0000.5152). The participants were informed about the data collection procedures and signed a consent form before data collection.

2.1 Participants

Data were collected from people with PD (PwPD) and healthy individuals (i.e., control group—CG). MDS-UPDRS (Unified Parkinson’s Disease Rating Scale modified by the Movement Disorder Society in 2008) [39] was applied to PwPD. The inclusion criteria in the control group were the absence of neurological disorder and any physical impairment that might prevent the individual from executing the experimental tasks.

The research included 20 PwPD, as shown in Table 1. They were treated with antiparkinsonian drugs. The experiment occurred while the participants were in the period “ON” of the medication to get a more stable handwritten drawing.

Table 1 Characterization of the research groups

The control group consisted of 20 matched individuals of the same age and sex-related to PwPD, as indicated in Table 1. The statistical equivalence of the groups was verified by the t test in which the null hypothesis (H0) is that the means of PwPD and CG groups are equal, and if it is rejected (p value < 0.05), the means are different. The t test is commonly used in data that follow a normal distribution. Thus, the normality was confirmed by Shapiro–Wilk test (H0: the sample has a normal distribution; if p value < 0.05, the null hypothesis is rejected) and the inspection on the quartile-quartile plot (QQ-Plot) [40].

2.2 Procedure for data collection

During the experiment, the participants were asked to use the dominant hand to draw a spiral and a sinusoidal pattern. All the participants were blind evaluated by three physiotherapists with experience in Parkinson’s disease. Besides guaranteeing the absence of problems or severe comorbidities that could affect the dominant hand, they also applied the MDS-UPDRS part III (specific questions 3.4, 3.5, 3.6, 3.15, 3.16, and 3.17) to evaluate the hand movement, in particular the presence and severity of hand tremor. These specific items of the MDS-UPDRS were applied to the right and left hand.

The participants were asked to draw a countered pattern on a printed drawing before doing the experimental task, as shown in Fig. 1a and b. The purpose of this step was to provide prior knowledge of the experimental task to the participant. Following this, they were asked to freely draw six patterns of each drawing on an A4 white sheet of paper using a 6B black pencil from Faber Castel.

Fig. 1
figure 1

a The spiral and b the sinusoidal patterns drawn by the participants. The pattern drawings were countered while the other images were produced freely

Data collection was performed in two different experimental sessions with an interval of 1 week between them, accounting for the effect of data variability in the analysis and the increase of data samples from a single participant. In total, each participant drew 12 patterns in an experimental session, being six spirals and six sinusoidal. There was no clinical complication with the participants in this period and no change in medication.

Therefore, a database of 960 images was created with 480 spirals and 480 sinusoidal for both sessions: 240 spirals and 240 sinusoidal images for each experimental session, being half of them collected from PwPD and the other half from CG. Hence, a balanced dataset was set up with the same number of samples per class [41].

The drawings were digitized and preprocessed so that features could be estimated. In the last stage, the extracted features were classified using distinct machine-learning methods to discriminate people with tremor from those without visible tremulous activity. Figure 2 illustrates the main steps of the study.

Fig. 2
figure 2

Flowchart depicting the main stages of the study. a Recruitment and clinical assessment of healthy volunteers and people with Parkinson’s disease. b The handwritten drawings were collected, digitized, and preprocessed. c Features of the digitized images were extracted. d The set of features was classified with the aim of discriminating people with tremor from those without visible tremulous activity

2.3 Computational and data processing environment

The pencil drawings were digitized in 300 dots per inch (dpi) of resolution using a scanner (HP Deskjet 3516 multifunctional). Since all the drawings of each participant were in a single sheet of paper, it was necessary to select individual drawings of each pattern after digitization, using an image manipulation program (GIMP). Each drawing was rescaled to a standard size of 256 by 256 pixels (wide and height) in 96 dpi.

Data processing and all other experiments were carried out in a machine with Intel Core i7 2.40 GHz, with 8 GB DDR3 RAM, 256 SSD of hard disk, and a 2 GB NVIDIA GeForce GT 650 video card on Microsoft Windows 7 Pro 64 bits.

Python programming language 3.6.5 was used with Tensorflow 2.1 (the core of an open-source library for Machine Learning) and Keras 2.3.1 (a deep learning framework), and the Scientific Python Development Environment (Spyder 3.3.2) was used for coding. R Studio 1.1.456 was used for statistical analyses as well as for graphical visualization.

2.4 Clinical analysis of tremor

In this study, three examiners applied the MDS-UPDRS to PwPD. The scores of the MDS-UPDRS part III (questions 3.4, 3.5, 3.6, 3.15, 3.16, and 3.17) given by each examiner were summed for each participant. The scores of the left and right hands were computed. The score range was between 0 and 48 points.

The analysis of the agreement between examiners was carried out by applying Kendall’s coefficient to compare the agreement among all examiners. Kendall’s coefficient is a non-parametric statistic and can be used to measure the agreement among several evaluators assessing a given set of n subjects. Its value ranges from 0 (no agreement) to 1 (complete agreement) [42].

2.5 Feature extraction

The histograms of oriented gradients (HOG) proposed and detailed in [26, 28, 43] is a method based on evaluating the normalized local histograms in a dense grid that uses gradient magnitude and angle information for object detection. The distribution of local intensity gradients can represent the appearance and shape of the region analyzed.

Figure 3 illustrates the main steps involved in the estimate of the HOG. In this study, a window of the digitized drawing was divided into small regions named cells, e.g., the blue square in Fig. 3b, of 16 (width) by 16 (height) pixels, and a cell was represented by a vector of one dimension of histogram of gradients. In this study, the histogram of a cell has 9 orientation bins (Fig. 3b), as suggested in [26, 43]. To the normalization process, the block of cells (the green grid in Fig. 3b) was set to 2 by 2 cells.

Fig. 3
figure 3

Basic steps involved in the HOG descriptor estimation: a input image, b HOG parameters setup such as the image division in cells and blocks, c the gradients gx and gy are computed, d the histograms of each cell are estimated, and the normalization is processed by block, e the HOG descriptor, and f a vector of the normalized cell histograms from all blocks are produced

For estimating the HOG descriptor, the gradients gx and gy must be calculated as in Eq. 1, where x and y are the pixel positions in the image and f is the pixel intensity. Therefore, the horizontal target pixel (gx) intensity is obtained by the difference between the right and left pixel values from it. In the vertical direction, gy is calculated by the difference between the top and bottom values of the neighbor pixels; an example of the gradients obtained is in Fig. 3c [26, 28].

$$ \nabla f\left(x,y\right)=\left[\frac{gx}{gy}\right]=\left[\begin{array}{c}\frac{\partial f}{\begin{array}{c}\mathrm{\partial x}\\ {}\ \end{array}}\\ {}\frac{\partial f}{\partial y}\end{array}\right]=\left[\begin{array}{c}f\left(x+1,y\right)-f\left(x-1,y\right)\\ {}f\left(x,y+1\right)-f\left(x,y-1\right)\end{array}\right] $$
(1)

The HOG is obtained as the combination of the local histograms of gx and gy. The parameters that define a local histogram are the following: the magnitude g in Eq. 2 and the edge orientation θ in Eq. 3 [27, 28].

$$ g=\sqrt{g_{\mathrm{x}}^2+{\mathrm{g}}_{\mathrm{y}}^2} $$
(2)
$$ \uptheta =\arctan \left({g}_{\mathrm{y}}/{g}_{\mathrm{x}}\right) $$
(3)

The number of bins in local histograms was configured to 9, and as the value of θ ranged from 0 to 180°, the first bin ranged from 0 to 20°, the second from 20 to 40°, and so on. After that, the voting process [26] selects the bin based on the value of θ and then adds the pixel magnitude to the bin (Fig. 3d).

For improving the invariance of shadows and illumination into the process, the measure of the local histogram is calculated within larger regions named blocks, set as 2 by 2 cells. The block scans the dense grid of the cell histograms produced from left to right and from top to bottom, and the block overlap (stride) is fixed at half of the block size (Fig. 3d). Each block can be normalized using, for instance, L1-norm (f = v/(||v||1 + c)) where v is the non-normalized descriptor vector of a block, ||v||1 is the 1-norm, and c is a constant value (c = 1) that prevents division by zero [26]. Each value of the cell histogram is divided by the block normalization value. When the block overlaps, the normalization process repeats. Each of the cells is represented in the final feature vector several times but normalized by different blocks. However, these redundancies increase the performance of the descriptor [26, 28].

The normalized block descriptors are referred to as histogram of oriented gradient (HOG) descriptors (Fig. 3e). The size of the final feature vector can be calculated multiplying the number of bins (9), the number of cells per block (4), the amount of horizontal overlapping (15), and the amount of vertical overlap (15). In this study, the size of the HOG feature vector, as represented in Fig. 3f, is 8100 (a dimensionality reduction of 87.6% when compared to the original image, which is 65,536). The features used in the test set of each classifier can be visualized in the “Results” section.

2.6 Data classification analysis

The set of HOG feature vectors was classified so that it was possible to discriminate drawings with tremor against those without tremulous activity. Though all PwPD had tremor in this study, it may happen that the individual did not present the symptom during the experimental trials. Likewise, some healthy people could present tremor because of anxiety or other stressful factors during the experiment.

Four supervised classifiers were employed: random forest classifier (RFC), k-nearest-neighbor (KNN), support vector machine (SVM), and convolutional neural network (CNN).

RFC is a type of supervised machine learning algorithm based on ensemble learning. This characteristic allows us to combine different algorithms or the same algorithm to create a more efficient prediction model. The combination of multiple decision-tree algorithms was used [31]. In general, an RFC takes N objects from the database and builds a decision tree with this data, and every tree predicts the category of the items belonging to it. Finally, the new object is assigned to the class that wins the majority vote [31,32,33]. In this work, 500 trees were used, such as in [34].

K-nearest-neighbor (KNN) is a type of data classification algorithm that attempts to classify which category the data point is in by looking at the data points around it. It is a non-parametric method generally used for classification and regression [33]. In this work, k was set to 7 neighbors (k = 7) as in [36].

SVM is a type of supervised machine learning classification algorithm. The algorithm aims to find a boundary that divides the data in such a way that the misclassification error can be minimized. The nearest points from the decision boundary that maximize the distance between the decision boundary and the points are called support vectors. The decision boundary in support vector machines is called the maximum margin classifier or the maximum margin hyperplane [38]. In this work, a linear kernel was used, as in [26].

A common CNN (two-dimensional or 2D CNN) is a type of deep learning method typically used in the classification of images. Unlike conventional multilayer perceptron architectures, CNN performs the so-called convolution and pooling layers, attempting to reduce the image to its basic features for understanding and classifying it [12, 14, 44].

In this proposal, the HOG features are used to feed a one-dimensional CNN (1D CNN), which is commonly used for sequence processing. In this context, the convolutional layer uses a kernel to extract local patches (subsequences) from the original sequence of features and feeds a fully connected layer to compute the classification [44]. In this study, 1D CNN was configured with two convolutional layers with kernel length of 5, two pooling layers, and with 3 fully connected layers (the first 2 with 16 units and the output layer with 1 unit) to classify PwPD or CG. The configuration was based on [44] and empirically improved.

A 2D CNN was used to extract the features and classify the original images of the database (without HOG) to compare the results. The 2D CNN arrangement was set up with 5 convolutional layers with kernel length of 3 by 3, 5 pooling layers, and with 2 fully connected layers (the first layer with 512 units and the output layer with 1 unit) to classify PwPD and CG. The configuration was based on [44] and empirically improved.

Cross-validation stratified k-fold was used for the evaluation of all the classifiers. This method splits randomly the data into k equally sized groups or folds preserving the same percentage of samples in each class, and then, one group is used to test and the others to train the classifier [45]. In this work, the 5-fold was employed [34, 46, 47], which means the data are trained/tested 5 times in each experiment.

The training/test was performed separately for session 1 and session 2 for each type of drawing (spirals and sinusoidal waveforms) on a balanced dataset with the same number of samples in each class [41].

The following metrics were employed to assess the quality of the classification results:

  • Accuracy (ACC) is the proportion of correct prediction of a given condition as defined in Eq. 4 [30, 46, 48].

$$ \mathrm{ACC}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} $$
(4)

Thus, TP is the number of true positives, TN the number of true negatives, FP the number of false positives, and FN the number of false negatives.

  • Sensitivity (SEN) is the measure of the true positive rate, as defined in Eq. 5 [46, 48].

$$ \mathrm{SEN}=\frac{\mathrm{TP}}{\mathrm{TP}+\kern0.5em \mathrm{FN}} $$
(5)
  • Specificity (SPE) is related to the true negative rate, as defined in Eq. 6 [46, 48].

$$ \mathrm{SPE}=\frac{\mathrm{TN}}{\mathrm{TN}+\kern0.5em \mathrm{FP}} $$
(6)
  • Receiver operating characteristic (ROC) curve presents on the y-axis the values of sensitivity and in the x-axis the complement of specificity (1—specificity) [30, 48].

  • The area under the curve (AUC) is given by Eq. 7. The higher the AUC is in a range from 0 to 1, the better the model is at distinguishing between individuals with and without tremor symptoms [48].

$$ \mathrm{AUC}={\int}_0^1\mathrm{ROC}(t)\ \mathrm{dt} $$
(7)

The motor fluctuation was assessed by considering data collection in two different experimental sessions. Thus, the following models were created for each classifier considering the distinct types of experimental conditions:

  • M1: a model of spiral drawings of the experimental session 1;

  • M2: a model of spiral drawings of the experimental session 2;

  • M3: a model of spiral drawings of the experimental sessions 1 and 2;

  • M4: a model of sinusoidal drawings of the experimental session 1;

  • M5: a model of sinusoidal drawings of the experimental session 2;

  • M6: a model of sinusoidal drawings of the experimental sessions 1 and 2;

The accuracy of the models (M1–M6) was tested for data normality verification. The Shapiro-Wilk test was used to test the null hypothesis that a sample came from a normally distributed population. If p value < 0.05, the null hypothesis (H0) was rejected. Bartlett’s test was applied to verify that the samples had equal variances (H0). If p < 0.05, the test presents no equality of variances. Finally, the Bonferroni outlier test was used to verify if there was the presence of outliers; on the null hypothesis, the outliers do not differ from the rest of the observations. If p value < 0.05, there is the presence of outliers [49].

The accuracy of the models was assessed by analysis of variance (ANOVA) and Tukey-Kramer to understand the differences of each model. ANOVA is a statistical method widely used to explain variations between two or more group means. The null hypothesis describes no differences between the group means (H0: μ1 = μ2 = … = μm). Suppose ANOVA results in significant differences (H0 is rejected, p value < 0.05), Tukey is applied for performing multiple pairwise comparison between the means of groups, and the means that are significantly different from each other are highlighted [49, 50].

The analysis of the accuracy of distinct models helps to understand and determine whether there is motor fluctuation [51] between different data collection sessions and also to identify the most suitable type of drawing (i.e., spiral or sinusoidal pattern) for tremor evaluation.

3 Results

3.1 Statistics for control and PD groups

The QQ-plot of the age distribution for each group (PD and CG) was inspected to confirm the normality of the variables [49]. Furthermore, the Shapiro-Wilk normality test [49] confirmed that the distribution of the variable age of both groups was normal (W = 0.9517 and p value = 0.3938 for PD; and W = 0.9512 and p value = 0.3853 for the CG).

The t test was applied to verify possible differences between the age of the groups [40, 49]. The estimated t test statistic was 0.2166, with 37.758 degrees of freedom and a p value of 0.8297, which is larger than 0.05, meaning that the null hypothesis that there are no significant differences between the ages of PwPD and CG should not be rejected.

People with PD were evaluated by three experienced evaluators in a blind procedure, and they had an average sum of the score of 15.57 to the first session and 16.42 to the second session, with a range of 0–48, indicating that PwPD are at the presence of a slight to mild level of tremor.

Furthermore, the estimated Kendall’s coefficient was 0.6610 and the p value 0.0065, which means the coefficient values are significantly different from zero. Therefore, the null hypothesis that the evaluators may disagree with was rejected.

3.2 HOG feature visualization

The vector of HOG features is calculated by multiplying the number of bins, the number of cells per block, and the amount of horizontal and vertical overlapping. In this research, the feature vector had 8100 elements. Figure 4 illustrates these HOG features in a 3D plot. The data are from the test set of PD and CG groups. The visual inspection of the features in Fig. 4 allowed for identifying distinct feature magnitude and variability for the spiral and sinusoidal images. For this reason, three equally sized regions delimited by these features were defined as region 1 (feature 0 to 2700), region 2 (feature 2700 to 5400), and region 3 (feature 5400 to 8100).

Fig. 4
figure 4

Visualization of features for PD and CG groups for the test sets used to evaluate each model. Data (36 images per group) from experimental session 1 (a, b, g, and h) and experimental session 2 (c, d, i, and j) are presented. In addition, data (72 images per group) of sessions 1 and 2 (e, f, k, and l) are shown

The mean and coefficient of variation, together with their respective 95% confidence interval, were estimated for each region, group (PD and CG), and proposed model (M1–M6). The statistics and their confidence interval were estimated by Bootstrap, which is a statistical method for estimating the sampling distribution of a statistic (e.g., mean, coefficient of variation) by sampling with replacement from the original sample [49]. In this research, random sampling was executed 1000 times, as suggested in [49].

Figure 5 shows the statistics mean and coefficient of variation, together with their 95% confidence interval, estimated through Bootstrap. The statistics were computed for HOG features of the spiral and sinusoidal images. In the graphs, the x-axis labels represent the name of the model concatenated with the group and the region delimited by HOG features. For example, the label M2PD3 represents model 2 (M2) of the PD group and region 3 (features from 5400 to 8100). Figure 5a and c show statistics for the features estimated from spiral images, while b and d show statistic values for the sinusoidal images.

Fig. 5
figure 5

a, b Mean HOG and its 95% confidence interval for spiral and sinusoidal images, respectively. c, d Coefficient of variation of HOG and its 95% confidence interval for the spiral and sinusoidal images, respectively. The statistics are presented for distinct models (M1 to M6), groups (PD and CG), and regions delimited by features (1, 2, and 3). For instance, M1CG1 is the statistic for model 1, group CG, and region 1. The 95% confidence interval was relatively narrow and hence difficult to see in the figure.

In Fig. 5, the gray area highlights the comparison between PD and CG groups of the same model and region. It is possible to note that, in general, there is no overlap between confidence intervals of distinct groups, and when overlaps occur, they are small; for instance, in Fig. 5a, the upper limit of the confidence interval for the mean of M3CG3 (0.02001) is equal to the lower limit of M3PD3. The results shown in Fig. 5 show that the confidence intervals for the statistics are narrow, suggesting an accurate estimate, for example, in Fig. 5d, the estimates for M6CG2 (coefficient of variation 3.20853, CI 3.20778, 3.20928) do not overlap with those of M6PD2 (coefficient of variation 3.18846, CI 3.1877, 3.18921). The results shown in Fig. 5 highlight differences between feature values in distinct regions (from 1 to 3), which can also be observed in Fig. 4 for each group and type of image.

3.3 Classification results

3.3.1 Random forest classifier

Table 2 shows the results of each session and type of drawing. The italicized results are the highest mean accuracy, sensitivity, and specificity, suggesting that data from both experimental sessions and the sinusoidal pattern are more relevant for the objective evaluation of hand tremor.

Table 2 Random forest classifier (RFC)

Figure 6 shows the ROC curve and the AUC value calculated for each subset. The results for RFC-M6 confirm the most accurate RFC model shown in Table 2.

Fig. 6
figure 6

ROC curve and AUC values of the RFC model. The M1, M2, and M3 graphs represent the results of the spiral drawing test set of the data collected from session 1, session 2, and all data together, respectively. Similarly, M4, M5, and M6 show the results of the sinusoidal drawings on the test set of the data collected from session 1, session 2, and all data together

3.3.2 K-nearest-neighbor

Table 3 shows the KNN results of each session and type of drawing. The italicized outcomes in the table are the highest mean accuracy, sensitivity, and specificity, implying that data from both experimental sessions and the sinusoidal drawing are more relevant for the objective evaluation of hand tremor.

Table 3 k-Nearest Neighbor (KNN)

Figure 7 shows the ROC curve and the AUC value calculated for each subset. The results for KNN-M6 confirm the identification of the most accurate KNN model shown in Table 3.

Fig. 7
figure 7

ROC curve and AUC values of the KNN model. The M1, M2, and M3 graphs represent the results of spiral drawing test set of the data collected from session 1, session 2, and all data together, respectively. Similarly, M4, M5, and M6 show the results of the sinusoidal drawings on the test set of the data collected from session 1, session 2, and all data together

3.3.3 Support vector machine

Table 4 shows the results of each session and the type of drawing from the SVM classifier. The italicized results are the highest mean accuracy, sensitivity, and specificity, suggesting that data from both experimental sessions and the sinusoidal pattern are more relevant for the objective evaluation of hand tremor.

Table 4 Support vector machine (SVM)

Finally, Fig. 8 illustrates the ROC curve and the AUC value calculated for each subset to the SVM classifier. The results for SVM-M6 confirm the identification of the most accurate SVM model demonstrated in Table 4.

Fig. 8
figure 8

ROC curve and AUC values of the SVM model. The M1, M2, and M3 graphs represent the results of spiral drawing test set of the data collected from session 1, session 2, and all data together, respectively. Similarly, M4, M5, and M6 show the results of the sinusoidal drawings on the test set of the data collected from session 1, session 2, and all data together

3.3.4 One-dimensional convolutional neural network

Table 5 shows the results of each session and the type of drawing from the 1D CNN classifier. The italicized results are the highest mean accuracy, sensitivity, and specificity, suggesting that data from both experimental sessions and the sinusoidal pattern are more relevant for the objective evaluation of hand tremor.

Table 5 These results are from a test set of 1D CNN

Figure 9 illustrates the ROC curve and the AUC value calculated for each subset to the 1D CNN classifier. The results for 1D CNN-M6 confirm the identification of the most accurate model shown in Table 5.

Fig. 9
figure 9

ROC curve and AUC values of the 1D CNN model. The M1, M2, and M3 graphs represent the results of spiral drawing test set of the data collected from session 1, session 2, and all data together, respectively. Similarly, M4, M5, and M6 show the results of the sinusoidal drawings on the test set of the data collected from session 1, session 2, and all data together

3.3.5 Two-dimensional convolutional neural network

Table 6 shows the results of each session and the type of drawing from the 2D CNN classifier. The italicized results are the highest mean accuracy, sensitivity, and specificity, suggesting that data from both experimental sessions and the sinusoidal pattern are more relevant for the objective evaluation of hand tremor.

Table 6 These results are from a test set of 2D CNN

Finally, Fig. 10 illustrates the ROC curve and the AUC value calculated for each subset to the 2D CNN classifier. The results for 2D CNN-M6 confirm the identification of the most accurate model shown in Table 6.

Fig. 10
figure 10

ROC curve and AUC values of the 2D CNN model. The M1, M2, and M3 graphs represent the results of spiral drawing test set of the data collected from session 1, session 2, and all data together, respectively. Similarly, M4, M5, and M6 show the results of the sinusoidal drawings on the test set of the data collected from session 1, session 2, and all data together

3.3.6 Evaluation of models

Table 7 displays the data normality (Shapiro-Wilk test), variance (Bartlett’s test), the presence of outliers (Bonferroni Outlier test), and the analysis of variance (ANOVA) that were employed to compare the accuracy of each model (M1–M6) for each classifier (RFC, KNN, SVM, 1D CNN, and 2D CNN). Thus, the appropriate ANOVA fitting shall be fulfilled verifying the normal distribution, the homogeneous variance across groups, and the absence of outliers [49].

Table 7 Statistical tests to verify normality, variance, and presence of outliers

Table 7 shows the accuracy of all models that followed the requirements for the application of ANOVA. Therefore, ANOVA one-way was applied for each classifier, and for most of them, the p value was smaller than the significance level 0.05, leading to the conclusion that there are significant differences between the models. However, 2D CNN does not show significant differences between the accuracy means between models.

The Tukey honest significant difference method was employed to evaluate the differences between the accuracy of models. The results are presented in Fig. 11a and b for RFC, Fig. 11c and d for KNN, Fig. 11e and f for the SVM classifier, Fig. 11g and h for 1D CNN, and Fig. 11i and j for 2D CNN.

Fig. 11
figure 11

Accuracy values of M1–M6 models for RFC (a and b), KNN (c and d), SVM (e and f), 1D CNN (g and h), and 2D CNN (i and j). The figure shows the boxplots of the accuracy values with a vertical orange line that is representing the mean and standard deviation. The Tukey multiple pairwise comparisons of the differences between means are also presented. Red and green colors highlight significant differences, whereas the blue non-significant

In Fig. 11, the boxplots show the accuracy values for each model that helps identify the differences in the Tukey graph. The label M2 − M1, for instance, denotes the average of model 2 minus model 1. Thus, if M2 has a higher average than M1, the difference is positive, otherwise negative.

4 Discussion

In this study, paper and pencil were used to quantify tremor through handwriting images. The HOG descriptor was introduced as a novel feature for tremor detection from sinusoidal and spiral patterns. The features were extracted using HOG and classified by machine learning methods (i.e., RFC, KNN, SVM, and CNN) to detect tremor in people with PD. As the classification results based on sinusoidal patterns were larger than those based on spiral patterns, this research reinforces the necessity of looking for alternative drawings to the traditional spiral pattern.

The hand tremor of people with PD was confirmed by three distinct examiners through the application of the MDS-UPDRS, which is the gold standard method [10]. Such a clinical assessment is not available in several related studies that analyzed spiral patterns [9, 10, 12, 14,15,16,17, 19,20,21,22,23, 25], making it difficult for comparative evaluations. In this research, it was found a fair agreement between assessments from the evaluators as pointed out by Kendall’s coefficient (66.1%), guaranteeing thus, the presence of a slight to a mild level of tremor in the participants with PD, as suggested by the mean sum of scores in MDS-UPDRS (session 1, 15.57; session 2, 16.42, values that vary between 0 and 48).

A practical and relevant aspect of this research is that there is no need for a supervised environment. Several studies [5, 6, 8,9,10,11,12,13,14,15,16] use tablet-based devices, which may require strict conditions to ensure the quality and usability of the acquired data. Other researches [12,13,14, 21, 22] employ less complex computing methods for data collection. However, they do not compare spiral patterns with additional drawings. In this sense, from the literature review, this is the first work to present objective results of handwritten drawings in comparison through the HOG descriptor.

Findings shown in Table 5 suggest that the result with model M6 using the 1D CNN method yielded the best average values of ACC 83.1%, SEN 85.4%, and SPE 80.8%. These results are more accurate than those obtained from RFC (Table 2), KNN (Table 3), SVM (Table 4), and 2D CNN (Table 6). In addition, the highest true positive rates (91%) were obtained for RFC and 1D CNN, as suggested by the AUC estimates for M6 in Figs. 6 and 9, respectively. Despite the small differences between the accuracies reached by the classifiers, the average time of training the models was elevated in some classifiers: 59.8 s for KNN, 105.7 s for RFC, 138.8 s for SVM, 1300.15 s for 1D CNN, and 16,957.88 s for 2D CNN in the environment described in this research.

Table 8 summarizes results reported by different studies, including those obtained from this research, which is the only one that compares results of sinusoidal and spiral drawings by using a unique type of visual feature (HOG). Conversely to distinct methods of feature extraction for tremor analysis [11, 12, 20, 23], HOG is independent of the type of the drawing and does not need to be estimated online such as in [11, 12]. The use of devices for online estimate features, such as digitizing tablets, can limit the use of technology in the clinical scenario.

Table 8 Performance comparison of classifiers reported in different studies

Two critical limitations of related researches are the lack of UPDRS evaluation of tremor and balanced groups. This research confirmed, through the assessment of three distinct examiners, that PwPD had slight to a mild level of tremor. It may be more challenging to discriminate these types of tremor from physiological tremor found in the control group. Furthermore, the lack of balanced groups may introduce bias in the results, thus, the comparison of outcomes from distinct studies is not straightforward, for instance, a study that used visual features estimated from digitized drawings reported accuracy results such as 83% [13] in balanced datasets, while 89.5% was reported [21] in a similar study that did not use balanced dataset. This research obtained, from a balanced data set, AUC of 83% for the spiral, as shown in Fig. 9 1D CNN-M3, and AUC of 91% for the sinusoidal, as illustrated in Fig. 9 1D CNN-M6. This suggests that the sinusoidal drawing should be considered in the clinical evaluation of patients.

Table 6 shows the results of CNN 2D; however, as shown in Fig. 10, the ROC curve and AUC indicate lower classification performance compared with the other tested classifiers employing HOG features. The fact that handwriting drawings have insufficient visual properties to be learnt from 2D CNN can justify this. The data set in this work only consists of 480 pictures per drawing (spiral or sinusoidal) and 240 images per experimental session with 120 pictures per group (PwPD and CG). This is a small dataset that can feed a 2D CNN which normally needs several thousand images per group [44].

Figure 4 shows important differences between features of PD and CG groups, and these differences were confirmed in Fig. 5 that shows the mean (a and b) and the coefficient of variation (c and d) for both groups and both types of handwritten drawings. The differences between groups are more considerable for features estimated from the sinusoidal than spiral images.

ANOVA was applied to evaluate the differences between the means of accuracies yielded by each model (Fig. 11). RFC ANOVA test and Tukey’s analysis highlighted a mean significant difference between M6 (sinusoidal drawing model related to data collected in the sessions 1 and 2) and M1 (spiral drawings model of session 1) in Fig. 11b.

The application of ANOVA to the results of KNN showed significant differences between means of the models M6 and M1, as shown in Fig. 11d; between M6 and M2 (the spiral drawing model related to data collected in session 2), and between M4 (the sinusoidal drawing model related to data collected in session 1) and M2. These results reinforce that the sinusoidal waveform can be more suitable than spiral drawings to evaluate the hand tremor symptom.

Results of SVM ANOVA showed more evidence of significant differences between M6 and M1, M6 and M2, and M6 and M3, as shown in Fig. 11f. Besides that, 1D CNN ANOVA also yielded a stronger significant difference between M6 and M2. All differences are related to spiral against sinusoidal-waveform subset models. ANOVA and Tukey tests could confirm similar significant differences between the means of spiral and sinusoidal drawing models. Despite the visual differences between the accuracy mean of the models, shown in Fig. 11j, the Tukey test applied to the 2D CNN models did not show any statistically significant difference, and this was confirmed in Table 7 that shows results of the ANOVA test with the null hypothesis accepted (p value > 0.05). On the other hand, the motor fluctuation could not be observed once there were no significant differences in the same type of drawing in distinct experimental sessions.

The literature reveals a lack of studies related to the significant differences between types of drawings. Smits et al. [5] used a tablet and pen to record the movement dynamics from circle, stars, spirals, and letters, and they were able to distinguish between PD from CG. However, the authors did not have the objective to present a statistical comparison between the shapes. Pereira et al. [14, 21] also did not aim to present a statistical comparison between means of spirals and meanders. Passos et al. [22] described that they had not observed any significant difference in recognition rates between meanders and spirals.

Furthermore, in Fig. 11f, the spiral models presented the differences close to zero compared to each other (M2 − M1, M3 − M1, and M3 − M2), and the sinusoidal models presented their differences between means close to 0.05 (M6−M4 and M6−M5). New possible research could investigate if the SVM sinusoidal model is capable of assessing motor fluctuation between different experimental sessions of data collecting, which could help understanding of drug effects to patients with PD.

In the proposed experimental conditions, sinusoidal patterns are more appropriate for the detection of tremor. This may be explained by the fact that the sinusoidal drawing requires different skills from the participant when he has to slide the hand from one point to another on the paper sheet.

Another important point is that this approach does not require the attachment of sensors to the body of the participant. This may be relevant to the prevention of skin irritation, especially in the elderly, and also does not need equipment to perform the online evaluation.

Future work could be related to the development of a practical mobile application. The app can capture pictures from sinusoidal drawings and present quantitative tremor measurements to the clinician responsible for the follow-up of people with PD. Due to the simplicity of the data collection procedure, this strategy could be used to monitor the patient at home without the need to care about controlling the equipment parameters such as speed, time, and pressure as needed in studies that use a tablet, digitizer pen, or inertial sensors.

5 Conclusion

The results of this research showed that HOG features extracted from sinusoidal handwriting drawings allow for better detection of tremor in people with PD. The highest results were obtained from the 1D CNN using HOG features (ACC 83.1%, SEN 85.4%, and SPE 80.8%, AUC 91%).

The methods described in this research were applied to a balanced dataset, and the feature extraction was based on a computer-vision technique, which was able to detect the tremor on the images. In addition, ANOVA and Tukey analysis evidenced the models from sinusoidal drawing obtained better results when compared with the models from spiral drawings, which is the most employed pattern in the clinical evaluation of tremor in people with PD.

All these results point that a sinusoidal pattern should be considered in the routine of clinical evaluations. Besides, the HOG descriptor can be applied with any drawing pattern other than the spiral and sinusoidal enhancing its use in assessing tremor in people with PD, unlike most studies that have been using specific features only for the drawings of Archimedean spiral.