1 Introduction

Parkinson’s disease involves the damage of nerve cells in the brain that reduces dopamine levels in the region called substantia nigra. Dopamine is a neurotransmitter that controls the movement and coordination of the body. After Alzheimer's, PD is the second most common neurodegenerative disease, a chronic disease affecting the whole nervous system and causing instability [1, 2]. PD affects the nervous system and, in turn, affects movement. When the brain’s nerve cell is damaged, dopamine levels are reduced, resulting in PD symptoms. Initial visibility of symptoms starts with tremors or slight shakiness on either hand. The symptoms of PD are categorized as motor and non-motor symptoms. Motor symptoms include tremors, bradykinesia, and speech impairment. Non-motor symptoms include fatigue, sleep problems, depression, and loss of smell etc. Figure 1 shows the motor and non-motor symptoms of PD. The patients experience a change in voice, shaky and small handwriting, slow movement, and postural imbalance. Since PD is progressive, the symptoms worsen over time [3]. Thus, diagnosing PD becomes crucial for the welfare of patients and lead a good quality of life. Hoehn and Yahr scale and the Movement Disorder Society-Unified Parkinson’s disease rating scale (MDS-UPDRS) are the commonly used scaling measures to find the stages of PD. The Hoehn and Yahr scale has stages ranging from 1 to 5 and MDS-UPDRS has four parts. Artificial intelligence is being applied in the health sector and has proven efficient. Machine learning (ML) and deep learning (DL) techniques, a subset of artificial intelligence, are popularly growing techniques in the medical field nowadays. Many studies carried out the detection of various diseases such as knee osteoarthritis [4, 5], Alzheimer's [6, 7], and stroke [8, 9] using artificial intelligence. ML and DL concepts are applied to enhance the prediction. In this systematic review, various approaches for the detection of PD are discussed. The approaches are categorized into brain observation techniques, motor symptoms and multimodal (combination of two or more symptoms). The brain observation techniques considered are single photon emission computed tomography (SPECT) images, Magnetic Resonance Images (MRI) and Electroencephalography (EEG) signals. The motor symptoms considered are speech impairments, Handwriting dynamics, and gait. This systematic review includes the research articles that were published in the year between 2012 and 2023. There are 28 papers reviewed under brain observation techniques, of which 13 were SPECT, 7 MR/MRI, 8 EEG and 55 Papers reviewed under motor symptoms, of which 22 were voice impairment, 20 were handwriting dynamics and 13 were gait. There are three papers reviewed under multimodal features (Fig. 2). Figure 3 shows the distribution of the literature selected. The number of articles reviewed under each method is shown in Fig. 4.

Fig. 1
figure 1

PD symptoms are classified into motor and non-motor symptoms

Fig. 2
figure 2

PRISMA model

Fig. 3
figure 3

Classification of PD detection analyzed by various studies in this review

Fig. 4
figure 4

Distribution of literature selected in this review

Therefore, the main research objectives of our review study are listed as follows:

  • Firstly, various machine and DL techniques for diagnosing PD are reviewed.

  • Secondly, the role and importance of diagnosing PD using motor symptoms, multimodal features and brain observation methods are studied.

  • Thirdly, the pros and cons of studies and the future scope are discussed which could benefit the PD community.

2 Search strategy and data extraction

The selection of papers plays a huge role in the systematic review. The articles were selected based on our research objectives. The relevant article search was performed on SCOPUS, Science Direct, Google scholar, IEEE Xplore and Springer databases. These databases are among the most popular and efficient databases for searching articles. This review analyzed literature published from 2012 to 2023. The PRISMA guidelines were followed for the quality assessment of the studies depicted in Fig. 2. The search keywords are connected through AND’s or OR’s for finding relevant studies. All the authors equally contributed to the selection of papers as per the inclusion and exclusion criteria. Apart from the selection from these databases, authors also searched the reference list of selected articles. The keywords used for searching are given below:

Main search keyword is “Detection of Parkinson’s Disease” AND “2012–2023” which resulted in 5297 Documents.

The papers in the form of letters, editorial documents, comments, erratum, retracted, languages other than English and duplicate documents were excluded. The count of refined search documents was 1252.

The search documents were further refined by the following keywords: “ML” AND “DL” AND “Intelligent systems” AND (“SPECT” OR “MRI” OR “EEG” OR “Speech” OR “Handwriting” OR “Gait”) AND motor symptoms. This review includes several modalities such as SPECT, MRI/MR and EEG under brain observation and voice impairment, handwriting dynamics and gait under motor symptoms. Therefore, searching was further refined for each specific modality. After extensive and careful selection, a final list of 131 most relevant documents that met our research objectives.

2.1 Inclusion and exclusion criteria

The inclusion and exclusion criteria for our study is listed below:

Inclusion criteria:

  • Research papers that involved ML and DL for diagnosing PD.

  • Research papers that aimed at PD diagnosis based on only motor symptoms, namely handwriting, voice impairment, gait, and brain observation techniques, namely SPECT, MR/MRI, EEG recordings and multimodal features.

  • Research papers written in English.

Exclusion criteria:

  • Research papers that aimed on PD diagnosis based on only non-motor symptoms.

  • Research papers published before 2012.

  • Letters, editorial documents, comments, erratum, retracted documents were excluded.

The organization of this paper is as follows: Sect. 2, 3, 4 discusses the literature survey on brain observation, motor symptoms and multimodal, respectively. Section 5 briefly discusses the literature, along with the pros and cons. Lastly, Sect. 6 discusses the conclusion and future scope.

3 Brain observation methods

The brain observation method is discussed in this section. Such methods are SPECT images, MRI images, and EEG signals.

3.1 SPECT images

Rumman et al. [10] used dopamine transporter scan (DaTscan) to identify PD at an early stage. 100 HC and 100 PD images were collected from PPMI.150 images were trained by artificial neural network (ANN) and 50 images were used for testing purposes. In image preprocessing, spatial normalization was carried out to get the same orientation between two images. With unsharp masking, these images were sharpened. The dynamic thresholding was done to detect edge and binarization, and a sequential grass fire algorithm was used to detect boundary and connected pixels. They have achieved accuracy, sensitivity, and specificity of 94%, 100%, and 88%. Ortiz et al. [11] detected PD using convolution neural network (CNN) architecture. To improve the performance of CNN architecture, iso-surfaces were used, which extracts suitable features. Their paper used two CNN architectures, LeNet and AlexNet that obtained 95.1% accuracy. They observed that the use of iso-surfaces made input simpler. Prashant et al. [12] used striatal binding ratio(SBR) of four brain regions obtained from the SPECT images. The Support Vector Machine (SVM) and logistic regression (LR) were used for performance evaluation. A comparison was made between SVM with Radial Basis Function (RBF) kernel and SVM with linear kernel. They observed that SVM with RBF kernel outperformed with accuracy, sensitivity, and specificity of 96.14%, 96.55%, and 95.03%, respectively, using 10 Cross Validation (CV). Prashant et al. [13] produced 97.29% accuracy using shape and surface fitting features obtained from SPECT images.

Choi et al. [14] proposed a deep learning-based automated SPECT interpretation system for detecting PD. The PD Net, a CNN architecture was proposed with SPECT images fed as input. They used two datasets obtained from PPMI and Seoul National University Hospital. They obtained 98.8% accuracy on dataset collected from Seoul National University Hospital. Rojas et al. [15] proposed a computer-aided diagnosis (CAD) system depending upon empirical mode decomposition for PD. They proposed an approach to improve voxel as feature-based system. 80 DaTscan images were used. Feature extraction was done using Principal Component Analysis (PCA) and Independent Component Analysis. The classification was done using SVM. They found that PCA performed extremely well.

Figure 5 shows the SPECT images in which the shape of the striatum region of PD patients seems to be smaller and distorted, whereas the shape is C-shape for HC. Pahuja et al. [16] have used the SBR values and the biomarkers, namely serum, urine, plasma, CSF, and RNA, for diagnosing PD. The performance was tested by taking only SBR, five biomarkers and then SBR + combination of all the biomarkers. SBR with five biological biomarkers provided an accuracy rate of 100%.

Fig. 5
figure 5

Shows the SPECT scan images for a HC, b PD patient [20]

Bhalchandra et al. [17] used SPECT images to classify PD patients from HC. They had 163 and 187 images of PD (stage I or II according to Hoehn and Yahr scale) and HC, respectively. They used three features: SBR, Radial and Gradient features. The classification was done using SVM (Linear and RBF kernel) and Linear Discriminant Analysis (LDA). Combination of the three features performed better with high accuracy of 99.42% using SVM classifier for both linear and RBF kernel. So, SBR, Radial and Gradient features improved the classification accuracy. Palumbo et al. [18] also observed that non-clinical features like age are essential for classification. SPECT with Basal Ganglia V2 software were analyzed for 90 patients. They used semi-quantitative data and age as features. Those were classified by SVM technique with two validation methods: Leave-one-out and fivefold validation. They observed that the addition of age as a feature improved the accuracy. Hajer khachanoui et al. [19] have opted for the clustering method for diagnosis of PD using 4 SPECT imaging data and 5 clinical data. They obtained 64% accuracy using density based spatial clustering.

The interpretation of model’s performance is vital for the clinicians/researchers. Such interpretable model has been used by Magesh et al. [20], called Local Interpretable Model-Agnostic Explainer (LIME), which gives justification for the final prediction (PD or HC) of their proposed model. They used SPECT images to classify PD patients vs. HC. The dataset consisted of 430 PD and 212 HC SPECT images collected from PPMI database. They used the transfer learning model (Visual Geometry Group 16 (VGG 16)) and observed the parameters: Accuracy, specificity, and sensitivity, whose values were 95.2%, 90.9%, and 97.5%, respectively. Kurmi et al. [21] used DaTscan with ensemble learning and developed software based on their proposed model for the detection of PD. The dataset was obtained from PPMI, consisting of 432 PD and 213 non-PD patients. They used DL models namely VGG16, Xception, ResNet50, Inception-V3, whose results were ensembled with Fuzzy Rank Level Fusion. They obtained an accuracy of 98.45%, precision of 98.84%, F1-Score of 98.84% and sensitivity and specificity of 98.84% and 97.67%, respectively. Figure 6 shows the studies that achieved accuracy of more than 90%.

Fig. 6
figure 6

SPECT scans that achieved more than 90% accuracy

Thakur et al. [22] tested their model with a larger dataset and obtained a high classification performance. They used 1390 DaTscan images from the PPMI repository and CNN to classify PD. DenseNet architecture was compared with Inception ResNet, ResNet, MobileNet, EfficientNet V2 and Xception. They obtained overall accuracy rate and AUC receiver operating characteristic (ROC) rate of 99.2% and 99%, respectively.

3.2 MRI/MR images

3D CNN improved the detection accuracy employed in [23, 24]. Chakraborty et al. [23] used 3T T1 weighted MRI images of brains with 3D CNN to diagnose early PD. These scans were collected from PPMI of which 203 were PD and 203 were HC group. An overall accuracy, average recall, average precision, average specificity, f1-score, ROC-AUC of 95.29%, 0.943, 0.927, 0.9430, 0.936, 0.98 were obtained respectively. Vyas et al. [24] compared 3D CNN and 2D CNN trained on MRI images in the axial plane to detect PD. The dataset had 318 MRI images which were collected from PPMI. Preprocessing of scans was done using bias field correction, histogram matching, z-score normalization and image resizing techniques to improve the model's performance. The model was evaluated using loss, accuracy, confusion matrix, precision-recall, and ROC curves. The observed result was that the 3D CNN approach classified PD better with an accuracy rate of 88.9% and AUC of 0.86 than 2D CNN, which had an average accuracy rate of 72.22% and AUC of 0.50.

Kaur et al. [25] used CNN to detect PD patients. They have employed Generative adversarial networks-based data augmentation and AlexNet in MR images for classification. This dataset consisted of 504 images, of which 360 images were used for augmentation. The augmented training set was applied to pre-trained AlexNet and fine-tuning was done at the dense layer. They obtained an accuracy, sensitivity, and specificity of 89.23%, 90.27%, and 89.03%, respectively. Amoroso et al. [26] used MRI images to detect PD. They obtained their dataset from PPMI, of which 374 were PD and 169 were HC. Feature selection was done using Random Forest (RF) and classification was performed using SVM using tenfold CV. They observed that their model performed best when using network + clinical features with an accuracy, specificity, sensitivity and AUC of 0.93 ± 0.04, 0.92 ± 0.07, 0.93 ± 0.06 and 0.97 ± 0.02, respectively. The Ensemble of VGG16 and ResNet50 proposed in Sri Lakshmi et al. [27] have produced 96.09% accuracy using MRI images. The hybrid approach of Genetic Algorithm (GA) based segmentation with CNN has been proposed in SreeLakshmi and Mathew [28].

Sivaranjini and Sujatha [29] detected PD using the brain's MR images. Their dataset had 100 PD and 82 HC. AlexNet carried out the detection in which 5-convolution layers and three fully connected layers were present. The Rectified Linear Unit was added in all 5-convolution layer. They obtained accuracy and AUC value of 88.9% and 0.9618, respectively (Fig. 7).

Fig. 7
figure 7

General CNN architecture

Solana-Lavalle and Rosas-Romero [30] applied Voxel-based morphometry (VBM) to MR images to detect PD and HC. They performed separate analysis for male and female. VBM was used to extract the area of interest. The features were extracted through first and second-order statistic approaches and features were selected through PCA and Wrappers feature subset selection followed by classification by using: K-Nearest Neighbors (KNN), multi layer perceptron (MLP), SVM, RF, Naïve Bayes, logistic classifier, Bayesian networks. They observed a high accuracy rate of 99.01% and 96.97% for male and female, respectively.

3.3 EEG signals

The balanced dataset is important to avoid biasedness in the model. The HC and PD dataset range for most of the EEG analysis was usually from 15 to 20 which was less than other brain observation approaches studied in this review. The neural network implementation has provided an optimized result in EEG analysis. EEG signals also aids in seizure detection [31], depression [32], pathology [33, 34] and abnormal EEG [35] detection. The studies that detected PD using EEG signals are given below and Fig. 8 shows the accuracy obtained.

Fig. 8
figure 8

Best accuracy obtained using EEG recording

Oh et al. [36] used EEG signals to diagnose PD. They implemented 13-layer CNN based on CAD system. A total of 20 PD and 20 HC EEG signals were studied. The convolutional layer convolves with the input-generating feature map. CNN includes convolution, Max-pooling, and the fully connected dense layer. A typical CNN architecture is shown in Fig. 7. To enable fast learning and boosting, batch normalization was used. The activation functions, Rectified Linear unit were applied to every layer and SoftMax was applied to the final layer. Accuracy, sensitivity and specificity were 88.25%, 84.71%, and 91.77%, respectively. Lee et al. [37] proposed a convolutional Recurrent Neural Network (CRNN) with recurrent gated units. This proposed network classifies PD patients using EEG signals from 20 PD and 21 HC. The 1D CNN layers were used to extract spatiotemporal features from EEG signal. These features were passed to recurrent gated units to find temporal features. They observed that their model achieved an accuracy, precision, recall of 99.2%, 98.9%, and 99.4%, respectively. In another study Xinjie [38], 3D CNN-RNN and 2D CNN-RNN outperformed the standard CNN and RNN with an accuracy rate of 82.89% and 81.13%, respectively.

Khare et al. [39] proposed Parkinson disease convolutional neural network (Smoothed pseudo-Wigner Ville distribution (SPWVD) coupled with CNN). Two datasets were used. The first dataset was OpenNeuro dataset with 15 PD patients and 16 HC groups taken from the University of San Diego, California. The second data set contains EEG recording of 20 PD patients and 20 HC group taken from Henan Provincial People’s Hospital (People’s Hospital of Zhengzhou University). The EEG signals were transformed to time–frequency representation using SPWVD and this was fed to CNN, which was estimated using tenfold CV. For detecting PD patients, dataset 1 obtained 100% accuracy and dataset 2 obtained 99.97% accuracy. Anjum et al. [40] detected PD using EEG recordings of 41 PD patients and 41 HC. The linear predictive coding EEG algorithm for PD was used to transform the recorded EEG time series into features. 27 PD patients and 27 HC were tested based on the in-sample that gave an accuracy of 85.3 ± 0.1%, 93.3 ± 0.5% of AUC, 87.9 ± 0.9% of sensitivity, and 82.7 ± 1.1% of specificity using multiple CV and 14 PD patients and 14 HC were tested for an out-of-sample test that gave 85.7%, 85.2%, 85.7% and 85.7% accuracy, AUC, sensitivity and specificity respectively.

Lee et al. [41] proposed a CRNN model for detecting PD using EEG signals. Both 20 PD and 21 HC EEG were recorded by making participants focus on a particular target on a computer screen once and twice (with on and off medication). CRNN was used to extract features from EEG. The output of 2 1D CNN produced spatial features, which were given to RNN along with Long Short-Term Memory (LSTM) to find temporal features. They obtained an accuracy, recall, and precision of 96.9%, 93.4%, 100%, respectively. Loh et al. [42] proposed 2D-CNN for detecting PD using EEG signals of 15 PD and 16 HC collected from the publicly available dataset OpenNeuro. EEG signals were converted to spectrogram using Gabor transformation in the preprocessing stage, which was given as input to 2D-CNN with Ten-Fold CV. Their proposed model achieved an accuracy of 99.46% in classifying HC and PD (with and without medication).

Yuvaraj et al. [43] used EEG signals with higher-order spectra features for diagnosing PD. They recorded the EEG signals from 20 PD with medication and 20 HC, each in a resting state from which high-order spectra Bispectrum features were taken. The classifiers used were KNN, Fuzzy KNN, DT, Probabilistic neural network, SVM and the proposed model was validated by tenfold CV. They found SVM with RBF kernel performed better with accuracy, sensitivity, and specificity of 99.62%, 100%, and 99.25%, respectively.

Majid Nour et al. [44] used the ensemble learning approach (Dynamic Classifier Selection in Modified Local Accuracy) and 1D-PDCovNN to diagnose PD using EEG dataset. Their Dataset consists of 15 PD and 16 HC. The ensemble approach obtained the highest accuracy of 99.31% using ensemble approach.

Table 1 summarizes the result obtained from brain observation.

Table 1 Summary of brain observation approach reviewed in this study

4 Motor symptoms

Different methods of body observation to detect PD are discussed in this section. The methods are voice signals, handwriting, and gait movements.

4.1 Voice

Parkinson's patients experience vocal impairments like trouble pronouncing words with breathy and hoarse voices. Voice is one of the noticeable early symptoms. Therefore, many studies have used voice as a symptom to detect PD using machine and DL. Kuresan et al. [45] detected PD using speech signals. The dataset had 20 PD and 20 HC obtained from the UCI ML repository. They used wavelet packets (WPT), Mel-frequency cepstral coefficients (MFCC) and fusion of MFCC and WPT for feature extraction. The classifiers used were Hidden Markov Model and SVM. They observed that the fusion of MFCC and WPT with the Hidden Markov Model performed the best, obtaining an accuracy, sensitivity, and specificity of 95.16%, 93.55%, and 91.67%, respectively. Gunduz [46] detected PD using vocal disorders of PD patients. The dataset of 252 patients was collected from UCI ML repository. This dataset had four features: Tunable Q-factor wavelet transform, wavelet, MFCC and Concat. All the features were combined in the first network and given as input to the 9-layer CNN. The features were passed to the input layers parallelly in the second network. These two networks' performance was checked by carrying out Leave-one-person-out CV. An accuracy of 0.869 was obtained and the second network provided good results.

Jeancolas et al. [47] used X-vectors for early detection using the voice of PD patients. X vector derived from DNN gives excellent speaker recognition for substantial training data. The performance of this technique was compared with MFCC-GMM (Mel-frequency cepstral coefficients-Gaussian Mixture model). 221 PD and HC groups were recorded using microphone and telephone systems. To check whether there were gender effects due to PD, they analyzed men and females separately, which gave more accurate models. They have tested for various aspects, including the impact of the data augmentation, audio segment durations, kind of speech tasks, type of dataset used for the neural network training, and back-end analyzes. The better performance was given by the X vector method for text-independent speech tasks obtained with a microphone and telephone. Detection of PD in women gave more promising results than in men.

Karaman et al. [48] developed CNN based on voice biomarkers (sustained vowels). The database was collected from mPower Voice. They performed data preprocessing and Fine-Tuning Based Transfer Learning. Three architectures that were used for retraining and fine-tuning were SqueezeNet1_1, ResNet101, DenseNet161 to classify frequency-time information. It was identified that DenseNet architecture had the best performance in detecting Parkinson’s patients, obtaining an accuracy, sensitivity, and precision of 89.75%, 91.50% and 88.4%, respectively. Figure 9 shows the role of the larynx and vocal cords in voice production.

Fig. 9
figure 9

Depicts sound production in person [48]

By examining the voice samples, Devarajan et al. [49] detected PD patients from HC using fog computing, an intelligent system between a cloud server and end devices. They combined Fuzzy K-nearest Neighbor-Case-based Reasoning classifier (FKNN-CBR) for greater classification purposes. The UCI Parkinson dataset was used to experiment and evaluate this technique. They compared FKNN-CBR with other classifiers like Naïve Bayes, J48, Random tree, SVM, and K-nearest neighbor algorithms on the PD dataset with 195 voice recordings for PD patients and healthy groups. The dataset was made into two groups, one for training and the other for testing. FKNN-CBR had an accuracy of 94.87%, more significant than other classifiers.

Yaman et al. [50] used statistical pooling to increase the features of vowels taken from UCI dataset and used ReliefF for feature selection. They obtained 91.25% and 91.23% accuracy using SVM and KNN, respectively.

Rahman et al. [51] used MFCC-LDA-SVM technique for PD detection in the cepstral domain of voice signals. The MFCC was used for extracting the features. They proposed LDA for classification and dimensionality reduction of extracted features and SVM for classification purposes. The proposed model was validated through a leave-one-subject-out (LOSO) validation scheme, 10 unique ML models. The AUC, sensitivity, and specificity were 88%, 73.33%, and 84%, respectively. Sakar et al. [52] detected PD by collecting various sound recordings. The dataset consisted of 20 PD patients, of which six were female, 14 were male and 20 were healthy individuals, and ten females and ten males, collected from the Department of Neurology in Cerrahpas ¸a Faculty of Medicine, Istanbul University. They collected voice data’s (i.e.) PD patients in the healthy group were instructed to read or spell vowels, words, and sentences. The performance was analyzed by KNN (LOSO and SLoo CV) and SVM (LOSO and SLoo CV). They analyzed that sustained vowel was more helpful in detecting PD.

Ali et al. [53] used voice impairments to detect PD patients. So, a hybrid intelligent system was developed. The dataset was obtained from Sarkar et al. [19] at the neurology department in Cerrahpasa, Faculty of medicine, Istanbul University. This data had gender imbalance. To avoid subject overlap, LOSO-CV was used. Here for dimensionality reduction, LDA was used. The classification neural network with a GA for optimizing hyperparameters was used to reduce validation loss. In LDA-neural network-GA model, the evaluation parameters were accuracy, Sensitivity, Specificity and Mathew Correlation function. The results obtained were 95% and 100% accuracy for training and testing, respectively. By removing gender-dependent features, they obtained an 80% accuracy in training and 82.14% in testing.

Gürüler [54] used k-means clustering-based feature weighting (KMCFW) and Complex valued ANN (CVANN) technique for the detection of PD. The data consisted of features of speech and sound samples collected from UCI ML repository with a total of 195 sound samples. Then, Dataset were preprocessed through a technique called KMCFW significantly reduces the variance of features and improves accuracy. A complex value obtained from feature values was given as input to CVANN. The performance was evaluated by f-measure, accuracy, specificity, sensitivity, and kappa statistic value. Model proposed obtained 99.52% classification accuracy and observed that complex valued ANN provided better accuracy than real-valued ANN.

Benba et al. [55] detected PD using 34 sustained vowels, of which 17 were PD. 1–20 coefficients of MFCC were extracted from each participant. LOSO validation with SVM (different kernels) was used for classification. They obtained the highest accuracy of 91.17% using SVM (linear kernel) when only the first 12 coefficients of MFCC were taken. Using a vocal dataset, Kamalakannan et al. [56] detected PD using ML algorithms. The dataset consisted of vocal data of both PD patients and healthy groups collected from the UCI ML repository. A total of 26 data was there for PD patients and HC. mRmR- Minimum Redundancy Maximum Relevance feature selection algorithm was applied during preprocessing stage for better accuracy. After this, the stacked autoencoder technique increased the model’s performance further. Artificial Immune Recognition System—Parallel was used for classification and K-fold CV was used for evaluating performance. This model attained a high accuracy of 97%.

Pramanik et al. [57] detected PD by proposing a model that relies on Vocal Fold, Baseline, and time–frequency features. They have 752 acoustic features from 252 subjects. These features were ranked using correlation feature selection, mutual information-based feature selection and fisher score feature selection techniques which were then passed to Naïve Bayes algorithm. They received an accuracy and precision of 78.97% and 0.926 withhold-out CVs, respectively. Senturk [58] used ML techniques with speech as the dataset for early PD identification. 23 Speech signal features were extracted from 31 subjects, of which 23 were PD patients and the remaining was HC. Then feature selection was performed using Recursive Feature Elimination (RFE), Univariate selection and Feature Importance. SVM and ANN were the classifiers used. They obtained the best accuracy of 93.84% from SVM with RFE.

Mittal et al. [59] proposed two models with acoustic features from UCI ML repository for classifying PD and HC. The dataset consisted of 40 HC and 40 PD. The first and second approaches grouped HC and PD into three equal parts. The feature selection was made using PCA followed by classification using Medium Gaussian Kernel SVM (MGSVM), weighted k-NN (wkNN) and LR. They observed that the second model with WKnn + PCA obtained the highest accuracy at 90.3%. Hariharan et al. [60] proposed a hybrid architecture for the classification of PD. The dataset was taken from UCI ML repository that had 22 dysphonia features. The feature was preprocessed using Gaussian mixture model (GMM). PCA and LDA were used for feature reduction; distinctive features were selected using sequential forward and backward selection. Finally, the classification was done using three classifiers: probabilistic neural network, general regression neural network and least-square SVM, with validation performed using convention and tenfold CV. They obtained a high 100% accuracy rate for their hybrid model.

Yadav et al. [61] used ML and ensemble techniques to detect PD using voice impairments. The dataset contained 20 PD and 20 HC collected from UCI ML repository that contained 26 features. A total of 5 ML classifiers and five ensemble techniques were used. They observed that SVM and bagging provided the highest accuracy of 93.83% and 73.28%, respectively.

Polat et al. [62] detected PD using features obtained from voice signals. The dataset was obtained from UCI ML repository consisting of 40 PD and 40 HC. Their data was sampled using the one against all method, which was categorized into five parts. These partitioned data were classified using WKnn, LR, and SVM classifiers with a medium Gaussian kernel function. They observed WKnn classified better with an accuracy rate of 88.48% in the first approach and 89.46% in the second approach. Vikas Chaurasia et al. [63] obtained 24 features from UCI machine learning repository. The authors have proposed four models namely base model, which comprised of ML techniques such as LR, KNN, NB, DT and SVM, Metal model which combines classifiers used in base model, ensemble model that consisted of AdaBoost, RF, Bagging and Gradient Boosting and finally K-fold model. They observed that the Gradient Boosting ensemble obtained highest accuracy of 97.43%. Priya das et al. [64] have proposed voting ensemble weighted extreme learning machine classifier with a binary cuckoo search technique using voice dataset for PD diagnosis. They obtained average accuracy of 99.21%, sensitivity of 100% and specificity of 98.90%. Aditya Shastry [65] opted an ensemble approach for the early detection of PD on speech dataset. The authors proposed nearest neighbor boosting, combining KNN and Gradient Boosting. They used Feature Permutation, Mean Decrease in Impurity and Pearson’s Correlation for feature selection. They observed an improvement in the performance metrics using their proposed model when compared several popular models. In another study Ouhmida et al. [66], KNN outperformed 8 ML classifiers with accuracy of 97.22 using speech dataset.

4.2 Handwriting

Parkinson disease affects the writing ability of the patient. PD patients' handwriting is often distorted and smaller than healthy individuals due to tremors, slowness, and rigidity. The most used handwriting tasks were spirals and meanders. Ali et al. [67] detected PD using the handwriting of PD patients. They used KNN, Gaussian Naïve Bayes (GNB), LDA, and decision tree. Due to class imbalance, there is biasedness and low accuracy. Therefore, they used random under-sampling for training to eliminate biasedness. A cascaded model (Chi2 with AdaBoost) was used to improve the accuracy. This cascaded system has shown better performance with accuracy, sensitivity, and specificity of 76.44%, 70.94% and 81.94%, respectively.

Pereira et al. [68] used handwriting dynamics to diagnose PD. Spiral and meanders tasks were conducted for the subjects using a smartpen consisting of sensors. The signal obtained from the sensor was transformed into pictures and represented as a time-based image. The total dataset consisted of 224 PD and 84 HC. Two experiments were made with image resolutions of 64 × 64 (75% training, 25% testing and 50% training, 50% testing) and 128 × 128(75% training,25% testing and 50% training, 50% testing). To check the accuracy, they used: ImageNet, CIFAR-10, LeNet and OPF (Optimum-Path Forest). Gil-Martín et al. [69] used drawing movements of PD patients using CNN architecture to detect Parkinson's disease. Dataset consisted of 62 PD patients and 15 HC groups. The subjects were asked to perform various tests: Static spiral, dynamic spiral, and stability test. They used a fivefold CV. During drawing movements, they examined the discrimination capability of different directions. X and Y directions performed best. The accuracy obtained was 96.5%, F1- the score was 97.7%, and AUC was 99.2%.

Lamba et al. [70] detected PD patients from handwriting dynamics. The dataset was collected using a digitized graphics tablet from UCI PD spiral drawings. Twenty-nine features were extracted and reduced using GA and mutual information gain feature selection technique. To tackle class imbalance, they used the synthetic minority oversampling technique (SMOTE). AdaBoost, RF, SVM and XGBoost with tenfold CV evaluated it. With GA, RF obtained an accuracy of 91.34%. With mutual information gain feature selection, AdaBoost had an accuracy of 96.02%. Khatamino et al. [71] detected PD using spiral test handwritten dynamics using CNN. The data included 72 spiral handwriting, of which 57 were PD and 15 were HC. It had both a dynamic spiral test and a static spiral test. The proposed CNN model performed with 90%, 75%, 50% for training purposes and 10%, 25%, and 50% for testing purposes. They observed that their proposed model obtained an 88% accuracy.

Moetesum et al. [72] used handwriting samples to detect PD patients. The graphometer samples were obtained at the PD handwriting database (PaHaW). The dataset contained a total of 72 subjects. They proposed a model to get the visual features from the samples with a convolutional network. Median residual and edge images were used to enhance the extracted features further. These features were applied to SVM with tenfold CV. They obtained an accuracy of 83%.

Drotár et al. [73] detected PD using handwriting samples. The datasets contained 37 PD, 38 HC. The In-air and on-surface movements while writing a sentence were tested using a digitized tablet that evaluated both movements. The features that satisfied the Mann–Whitney test were only taken for processing. For feature selection, mRmR and sequential forward feature selection were used, and for classification purposes, SVM was used with a leave-one-out approach. The in-air movements were found to classify PD patients more accurately, with an accuracy rate of 84%, than on-surface movements, with an accuracy rate of 78%. The accuracy of the combination of two movements was 85%. Drotár et al. [74] used handwriting as the basis for detecting PD. They used PaHaW database consisting of 75 data, from which 37 were PD patients and 38 were HC. The kinematic and pressure features were analyzed. The features that satisfied Mann–Whitney U test were taken. To evaluate the performance KNN, ensemble AdaBoost, SVM were used. They obtained an accuracy, sensitivity, and specificity of 81.3%, 87.4%, 80.9%, respectively, with SVM and showed that pressure features were an important feature that had an accuracy of 82.5% for PD detection.

Afonso et al. [75] carried out the detection of PD using handwriting. The dataset was collected from the São Paulo State University medical school, Botucatu, Brazil, and the signals during PD and HC handwriting movements was recorded. The two main drawings they performed in their study were meanders and spirals with 224 PD and 84 HC images, as shown in Fig. 10. These signals were sent to the recurrence plot to convert it to images. These were given as input to CNN for classification. The three CNN architecture used in their study was CIFAR10_quick, ImageNet and LeNet architecture. In the classification process, OPF was used. They compared the architectures with different image resolutions and training set sizes to evaluate the effectiveness of CNN. They achieved a recognition rate above 90% with the help of a recursive approach. Naseer et al. [76] used handwriting to detect PD. They used AlexNet (25-layer CNN architecture) with transfer learning and data augmentation. They analyzed Freeze and fine-tuning using Modified National Institute of Standards and Technology and ImageNet datasets. They obtained 98.28% accuracy using AlexNet-fine-tuning using ImageNet and PaHaW dataset. Impedovo et al. [77] aimed to detect PD using dynamic handwriting. They used a subset of PaHaW of 75 subjects, of which 38 were HC and 37 were PD (early and mild severity). They used KNN, SVM, LDA, RF, AdaBoost and GNB. A linear SVM classifier evaluated each feature and rated them according to their predictive accuracy; only those features with higher rank characteristics were used. They performed two cases: merging features from task and ensemble approach. They obtained 74.76% accuracy using the ensemble method.

Fig. 10
figure 10

Handwriting tasks showing spiral task for a HC, b PD. Meander task for c HC, d PD [75]

Kurt et al. [78] detected PD using handwriting tasks. They obtained the dataset from UCI ML repository, which had a spiral dataset of 57 PD and 15 HC, out of which they used SST and DST. The dynamic time warping method was applied to the spiral dataset. They classified using SVM (linear and RBF kernel) and KNN. Using SVM with linear kernel, they obtained the highest accuracy, MCC, and F-score of 97.52%, 0.9150, and 0.9828. Ujjwal et al. [79] used the handwriting dynamics of males and females and their ages to diagnose PD. Their dataset contained 37 PD subjects, of which 19 were male and 18 were female around age 69.3 ± 10.9 years and 38 HC of which 20 were male and 18 were female around 62.4 ± 11.3 years collected from PaHaW dataset. They performed 7 tasks, and the features were divided into kinematic, Entropic and Energetic. The features were selected by Mann–Whitney U test and classified using SVM RBF. The classification was divided into male/female and young/old classes. They observed that the female class obtained a high accuracy of 83.75%.

Mucha et al. [79] detected PD based on fractional order derivatives in handwriting. The dataset consisted of 33 PD and 36 HC. They extracted kinematic features and features were selected using Spearman’s and Pearson’s correlation techniques. They used RF with sevenfold CV and obtained an accuracy of 89.81%. Kotsavasiloglou et al. [80] used drawing patterns to detect PD. A total of 44 subjects were considered for their experiment. Using a tablet pen, the subjects were asked to draw ten straight lines (horizontal) with both hands. They extracted mean velocity, normalized velocity variability, standard deviation of velocity, and entropy features of signal. The average was calculated for all subjects in six ways to form a dataset. 13 feature selection methods were performed, followed by classification using classifiers: AdaBoost (J48), Naïve Bayes, LR, SVM, J48 and RF. They obtained an accuracy rate of 91%. Zham et al. [81] detected PD through handwriting dynamics. There were 31 PD and 31 HC (62 subjects) whose handwriting samples were recorded using four tasks. The four tasks were writing a sentence task, letter ‘b’ and ‘d’ separately, letter ‘bd’ together and drawing a spiral (Angular and direction features taken from spiral). The dynamic features were extracted from each task and correlation analysis was done using the Spearman rank order correlation coefficient. The feature selection was made using the ReliefF approach, which was then classified using the Naïve Bayes classifier. They obtained an AUC of 0.933. They observed that using a spiral produced a better classification of PD and HC. Figure 11 shows the best accuracy among handwriting dynamics.

Fig. 11
figure 11

Best Accuracy obtained using handwriting dynamics

Ranjan et al.[82] used two handwriting tasks, namely spiral and wave. In their proposed system Histogram of Oriented Gradients was used for feature extraction, which was then passed to RF, obtaining an accuracy of 86.67% and 83.30% for spiral and wave tasks, respectively. Saravanan et al. [83] proposed VGG19-INC DL model whose input were spirals and wave tasks. They also implemented LIME as an interpretable model. They obtained a high accuracy of 98.45%. Thakur et al. [84] aimed to detect PD using static and dynamic spiral handwriting tasks. Their dataset consisted of 62 PD and 15 HC. LR and SVM were analyzed for these tasks individually. These tasks were combined as input to a fusion of multi layer perceptron and Restricted Boltzmann machine. They obtained an accuracy of 95.32% using their proposed approach. Kamran et al. [85] proposed a method for early diagnosis of PD by transfer learning. They used handwriting samples collected from Hand PD and NewHandPD and combined them. They used 6 CNN architectures, of which AlexNet obtained the highest accuracy of 99.22% with fine-tuning.

4.3 Gait movements

Tremor is also one of the early symptoms of PD. But unlike other motor symptoms discussed above, like voice and handwriting, one needs a sensor to be placed correctly and supervised to observe the gait and extract features. Moreover, Arora et al. [86] provided individual smartphone accelerometers to every subject that provided a self-administered movement test for gait and posture sway proving its feasibility. The limitation was that they could not test for a few gaits and posture sway that could be found out by using advanced wearable equipment. In comparison, voice and handwriting could be recorded through a smartphone or another device at home with less computational cost and time. Though it also has cost limitations, it has proved to be an effective symptom in diagnosing PD and predicting its severity. The literature below gives various methods the authors employ to detect PD using the gait dataset. Tong et al. [87] classified the severity of PD patients based on persistent entropy of topological imprints, PVI-Permutation variable importance. The data were collected from Physionet. The data signals included normal walking, dual-task walking. They used SVM to classify the samples for the PD severity levels. They obtained 98.08% accuracy, which was evaluated using a 10-Fold CV.

Setiawan et al. [88] used vertical ground reaction force (VGRF) signals to detect PD patients. Since PD patients suffer from gait impairments; therefore, depending on the severity, the force varies. The dataset was obtained from Physionet database, which had 93 PD patients. VGRF signals were divided into time windows (10 s, 15 s, 30 s). Continuous wavelet transform was used to convert VGRF in time domain signals to a time–frequency spectrogram. This process was called feature transformation. For feature enhancement, PCA was used and for classification various CNN models were deployed. The CNN models were GoogLeNet, AlexNet, ResNet-101 and ResNet-50. It was observed that Resnet-50 model gave an average accuracy of 96.52% with a tenfold CV. Balaji et al. [89] used LSTM network to identify and severity PD. The VGRF for three unique walking patterns was collected (Fig. 12). The VGRF data were preprocessed and given to LSTM network. Dropout and L2 regularization techniques were used to prevent overfitting. Adam and stochastic gradient-based optimizers were used to minimize cost function. UPDRS and Hoehn and Yahr scale were used to determine the severity level of PD. They observed that Adam-optimized LSTM provided better performance with 98.6% accuracy and 96.6% accuracy for binary and multi class classification, respectively.

Fig. 12
figure 12

Proposed architecture to diagnose Parkinson disease by [89]

Buongiorno et al. [90] carried out detection of PD concentrated on the motor abilities. The motor abilities examined were gait, finger, and foot tapping. The dataset consisted of 30 people, of which 16 were PD patients (according to MDS-UPDRS scaling) and 14 were HC. Microsoft Kinect v2 sensor was used for classification. SVM and ANN evaluated the performance with fivefold CV. An overall of 16 features were extracted for gait and 8 for hand and foot tapping. It was observed that with 9 and 6 features as input, the ANN performed better with an accuracy of 89.4% and 95.0%. For hand and foot tapping, SVM achieved an accuracy, sensitivity, and specificity of 87.1%, 87.7% and 86.0%, respectively. They also observed that foot tapping was vital in detecting PD patients with an accuracy and specificity of 81.0% and 78.0%, respectively, with SVM classifier.

Abdulhay et al. [91] aimed at classifying PD patients based on gait and tremor symptoms. The dataset consisted of 279 VGRF recordings obtained from Physionet of which 93 were PD patients and 73 were HC. Eight Sensors attached to the feet were responsible for recording the values. The preprocessing stage included using Chebyshev type II high pass filter to eliminate the undesired noise. They used stance time, swing time, stride time, foot strike profile for classification. They observed an average accuracy rate of 92.7% for gait. Dang et al. [92] aimed to detect PD utilizing stooped posture in PD patients. An accelerometer was placed at the upper back and neck and compared with C7-SAR distance. They placed four cameras at the back joints. They concluded that sensors placed at the back provided better results than those placed at the neck. When sensors were placed at the back, they obtained 0.9 degrees (mean absolute error) and -0.96 (R2 value) compared to 1.5 degrees (mean absolute error) and -0.99(R-squared value) when the sensor was placed at the neck.

Giovanni et al. [93] proposed a new hybrid architectural neutral network based on time series classification to diagnose PD. Their dataset was taken from Physionet with 60 subjects. The subjects were asked to walk with accelerometers fixed on their shoes up to a distance of 77 m for 5 min. The two layers in their model were the classification layer obtained by LSTM and DNN and were responsible for classifying PD as anomalous and HC as usual. Another layer was the reduction layer. The reduction of temporal time series was done using under-sampling, autoencoder, Fourier transformation, and CNN based approach. They observed high training and testing accuracy. Arora et al. [86] detected PD in the home using smart phones. The dataset consisted of 10 PD and 10 HC. The detection was based on gait and posture symptoms. The accelerometer data was collected and tested by three ML concepts: RF, Random Classifier and Conditional Random classifier with tenfold CV. They obtained a good performance using RF. Age, height, gender, and weight were considered in Cem Guzelbulut [94] for detecting gait variations using ANN. The gait of an individual could vary over time. Hence, considering these factors could play a huge role in PD diagnosis.

Richa et al. [95] have detected PD using gait, voice and handwriting features using a modified KNN technique. They observed a very high performance when using gait features (99.60% accuracy). El Maachi et al. [96] proposed DNN classifier based 1D Convnet to detect PD. The dataset was collected from Physionet, which had 93 PD and 73 HC. The authors recorded the subjects' walk for 2 min through 8 sensors fixed on each foot that produced VGRF signals. These signals were passed to 18 1D-CNN individually, each connected to a fully connected layer that classified PD and HC. They obtained an accuracy of 98.7% with their model to predict the severity of a subject. Moon et al. [97] aimed to detect PD and Essential Tremor (ET) using gait with six sensors. The task was to make the PD patient stand still for 30 s, walk 7 m and return to their original position. Mobility Lab software assessed 130 features from the task and SMOTE was used to eliminate the class imbalance (524-PD and 43-ET). They observed neural networks provided the best results from various ML techniques with an F1 score of 0.61.

Table 2 summarizes the result obtained for motor symptoms. Figure 13 shows the accuracy obtained using gait.

Table 2 Summary of motors symptoms reviewed in this study
Fig. 13
figure 13

Accuracy obtained using gait

5 Based on multimodal features

This section discusses the studies that used more than one feature to detect PD. Arora et al. [98] aimed at detecting PD and using smartphones at home. They performed with a dataset that consisted of 10 PD patients and 10 HC. Here the detection was based on 5 symptoms: voice tested for the pronunciation of ‘aaah’, finger tapping, response time, gait, and posture. These were recorded for a total of 1772 recordings. They obtained a mean sensitivity and specificity of 96.2% and 96.9% using RF with tenfold CV. Then the subjects also performed a modified UPDRS test once, for which the mean error they obtained was 1.26. Vásquez-Correa et al. [99] used various handwriting, speech and gait symptoms using a CNN based approach for diagnosing PD. The dataset consisted of 44 PD and 40 HC.2D CNN architecture for speech and gait analysis and 1D CNN architecture for handwriting analysis. The languages used for their experiments were Spanish, Czech and German. The features obtained through their proposed architecture were combined to form a multimodal vector per subject, which was later categorized based on SVM. Their approach was validated using 80%, 10%, and 10% data for training, testing, and validation. They obtained an accuracy of 97.6%. While both above studies mixed one or more motor features. Das et al. [100] detected PD by using a questionnaire. They analyzed 53 features of PPMI data. The classification was performed using machine and ensemble learning and ANN. The feature selection methods used were Wilcoxon ran-sum test, PCA, Chi-square test and low variance filter. The classifiers RF, DT, KNN, SVM, ANN, AdaBoost, XGBoost and LR. Their ANN model obtained high mean accuracy, specificity, kappa score, AUC, F1-score of 99.51%, 98.17%, 0.9830, 0.99, and 99.70%, respectively. Medical imaging techniques with other imaging modalities and multimodal techniques could help enhance the reliability and accuracy of PD diagnoses and play a significant role in clinical applications [101,102,103,104,105,106,107,108,109,110,111,112,113].

6 Discussion

There is no permanent cure for PD to date. So, diagnosing PD plays a considerable role in providing proper medication. Artificial intelligence has been proven to diagnose PD. Nowadays, these techniques are a massive bonus to the medical field. Many researchers have contributed by developing new models for the diagnosis of PD. For brain image analysis, Image processing plays a key role; therefore, a clear image can significantly improve the detection rate. Samiappan et al. [114] enhanced images to remove noise, mainly using Significant Cluster Identification for Maximum Edge Preservation (SCI-MEP). This study reviews several ML and DL approaches for PD diagnosis using brain observation methods, Motor Symptoms and multimodal features. SPECT,MR/MRI and EEG comes under the brain observation method, Voice impairment, Handwriting Dynamics and Gait comes under motor symptoms and multi modal features that includes two or more symptoms.

In brain observation methods SPECT could serve as a powerful method for PD diagnosis. SPECT images have obtained accuracy between 74 and 99%. Although there are many radiopharmaceuticals, most of the studies used 123I-Ioflupane and the most used database for collecting SPECT images was PPMI [12, 20, 21] had 600+ images in which [21] have obtained the highest accuracy of 98.45% using ensemble approach but ensemble approach can lead to high complexity and computation time as compared to using single neural network model. In Magesh et al. [20] single neural network model namely VGG16 was proposed where LIME was used for interpretability of the model. The interpretability of model is gaining popularity in recent years as it provides easy understanding on how the model arrived at particular conclusion [115,116,117,118]. This also helps to understand the nature/behaviour of the model. Without these interpretability, certainity of the model is unclear or untrustful. Another study, [12] justified the importance of the SBR as a significant PD biomarker. Various Studies extracted SBR values, shape features and surface fitting features from SPECT images have shown improvement in accuracy [12, 13, 17] proving its significance in PD diagnosis. Moreover, recent studies showed that the combination of SBR values from SPECT images with other biomarkers can improve the detection rate. For example, the shape features and surface fitting features in combination with SBR values as proposed in Prashanth [13, 16] has significantly shown great difference in dopamergic activity in SPECT images of HC and PD that helps for better classification. Radial features and gradient features combined with SBR values in Bhalchandra et al. [17] produced high accuracy of 99.42% using SVM. Though they got high performance as compared to (Prasanth [13]16),dataset used in Bhalchandra et al. [17] is less compared to Prashanth [13]. The combination of SBR and biological biomarkers mainly plasma resulted in high performance and combination of SBR with five biomarkers namely (CSF, RNA, plasma, serum, urine) produced 100% accuracy [16]. Other factors, such as age as biomarker improved the classification in Palumbo et al. [18] which was the drawback in the study [14].

Another popular neuroimaging technique is MR/MRI. For MR/MRI, the accuracy rate ranged from 88% -99%. Although only few papers are studied, the commonly used database for collecting MR/MRI images was PPMI repository. When compared to SPECT,MR/MRI showed better performance when the input is given as volumetric data [23, 24]. Both authors have shown that 3D CNN were able to extract important key features for diagnosis of PD since all the slices are considered. Therefore, 3D CNN produced a reliable performance as compared to 2D CNN that used only single slices leading to restriction in data size and also there are chances of missing the slice that contains important PD marker for diagnosis of PD. Another study 30[30] studied MRI images separately for men and women. They obtained a highest accuracy of 99.01% for men as compared to women that obtained 96.97%. Therefore, the consideration of gender and age seems to be effective in brain observation methods.

Lastly for EEG, the common limitation is limited availability of dataset. Unlike SPECT and MR/MRI which are imaging techniques, EEG produces graph that is the recording of brain’s activity. The neural networks are the mostly used for diagnosis of PD using EEG recordings. Two authors [36, 43] have collected EEG signals from hospital Universiti Kebangsaan Malaysia. Yuvraj et al. [43] have shown higher-order spectra-based bispectrum features from EEG signals to be effective in diagnosing PD and robust towards noise. The bispectrum features are seen decreasing in PD group. Another database [39, 42] used open neuro, which is a publicly available dataset with and without medication and HC were considered in their work. Loh et al. [42] and Kuan Li et al. [119] used 3 class classifications (PD with and without medication, Healthy) using CNN, obtaining accuracy greater than 98%, but the limitation is large computer memory and complexity. Therefore, from the brain observation method SPECT gave the best result. Among the well-known brain analysis, most studies used SPECT, which yielded better performance than MRI. Their choice of taking SPECT over MRI may be because SPECT significantly captures the striatum more than MRI, which captures the brain's structure.

The studies related to detecting PD using motor symptoms, especially voice impairment and handwriting, are increasing each year. This might be due to the availability of a larger number of public datasets. These two early visible motor symptoms of PD seem to give promising results. For voice and handwriting, the accuracy ranged from 76%-100%. These two samples can be obtained by noninvasive means. For speech, the subject needs to speak the given task either recorded on smartphone or any other recording device and for handwriting, the subject draws or writes the given task on a smartphone. The most used database for speech analysis was UCI ML repository. Various Speech tasks used in literature include words, sentence, letter and sustained vowel phonotation, specifically ‘a’,‘o’,‘u’. Out of which, sustained vowels seemed to be effective PD discriminator. Language and gender could be significant biomarkers in assessing PD through voice impairments, this may be because of the difference in pronunciations, punctuation, also pace and tone of voice in different languages. Jeanlocas et al. [47] assessed using French language. The most feature extraction method for speech signals was MFCC [45, 120, 121]. The choice of MFCC could be due to the less complexity and it is suitable for repetitive and sustained phonotation though the performance could degrade in a noisy environment. Though MFCC was widely used (Pankaj et al. [123[) concluded that time–frequency based entropy features outperformed MFCC obtaining a high accuracy of 98% and 99% using SVM for tasks, namely vowel’a’ and word ‘atleta’. Gürüler et al. [54], and Polat et al. [62] have used clustering-based feature weighting, namely K-means clustering and fuzzy c means clustering in which [54] opted for DL based PD classification giving 99.52% accuracy rate outperforming [62] who have opted for ML based PD classification (97.93% accuracy). Though major studies used ML technique and got promising results, research could be done using DL technique with appropriate feature selection technique that can effectively contribute to the enhancement of accuracy in PD diagnosis. The most used database for handwriting analysis was PaHaW, HandPD, NewHandPD and UCI ML repository. As in voice analysis, there are several handwriting tasks used in literature which are drawing of spirals, meanders, wave, line and writing of syllables, sentence and words. Static and dynamic spiral test and stability test were tests that were often used by studies [69,70,71, 78, 84]. During hand movements, different directions were analyzed by Gil Martin et al. [69] and Khatamino et al. [71] and they concluded that x and y carry most information [123] and z shows less information for PD diagnosis. DS [71, 124] or a combination of SST and DST [84] were more significant. But the main limitation in all studies was the limited and unbalanced dataset. To overcome the limited dataset, data augmentation such as flipping, and rotation improved the accuracy by a large rate. For example, [76, 83, 85] used various data augmentation techniques and transfer learning approach and obtained more than 98% accuracy. Features like kinematics [70, 79], pressure features [74], CNN based features from handwriting or a combination of all features with feature selection techniques, Dynamic Time Warping [78], HOG [82] etc. could be implemented future for better results. Age, gender [79] and language could be significant biomarkers in assessing PD through handwriting dynamics. For gait analysis, the accuracy ranged from 81 to 99%. The most used database for gait analysis is PhysioNet. Both PD diagnosis and severity prediction from gait signals were performed [87, 125, 126]. MDS-UPDRS and Hoehn and Yahr stage were the most used severity estimation scales. However, studies that estimated PD severity was limited to subjects under severity levels from mild to moderate [88,89,90], not consider high level. The tasks the subject asked to perform in the literature include walking, finger tapping, foot tapping, gait, kinematics and postural. Few authors have included temporal features such as stride, swing, stance time and foot strike profile [91, 93, 99]. GRF/VGRF sensors are widely used in gait analysis. Unlike handwriting and voice, measurement of GRF/VGRF needs sensors to be placed. But the main advantage of GRF sensors is also useful in PD diagnosis and severity prediction. The most used VGRF signals from gait tasks require wearable sensors and 3D cameras. For example, in Dang et al. [92], sensors and 3D cameras measure stooped posture. They concluded that the sensor measurement at top back position gives a precision better than neck area. However, these tasks were experimented on only HC subjects who imitated gait characteristics of PD. Another study analyzed that foot tapping, postural features [90] and temporal features [91] had most PD discriminative power. The accuracy rate improved to greater than 98% when the DL techniques were used: CNN, LSTM [89, 93] and a combination of LSTM-neural networks. But the main limitation noticed by Balaji et al. [89], Giovanni et al. [93] is that both their approach was time-consuming.

In this review, though brain observation, especially SPECT obtained a very high accuracy, its installment is very expensive. In contrast, the motor symptoms reviewed in this study require smartphone or sensors, making it cost-effective and, most importantly, noninvasive. Brain imaging and photoacoustic imaging are also effective techniques of PD detection [129]. Also, the CAD system can effectively assist clinicians for further accurate decision [130,131,132,133,134]. Figure 14 shows the CAD implementation of the diagnosis of PD.

Fig. 14
figure 14

CAD Implementation

Using Multiple symptoms of PD has also produced significant improvement in accuracy aspects. Since multimodal features include a combination of symptoms that may include both motor and non-motor, the prediction results could be more accurate as it does not depend only on one modality. Though few articles are related to multimodal features for PD diagnosis, the articles reviewed, and the significant results are discussed. Thus, the usage of multiple features could significantly help the PD community.

7 Conclusions

ML, a DL approach with appropriate classifiers, has quickly predicted and detected PD. They have been proven to diagnose PD and HC effectively. ML techniques have helped improve PD's classification and prediction accuracy. The major findings are:

  • From the comparison of studies, CNN was a highly used DL architecture. It provided high accuracy, while implementation of transfer learning, ensemble learning, and hybrid model of architectures have provided optimistic results. Similarly, in ML, SVM classifiers produced the best result compared to other classifiers.

  • In the brain observation approach, SPECT has the highest prediction rate. The SBR values with biomarkers are prominent features that could improve the classification performance.

  • In the case of motor modality, gait and handwriting performed well. Collecting samples for gait is more complicated as it requires sensors to be placed on the patients and must be carried out with the supervision of professionals. The process to be carried out for obtaining handwriting samples and voice recordings is comparatively less complex and the samples can be collected at an individual’s home. Implementation of CAD system could assist clinicians that require an internet connection. This could become easier for clinicians to visualize and make medical decisions.

The future scope in the detection of Parkinson's disease is as follows.

There is a lack of research papers that diagnose PD using multimodal features compared to single-modal features. In the future, research with multimodal rather than single-modal features could increase the detection rate. Considering one or more motor features helps to provide reliable, accurate results and helps to predict the severity and progression of the disease in the later years. Also, implementing a more CAD-based system would help clinicians with further diagnosis and treatment in an optimistic method. The ensemble approach/Transfer learning/hybrid model can be effective. This is because of their ability to increase diagnostic rates. This approach can be proposed to diagnose PD efficiently with several modalities to find suitable best-performing approaches for specific modalities.