Keywords

1 Introduction

The early detection of cardiovascular issues is essential for reducing the fatality rate associated with cardiovascular diseases [3, 18], which are among the major causes of death worldwide [6, 27]. For this reason, we believe that experts should adopt automatic electrocardiogram (ECG) analysis to aid in diagnosing cardiovascular diseases [11, 19, 20].

The research on the gender identification is a new subject that is currently starting with more research studies, but the number of studies available in the literature is very small [2, 24]. However, if the studies were performed, we can classify the results according with the gender.

We investigated the most recent machine learning techniques for categorizing and analysis of ECG signals and found that few studies have been unsuccessful in achieving high precision rates [2, 24]. Due to this, we decided to do our research using a dataset we had built and some of the methods we had discovered during our search.

We implemented eight different methods to analyze the results: Nearest Neighbors, Linear SVM, RBF SVM, Decision Tree, Random Forest, Neural Networks, AdaBoost, and Naive Bayes. Based on the machine learning methods, the data was classified by Gender.

2 Methods

2.1 Study Design and Participants

Using a BITalino (r)evolution device [10], and the OpenSignals (r)evolution software, [7], 219 individuals from Portugal’s continental region provided the ECG recordings used in this study. All volunteers in this study disclosed at least one previously diagnosed health condition, such as allergies, hypertension, cholesterol, diabetes, arrhythmia, asthma, unspecified heart problems, and unspecified brain problems; the most frequently disclosed conditions are hypertension and diabetes. The patient spends 30 s sitting down and 30 s standing during the ECG recordings, which at least takes about 60 s. This procedure tries to emphasize the differences between individuals (gender) and allow a minimum effort to avoid a significant change in the overall ECG result. Ethics Committee from Universidade da Beira Interior approved the study with the number CE-UBI-Pj-2021-41.

2.2 Feature Extraction

The NeuroKit Python module [29] was used to automatically extract important features from the ECG recordings for this study (Fig. 1), including P, Q, R, S, and T peaks, as well as the onsets and offsets of P, T, and R waves. Based on the features that were automatically retrieved and used in this investigation, the following features were manually calculated:

  • RR interval → \({{Peak R}_{N}-Peak {R}_{N-1}}\)

  • PP interval → \(Peak {{P}_{N}-Peak {P}_{N-1}}\)

  • P duration → \(Offset {P-Onset P}\)

  • PR interval → \(Onset {R-Onset P}\)

  • PR segment → \(Onset {R-Offset P}\)

  • QRS duration → \(Offset {R-Onset R}\)

  • ST segment → \(Onset {T-Offset R}\)

  • ST-T segment → \(Offset {T-Offset R}\)

  • QT duration → \(Offset {T-Onset R}\)

  • TP interval → \(Onset {P-Offset T}\)

  • R amplitude → \({{Peak R}_{N}-Peak {S}_{N}}\)

  • T amplitude → \({Peak T}_{N}-Peak {S}_{N}\)

  • P amplitude → \({Peak P}_{N}-Peak {Q}_{N}\)

Fig. 1.
figure 1

Features from the ECG recordings

2.3 Description of the Method

We have implemented eight machine learning methods to analyze the results in the dataset that can predict some results.

Nearest Neighbors.

K-Nearest Neighbors (K-NN) is a non-parametric technique that uses data from many classes to predict how the new sample point will be classified [26]. This algorithm does not use the training data points to draw any conclusions.

Linear SVM.

The Support Vector Machine attempts to generate the best line or decision boundary to divide n-dimensional space into classes, so new data points are assigned to the correct category when added [5]. Support vectors are the extreme points/vectors that help this algorithm generate the best decision boundary and give the method its name [5]. The term linear SVM refers to linearly separable data, which can be divided into two classes using a single straight line [25].

RBF SVM.

The Radial Basis Function (RBF) is the default kernel function in many kernelized learning algorithms [21]. It’s very similar to the K-Nearest Neighborhood Algorithm, but instead of storing the entire dataset during training, the RBF SVM only needs to store the support vectors [12]. Linear SVM differs from RBF SVM in that the latter is not a parametric model like linear SVM, is more complex depending on the size of the database, and is more expensive to train [16].

Decision Tree.

It is a rule-based supervised machine learning classifier that generates questions based on dataset properties and may categorize new entries depending on the answers [4]. It is a tree-based method because every question it creates has a binary response, which divides the database into halves [9]. A tree-like graph may be seen as the result of these divisions [9].

Random Forest.

It is a machine learning algorithm widely applied to classification and regression issues [1]. It works by building decision trees on various samples throughout the training phase [1]. The outcome is determined based on the class with the most tree selections.

Neural Networks.

It is commonly a multilayer perceptron with three layers: an input, a hidden layer, and an output layer [13, 15]. The last two layers are made up of nodes that function as neurons and make use of a nonlinear activation function. A multilayer perceptron can categorize data that is not linearly separable and uses backpropagation for training [22].

AdaBoost.

It is an ensemble learning method that combines the results of various classifier algorithms to increase their effectiveness and predictive ability [14]. The output of the AdaBoost meta-algorithm is the outcome of this weighted sum.

Naive Bayes.

Based on the Bayes theorem, it is a probabilistic machine learning method classifier [23, 28]. Its simplicity and lack of a complex iterative parameter calculation make it suitable for diagnosing cardiac patients in medical science [8]. This algorithm performs well and is popular because it frequently outperforms the most advanced classification techniques.

3 Results

3.1 Data Acquisition

For this study, we collected data from 219 volunteers (112 men, 106 women, and one other) aged between 12 and 92 years old. All participants provided informed consent to the experiments, allowing us to share the results anonymously. The agreement also provided informed consent to the participants regarding the risks and purpose of the study. Ethics Committee from Universidade da Beira Interior approved the study with the number CE-UBI-Pj-2021-041. The dataset used in this research is publicly available at [17].

Data were acquired with the BITalino (r)evolution device with a 1 kHz sample frequency, connected to a personal computer, using OpenSignals (r)evolution software. Each volunteer’s data was stored in two files: one JSON file referring to the characteristic data of the volunteer plus their lifestyle and a text file with the test data recorded over time. These files were stored in an individual folder per volunteer.

The volunteer needed to stand for 30 s and then sit in a chair for 30 s while the data was collected.

The dataset is available in a Mendeley Data repository, which contains two files for each individual, with 219 folders. Each folder has a JSON file containing a description of the data acquisition conditions, the individual’s characteristics, and the sensors used, and a TXT text file including the data acquired from the ECG sensor.

The Fig. 2 flowchart illustrates the data processing till the classification results.

Fig. 2.
figure 2

Data processing flowchart

3.2 Results by Gender

We started with the extraction of the different analyzed variables related to the ECG data, such as RR interval, PP interval, P duration, PR interval, PR segment, QRS duration, ST segment, ST-T segment, QT duration, TP interval, R amplitude, T amplitude, and P amplitude. Table 1 presents the average of the features extracted. Before the analysis by gender, we found that the data of the individuals with the IDs 20, 25, 31, 33, 35, 38, 39, 54, 153, 195, and 202 were invalid, so it was necessary to exclude these 11 individuals from the analysis.

Table 1. Average of features extracted.

In Table 2, we compare the results of each classifier utilized during this study. The Decision Tree algorithm achieved the highest accuracy at 62.90%. Linear SVM, Adaboost, and Naive Bayes are right next, with an accuracy of 61.29%

Table 2. Performance comparison of the various methods

Nearest Neighbors method correctly identified 32 male and female out of 62 volunteers in this study. It achieved an accuracy of 51.61%, a precision of 51.51%, a recall of 54.84%, and an F1-score of 53.14%. More details of this method are in the confusion matrix presented in Table 3.

As seen in Table 3, the Linear SVM classifier correctly identified 38 males and females out of 62, making it one of the methods that identified the highest numbers. This algorithm reached an accuracy of 61.29%, a precision of 51.72%, a recall of 78.13%, and an F1-score of 62.24%.

From the ECG recordings in our dataset, the RBF SVM method was capable, as seen in Table 3, of accurately predicting 28 males and females. It achieved the lowest accuracy out of any method at 48.38%, a precision of 38.46%, a recall of 16.67%, and an F1-score of 23.26%.

The Decision Tree algorithm could identify the correct gender in 4 instances. As seen in Table 3, it correctly classified 19 of the recordings as belonging to males and 20 of them as belonging to a female. Overall, this method was the one that could have the highest percentages. It achieved an accuracy of 62.90%, a precision of 61.29%, a recall of 63.33%, and an F1-score of 62.29%.

The Random Forest classifier correctly predicted 35 results. As seen in Table 3, it identified 15 males and 20 females. This method reached an accuracy of 56.45%, a precision of 57.69%, a recall of 48.39%, and an F1-score of 52.63%.

As seen in Table 3, the Neural Network classifier utilized in this study correctly identified 14 males and 18 females, counting 32. This method reached an accuracy of 51.61%, a precision of 45.16%, a recall of 51.85%, and an F1-score of 48.27%.

As seen in Table 3, the AdaBoost algorithm correctly classified 38 of the 62 volunteers. It achieved an accuracy of 61.29%, a precision of 46.67%, a recall of 63.63%, and an F1-score of 53.85%.

The Decision Tree algorithm presented in Table 3 could identify the gender achieving 38 males and females out of 62 volunteers. Overall, it achieved an accuracy of 61.29%, a precision of 46.67%, a recall of 63.63%, and an F1-score of 53.85%.

Table 3. Confusion matrix for the results by gender

4 Discussions and Conclusions

We tested 8 methods on a dataset of ECG recordings to see how well they could categorize the data. We could use the confusion matrix produced by the application to our dataset to calculate each of these eight methods’ accuracy, precision, recall, and F1-score. The Decision tree attained the best accuracy at 62.90%, followed by the Linear SVM, AdaBoost, and Naive Bayes approach at 61.29% of accuracy.

With these outcomes, we concluded that Decision Trees was the technique that performed the best overall, which accurately identified 39 of the 62 results. Our initial expectations met the outcomes of this investigation, as the best-performing approach could accurately classify more than 50% of males and females in these ECG recordings.

This study used a small database, which could influence the results presented in these studies. We expect to get more data to consolidate the results in the future.