Preliminary Study on Gender Identification by Electrocardiography Data

Bastos, Eduarda Sofia; Duarte, Rui Pedro; Marinho, Francisco Alexandre; Pimenta, Luís; Gouveia, António Jorge; Gonçalves, Norberto Jorge; Coelho, Paulo Jorge; Zdravevski, Eftim; Lameski, Petre; Garcia, Nuno M.; Pires, Ivan Miguel

doi:10.1007/978-3-031-28663-6_4

Eduarda Sofia Bastos¹⁹,
Rui Pedro Duarte¹⁹,
Francisco Alexandre Marinho¹⁹,
Luís Pimenta¹⁹,
António Jorge Gouveia¹⁹,
Norberto Jorge Gonçalves¹⁹,
Paulo Jorge Coelho^20,21,
Eftim Zdravevski²²,
Petre Lameski²²,
Nuno M. Garcia²³ &
…
Ivan Miguel Pires²³

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 456))

Included in the following conference series:

EAI International Conference on IoT Technologies for HealthCare

210 Accesses

Abstract

Medical teams can use an electrocardiogram (ECG) as a quick test to examine the electrical activity and rhythm of the heart to look for irregularities that may be indicative of diseases. This work aims to summarize the outcomes of several artificial intelligence techniques developed to identify ECG data by gender automatically. The analysis and processing of ECG data were collected from 219 individuals (112 males, 106 females, and one other) aged between 12 and 92 years in different geographical regions, located mainly in the municipalities of the center of Portugal. These data allowed to discretize gender by the analysis of ECG data during the experiment performed and were acquired with the BITalino (r)evolution device, connected to a personal computer, using the OpenSignals (r)evolution software. The dataset describes the acquisition conditions, the individual’s characteristics, and the sensors used as the data acquired from the ECG sensor.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Gender and age classification using a new Poincare section-based feature set of ECG

Article 31 October 2018

Individual identification via electrocardiogram analysis

Article Open access 14 August 2015

A hybrid model for EEG-based gender recognition

Article Open access 04 July 2019

Keywords

1 Introduction

The early detection of cardiovascular issues is essential for reducing the fatality rate associated with cardiovascular diseases [3, 18], which are among the major causes of death worldwide [6, 27]. For this reason, we believe that experts should adopt automatic electrocardiogram (ECG) analysis to aid in diagnosing cardiovascular diseases [11, 19, 20].

The research on the gender identification is a new subject that is currently starting with more research studies, but the number of studies available in the literature is very small [2, 24]. However, if the studies were performed, we can classify the results according with the gender.

We investigated the most recent machine learning techniques for categorizing and analysis of ECG signals and found that few studies have been unsuccessful in achieving high precision rates [2, 24]. Due to this, we decided to do our research using a dataset we had built and some of the methods we had discovered during our search.

We implemented eight different methods to analyze the results: Nearest Neighbors, Linear SVM, RBF SVM, Decision Tree, Random Forest, Neural Networks, AdaBoost, and Naive Bayes. Based on the machine learning methods, the data was classified by Gender.

2 Methods

2.1 Study Design and Participants

Using a BITalino (r)evolution device [10], and the OpenSignals (r)evolution software, [7], 219 individuals from Portugal’s continental region provided the ECG recordings used in this study. All volunteers in this study disclosed at least one previously diagnosed health condition, such as allergies, hypertension, cholesterol, diabetes, arrhythmia, asthma, unspecified heart problems, and unspecified brain problems; the most frequently disclosed conditions are hypertension and diabetes. The patient spends 30 s sitting down and 30 s standing during the ECG recordings, which at least takes about 60 s. This procedure tries to emphasize the differences between individuals (gender) and allow a minimum effort to avoid a significant change in the overall ECG result. Ethics Committee from Universidade da Beira Interior approved the study with the number CE-UBI-Pj-2021-41.

2.2 Feature Extraction

The NeuroKit Python module [29] was used to automatically extract important features from the ECG recordings for this study (Fig. 1), including P, Q, R, S, and T peaks, as well as the onsets and offsets of P, T, and R waves. Based on the features that were automatically retrieved and used in this investigation, the following features were manually calculated:

RR interval → \({{Peak R}_{N}-Peak {R}_{N-1}}\)
PP interval → \(Peak {{P}_{N}-Peak {P}_{N-1}}\)
P duration → \(Offset {P-Onset P}\)
PR interval → \(Onset {R-Onset P}\)
PR segment → \(Onset {R-Offset P}\)
QRS duration → \(Offset {R-Onset R}\)
ST segment → \(Onset {T-Offset R}\)
ST-T segment → \(Offset {T-Offset R}\)
QT duration → \(Offset {T-Onset R}\)
TP interval → \(Onset {P-Offset T}\)
R amplitude → \({{Peak R}_{N}-Peak {S}_{N}}\)
T amplitude → \({Peak T}_{N}-Peak {S}_{N}\)
P amplitude → \({Peak P}_{N}-Peak {Q}_{N}\)

2.3 Description of the Method

We have implemented eight machine learning methods to analyze the results in the dataset that can predict some results.

Nearest Neighbors.

K-Nearest Neighbors (K-NN) is a non-parametric technique that uses data from many classes to predict how the new sample point will be classified [26]. This algorithm does not use the training data points to draw any conclusions.

Linear SVM.

The Support Vector Machine attempts to generate the best line or decision boundary to divide n-dimensional space into classes, so new data points are assigned to the correct category when added [5]. Support vectors are the extreme points/vectors that help this algorithm generate the best decision boundary and give the method its name [5]. The term linear SVM refers to linearly separable data, which can be divided into two classes using a single straight line [25].

RBF SVM.

The Radial Basis Function (RBF) is the default kernel function in many kernelized learning algorithms [21]. It’s very similar to the K-Nearest Neighborhood Algorithm, but instead of storing the entire dataset during training, the RBF SVM only needs to store the support vectors [12]. Linear SVM differs from RBF SVM in that the latter is not a parametric model like linear SVM, is more complex depending on the size of the database, and is more expensive to train [16].

Decision Tree.

It is a rule-based supervised machine learning classifier that generates questions based on dataset properties and may categorize new entries depending on the answers [4]. It is a tree-based method because every question it creates has a binary response, which divides the database into halves [9]. A tree-like graph may be seen as the result of these divisions [9].

Random Forest.

It is a machine learning algorithm widely applied to classification and regression issues [1]. It works by building decision trees on various samples throughout the training phase [1]. The outcome is determined based on the class with the most tree selections.

Neural Networks.

It is commonly a multilayer perceptron with three layers: an input, a hidden layer, and an output layer [13, 15]. The last two layers are made up of nodes that function as neurons and make use of a nonlinear activation function. A multilayer perceptron can categorize data that is not linearly separable and uses backpropagation for training [22].

AdaBoost.

It is an ensemble learning method that combines the results of various classifier algorithms to increase their effectiveness and predictive ability [14]. The output of the AdaBoost meta-algorithm is the outcome of this weighted sum.

Naive Bayes.

Based on the Bayes theorem, it is a probabilistic machine learning method classifier [23, 28]. Its simplicity and lack of a complex iterative parameter calculation make it suitable for diagnosing cardiac patients in medical science [8]. This algorithm performs well and is popular because it frequently outperforms the most advanced classification techniques.

3 Results

3.1 Data Acquisition

For this study, we collected data from 219 volunteers (112 men, 106 women, and one other) aged between 12 and 92 years old. All participants provided informed consent to the experiments, allowing us to share the results anonymously. The agreement also provided informed consent to the participants regarding the risks and purpose of the study. Ethics Committee from Universidade da Beira Interior approved the study with the number CE-UBI-Pj-2021-041. The dataset used in this research is publicly available at [17].

Data were acquired with the BITalino (r)evolution device with a 1 kHz sample frequency, connected to a personal computer, using OpenSignals (r)evolution software. Each volunteer’s data was stored in two files: one JSON file referring to the characteristic data of the volunteer plus their lifestyle and a text file with the test data recorded over time. These files were stored in an individual folder per volunteer.

The volunteer needed to stand for 30 s and then sit in a chair for 30 s while the data was collected.

The dataset is available in a Mendeley Data repository, which contains two files for each individual, with 219 folders. Each folder has a JSON file containing a description of the data acquisition conditions, the individual’s characteristics, and the sensors used, and a TXT text file including the data acquired from the ECG sensor.

The Fig. 2 flowchart illustrates the data processing till the classification results.

3.2 Results by Gender

We started with the extraction of the different analyzed variables related to the ECG data, such as RR interval, PP interval, P duration, PR interval, PR segment, QRS duration, ST segment, ST-T segment, QT duration, TP interval, R amplitude, T amplitude, and P amplitude. Table 1 presents the average of the features extracted. Before the analysis by gender, we found that the data of the individuals with the IDs 20, 25, 31, 33, 35, 38, 39, 54, 153, 195, and 202 were invalid, so it was necessary to exclude these 11 individuals from the analysis.

Table 1. Average of features extracted.

Full size table

In Table 2, we compare the results of each classifier utilized during this study. The Decision Tree algorithm achieved the highest accuracy at 62.90%. Linear SVM, Adaboost, and Naive Bayes are right next, with an accuracy of 61.29%

Table 2. Performance comparison of the various methods

Full size table

Nearest Neighbors method correctly identified 32 male and female out of 62 volunteers in this study. It achieved an accuracy of 51.61%, a precision of 51.51%, a recall of 54.84%, and an F1-score of 53.14%. More details of this method are in the confusion matrix presented in Table 3.

As seen in Table 3, the Linear SVM classifier correctly identified 38 males and females out of 62, making it one of the methods that identified the highest numbers. This algorithm reached an accuracy of 61.29%, a precision of 51.72%, a recall of 78.13%, and an F1-score of 62.24%.

From the ECG recordings in our dataset, the RBF SVM method was capable, as seen in Table 3, of accurately predicting 28 males and females. It achieved the lowest accuracy out of any method at 48.38%, a precision of 38.46%, a recall of 16.67%, and an F1-score of 23.26%.

The Decision Tree algorithm could identify the correct gender in 4 instances. As seen in Table 3, it correctly classified 19 of the recordings as belonging to males and 20 of them as belonging to a female. Overall, this method was the one that could have the highest percentages. It achieved an accuracy of 62.90%, a precision of 61.29%, a recall of 63.33%, and an F1-score of 62.29%.

The Random Forest classifier correctly predicted 35 results. As seen in Table 3, it identified 15 males and 20 females. This method reached an accuracy of 56.45%, a precision of 57.69%, a recall of 48.39%, and an F1-score of 52.63%.

As seen in Table 3, the Neural Network classifier utilized in this study correctly identified 14 males and 18 females, counting 32. This method reached an accuracy of 51.61%, a precision of 45.16%, a recall of 51.85%, and an F1-score of 48.27%.

As seen in Table 3, the AdaBoost algorithm correctly classified 38 of the 62 volunteers. It achieved an accuracy of 61.29%, a precision of 46.67%, a recall of 63.63%, and an F1-score of 53.85%.

The Decision Tree algorithm presented in Table 3 could identify the gender achieving 38 males and females out of 62 volunteers. Overall, it achieved an accuracy of 61.29%, a precision of 46.67%, a recall of 63.63%, and an F1-score of 53.85%.

Table 3. Confusion matrix for the results by gender

Full size table

4 Discussions and Conclusions

We tested 8 methods on a dataset of ECG recordings to see how well they could categorize the data. We could use the confusion matrix produced by the application to our dataset to calculate each of these eight methods’ accuracy, precision, recall, and F1-score. The Decision tree attained the best accuracy at 62.90%, followed by the Linear SVM, AdaBoost, and Naive Bayes approach at 61.29% of accuracy.

With these outcomes, we concluded that Decision Trees was the technique that performed the best overall, which accurately identified 39 of the 62 results. Our initial expectations met the outcomes of this investigation, as the best-performing approach could accurately classify more than 50% of males and females in these ECG recordings.

This study used a small database, which could influence the results presented in these studies. We expect to get more data to consolidate the results in the future.

References

Alazzam, H., Alsmady, A., Shorman, A.A.: Supervised detection of IoT botnet attacks. In: Proceedings of the Second International Conference on Data Science, E-Learning and Information Systems, pp. 1–6 (2019)
Google Scholar
AlDuwaile, D.A., Islam, M.S.: Using convolutional neural network and a single heartbeat for ECG biometric recognition. Entropy 23, 733 (2021)
Article Google Scholar
Ali, F., et al.: A smart healthcare monitoring system for heart disease prediction based on ensemble deep learning and feature fusion. Inf. Fus. 63, 208–222 (2020)
Article Google Scholar
Almuhaideb, S., Menai, M.E.B.: Impact of preprocessing on medical data classification. Front. Comput. Sci. 10(6), 1082–1102 (2016). https://doi.org/10.1007/s11704-016-5203-5
Article Google Scholar
Amarappa, S., Sathyanarayana, S.V.: Data classification using support vector machine (SVM), a simplified approach. Int. J. Electron. Comput. Sci. Eng. 3, 435–445 (2014)
Google Scholar
Balakumar, P., Maung-U, K., Jagadeesh, G.: Prevalence and prevention of cardiovascular disease and diabetes mellitus. Pharmacol. Res. 113, 600–609 (2016)
Article Google Scholar
Batista, D., Plácido da Silva, H., Fred, A., Moreira, C., Reis, M., Ferreira, H.A.: Benchmarking of the BITalino biomedical toolkit against an established gold standard. Healthc. Technol. Lett. 6, 32–36 (2019)
Article Google Scholar
Celin, S., Vasanth, K.: ECG signal classification using various machine learning techniques. J. Med. Syst. 42, 1–11 (2018)
Article Google Scholar
Chio, C., Freeman, D.: Machine Learning and Security: Protecting Systems With data and Algorithms. O’Reilly Media, Inc. (2018)
Google Scholar
Da Silva, H.P., Guerreiro, J., Lourenço, A., Fred, A.L., Martins, R.: BITalino: a novel hardware framework for physiological computing. In: PhyCS, pp. 246–253 (2014)
Google Scholar
Escobar, L.J.V., Salinas, S.A.: e-Health prototype system for cardiac telemonitoring. In: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Orlando, FL, USA, pp. 4399–4402. IEEE (2016)
Google Scholar
García, V., Mollineda, R.A., Sánchez, J.S.: On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal. Appl. 11, 269–280 (2008)
Article MathSciNet Google Scholar
Gautam, M.K., Giri, V.K.: A neural network approach and wavelet analysis for ECG classification. In: 2016 IEEE International Conference on Engineering and Technology (ICETECH), Coimbatore, India, pp. 1136–1141. IEEE (2016)
Google Scholar
Hastie, T., Rosset, S., Zhu, J., Zou, H.: Multi-class AdaBoost. Stat. Interface 2, 349–360 (2009). https://doi.org/10.4310/SII.2009.v2.n3.a8
Haykin, S.: Neural Networks: A Comprehensive Foundation, 1st edn. Prentice Hall PTR, Hoboken (1994)
MATH Google Scholar
Hsu, C.-W., Lin, C.-J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13, 415–425 (2002)
Article Google Scholar
Pires, I.M., Garcia, N.M., Pires, I., Pinto, R., Silva, P.: ECG data related to 30-s seated and 30-s standing for 5P-Medicine project. Mendeley Data (2022). https://data.mendeley.com/datasets/z4bbj9rcwd/1
Jindal, H., Agrawal, S., Khera, R., Jain, R., Nagrath, P.: Heart disease prediction using machine learning algorithms. In: IOP Conference Series: Materials Science and Engineering. IOP Publishing, p. 012072 (2021)
Google Scholar
Kakria, P., Tripathi, N.K., Kitipawang, P.: A real-time health monitoring system for remote cardiac patients using smartphone and wearable sensors. Int. J. Telemed. Appl., 1–11 (2015). https://doi.org/10.1155/2015/373474
Kannathal, N., Acharya, U.R., Ng, E.Y.K., Krishnan, S.M., Min, L.C., Laxminarayan, S.: Cardiac health diagnosis using data fusion of cardiovascular and haemodynamic signals. Comput. Methods Programs Biomed. 82, 87–96 (2006). https://doi.org/10.1016/j.cmpb.2006.01.009
Article Google Scholar
Maji, S., Berg, A.C., Malik, J.: Classification using intersection kernel support vector machines is efficient. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)
Google Scholar
Pires, I.M., Garcia, N.M., Flórez-Revuelta, F.: Multi-sensor data fusion techniques for the identification of activities of daily living using mobile devices. In: Proceedings of the ECMLPKDD (2015)
Google Scholar
Prescott, G.J., Garthwaite, P.H.: A simple Bayesian analysis of misclassified binary data with a validation substudy. Biometrics 58, 454–458 (2002)
Article MathSciNet MATH Google Scholar
Ramaraj, E.: A novel deep learning based gated recurrent unit with extreme learning machine for electrocardiogram (ECG) signal recognition. Biomed. Signal Process. Control 68, 102779 (2021)
Article Google Scholar
Suthaharan, S.: Support vector machine. In: Suthaharan, S. (ed.) Machine Learning Models and Algorithms for Big Data Classification. Integrated Series in Information Systems, vol. 36, pp. 207–235. Springer, Boston (2016). https://doi.org/10.1007/978-1-4899-7641-3_9
Tran, T.M., Le, X.-M.T., Nguyen, H.T., Huynh, V.-N.: A novel non-parametric method for time series classification based on k-nearest neighbors and dynamic time warping barycenter averaging. Eng. Appl. Artif. Intell. 78, 173–185 (2019)
Article Google Scholar
Vogel, B., et al.: The Lancet women and cardiovascular disease commission: reducing the global burden by 2030. Lancet 397, 2385–2438 (2021)
Article Google Scholar
Webb, G.I., Boughton, J.R., Wang, Z.: Not so Naive Bayes: aggregating one-dependence estimators. Mach. Learn. 58, 5–24 (2005). https://doi.org/10.1007/s10994-005-4258-6
Article MATH Google Scholar
Neurophysiological Data Analysis with NeuroKit2 — NeuroKit2 0.2.1 documentation. https://neuropsychology.github.io/NeuroKit/. Accessed 10 July 2022

Download references

Acknowledgments

This work is funded by FCT/MEC through national funds and co-funded by FEDER – PT2020 partnership agreement under the project UIDB/50008/2020 (Este trabalho é financiado pela FCT/MEC através de fundos nacionais e cofinanciado pelo FEDER, no âmbito do Acordo de Parceria PT2020 no âmbito do projeto UIDB/50008/2020).

This article is based upon work from COST Action CA19136 - International Interdisciplinary Network on Smart Healthy Age-friendly Environments (NET4AGE-FRIENDLY), supported by COST (European Cooperation in Science and Technology). More information in www.cost.eu.

Author information

Authors and Affiliations

Escola de Ciências e Tecnologia, University of Trás-os-Montes e Alto Douro, Quinta de Prados, 5001-801, Vila Real, Portugal
Eduarda Sofia Bastos, Rui Pedro Duarte, Francisco Alexandre Marinho, Luís Pimenta, António Jorge Gouveia & Norberto Jorge Gonçalves
Polytechnic of Leiria, 2411-901, Leiria, Portugal
Paulo Jorge Coelho
Institute for Systems Engineering and Computers at Coimbra (INESC Coimbra), 3030-790, Coimbra, Portugal
Paulo Jorge Coelho
Faculty of Computer Science and Engineering, University Ss Cyril and Methodius, 1000, Skopje, Macedonia
Eftim Zdravevski & Petre Lameski
Instituto de Telecomunicações, Universidade da Beira Interior, 6200-001, Covilhã, Portugal
Nuno M. Garcia & Ivan Miguel Pires

Authors

Eduarda Sofia Bastos
View author publications
You can also search for this author in PubMed Google Scholar
Rui Pedro Duarte
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Alexandre Marinho
View author publications
You can also search for this author in PubMed Google Scholar
Luís Pimenta
View author publications
You can also search for this author in PubMed Google Scholar
António Jorge Gouveia
View author publications
You can also search for this author in PubMed Google Scholar
Norberto Jorge Gonçalves
View author publications
You can also search for this author in PubMed Google Scholar
Paulo Jorge Coelho
View author publications
You can also search for this author in PubMed Google Scholar
Eftim Zdravevski
View author publications
You can also search for this author in PubMed Google Scholar
Petre Lameski
View author publications
You can also search for this author in PubMed Google Scholar
Nuno M. Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Miguel Pires
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Norberto Jorge Gonçalves .

Editor information

Editors and Affiliations

Università Politecnica delle Marche, Ancona, Italy
Susanna Spinsante
Università Politecnica delle Marche, Ancona, Italy
Grazia Iadarola
CNR-Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni, Milan, Italy
Alessia Paglialonga
Università di Modena e Reggio Emilia, Modena, Italy
Federico Tramarin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bastos, E.S. et al. (2023). Preliminary Study on Gender Identification by Electrocardiography Data. In: Spinsante, S., Iadarola, G., Paglialonga, A., Tramarin, F. (eds) IoT Technologies for HealthCare. HealthyIoT 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 456. Springer, Cham. https://doi.org/10.1007/978-3-031-28663-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-28663-6_4
Published: 16 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28662-9
Online ISBN: 978-3-031-28663-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Preliminary Study on Gender Identification by Electrocardiography Data

Abstract

Similar content being viewed by others

Gender and age classification using a new Poincare section-based feature set of ECG

Individual identification via electrocardiogram analysis

A hybrid model for EEG-based gender recognition

Keywords

1 Introduction