Keywords

1 Introduction

Arrhythmia is a representative type of cardiovascular diseases (CVDs) that refers to any irregular change from normal heart rhythms. Although a single arrhythmia heartbeat may not have a serious impact on life, continuous arrhythmia beats can result in fatal circumstances [1]. The most widely applied solution for arrhythmia detection is electrocardiography (ECG). ECG shows the morbid status of the cardiovascular system by changes in its waveforms or rhythms [2]. The screening of arrhythmias requires careful study of the ECG records by experienced cardiologists and this process is tedious and time-consuming. Moreover, there may be minute changes in the ECG that are overlooked by the naked eye [3]. Automatic arrhythmia detection based on ECG provides important assistance for doctors, and also helps common people to self-monitor their heart conditions using wearable devices. Accurate automatic arrhythmia detection plays as the foundation of machine-aided diagnosis and treatment of cardiovascular diseases [4].

With the improvement of computing power, deep learning has been successfully applied to many areas such as numbers, characters and face recognition, image classification. Deep learning methods are also used effectively in the analysis of bioinformatics signals. Kiranyaz et al. proposed an adaptive CNN model for the detection of ventricular ectopic beats (VEB) and supraventricular ectopic beats (SVEB). This model required small common patient-specific training data and achieved a superior classification performance. Acharya et al. developed a 9-layer deep CNN model to automatically identify five different categories of heartbeats in ECG signals. Their experiment obtained an accuracy of 94.03% in the diagnostic classification of the original ECG heartbeats [14]. Rajpurkar et al. proposed a 34-layer CNN model for arrhythmia detection [5]. It is reported that accuracy exceeds the level of cardiologists.

Fig. 1.
figure 1

An ECG heartbeat with P,Q,R,S and T waves.

In the last decades, there have been some works on ECG for different tasks promoting the application of ECG in clinical practice. Generally speaking, there are four main tasks: (1) ECG data preprocessing, (2) heartbeat segmentation, (3) feature extraction, (4) ECG classification. Among the four tasks, ECG feature extraction and classification are the keys to successfully detect cardiac diseases [7]. Although many researchers achieved almost optimal results for ECG classification and various filtering methods were proposed, owing to the diversities of the ECG dataset and preprocessing methods, as well as the differences in evaluation indicators, there is a lack of horizontal comparison of such research works. Thus, in this paper, we will focus on the comparison of commonly used ECG preprocessing methods and mainstream deep learning networks to promote the clinical practice of diagnosis by ECG (Fig. 2).

Fig. 2.
figure 2

Overall view of the proposed systems.

Major contributions of this research are as follows: 1) Referred to the previous application of deep learning networks structures in the detection of ECG arrhythmia, we developed four mainstream networks such as an 11-layer pure CNN model, a combination of CNN and LSTM model, Attention mechanism model and ResNet model, separately trained to compare the final classification performance. the same ECG dataset and preprocessing methods were used for all reference models 2) Considering the comparison of preprocessing methods, we chose three ECG presentations as the inputs of the classifier, there was a raw signal (without any preprocessing), denoised signal, and hand-crafted ECG features which widely used in machine learning. 3) In this paper, we discussed two major evaluations of heartbeat classification methods, intra-patient, and inter-patient evaluation, presented the classification performance of existing approaches under each evaluation.

2 Dataset

The ECG dataset used in our experiments is from the MIT-BIH arrhythmia database [8] publicly available on PhysioNet. The MIT-BIH dataset contains 48 30-minutes long records from 47 patients. Each ECG record was digitized at 360 samples and annotated by cardiologists. The database is regarded as the benchmark database in arrhythmia detection and classification and has been extensively utilized for algorithm validation [9]. Hence, we used the MIT-BIH arrhythmia database to do contrast with different classification methods.

Table 1. Summary of training and testing datasets in intra-patient

For the sake of fairness, two types of evaluation approaches were investigated in this paper, namely, “inter-patient” and “intra-patient”. In the “intra-patient” evaluation, all 48 records were used and heartbeat segments were provided by the annotated QRS position. The resulting heartbeat dataset was divided into 16 classes [9]. Specifically, 13% of the beats from the normal class, 40% of the beats from each of five bigger arrhythmia classes (i.e., “L,” “R,” “A,” “V,” “P”), and 50% of the beats from each of ten smaller arrhythmias classes were randomly selected to constitute the training dataset, for a total of 21.89% beats of the whole dataset. The remaining heartbeats were used as the test dataset. The details of the 16 heartbeat classes were summarized in Table 1. To be mentioned, in the “intra-patient” evaluation, the training and test dataset will contain heartbeats from the same people.

In consideration of it, the “intra-patient” evaluation is not a realistic measure. Therefore, the “inter-patient” evaluation is conducted in addition to a more realistic estimate of the generalization ability of the algorithm. According to ANSI/AAMI EC57:1998 standard [10], among the 48 records available in the MIT-BIH dataset the training and testing dataset contained 22 records each, the other four paced records (i.e., the records 102, 104, 107, 217) were excluded for experiments. The original 16 heartbeat classes were divided into the five bigger classes, namely “N”, “S”, “V”, “F”, and “Q”. The mapping from the MIT-BIH arrhythmia database heartbeat classes to the AAMI heartbeat classes was shown in Table 2.

The purpose of ECG signal preprocessing is to reduce various types of noise that may be present in the ECG signal. Typical sources of noise in the ECG signal include baseline wander, artifacts caused by muscle contraction, and electrode movement [11]. In our research, the preprocessing of the ECG signal included baseline wander correction and band-pass filtering. The raw ECG signal was first processed to correct the baseline wander using a wavelet-based approach [12]. Following [13], the signal was band-pass filtered at 5–12 Hz to maximize the energy of the QRS complex, removing high-frequency and low-frequency artifacts. The denoised signals were used as one of the classifier inputs.

Table 2. Summary of training and testing datasets in inter-patient and the mapping from MIT-BIH classes to AAMI classes

ECG signals usually show changes over time or position. The classic Fourier transform (FT) can provide us with frequency-domain features, but a lack of analysis of time-domain features. Wavelet transform (WT) can satisfy these two characteristics. Until now, WT has been widely used in ECG signals [18,19,20,21], including denoising, heartbeat detection, and feature extraction. In traditional machine learning classification tasks, WT functions are often used for handcrafted features as the classifier input. The performance of WT features is often better than using the raw ECG signal as the classifier input. In our paper, WT was used as one of the feature extraction methods. Daubechies wavelets of order 8 were selected due to their similarity with the most characteristic QRS waveform [9]. After applying four-level wavelet decomposition, we kept 114 coefficients (32 from A4, 32 from D4, and 50 from D3) as the wavelet features for one of the classifier inputs. In the heartbeat segmentation of ECG signals, we used the annotations of R-peak locations provided by the database. Each heartbeat segment consisted of 100 samples before the R peak location, R peak location, and 200 samples after the R peak, a total of 300 samples corresponding to 0.83 s.

3 Intra-Patient Model Architecture

3.1 PCNN Model

The PCNN (Pure-CNN) model used an 11-layer network structure [6], contained 4 convolutional layers, and the size of convolutional kernels was respectively 27, 14, 3, and 4. Structurally, each convolutional layer was followed by a max-pooling layer of stride 2. The last part of the model has consisted of 3 full-connected layers, the number of neurons of the first two layers were 30 and 10, the last layer neurons were determined by the number of output classes. We used a leaky rectifier linear unit as the activation function of the convolutional layers and the first two fully connected layers, also applied Xavier to initialize the weight of each layer. Softmax function was used for the last layer. During the training of the CNN model, we used Adam optimizer, the initial learning rate was set to 0.001, and the batch size was 20. The architecture for the PCNN model was illustrated in Fig. 3.

Fig. 3.
figure 3

The architecture of the proposed PCNN model.

3.2 CNN-LSTM Model

The model was composed of CNN and LSTM layers. The structure of CNN layers were designed with the reference to VGGNet developed by Visual Geometry Group [15]. VGGNet was chosen as it features smooth information flow and simple implementation. As shown in Fig. 4, the number of convolutional kernels continued to grow with increasing depth of network layers with the kernel size of 3. The stride size as well as kernel size of max-pooling layer is 3. Each convolutional layer were stepped by a batch normalization layer (BN) and a rectified linear unit activation function. The addition of the BN layer can make the distribution of the input data from each layer in the network relatively stable, which is conducive to improving the learning speed of the entire neural network [23]. The RELU function is a widely used activation function [16], which has the ability to prevent the gradient vanishing problem in deep neural networks. Following the convolutional layers were two LSTM layers, the input of the first LSTM layer was determined by the number of features which generated by the convolutional layer, and the output size of both LSTM layers were 32.

Fig. 4.
figure 4

The architecture of the proposed CNN-LSTM model.

The CNN-LSTM model was trained by Adam optimizer with an initial learning rate of 0.0001. Adam has the ability to balance gradient updates between different classes, so it can mitigate the adverse effects caused by data imbalance [24]. In a total of 90 periods of training, the learning rate was multiplied by 0.1 times every 30 periods. We used Kaiming to initialize the weight of each convolutional layer, and orthogonal to initialize the weight of the LSTM layer, which can greatly improve the convergence speed of the model parameters. To prevent overfitting [21], we multiplied the L2 loss of all parameters in the network by 0.004 to the training loss and used a 20% dropout rate in the LSTM layers.

3.3 Attention-Based Model

The Attention-Based model was constructed with CNN, LSTM, and Attention mechanism. The part of CNN and LSTM referred to the CNN-LSTM model. The size and number of kernel functions were consistent with the CNN-LSTM model. To give a fair comparison, the same hyperparameters including learning rate and batch size were used for the proposed model and kept in line with regularization strategies. The Attention-Based Model added the Attention layer after the LSTM layer as shown in Fig. 5 and the attention mechanism would generate a set of independent weights for each possible class in the N classes. Then for each class, the unique weighted average of the inputs was calculated, and finally, the probability of the ECG signal class was outputted.

Fig. 5.
figure 5

The architecture of the proposed Attention-Based model.

The introduction of the attention mechanism in the network structure had two advantages. First, it can help the model to focus on the important part of the heartbeat, thereby improving the classification performance. Second, it can help highlight the location of the abnormal heartbeat, and further research on the location of the heartbeat can increase the interpretability of the deep learning model.

3.4 ResNet Model

For this model, we refer to the ECG diagnosis algorithm proposed by Andrew Ng in Nature Medicine [27], and design a 34-layer network structure, using a connection method similar to the residual network structure. The network consists of 16 residual blocks, each with two convolutional layers, the number of filters in the convolutional layer is 32 * 2k, and the size is 3, where k starts from 1, and every 4 residual blocks Increase by 1. In the case of reaching the same receptive field, the smaller the convolution kernel, the less calculation of the parameter kernel required, so in the experiment, we prefer to use multiple small convolution kernels instead of using one alone. Large convolution kernel. Each spare residual block will subsample the input with a step size of 2. Before each convolutional layer, a batch normalization layer and activation function RELU layer is designed as a pre-activated block. In order to prevent over-fitting, the model adds a Dropout layer between the RELU layer and the convolutional layer, with a loss rate of 20.

In the training process, Adam optimizer is used, the initial learning rate is 0.0001, and the default parameters \(\beta 1= 0.9\) and \(\beta 2= 0.999\). The batch size is 128. We use the Xaiver to initialize the weight of each convolutional layer. In a total of 30 epochs of training, the learning rate is multiplied by 0.1 every 10 epochs.

4 Inter-Patient Model Architecture

Initially, we were consistent in choosing models for the inter-patient and intra-patient model, however, during the training process, we found the gradient vanishing problem by used intra-patient models to do the inter-patient classification. Hence, we referred to the PCNN model to simplify the network structure and parameters, and retraining. The CNN model selection was consistent with the above-mentioned A model. The difference is the construction of the CNN-LSTM framework, we add the two LSTM layers after the PCNN model, no longer applied modified VGGNet as the convolutional structure, and the output size were all 128, and 20% of the dropout rates were used.

5 Experimental Setup

Proposed models were deployed in Python 3.7.6 language with TensorFlow [25] which is an open-source software library for deep learning launched by Google. Since deep learning networks require a lot of free parameters to train, GPU support is strongly recommended to reduce the learning time of the model. Thus, our experimental setup was RTX2060-super GPU on the window10 system. With GPU, TensorFlow is accelerated by using CUDA and CUDNN [26]. Versions of each software are TensorFlowr2.1.0, CUDA 10.1, and CUDNN 7.6.4.

6 Experimental Results

6.1 Performance Metrics

In this research, typical classification metrics, including accuracy, precision, recall, and F1 score were used for each model. They were defined as:

$$\begin{aligned} Accuracy=\frac{TP+TN}{TP+FP+TN+FN} \end{aligned}$$
(1)
$$\begin{aligned} Precision=\frac{TP}{TP+FP} \end{aligned}$$
(2)
$$\begin{aligned} Recall=\frac{TP}{TP+FN} \end{aligned}$$
(3)
$$\begin{aligned} F1=\frac{2\times (Precision\times Recall)}{Precision+Recall} \end{aligned}$$
(4)

where TP refers to the number of correctly classified samples in a certain class, FN refers to the number of samples belonging to a certain class which was misclassified as in other classes, and FP refers to the number of samples misclassified as in a certain class when they belong to other classes.

6.2 Intra-Patient Performance

In the intra-patient experimental performance, from the view of various inputs, the performances of raw signal and manual WT feature were generally better than the denoised signal in Table 3, whether it is from the perspective of accuracy, or from the analysis of F1-score. But there was a special result here that the F1-score of the denoised signal which training on the PCNN and ResNet networks showed the highest performance in all groups.

Table 3. The intra-patient classification performance of three inputs and models

We deliberately compared and analyzed the confusion matrix of the PCNN Denoised model and the lowest F1-score Attention Denoised model, and we found the recognition rate of the heartbeats of class6, class9, class13, class14, class15 in the Attention Denoised model was 0%. Considering that the sample numbers of these classes occupied a very low proportion of the total sample numbers of heartbeats (as shown in Table 1). On the other side, the recognition rate of the PCNN Denoised model included the above-mentioned classes and exceeded 90% (excepted for classes 13).

From the framework point of view, the four mainstream deep learning frameworks can achieve an accuracy rate of more than 95%. Apart from denoising signals, almost all the frameworks can reach an accuracy rate of more than 97%. And in terms of overall performance, the Resnet model performs better than other frameworks but we need to point out that the intra-patient classification is not a real method for the actual evaluation of the performance of the classifier. The time-varying dynamics and the morphological characteristics of ECG signals show significant variations for different patients. Even for the ECG of a healthy subject, which appears to be deterministic, the shapes of QRS complex, P waves, and R-R intervals will not be the same from one beat to the other under different circumstances [17]. However, in real-world scenarios, the trained model must deal with heartbeats from patients that are unseen during training [22].

Table 4. The inter-patient classification performance of three inputs and models

6.3 Inter-Patient Performance

In the inter-patient evaluation, according to the AAMI standard, 22 records were used for training and 22 records were used for testing. For the training set, we counted the heartbeats contained in each record according to the classes, and 21 records were used in the training process, one record for the validation. The validation data was derived from training data that had never been used before.

Fig. 6.
figure 6

The confusion matrix of CNN model.

Fig. 7.
figure 7

The confusion matrix of CNN-LSTM model.

Fig. 8.
figure 8

The confusion matrix of Attention-Based model.

Table 4 compared the three reference models (PCNN, CNN-LSTM, Attention-Based Model) in the classification of arrhythmia. The results showed that the Attention-based model was superior to the PCNN and CNN-LSTM models in the F1-score of all classes. Besides, due to the sample imbalance of the MIT-BIH database, we used the weighted F1-score indicator. In terms of accuracy, the CNN-LSTM model was generally highest, up to 87.0%. On account the Attention-Based model was modified based on the CNN-LSTM model, the two were more comparable, it can be seen that as the introduction of attention mechanism, the model can help us improve the recognition rate of abnormal heartbeats. Observed Figs. 6, 7 and 8, in the recognition rate of “S” heartbeats, Attention-based model exhibited better performance than the CNN and CNN-LSTM networks. Compared to the classification performance of different signal inputs, we are more inclined to recommend using the raw signal as the input of the inter-patient in the future arrhythmia classification research, because of its stable performance upon different deep learning network frameworks.

7 Discussion and Conclusion

As a result of the long history of ECG classification researches, these works used different datasets and detected different types of arrhythmia. It is unfair to compare it with them directly. In this paper, we proposed three different ECG signal inputs and used four different deep learning networks CNN, CNN-LSTM, Attention-Based model to do the intra-patient and inter-patient evaluation on the MIT-BIH dataset. The results showed that deep learning networks can generally achieve good classification performance on the MIT-BIH database. The CNN network is good at capturing spatial features, the LSTM network is suitable for processing time-series signals. Compared with the classification performance of CNN and CNN-LSTM, the combined performance of the two will be superior to the single CNN network. In addition, the introduction of the attention mechanism helps the model locate the important information part of the ECG signal and improve the interpretability of the model. The results of the Attention-Based model also showed that in future ECG classification research, it is important to focus on the subtle information part of the ECG signal. ResNet model has good behavior in intra-patient classification, but due to its complicated network structure and expensive training cost, we tend to recommend the simplified network (layers number did not exceed 15) in the heartbeats classification based on the MIT-BIH database. Due to the particularity of the MIT-BIH dataset, the numbers of normal heartbeats were far more than the abnormal. Therefore, experimental results with high accuracy were more inclined to the normal class. The phenomenon was obvious in the inter-patient experiments. It can be considered that keep a balance of the sample numbers between classes or using weighted evaluation indicators to compare performance more fairly. In this research, we used three different inputs, the raw signal, the denoised signal, and the handcrafted features, discussed the impact and significance of each input on the performance and applicability of our heartbeat classification method. It can be seen that the raw signal performance was more stable than the others. So we recommend using the raw signal in future works.