1 Introduction

In recent years, deep learning has emerged as a powerful computer-based method for solving various recognition problems. Deep learning was first introduced by Hinton [1] and focuses on automatically learning a good feature representation from input data [1,2,3,4]. Typical deep learning architectures include deep belief networks (DBNs) [1], the stacked autoencoder (SAE) [5], and convolutional neural networks (CNNs) [6].

Deep learning methods have provided outstanding results for several benchmark classification problems, such as image classification and segmentation [7,8,9,10,11,12], landmark detection [13], object recognition [14], face detection and recognition [15, 16], and speech recognition [17, 18]. Compared to shallow methods based on handcrafted features, deep learning methods allow 4 learning of powerful representations in a hierarchical way thanks to their sophisticated structure.

In the biomedical engineering field, several authors have recently used these powerful methods to solve various problems, such as breast cancer diagnosis and mass classification [19, 20], abdominal adipose tissues extraction [21], detection and classification of brain tumors in MR images [22,23,24,25], skeletal bone age assessment in X-ray Images [26], EEG classification of motor imagery [27], and arrhythmia detection and analysis using ECG signals. [28,29,30].

Analysis of ECG signals provides valuable information to cardiologists about the rhythm and function of the heart. Therefore, its analysis represents an efficient way to detect and treat different types of cardiac diseases [29, 31,32,33,34,35,36,37]. In [28], the authors proposed an approach that learns a suitable feature representation from raw ECG data in an unsupervised way using a de-noising auto-encoder (DAE). Then, they build a deep neural network (DNN) by adding a soft-max regression layer on top of the resulting hidden representation layer. During the interaction phase, they allow the expert to label the most relevant ECG beats in the test record and use them to update the weights of the network. In [29], the authors proposed a 1-D convolutional neural network for ECG classification. In [30], the authors used the de-noising auto-encoder (DAE) to enhance the quality of ECG signals by removing different types of residual noise.

Although promising results have been obtained in these studies, further research is required to boost the classification accuracy. In addition, the above works mainly rely on a large initial ECG training set for learning a suitable feature representation of the ECG data to reduce the computationally demanding data-shift problem. In addition, it is well known that collection and labeling of training data are costly and time-consuming. In this paper, we propose to achieve the above goal using a small training set. To this end, we introduce a transfer learning approach based on pre-trained convolutional neural networks (CNNs). We recall that CNNs were recently introduced for the analysis of ECG signals where the authors tailored them to address one dimensional signals [29].

While CNNs work well for problems with large labeled data, they are likely subjected to over-fitting problems when dealing with datasets consisting of small labeled data. A common strategy that was recently introduced in the computer vision literature is to exploit CNNs pre-trained on very large labeled dataset and transfer the knowledge to another classification task with limited training data [38]. Examples of pre-trained CNNs include AlexNet [39], VGGNet [40], and GoogleNet [41] trained on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset. Typical transfer methods include fine-tuning the pretrained network using new target data or using the CNN as a feature extractor and train an external classifier (such as the support vector machine (SVM) classifier) on the new feature representation of the target data.

In this work, we propose using these models to generate a robust feature representation from the raw ECG training data available at hand. Since the pre-trained CNN accepts RGB images, whereas the ECG signals are raw vectors, we propose first converting them into images using the Continue Wavelet transform (CWT). The choice of CWT is motivated by its success at analyzing ECG signals [42,43,44,45,46,47,48,49]. This method of representing the signal can be seen as the over-complete representation used by a special type of networks called auto-encoders where the dimensionality of the output is higher than the dimensionality of the input [50]. Unlike feature reduction, the over-complete representation allows for the discovery of more robust and sparse feature representations from the data. Then, we feed the resulting image-like data into the pre-trained CNN to generate their corresponding CNN features. During the learning phase, we train an extra network (placed on top of CNN) on the available labeled data. Then, we iteratively fine-tune this extra network by allowing the expert to interact with the system to label the most uncertain ECG beats from the records under analysis [28]. In the experiments, we validate the method in a cross-database setting using the MIT-BIH arrhythmia, INCART, and SVDB databases. The obtained results show that the proposed approach provides better accuracy improvements compared to the recent solutions. This paper conveys the following main contributions: (1) it uses the CWT and a pretrained CNN to learn a robust ECG representation, unlike the method proposed in [28] that is based on a simple DNN initialized with a DAE; (2) it uses a reduced training set (100 ECG beat per class) compared to [28] that relied on a larger training set (50,933 ECG beat); and (3) it is able to achieve better results with fewer ECG beats labeled by the expert.

This paper is organized as follows. A detailed description of ECG data processing is presented in Sect. 2. The proposed approach is presented in Sect. 3, while experimental results are reported in Sect. 4. Finally, conclusions and future directions are reported in Sect. 5.

2 ECG Data Processing

In the experiments, we use three different ECG databases to evaluate the proposed method, as shown in Table 1. We follow the recommendations of the Association for the Advancement of Medical Instrumentation (AAMI) for class labeling. The AAMI standard defines five classes of interest: normal (N), ventricular (V), supraventricular (S), fusion of normal and ventricular (F) and unknown beats (Q). The Q beats are removed from the analysis as they are marginally represented in these three databases obtained under different acquisition conditions and for different patients.

Table 1 ECG databases used in the experiments

The most common database that has been used is the MIT-BIH Arrhythmia Database (MIT-BIH). This Database consists of 48 records (taken from 47 patients, 25 men aged 32–89 years, and 22 women aged 23–89 years). Each record is slightly over 30 min long (approximately a half-hour long) and sampled at 360 Hz. The first 23 records have been chosen at random from this set that is numbered from 100 to 124 inclusive with some numbers missing. The remaining 25 records have been selected from the set to include the rarer variety.

To further assess the proposed approach, we use other databases, including the St.-Petersburg Institute of Cardiological Technics (INCART) and the MITBIH Supraventricular Arrhythmia Database (SVDB). The INCART database consists of 75 annotated recordings extracted from 32 Holter records. Each record is 30 min long and contains 12 standard leads, each sampled at 257 Hz. The reference annotation files contain over 175,000 beat annotations in all. The original records were collected from patients undergoing tests for coronary artery disease (17 men and 15 women, aged 18–80; mean age: 58). None of the patients had pacemakers, and most had ventricular ectopic beats. In selecting records to be included in the database, preference was given to subjects with ECGs consistent with ischemia, coronary artery disease, conduction abnormalities, and arrhythmias.

The MIT-BIH Supraventricular Arrhythmia Database (SVDB) consists of 78 two-lead recordings of approximately 30 min that are sampled at 128 Hz. The beat type annotations of the recordings were first automatically performed by the Marquette Electronics 8000 Holter scanner and later reviewed and corrected by a medical student, as shown in the Table 1. Figure 1 shows the raw ECG signals for three different datasets (MIT-BIH, INCART, and SVDB).

Fig. 1
figure 1

ECG row data for three different datasets

We preprocess the ECG signals to reduce noise similar to [31, 34]. To do this, we apply a 200-ms width median filter to remove the P wave and the QRS complex. Then, we use a 600-ms width median filter to remove the T wave. The resulting signals are subtracted from the original signals, which yield the baseline-corrected ECG signals. Then, we apply a 12-order low-pass filter with a 35 Hz cut-off frequency to remove the power-line and high-frequency noise. To extract the ECG waveform, we perform the QRS detection and the ECG wave boundary by means of the ecgpuwave software [51]. Then, we resample all segmented ECG signals using a different sampling rate to the same periodic length equal to 50 uniformly distributed samples using a different sampling rate, as shown in Fig. 2. Finally, we applied the CWT with three different mother wavelets, as shown in Fig. 3.

Fig. 2
figure 2

ECG Enhancement data for three different datasets

Fig. 3
figure 3

Representing ECG signals as images using the CWT

Unlike previous studies, as training we use 400 ECG training beats (100 from each class) from the following 22 records of the MIT-BIH database termed as DS1 = {101, 106, 108, 109, 112, 114, 115, 116, 118, 119, 122, 124, 201, 203, 205, 207, 208, 209, 215, 220, 223, 230} [1]. To extract the representative ECG signals from DS1, we independently apply the k-means clustering algorithm to each AAMI class in DS1 to generate 100 clusters. Then, we take the centers of the resulting clusters and form the initial training set. This represents approximately 0.78% of the global training set DS1. The remaining records of this database are grouped into DS2 = {100, 103, 105, 111, 113, 117, 121, 123, 200, 202, 210, 212, 213, 214, 219, 221, 222, 228, 231, 232, 233, 234}, and the other databases are used as tests for evaluating the capability of the method. Details about the number of training ad test ECG samples are reported in Table 1.

3 Proposed Methodology

Let \(Tr = \left\{ {{\mathbf{x}}_{i} ,y_{i} } \right\}_{i = 1}^{n}\) be a training set where \({\mathbf{x}}_{i} \in {\mathcal{R}}^{d}\) is an ECG beat signal.\(y_{i} \in \left\{ {1,2, \ldots ,K} \right\}\) is its corresponding class label where \(K\) is the number of classes and \(n\) is the number of training samples. Let us also consider \(f^{CNN}\) to be a CNN model pretrained on the ImageNet dataset. Our aim is to develop a classification system that allows classifying the test ECG record \(Ts = \left\{ {{\mathbf{x}}_{j} } \right\}_{j = n + 1}^{n + m}\) based on the available training set. Figure 4 shows the flowchart of the proposed method, whereas detailed descriptions are provided in next subsections.

Fig. 4
figure 4

Flowchart of the proposed method: Step (1) Apply the CWT to the ECG signals to generate an image-like representation. Step (2) Feature extraction. Step (3) Classification

3.1 ECG Signal Transformation Using CWT

The pretrained CNN accepts RGB images of dimensions (\(w \times h \times 3\)) as inputs, whereas the ECG signals are raw vectors of dimension \(d\). Thus, it is necessary to first convert them into images using certain transformations. This work considers the CWT as a possible candidate solution since it has been shown to be efficient for analyzing the ECG signals [42,43,44,45,46,47,48,49]. Basically, the CWT allows for the mapping of the signals into a time-scale space. In addition, it allows for better visible localization of the frequency components in the analyzed signals.

Mathematically specking, the CWT of a given function \({\mathbf{x}}(t)\) is defined as the integral transform of \({\mathbf{x}}(t)\) with a family of wavelet functions \(\psi_{a,b} (t)_{{}}^{{}}\):

$$CWT(a,b) = \frac{1}{\sqrt a }\int\limits_{ - \infty }^{ + \infty } {\psi_{ab} \left( {\frac{t - b}{a}\,} \right){\mathbf{x}}(t)dt.}$$
(1)

The CWT is defined as the sum of the signal multiplied by scaled and shifted versions of the wavelet function \(\psi\).

$${\text{CWT}}({\text{scale,\,position}}) = \int\limits_{ - \infty }^{ + \infty } {x(t) \times \psi ({\text{scale,position}},t)\;dt.}$$
(2)

The function \(\psi (t)\) is known as the mother wavelet, and the components of functions \(\psi_{a,b} (t)\) are called daughter wavelets. The daughter wavelets are simply obtained by scaling and shifting the mother wavelet. The scale factor a represents the scaling of the function \(\psi (t)\), while the shift factor b represents the temporal translation of the function. Then, the modulus \({\mathbf{I}} = {\text{moduls}}(CWT)\) of the complex wavelet coefficients produced by this transformation can be viewed as an image. In this way, the training set \(Tr = \left\{ {{\mathbf{x}}_{i} ,y_{i} } \right\}_{i = 1}^{n}\) in addition to the test record \(Ts = \left\{ {{\mathbf{x}}_{j} } \right\}_{j = n + 1}^{n + m}\) are transformed to \(Tr = \left\{ {{\mathbf{I}}_{i} ,y_{i} } \right\}_{i = n + 1}^{n + m}\) and \(Ts = \left\{ {{\mathbf{I}}_{j} } \right\}_{j = n + 1}^{n + m}\).

3.2 Feature Extraction Using a Pretrained CNN

CNN is one deep learning model that attempts to learn the feature representation for signal/image data with different levels of abstraction [12]. It is made up of several alternating convolutional and pooling layers, followed by fully connected layers. The convolutional layer is the main building block of CNN and it’s the representative structure of deep models. The outputs of this layer are termed as feature maps and their number depend on the number of filters used in the convolution layer. The feature maps produced by convolving the learnable filters across the input image are usually fed to a non-linear gating function, such as the rectified linear unit (ReLU) [52]. Mathematically, let \(x^{i}\) and \(y^{j}\) denote the i-th input feature map and j-th output feature map of a convolutional layer [12]. The activation function applied in CNN can expressed as:

$$y^{j} = { \hbox{max} }\left( {0,b^{j} + \mathop \sum \limits_{i} z^{ij} \times x^{i} } \right)$$
(3)

where \(z^{ij}\) is the convolutional kernel between \(x^{i}\) and \(y^{j}\), and \(b^{j}\) is the bias. The symbol ∗ indicates the convolutional operation. When there are M input maps and N output maps, this layer will contain N 3-D kernels of size d × d × M (here d × d is the size of local receptive fields) and each kernel owns a bias.

Furthermore, the output of the activation function can be subjected to normalization to help in the generalization. The pooling layers shrink the size of the feature map (reduce the data dimensions) to produce a single output from each block. Typical popular ways to perform pooling are taking the average or the maximum.

Then, after several convolutional and pooling layers, the high-level reasoning in the neural network occurs using fully connected layers that are regarded as the classification phase at the output’s end. The last layer takes all neurons in the previous layer and connects them to every single neuron it contains. In the case of classification, we add a softmax layer to the end of this network. Then, the complete weights are learned using the back-propagation algorithm.

3.3 ECG Classification

Since the CNN is composed of several layers, we can extract the features at different representation layers. Here, we take the output of the last dense fully connected layer to represent the data. That is, we feed each image \({\mathbf{I}}_{\varvec{i}}\) as an input to the CNN and generate a CNN feature representation \({\mathbf{z}}_{\varvec{i}} \in R^{D}\).

$${\mathbf{z}}_{\varvec{i}} = f^{CNN} \left( {{\mathbf{I}}_{\varvec{i}} } \right), i = 1, \ldots ,n + m$$
(4)

We feed these CNN feature vectors into an extra network placed on top of the pretrained CNN, as shown in Fig. 4. This last layer is composed of a hidden layer and a softmax regression layer. The hidden layer takes the input \(\varvec{z}_{i}\) and maps it to another representation \(\varvec{h}_{i}^{(1)} \in R^{{D^{(1)} }}\) of dimension \(D^{(1)}\) through the nonlinear activation function \(f\):

$$\varvec{h}_{i}^{(1)} = f\left( {\varvec{W}^{\left( 1 \right)} \varvec{z}_{i} + {\mathbf{b}}^{(1)} } \right)$$
(5)

where \(\varvec{W}^{(1)} \in {\Re }^{{D^{(1)} \times D}}\) represents the mapping weight matrix and \({\mathbf{b}}^{(1)} \in R^{D}\) is the mapping bias vector. A sigmoid function can be a typical choice of the activation function, such as \(f\left( v \right) = 1/(1 + \exp \left( { - v} \right)\). For the ease of analysis, we omit the bias vector in the expression since it can be incorporated as an additional column vector in the mapping matrix. Then, in that case, the feature vector size is augmented by 1.

The softmax regression performs the multiclass classification and takes the resulting hidden representation \(\varvec{h}_{i}^{(1)}\) as input. It produces an estimate of the posterior probability for each class label \(k = 1,2, \ldots ,K\) as follows:

$$p\left( {\left. {\hat{y}_{i} = k} \right|\varvec{x}_{i} } \right) = \frac{{{ \exp }\left( {\left( {\varvec{w}_{k}^{(2)} } \right)^{\text{T}} \varvec{h}_{i}^{(1)} } \right)}}{{\mathop \sum \nolimits_{j = 1}^{K} { \exp }\left( {\left( {\varvec{w}_{j}^{(2)} } \right)^{\text{T}} \varvec{h}_{i}^{(1)} } \right)}}$$
(6)

Where \(\varvec{W}^{(2)} = \left[ {\varvec{w}_{1}^{(2)} \varvec{w}_{2}^{(2)} \ldots \varvec{w}_{K}^{(2)} } \right] \in R^{{D^{(1)} \times K}}\) are the weights of the softmax regression layer and the superscript \(\left( \cdot \right)^{\text{T}}\) refers to the transpose operation.

We use the dropout technique introduced by Hinton [53] to prevent the network from over-fitting and to increase its generalization ability. This regularization technique aims to drop nodes of the hidden layer with their weights during the training phase. This permits us to generate a thinned network by temporarily removing the nodes from the original fully connected network, along with all its incoming and outgoing connections. Typically, the dropout regularization technique acts by defining \(\varvec{r} \in R^{{D^{(1)} }}\) (the same dimension as the hidden representation \(\varvec{h}_{i}^{(1)}\)) as a vector of independent Bernoulli random variables, each of which has a probability \(\rho\) (which is usually set to 0.5) of being 1. At the training time, the output of the hidden layer after dropout is:

$$\left\{ {\begin{array}{*{20}c} {\varvec{r} = {\text{Bernoulli}}(\rho )} \\ {\varvec{h}_{i}^{(1)} : = \varvec{h}_{i}^{(1)} \odot \varvec{r}} \\ \end{array} } \right.$$
(7)

with \(\odot\) denoting an element-wise product. At test time, the dropout is turned off and all hidden units are used, but the weights are scaled by the retaining probability \(\rho\).For learning the vector of weights \(\varvec{\theta}= \left\{ {\varvec{W}^{(1)} ,{\mathbf{W}}^{(2)} ,{\mathbf{b}}^{(1)} } \right\}\) representing the complete network architecture, we minimize the error between the actual network outputs and the desired outputs of the training data. As the outputs of the network are probabilistic, we propose maximizing the log-posterior probability to learn the network weights, which is equivalent to minimizing the so-called cross-entropy error:

$$E_{\text{net}} = - \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{k = 1}^{K} 1\left( {y_{i} = k} \right){ \ln }\left( {\frac{{\exp \left( {\left( {\varvec{w}_{k}^{\left( 2 \right)} } \right)^{\text{T}} \varvec{h}_{k}^{(1)} } \right)}}{{\mathop \sum \nolimits_{j = 1}^{K} \exp \left( {\left( {\varvec{w}_{j}^{\left( 2 \right)} } \right)^{\text{T}} \varvec{h}_{j}^{(1)} } \right)}}} \right)$$
(8)

where \(1\left( \cdot \right)\) is an indicator function that takes the value of 1 if the statement is true and otherwise it takes the value of 0, and superscript T refers to matrix transpose.

4 Experimental Results

4.1 Experiment Setup

For the pretrained CNN, we use the VGGNet model of [40] that is composed of 8 layers. It uses five convolutional filers of dimensions (number of filters × filter height × filter depth: 96 × 7 × 7, 256 × 5 × 5, 512 × 3× 3, 512 × 3 × 3, and 512 × 3 × 3) and three fully connected layers with the following number of hidden nodes (fc1: 4096, fc2: 4096, and softmax: 1000). This network was pretrained on the ILSVRC-12 challenge dataset. We recall that the ImageNet dataset used in this challenge is composed of 1.2 million RGB images of size \(224 \times 224\) pixels belonging to 1000 classes. These classes describe general images such as beaches, dogs, cats, cars, shopping carts, minivans, and more. As can be clearly seen, this auxiliary dataset is completely different from the ECG signals used in the experiments.

For training the extra network placed on top of the pretrained CNN, we follow the recommendations of [54] for training neural networks. We set the dropout probability \(p\) to 0.5. We use a sigmoid activation function for the hidden layer. For the backpropagation algorithm, we use a mini-batch gradient optimization method with the following parameters (i.e., learning rate: 0.01, momentum: 0.5, and mini-batch size: 50). The weights of the network are set initially in the range [− 0.005, 0.005]. Regarding the Active learning (AL) step, we allow the expert to add \(N_{AL} = 10\) ECG beats for each iteration of the AL process.

For the performance evaluation, we present the results in terms of VEB [V class versus (N, S, and F)] and SVEB [S class versus (N, V, and F)]. In particular, we use the standard measures of sensitivity \((Se)\), positive predictive value \((Pp)\), specificity \((Sp)\), and overall accuracy \((OA)\) [31, 37, 55].

5 Results and Discussions

As mentioned in the first step of the method, we first apply the CWT to the ECG signals to represent them as images before feeding them to the CNN network. In this context, we have explored various wavelet families with different scales for identifying a suitable initial representation for the different ECG classes. After an extensive analysis of the different wavelets, we experimentally found that the Daubechies wavelet (db4), the Biorthogonal wavelet (bior3.5), and the Coiflet wavelets (coif3) represent good choices. Therefore, we apply these wavelets with scales from 1 to 64 to the ECG signals. Then, for every transformation, we compute the modulus of the obtained coefficients. Figure 3 depicts some views of the images obtained by applying these wavelets to the AAMI ECG classes. These views show interesting behaviors for different classes in terms of their discriminatory ability. Then, we resize the images to \(224 \times 224\), feed them to the pretrained CNN and take the output of its last fully connected layer, which produces a CNN feature vector of dimension \(D = 4096\).

For the MIT-BIH database, we present the results by considering three different scenarios for building the test set, as has been done by several works dealing with AAMI classes [28]. For the first scenario, we use the 11 common testing records for VEB {i.e., 200 202 210 213 214 219 221 228 231 233 234} and 14 testing records for SVEB {i.e., 200 202 210 212 213 214 219 221 222 228 231 232 233 234}. For the second one, we use the 24 common testing records from 200 to 234. For the third and last scenario, we use all 48 records (i.e., DS1 + DS2). Figures 5, 6, 7 and 8 show the CNN features obtained from the pretrained CNN for the different AAMI classes. A preliminary inspection shows that the learned features look different. Tables 2 and 3 show the classification results in terms of \(\left( {OA,Se, Sp and Pp} \right)\) for SVEB and VEB, respectively, after adding 100 ECG beats per record. As seen here, these results are clearly better than those obtained by recent state-of-the-art methods for all scenarios. For the first scenario, our methods yield an \(\left( {OA,Se, Sp {\text{and }}Pp} \right)\) equal to (99.9, 99.1, 100.0, and 99.3%) for VEB and (99.9, 97.3, 100.0 and 99.3%) for SVEB. For the second scenario, we obtain (99.7, 98.7, 99.9, and 98.8%) for VEB and (99.3, 98.3, 99.4 and 98.1%) for SVEB. Finally, for the last one, we obtain (99.9, 99.0, 100.0, and 99.6%) for VEB and (99.8, 95.9, 100.0 and 98.3%) for SVEB.

Fig. 5
figure 5

Example of a CNN feature for class N

Fig. 6
figure 6

Example of a CNN feature for class S

Fig. 7
figure 7

Example of a CNN feature for class V

Fig. 8
figure 8

Example of a CNN feature for class F

Table 2 VEB Classification results for the MIT-BIH database
Table 3 SVEB classification results for the MIT-BIH database

For the other databases, the \(\left( {OA,Se, Sp {\text{and }}Pp} \right)\) obtained by the method in the term of VEB are equal to (99.23, 96.5, 99.7, and 98.0%) for the INCART database and (99.4, 91.7, 99.8, and 96.7%) for the SVDB database, as shown in Table 4. For SVEB, the (overall and average) accuracies are (99.82, 89.30, 99.93, and 97.50%) for the INCART database and (98.4, 80.2, 99.7, and 94.9%) for the SVDB database, as shown in Table 5.

Table 4 VEB classification results obtained for the INCART and SVDB databases
Table 5 SVEB classification results obtained for the INCART and SVDB databases

6 Conclusions

In this paper, we presented a method based on a CNN for the classification of ECG signals. Compared to the existing solutions, this method has the following attractive proprieties: (1) it transfers knowledge from a CNN pretrained on a different domain from computer vision (called ImageNet) with large labeled images; (2) it exploits the CWT to make the ECG signals suitable inputs for this network; and (3) it uses an efficient AL strategy to fine-tune the extra network placed on top of the CNN by minimizing the cross-entropy error with dropout regularization. Experiments carried out on a non-GPU unit and on three ECG cross-databases (obtained under different acquisition conditions) confirmed its efficiency and ability to provide improved classification results versus several other methods. In the future, we plan to improve this method by including several modifications to further reduce expert interactions while improving its accuracy. These improvements include: (1) learning suitable representations for transforming the ECG data into images instead of applying the CWT, (2) fusing several pretrained CNN models for generating more robust feature representations, and (3) exploring other AL criteria for identifying the most relevant ECG beats in the test records.