1 Introduction

ECG is a non-stationary bio-medical signal that represents the electrical activity of a heart. Analysis of ECG signals gives essential diagnostic information about heart diseases and any abnormality of heart functioning is reflected in an ECG. An exemplary ECG signal consists of a repetitively occurring series of beats. The features of ECG signals are depicted by wave peaks P, Q, R, S, T, and U, intervals P-R, R-R, Q-T and segments P-R and S-T [14]. The span between two waves of an ECG is known as a segment, the time duration inclusive of one or more waves and a segment corresponds to an interval as shown in Fig. . There is a lot of research going on in the context of generating and analyzing biomedical signals synthetically. A significant work in the aforesaid area was done by McSharry, P. E., and Clifford, who proposed a dynamic model(ecgsyn)using ordinary differential equations that were further used to generate realistic ECG waveform [22]. A three-dimensional trajectory was generated and variation of this trajectory at different points, all over the limit cycle reflected the ECG. Based on this dynamical model, filtering, segmentation, compression and classification of ECG signals have been also proposed [7, 8, 12, 25]. Another model was put forth by N. Jafarnia et al, which is the modification of the Zeeman model for the generation of heart rate variability (HRV) signals and an enhancement of the original Zeeman model based on Van der Pol-Lienard equation [8, 16]. An approximation-based mathematical approach also has been presented using different kinds of base functions [20]. In the same way, Ackora–Prah et al developed a model using MATLAB, with the aid of different inbuilt functions for generating artificial ECG signals. In order to achieve this, the temporal dynamics were captured, utilizing a computer-based model [1]. Similarly, other ECG generation models can be more realistic by coupling the dynamics of the standard map. Quiroz–Ju\(\acute{a}\)rez, M. et al. proposed a set of three nonlinear oscillators to simulate pacemakers (SA, AV, His-Purkinje) in the heart [23]. Oscillators are based on a discretized reaction-diffusion system, the Barrio–Varea–Aragon–Maini (BVAM) model, a generic reaction-diffusion system that can generate wavefronts, limit cycles and turing patterns, etc [4].

Fig. 1
figure 1

Basic ECG signal

In the last decade, a number of research papers have reported different soft-computing techniques for the Classification of ECG signals on the basis of real ECG features. In another approach, the fuzzy-hybrid neural network classification technique was proposed on the basis of auto-regressive model coefficients, higher-order cumulant and wavelet transform variances as features [11]. To manage healthcare data and to enhance the quality of hospitals, big data analysis is very essential. Numerous Optimization methods have been also proposed for ECG beats classification [10, 18, 19, 21]. There are also some studies that relate to classification using machine learning models [2, 17, 24]. Machine learning analysis offers a robust structure to imagine the targets. ECG signal datasets tend to be imbalanced as the availability of Real ECG data from a patient is administered by privacy issues [28], there has been always a need for more ECG data, especially for the training of Machine learning models that tend to perform automatic diagnosis and give better results on balanced data set. Also, access to medical data is highly restricted due to its sensitive nature, preventing communities from using this data for research or clinical training. In this regard, a method has been proposed in this paper, that enables the extraction of a new dataset based on the cubic spline construction method that requires few critical points on the ECG. From the resulting dataset, classification is done using machine learning algorithms. The results obtained conform with the actual dataset. The rest of the paper is organized as follows: In section 2, the sequence of steps involved in the proposed method is discussed by means of a flowchart, and the methodology entailing cubic spline construction using critical point selection on ECG signal is delineated. Also, the model classification techniques used are briefed. Section 3 describes the results followed by a discussion in section 4. The last part is the conclusion.

2 Methodology and Proposed Sequence of Instructions

Fig. 2
figure 2

Algorithm of proposed method

The main aim of the proposed system is to generate a real ECG signal without any noise and be able to predict early, any disturbance in the ECG signal to save the loss of lives. Figure  shows the procedure of the proposed system. It mainly consists of two parts, first part includes four steps namely signal acquisition, control point detection, extraction of slopes at each control point and spline model construction, respectively. The second part involves two steps, classification and prediction

In this section, the details regarding the cubic spline curve and choice of control points are described. Further, classification techniques are briefed.

2.1 Parametric Cubic Spline Curve

It is an interpolation curve and it is also known as the Hermit curve. Actually, it connects two data endpoints by applying a cubic equation. The parametric equation of a cubic spline segment is given by [15, 27]

$$\begin{aligned} p(u)=\sum _{i=0}^{3}H_{i}u^{i}, 0\le u \le 1 \end{aligned}$$
(1)

where p(u) is a position vector of a point on the curve represented by \(\overrightarrow{p}(u)=[x y z]^{T}\), it is a function of u in parametric space and represented by a point vector in cartesian space, whose components are x, y and z. Expression given by (1) is a function of parameter u and \(H_{i}\) are the polynomial coefficients. In order to find polynomial coefficient for cubic Hermite curve, following four values are required, first point and last point on the curve, i.e., endpoints \(p_{0},p_{1}\) and slope at the points, i.e., \(p_{0}^{\prime },p_{1}^{\prime }\). The tangent vector at any point on the curve can be calculated by differentiating the above equation (1) with respect to u. After applying the boundary condition (\(p_{0},p_{0}^{\prime } \),at u=0 and\( p_{1},p_{1}^{\prime }\), at u=1) for cubic spline curve at its endpoints \((p_{0},p_{1})\) the final form of the equation (1) for cubic Hermite curve will be

$$\begin{aligned} p(u)= & {} (2u^{3}-3u^{2}+1)p_{0}+(-2u^{3}+3u^{2})p_{1}+(u^{3}-2u^{2}+u)p_{0}^{\prime }\nonumber \\ {}{} & {} \quad +(u^{3}-u^{2})p_{1}^{\prime } \end{aligned}$$
(2)

where \(0\le u \le 1\), \(p_{0},p_{1},p_{0}^{\prime },p_{1}^{\prime }\) are geometric coefficients and the tangent vector becomes

$$\begin{aligned} p^{\prime }(u)= & {} (6u^{2}-6u)p_{0}+(-6u^{2}+6u)p_{1}+(3u^{2}-4u+1)p_{0}^{\prime }\nonumber \\ {}{} & {} \quad +(3u^{2}-2u)p_{1}^{\prime }, 0\le u \le 1 \end{aligned}$$
(3)

The expression given by (2) represents only one cubic spline segment but it can be generalized for any two adjacent segments of a spline curve that can be fit to a given number of data points. For a given set of n data points, the two end tangent vectors connect the points with a cubic spline curve but the tangent vectors at the intermediate points are needed. To eliminate the need for these vectors, the continuity of the curvature at these points can be imposed. To connect two Hermite spline curves to form a second-order continuity, the second derivative at the end of the first curve must be equal to the second derivative at the beginning of the second curve. Thus, it can be written as:

$$\begin{aligned} p^{\prime \prime }_{(u_{1}=1)}=p^{\prime \prime }_{(u_{2}=0)} \end{aligned}$$
(4)

where \(p^{\prime \prime }\) represents the second derivative and subscript of u denotes the segment number. Using this relation, the tangent vector at the end of the second curve is derived, which is also the tangent vector at the beginning of the second curve, by differentiating (3) and using the result of (4), the following expression is established.

$$\begin{aligned} p_{1}^{\prime }=\frac{1}{4}(3p_{0}+p_{0}^{\prime }-3p_{2}+p_{2}^{\prime }) \end{aligned}$$
(5)

The unknown tangent vector or slope can be determined by (5). For more segments, the same procedure can be repeated and a matrix equation is formed. By solving the matrix, intermediate tangent vectors can be determined [27].

Table 1 A set of control points:\(C_{1}\) to \(C_{13}\)
Table 2 Cubic spline curve coefficients:\(H_{0}\) to \(H_{3}\)

In order to do this here, a real ECG wave is taken from Physionet database [13], 13 control points have been chosen depending upon the diagnostic importance and a parametric spline curve has been constructed. These control points are clinically-relevant, morphological critical points and the location of control points indicates important changes in cardiac activity, which can be used by clinicians to determine the health state of patients. The control points chosen include all the peaks, valley points which are of paramount significance in ECG analysis and some intermediate points. Control points directly reflect the cardiac activity. Control points and their values are shown in Table and the visual form of all 13 control points is as shown in Fig.  (Physionet database [13]). Table depicts the coefficients \(H_{0}\) to \(H_{3}\) for all 12 cubic spline curves between 13 control points as u is varied between 0 and 1 in each case and as time progresses.

Fig. 3
figure 3

Control point locations on ECG wave

The spline Model of ECG with all 13 controls points that shows ECG intervals, segments and location of wave points is shown in Fig.  which is spline approximation of Fig. 3. In this section, a database with reference to parametric cubic splines has been prepared using the following four steps:

  • Firstly, from a number of normal and abnormal ECG signals, normal sinus data, congestive heart failure data and atrial fibrillation data have been taken.

  • Secondly, 27 control points are considered between two consecutive cycles(R to R)of each database of ECG

  • Thirdly, the slope at various data points for x, p(x) and slope at various data points for y, p(y) with respect to \(t:(0\le t \le 1)\) is calculated by using (3) and slope angle is obtained

    $$\begin{aligned} \theta =\text {tan}^{-1}{\frac{\frac{\partial y}{\partial t}}{\frac{\partial x}{\partial t}}} \end{aligned}$$
    (6)
  • Next, based on extracted slopes new datasets are obtained and are classified with three classifier algorithms, i.e., SVM (support vector machines), CN2 rule induction, and Tree.

NoteFootnote 1:

Fig. 4
figure 4

Modeled ECG wave

2.2 Model Classification

Classification is an algorithm that divides the input dataset into categories or groups, and model classification provides predictions from a part of the dataset known as the training dataset. It also predicts the category of a new dataset which is called a test dataset. Classification aids in early and accurate prediction of any type of disturbance in ECG wave is important to detect heart-related disease and choose an appropriate treatment for a patient. In this paper, a classification using three different machine learning algorithms is proposed. Orange is one of the powerful component-based visual programming software for data mining, machine learning and data analysis. Orange is one of the best suitable software, as there are many machine learning algorithms made available within the software and most importantly it is open source [9, 26]. The analysis of the proposed method is done using orange software.

In the following section, the three techniques used for classification purposes are briefed. Where, SVM and Tree are supervised classification techniques that provide required features like low computational complexity, high classification accuracy, and the ability to deal efficiently with large datasets having redundant data and high dimensionality [3, 6]. A rule-based system CN2 is also considered because it is an expert system model that contains a set of rules usually in the form of ’If-then.’ Rule-based system models are more reliable in many tasks such as classification, regression, association, and prediction [5]. This algorithm entails the essence of the popular decision tree learning algorithm.

2.2.1 Support Vector Machines (SVM)

SVM is a supervised machine learning technique that can be used for both regression and classification problems but mostly it is used for classification purposes. SVM with a small number of support vectors can have good generalization. First, in this technique, each data is plotted into n-dimensional space, where n is the number of features associated with the dataset. Next for the classification purpose, a hyper-plane or the best fit line that differentiates two groups is found [6].

2.2.2 CN2 Rule Induction

CN2 is an algorithm for inducing propositional classification rules (Clark, 1989). This algorithm consists of two stages [5]:

  • the best single rule is found using a heuristic search procedure and if the created rule by itself does not cover a positive example, all examples covered by this rule are deleted from the currently analyzed set of all positive examples

  • a new rule is created for the remaining data set. Stages 1 and 2 are repeated until the rule set satisfies a quality threshold or the positive example set is empty

2.2.3 Tree

Decision tree is one of the supervised learning methods under classification models. In the decision tree technique, data may be in numerical or categorical form. It can deal with both type of datasets. The decision tree may be a classification tree or regression tree, in classification type the target value will be in categorical form and in regression type the target value will be a continuous quantity [3].

2.2.4 Performance Metrics of Classifiers

To analyze the best algorithm out of three given classifiers, there are following measurements:

  • AUC (Area under the curve) It is a good measure of separability, in an excellent model AUC should be near to 1.

  • CA (Accuracy): In simple words accuracy, is the ratio of number of correctly predicted instance divided by total number of instances.

  • Precision and recall precision and recall handle the imbalance data set efficiently. Precision is used, when avoiding false positive is more essential than encountering false negatives

  • F1 F1-Score represents harmonic mean of precision and recall.

3 Results

This section presents the experimental results based on a dataset size of 120. Three experiments using machine learning methods that include SVM, CN2 induction and Tree were performed on the dataset. The confusion matrix gives the facility to calculate other performance measures related to the experiment. Test and score represent the performance measures of different classifiers in tabulation form which is shown in Table . For analyzing the performance of classifiers, confusion matrix shows evaluation results of Tree, CN2 and SVM classifiers in Tables , and , respectively, where ’A’ represents Atrial Fibrillation, ’C’ represents Congestive Heart Failure,’N’ represents Normal Sinus Rhythm and ’\(\sum \)’ along a column represents number of cases taken from each class and along a row number of cases classified in each class.

Table 3 Test and score of all three classifiers
Table 4 Confusion matrix of tree classifier
Table 5 Confusion matrix of SVM classifier
Table 6 Confusion matrix of CN2 classifier

4 Discussion

In this paper, a framework was developed, in the first part, an efficient parametric representation of ECG signals using cubic splines was proposed. The parametric cubic spline provides curvature continuity. It is implemented on a minimum number of data points, these are considered control points. In some dominions, the control points represent characteristic points, indicating crux changes reflected by the signal. Information regarding the exact locations of crucial points provides a deep perspective for diagnosis, analysis and therapeutic purposes. Next is database preparation, conventionally all the ECG database depends on the features of real ECG signal but here data base preparation on basis of slope at each control point has been introduced. In order to accomplish this, three medical conditions have been considered namely, Normal Sinus Rhythm (NSR), Atrial Fibrillation(AF) and Congestive Heart Failure(CHF). The second part entailed a comparative study of machine learning algorithms for early diagnosis of the condition of the heart.

5 Conclusions

The main contribution of this paper is to construct a novel parametric cubic spline model for three different medical conditions and prepare a dataset based on slopes at crucial control points of modeled ECG signals and apply machine learning classification techniques using Orange data analysis software. SVM, CN2 rule Induction and Tree algorithms have been tested in the Orange data mining tool and the Tree algorithm has given the best performance, i.e., AUC\(-\)98.74\(\%\), CA\(-\)98.33\(\%\), F1-Score\(-\)98.33\(\%\), Precision\(-\)98.35\(\%\), and Recall\(-\)98.33\(\%\). The result suggested that the Tree classifier is best suitable for this type of dataset. This technique requires less number of data points, it has the capability to generate a wide range of ECG signals and does not require pre-processing of the ECG beats because it is implemented well on raw data. Implementation of this model provides an efficient tool for medical education, research and testing purposes. This method is characterized by low computational complexity, it is easy to design and able to perform the classification with the use of modeled database that can be useful for real-time applications. Different heart disorders have their own severity level and still, many heart conditions are yet to be explored, in the future these can be taken into account to generalize the proposed model.