Keywords

1 Introduction

Among all the necessities for the normal functioning of the human body, a healthy heart is the most important. It is the heart that carries out mechanical and electrical activities to ensure blood is pumped to all parts of the body. A problem in the functioning of the heart can therefore be devastating. Cardiovascular diseases (CVDs) are a very common cause of demise for individuals. According to WHO surveys, approximately 33% of all deaths are related to CVDs. An early and accurate detection of abnormalities or diseases can essentially save the lives of countless individuals. Amidst the most popular modalities that exist to monitor the health of a functioning heart are electrocardiogram (ECG), photoplethysmography (PPG) and phonocardiogram [12] (PCG). An ECG signal is a recording of the electrical activity of the heart; a PPG estimates the blood flow rate by employing light based sensors; PCG signals are audio recordings of heart sounds and murmurs present in one cardiac cycle.

A PCG signal is obtained using a machine called phonocardiograph. It uses a high-fidelity microphone to record the sounds and murmurs made by the heart. There are two fundamental heart sounds in every PCG signal - S1 and S2. These are caused by the atrioventricular and semilunar valves during their closure, and are also what we generally associate with the ‘lub-dub’ sound our hearts make. The interval between S1 and S2 is called systole (‘lub’) and the vice versa is called diastole (‘dub’). A normal PCG signal contains only S1 and S2, however abnormalities cause other sounds or murmurs to arise and can be labeled as S3, S4 and so on.

Traditionally, a doctor analyses the sounds produced by the heart using a stethoscope and tries to identify any abnormality in the rhythm or the sound. This is a very difficult skill that requires years of exposure to gain proficiency at. Also, there are a myriad of limitations to the human ear as it ages that make detection of pathological symptoms quite inaccurate.

In this paper, MFCCs [4, 14] have been employed because of the similarities in properties that PCG signals have with speech signals. 26 such coefficients are extracted from a single frame. After extraction of features, a 2-D convolutional network ensues that classifies each audio signal into one of the five classes mentioned earlier.

The following graphical representation depicted in Figs. 1, 2, 3, 4 and 5 are PCG signals from individuals having N, MR, MS, MVP and AS conditions.

Fig. 1.
figure 1

N type PCG signal

Fig. 2.
figure 2

MR type PCG signal

Fig. 3.
figure 3

MS type PCG signal

Fig. 4.
figure 4

MVP type PCG signal

Fig. 5.
figure 5

AS type PCG signal

2 Related Work

Chowdhury et al. [1] employs DWT to decompose the PCG signals into multiple sub- bands having different frequencies. The sub-bands which contain unnecessary noise are dropped. For feature extraction, MFCC and Mel-scaled power spectrograms (Mel- Scale) are used. The latter is then fed through a 5-layered feed-forward DNN model trained by keras. The model has an accuracy, specificity and sensitivity of 97.10%, 94.86% and 99.26% respectively.

K. Poudel et al. [2] encountered a problem of an unbalanced dataset and employed a pre-processing method called SMOTE (Synthetic Minority Over-Sampling Technique) to counter it. Mel-Scale and MFCCs have been used for feature extraction from the PCG signals. They then pass this to a 1-D CNN model that has 4 hidden layers. The layers have been implemented with the ReLu activation function having filters of sizes 128 to 1024, with each increment doubling in size. The PCG signal is then classified in the database. The authors have used Shannon energy envelopes to develop a segmentation technique. The model has an accuracy of 93.20%, specificity of 94.20% and sensitivity of 89.20%.

Alkhodhari et al. [3] have used the combination of CNN and Bi-LSTM for the automatic extraction of features from the PCG signals. The VHD classes namely AS, MR, MVP, MS were preprocessed by MODWT and z-scoring normalization. The model was tested and trained using a 10-fold cross validation with CNN-Bi-LSTM network as well as CNN and Bi-LSTM individually. The model has an Accuracy of 99.32%, specificity of 99.58% and Sensitivity of 98.30%

The work of N. Baghela et al. [4] proposes a machine learning model to automatically diagnose CVDs using PCG signal. The model has a combination of 1-D CNN layers and Dense layers. Extensive preprocessing such as pitch correction, amplitude normalization, etc were done along with augmentation to increase the dataset size. The model was trained and evaluated using 10-fold cross validation, with an accuracy of 98.6%.

Shuvo et al. [5] have employed automatic detection of CVDs under the classes - N, AS, MR, MS and MVP using raw PCG signals. They use a CRNN architecture for this. Their model has representational and sequence residual learning phases. The time invariant features of the PCG signal are extracted using Adaptive Feature Extractor (AFE), Frequency Feature Extractor (FFE) and Pattern Extractor (PE), which are all included under representational learning. The latter includes bidirectional connections, which is used for the extraction of temporal features. Their model achieved 99.6% accuracy in the GitHub dataset and 86.57% in the Physionet dataset.

Li Oh et al. [6] proposed the WaveNet model which consisted of 6 residual blocks. 1000 PCG signals were collected from an open database which consisted of signals from 5 different classes. The signals were resampled at 8 Khz and were then normalized between −1 to 1. The model was cross-validated using 10 folds. It was trained for 3 epochs and the optimization algorithm used was Adam. The learning rate was set to 0.0005. The model has an average accuracy of 97%.

3 Proposed Methodology

2-D CNNs [13] are widely used in image recognition and object detection. For audio signals, 1-D convolutions are preferred as the kernel is only expected to slide across the time axis. In this paper, we extracted Mel Frequency Cepstral Coefficients from the audio signals. MFCCs are represented as 2-D data, with one axis representing the coefficient and the other axis representing time. We extracted 26 such coefficients. The magnitude of the frequency is represented by color. As a result, the MFCCs can be considered as a 2-D image. We have used 2048 samples in a window with a hop length of 512. The proposed methodology is depicted in Fig. 6.

3.1 Block Diagram

Fig. 6.
figure 6

Block diagram

3.2 Architecture and Training

Table 1. Model architecture
  • Input layer: 32 filters of dimensions 3 × 3 with stride size set to 1 and padding set to ‘same’ and activation function set to relu, resulting in an output dimension of (26, 44, 32) (Table 1).

  • Hidden Layer 1: 32 filters of dimensions 3 × 3, stride size set to 1, padding set to ‘same’ and activation function set to relu.

  • Hidden Layer 2: 64 filters of dimensions 3 × 3 with stride size set to 1, padding set to ‘same’ and activation function set to relu.

  • Hidden Layer 3: 128 filters of dimensions 3 × 3 stride size set to 1, padding set to ‘same’ and activation function set to relu.

  • Hidden Layer 4: 64 filters of dimensions 3 × 3 with stride size set to 1, padding set to ‘same’ and activation function set to relu. The output is flattened.

  • Hidden Layer 5: Dense layer comprising 512 units and activation function as relu.

  • Hidden Layer 6: Dense layer comprising 256 units and activation function as relu.

  • Output Layer: Dense layer comprising 5 units and activation function as softmax.

The model was trained for 15 epochs on a Tesla K80 GPU. The loss function used was categorical cross entropy, with Adam being the choice of the optimizer with a learning rate of 0.001.

4 Results and Discussion

The dataset (link included) used in this study includes a total of 1000 PCG signals from patients (inclusive of both sexes and all age groups) with normal and 4 different valvular heart diseases (MS, MR, MVP, AR). The 1000 signals are divided into the 5 classes of 200 signals each. The duration of each signal is fixed at 2 s. To evaluate the performance metrics of the model, cross validation with fold size 10 has been used.

Table 2 shows the results of the cross validation with accuracy as the parameter.

Table 2. Training and validation accurary of each fold

The lowest validation accuracy was 98.46% and the highest validation accuracy was 100%. The mean validation accuracy across all the folds was 99.64%.

Table 3 shows the performance of the model for each class on metrics such as precision, recall and F1-scores for all 10 folds. The following parameters are calculated as follows:

$$Precision \, = \frac{TP}{{FP \, + \, TP}}$$
(1)
$$Recall \, = \frac{TP}{{TP \, + \, FN}}$$
(2)
(3)

Table 3 shows the parameter values for all the folds while Table 4 compares the model presented in this paper with other models.

Table 3. Parameter values of each fold

The following figures present the confusion matrices for folds that do not have a validation accuracy of 100%.

Fig. 7.
figure 7

Confusion matrix for Fold 2

Fig. 8.
figure 8

Confusion matrix for Fold 3

Fig. 9.
figure 9

Confusion matrix for Fold 6

Fig. 10.
figure 10

Confusion matrix for Fold 8

From Figs. 7, 8, 9 and 10 it is evident that the misclassifications have occurred at certain instances.

  • In the confusion matrix for fold 2, as shown in Fig. 7, 1 signal attributed to MVP has been misclassified as MR, resulting in an overall accuracy of 99.49%.

  • For fold 3, 1 MVP signal has been misclassified as AS and 1 MR signal has been misclassified as MVP, lowering the overall accuracy to 98.97%.

  • Fold 6 shown in Fig. 9 has the most number of misclassifications and hence the least overall accuracy of 98.46%. 2 MR signals have been incorrectly classified as AS and MVP respectively. In addition to this, 1 MS signal has been misclassified as N.

  • In fold 8, 1 MS signal has been classified as MVP thereby resulting in an overall accuracy of 99.49%.

MR is incorrectly classified three times, while MVP and MS signals are misclassified twice.

Even though MFCCs are not traditional two-dimensional images, the 2-D CNN model was able to perform surprisingly well. It matches and even surpasses the performance of 1-D CNN and LSTM [15] in some cases.

Table 4. Study comparison

5 Conclusion

Manual detection of heart abnormalities is a challenging and time-consuming task that requires specific expertise. This study proposes a computer aided diagnosis (CAD) system using 2-D CNN for classification of cardiovascular diseases. 2-D CNNs are uncommon in the audio domain, but continue to gain traction. The proposed method achieves an average 10-Fold cross validation accuracy of 99.64%, which surpasses many other state of the art models in this dataset. This model does not require extensive pre-processing and is relatively light-weight. The overall accuracy of the model may be further improved by performing data augmentation.

The main limitation of the proposed work is the lack of multi-class PCG datasets. While there are multiple datasets for binary PCG signal datasets, it is not the case for non-binary datasets.