Introduction

Respiratory diseases are the third leading cause of death worldwide. As rapid growth of respiratory diseases is witnessed around the world, medical research field has gained interest in integrating potential audio signal analysis-based technique. Like in other application domains, audio signal analysis tools can potentially help in analyzing respiratory sounds to detect problems in the respiratory tract. Audio analysis aids in timely diagnosis of respiratory ailments more effortlessly in the early stages of a respiratory dysfunction. Respiratory conditions are diagnosed through spirometry and lung auscultation. Even though, spirometry is one of the most commonly available lung function tests, it is limited to patient’s cooperation. As a result, it is error prone. Auscultation is a technique that involves listening to the internal human body sounds with the aid of a stethoscope. Over several years, it has been an effective tool to analyze lung disorders and/or abnormalities. Such procedure is limited to trained physicians. Besides, for various reasons (e.g., faulty instrument), false positives can happen. Therefore, it opens a door to develop computerized respiratory sound analysis tools/techniques, where automation is integral.

Lung sounds are difficult to analyze and distinguish because they are non-stationary and non-linear signals. Automated analysis was made possible with the use of electronic stethoscope. In 2017, the largest publicly available respiratory sound database was compiled and encouraged the development of algorithms that can identify common abnormal breath sounds (wheezes and crackles) from clinical and nonclinical settings. Respiratory sounds are generally classified as normal or adventitious. Adventitious sounds are RS superimposed on normal respiratory sounds, which can be crackles or wheezes. Crackles are discontinuous sounds, explosive, and non-musical. that are typically less than 20 ms that occur frequently in cardiorespiratory diseases associated with lung fibrosis (fine crackles) or chronic airway obstruction (coarse crackles). Wheezes are high pitched sounds that last more than 100 ms. They are common in patients with obstructive airway diseases and indicate obstructive airway conditions, such as asthma and COPD. The dataset contains respiratory cycles that were recorded and annotated by professionals as wheezes, crackles, both, or no abnormal sounds.

Rao et al. [19] discussed acoustic techniques for pulmonary analysis. They studied acoustic aspects for different lung diseases. It includes different type of sounds in the thick of internal and external sounds. Aykanat et al. [3] presented a convolutional network plus mel frequency cepstral coefficient-support vector machine-based approach for lung sound classification. On a dataset of 17930 sounds from 1630 subjects, an accuracy of 86% (for healthy-pathological classification) was reported. Pramono et al. [18] classified normal respiratory sounds and wheezes on a dataset of 38 recordings. Of 425 events, 223 were wheezes and the rest were normal. They reported a AUC value of 0.8919 with MFCC-based features. Acharya et al. [1] presented a deep learning-based approach for lung sound classification. They reported an accuracy of 71.81% on the ICBHI17 dataset of size 6800+ clips. Dokur [10] used machine learning approaches to distinguish respiratory sounds. In their experiments, nine different categories from 36 patients were used. An accuracy of 92% was reported by using Multilayer Perceptron (MLP).

Melbye et al. [14] studied the classification of lung sounds by 12 observers. They worked with 1 clip each from 10 adults and children and obtained fleiss kappa values of 0.62 and 0.59 for crackles and wheezes, respectively. Among the 20 cases, they found that in 17 cases, the observers concluded presence of atleast 1 adventitious sound. Bahoura and Pelletier [4] used cepstral features to distinguish normal and wheezing sounds. They worked with 12 instances from each class and reported the highest true positive value of 76.6% for wheezing sounds. They also reported 90.6% true positives for normal sounds with fourier transform-based features. Ma et al. [13] developed a system to distinguish lung sounds using a resnet-based approach. On ICBHI17 dataset, an accuracy of 52.26% was reported. Emmanouilidou et al. [11] proposed a robust approach to identify lung sounds in the presence of noise. In their experiments, with 1K+ volunteers (over 250 hours of data), an accuracy of 86.7% was reported.

To analyze lung sounds, Sen et al. [23] used Gaussian mixture model and support vector machine-based classifier. Using 20 healthy and non-healthy subjects, they reported an accuracy of 85%. Demir et al. [9] used a CNN-based approach. On ICBHI17 dataset, the highest accuracy of 83.2% was reported. Chen et al. [7] used a S-transform-based approach coupled with deep residual networks to classify lung sounds: crackle, wheeze, and normal. In their study, the reported accuracy was 98.79%. Kok et al. [12] employed multiple features, such as MFCC, DWT, and time domain metrics to distinguish healthy and non-healthy sounds. In their study, they reported accuracy, specificity, and sensitivity values of 87.1%, 93.6%, and 86.8%, respectively on the ICBHI17 dataset.

Chambers et al. [6] developed a tool to identify healthy/ non-healthy patients using respiratory sounds. They used several spectral, rhythm, SFX, and tonal features coupled with decision tree-based classification. In their study, they reported an accuracy of 85% on a dataset of 920 records. Altan et al. [2] developed a deep learning-based approach to detect chronic obstructive pulmonary disease. Their tool used Hilbert-Huang transform on multi-channel lung sounds. In their experiment, an accuracy of 93.67% was reported on a dataset of 600 sounds collected from 50 patients. Cohen and Landsberg [8] classified 7 different type of sounds using linear predictive coefficient-based technique. In their experiments, out of 105 instances, 100 were classified correctly.

Even though there exists a rich state-of-the-art literature for lung sound analysis, they do not guarantee optimal performance. Moreover, non-healthy cases are composed of several issues/criteria. Distinguishing healthy sounds from non-healthy sounds is not trivial. Handcrafted feature-based systems are preferred over deep learning-based systems, where computational resource is considered. Secondly, prior to deeper analysis of non-healthy sounds, it is essential to distinguish healthy and non-healthy sounds. A hierarchical approach can aid to reduce the workload of medical experts in resource-constrained regions. After ensuring that whether a person has lung infection, the true positive case can be taken for further treatment(s)/processing.

In this paper, we developed an automated tool, where LPCC-based features are employed. LPCC-based features were chosen due to its ability of modeling a variety of audio signals [15, 16]. In our experiments, on a dataset ICBHI17 (of size 6800+ clips), we achieved an accuracy of 99.22% using MLP.

The remainder of the paper is organized as follows. “Dataset description” discusses on dataset. In “Proposed method: LPCC-based features and MLP”, we describe the proposed tool. Experimental results are provided in “Results and analysis”. We conclude the paper in “Conclusion”.

Dataset description

To develop of a robust system, it is important to ensure that the dataset mimics real-world problems. Our system was trained on a publicly available respiratory sound database [20], which is associated with the International Conference on Biomedical and Health Informatics (ICBHI). Most of the database consists of audio samples recorded by the School of Health Sciences, University of Aveiro (ESSUA) research team at the Respiratory Research and Rehabilitation Laboratory (Lab3R), ESSUA and at Hospital Infante D. Pedro, Aveiro, Portugal. The second research team, from the Aristotle University of Thessaloniki (AUTH) and the University of Coimbra (UC), acquired respiratory sounds at the Papanikolaou General Hospital, Thessaloniki and at the General Hospital of Imathia (Health Unit of Naousa), Greece.

To collect data, disparate stethoscopes and microphones were used. The audios were recorded from the trachea and 6 other chest locations: left and right posterior, anterior, and lateral. The audios were collected in both clinical and non-clinical settings from adult participants of disparate ages. Participants encompassed patients with lower and upper respiratory tract infections, pneumonia, bronchiolitis, COPD, asthma, bronchiectasis, and cystic fibrosis.

The ICBHI database consists of 920 audio samples from 126 subjects. These are annotated by respiratory experts, and used as a benchmark in the field. Each respiratory cycle in the dataset is annotated amidst 4 classes. The annotations basically cover 2 broad groups: healthy and non-healthy. The non-healthy category is further divided into wheeze and crackle with some cycles having both issues. Among 6898 cycles totaling to 5.5 hours, 1864 cycles have crackles while 886 have wheezes. There are 506 cycles, which have both wheezes and crackles.

While recording, the participants were seated. The acquisition of respiratory sounds was performed on adult and elderly patients. Many patients had COPD with comorbidities (e.g., heart failure, diabetes, and hypertension). Further, noise exists, such as rubbing sound of the stethoscope with the patient’s dress, and background talking. Such varieties in the data made it challenging to identify problems in the respiratory sounds. One of the most challenging aspects of the audio clips was the presence of heartbeat sound along with the breath sounds. No preprocessing was performed to remove the heartbeat sounds.

For better understanding, visual representations of 200 audio clips from the healthy and non-healthy sounds are shown in Fig. 1. In Table 1, a complete dataset is provided.

Fig. 1
figure 1

200 audio clips (original): healthy class (left) and non-healthy class (right)

Table 1 Respiratory sound database [20]

Proposed method: LPCC-based features and MLP

Respiratory sound representation: LPCC-based features

As audio clip contains high deviations across its entire length, its analysis is not trivial. Therefore, each audio clip is broken down into smaller segments called frames to facilitate analysis. In our study, we divided each clip into frames consisting of 256 sample points with a 100-point overlap in between them. The parameters were empirically designed. The same 200 audio clips (as in Fig. 1) are shown in Fig. 2 after framing. The number of Sz sized overlapping frames Of with O overlapping points for a signal having S points is presented below:

$$ {O_{f}=\Bigl\lceil\frac{S-S_{z}}{O}+1\Bigr\rceil.} $$
(1)
Fig. 2
figure 2

200 audio clips (as in Fig. 1) after framing: healthy class (left) and non-healthy class (right)

After framing audio clips (into shorter segments), it was observed that in various instances the starting and ending points were not aligned in a frame. These discontinuities/ jitters lead to smearing of power across the frequency spectrum. This posed a problem in the form of spectral leakage during frequency domain analysis which produced additional frequency components. To tackle this, the frames were subjected to a window function. Hamming window was selected for this purpose due to its efficacy as reported in [16]. The same frames (Fig. 2) are presented in Fig. 3 after windowing. The hamming window is mathematically illustrated as

$$ {A(z)=0.54-0.46 \cos \Bigg(\frac{2 \pi z}{S_{z}-1}\Bigg ),} $$
(2)

where A(z) is the hamming window function and z is a point within a frame.

Fig. 3
figure 3

Representation of 200 audio clips (as in Fig. 1) after windowing: healthy class (left) and non-healthy class (right)

Thereafter, we performed Linear Predictive Coefficient (LPC) based analysis [15] on each of them. The previous P samples are used to present the rth sample in a signal s() as

$$ \begin{array}{@{}rcl@{}} s(r)&\approx& p_{1}s(r-1)+p_{2}s(r-2)+p_{3}s(r-3)\\&&+,\dots,+p_{P} s(r-P), \end{array} $$
(3)

where p1, p2,…, pP are the LPCs or predictors. The error of this prediction E(r) bounded by the actual and predicted samples: (s(r) and \(\hat {s}\)(r)) can be explained as

$$ E(r)=s(r)-\hat{s}(r)=s(r)-\sum\limits_{k=1}^{P}p_{k}s(r-k). $$
(4)

The error of sum of squared differences (as shown below) is minimized to generate the unique predictors for a x sized frame, which can be expressed as

$$ E_{r}=\underset{x}{\sum}\Bigl[s_{r}(x)-\sum\limits_{k=1}^{P}p_{k}s_{r}(x-k)\Bigr]^{2}. $$
(5)

Thereafter, a recursive technique is used to compute the Cepstral coefficients (C), which is expressed as

$$ \left.\begin{aligned} C_{0}&=\log_{e}P \\ C_{r}&=p_{r}+\sum\limits_{q=1}^{r-1}\frac{q}{r}C_{q}p_{r-q}, for1<r\leq Pand \\ C_{r}&0.=\sum\limits_{q=r-P}^{r-1}\frac{q}{r}C_{q}p_{r-q}, forr>P \end{aligned}\right\}. $$
(6)

Since clips in the dataset were of unequal lengths and number of frames obtained varied. When features were extracted in frame level, it produced different dimensions. To handle this, we performed two operations: a) grading and b) standard deviation measurement.

  1. 1.

    Firstly, the sum of LPCC coefficients in each of the frequency ranges (bands) across all the frames was computed. Based on the sum of these energy values, bands were graded in an ascending order. This sequence of band numbers was used as features that helped in identifying dominance of different bands for the clips from various categories.

  2. 2.

    Secondly, standard deviation was computed for every band. These two metrics were stacked to form the feature, which is independent of the clip length. 10, 20, 30, 40 and 50 dimensional features were extracted for the 2 classes. The trend of the 30 dimensional feature values (best result) for the 2 classes is shown in Fig. 4.

Fig. 4
figure 4

Representation of 30 dimensional features for the audio clips: healthy class (left); non-healthy class (right)

Classification: MLP

We emplpyed MLP classfier – feed-forward artificial neural network – for classification purpose [17]. Feedforward neural networks are made up of the input layer, output layer and hidden layer. It is a supervised learning algorithm trained on a dataset using a function f() : ZnZo, where n and o represent the dimensions for input and output. For a given set of features \(P = p_{1},p_{2},\dots ,p_{n}\) and aim x, a non-linear function is learned for classification. The difference between MLP and logistic regression lies in the existence of one or more non-linear layers (hidden layers) between the input and the output layer. MLP consists of three or more layers (input layer, output layer and one or more hidden layers) of non-linear activating neurons. The number of hidden layers can be increased according to the requirement of developing a model to accomplish certain task.

The initial layer is the input layer which comprises of a set of neurons \(\{p_{i} \mid p_{1}, p_{2},\dots ,p_{n}\}\) denoting the features. Each neuron of the hidden layer modifies the values from the previous layer using sum of weights as \(w_{1}p_{1}+w_{2}p_{2}+,\dots ,+w_{n} p_{n}\).

The activation function that represents the relationship between input and output layer in of non-linear nature. It makes the model flexible in defining unpredictable relationships. The activation function can be expressed as

$$ y_{i}=\tanh (w_{i}) \text{ and } y_{i}=(1+e^{w_{i}})^{-1}, $$
(7)

where yi and wi denotes the outcome of the ith neuron and weighted sum of the input features. The values from the ultimate hidden layer are provided to the output layer as output values. Each layer of MLP contains several fully connected layers as each neuron in a layer is attached to all the neurons of the previous layer. The parameters of each neuron are independent of the remaining neurons of the layer ensuring possession of unique set of weights. The initial momentum and learning rate were set to 0.2 and 0.3 respectively.

Results and analysis

Evaluation metrics and protocol

Accuracy is not enough to measure the performance of any system. It is also much important to analyze the disparate misclassifications. Hence, to evaluate our tool, the following performance metrics are used: Precision, Accuracy, Sensitivity (Recall), Specificity, and Area under ROC curve (AUC). They are computed as

$$ \begin{array}{@{}rcl@{}} \text{Accuracy} &=&\frac{T_{P}+T_{N}}{T_{P}+T_{N}+F_{P}+F_{N}}, \end{array} $$
$$ \begin{array}{@{}rcl@{}} \text{Precision}&=&\frac{T_{P}}{T_{P}+F_{P}}, \\ \text{Sensitivity (Recall)}&=&\frac{T_{P}}{T_{P}+F_{N}}, \\ \text{Specificity}&=&\frac{T_{N}}{T_{N}+F_{P}}, \text{ and} \\ \text{F1 score}&=&2\times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}, \end{array} $$
(8)

where TP, TN, FP, and FN refer to true positive, true negative, false positive, and false negative, respectively.

To avoid possible bias in evaluation, 5-fold cross validation was used.

Our results

The performance of the different features are provided in Table 2. It is observed that the best result was obtained with 30 dimensional features and it’s corresponding confusion matrix is provided in Table 3.

Table 2 Performance of different feature dimensions using MLP
Table 3 Inter-class confusions for the 30 dimensional features (Best result) using MLP

Next, the momentum was varied from 0.1 to 0.5 with a step of 0.1, and results are provided in Table 4. The best result was obtained for a momentum of 0.1 whose inter-class confusions are provided in Table 5. As compared to the default scenario, there were 4 more misclassifications in the case of the healthy cases (and 9 less misclassifications for the non-healthy cases).

Table 4 Performance for different momentum values on 30 dimensional features with learning rate of 0.3
Table 5 Inter-class confusions for momentum value of 0.1 on 30 dimensional features

Finally, the momentum was varied from 0.1-0.6 with a step of 0.1 whose results are provided in Table 6. In our experiment, the highest performance was obtained when a learning rate of 0.5 was selected. We presented a confusion matrix for this setup in Table 7. It is observed that the number of misclassifications for both classes was reduced as compared to the initial setup. The misclassified instances were analyzed, and it was found that many of them had heartbeat sounds. Along with this, other unwanted artefacts, such as talking and movement of the probe helped in misclassifying.

Table 6 Performance for different learning rates with momentum of 0.2
Table 7 Interclass confusions for learning rate of 0.5 and momentum of 0.2 on 30 dimensional features

It is observed that the misclassified instances was reduced by almost 15.63% as compared to the original setup using default settings. As compared to best result, after momentum tuned, a decrease of nearly 8.47% occurred for the misclassified instances.

A deeper analysis of the misclassifications revealed that approximately 0.74% of the healthy cases were misclassified as opposed to non-healthy. In the case of non-healthy instances, approximately 0.83% of the clips were misclassified as healthy, which we call false negative.

The different performance metrics were computed for the default setup, best momentum, and best learning rate (overall highest). Such results are provided in Table 8. The ROC curves for these scenarios are shown in Fig. 5.

Fig. 5
figure 5

ROC curves: a default settings, b best momentum value (0.1), and c best learning rate (0.5, overall highest result)

Table 8 Performance metrics for default scenario, best results after tuning momentum value and best result after tuning learning rate

Comparative study

The performance of several other classifiers was compared in order to establish the efficacy of MLP. For comparison, the 30 dimensional feature set (best performance) was chosen. We experimented with BayesNet, SVM, RNN, Naive Bayes, RBF network, Decision Table, LibLINEAR, and Simple logistic. The results are provided in Table 9.

Table 9 Performance of different classifiers on the 30 dimensional features

We also compared the performance of our system with reported works by Kok et al. [12] and Chambers et al. [6]. The average accuracies for both the systems along with the proposed system are provided in Table 10.

Table 10 Comparison with reported works

Conclusion

In this paper, we developed a tool to detect respiratory sounds that come from respiratory infection carrying patients. We have employed Linear Predictive Cepstral Coefficient (LPCC)-based features to characterize respiratory sounds. With Multilayer Perceptron (MLP)-based classifier, in our experiment, we have achieved the highest possible accuracy of 99.22% (AUC = 0.9993) on a publicly available dataset of size 6800+ clips. In addition to other popular machine learning classifiers, our results outperformed common works that exist in the literature.

Not limiting to binary classification (health/non-healthy), our immediate plan is to classify disease types from non-healthy category. This will help identify the nature and severity of infection. As we observed that COVID-19 could possibly screened by analyzing respiratory sound [5], we are now extending our experiments on COVID-19 [21, 22].