Keywords

1 Introduction

Respiratory symptoms are associated with illnesses, infections or allergies. For example, cough is the main symptom of asthma. When the patient has pneumonia (e.g., COVID-19), it is often accompanied by throat-clearing (t-c) and nasal aspiration symptoms. Currently, patients commonly use subjective reporting methods when seeking medical care [6]. This has been shown to be inefficient and inaccurate.

In recent years, works have focused on the detection of specific types of respiratory symptoms, such as cough [13], sneeze [1], and snore [15]. PulmoTrack-CC [14] uses a combination of sound recorded from the neck and a motion sensor placed on the chest to achieve a sensitivity of approximately 96% when calculating cough events. All of the above systems require the user to wear special sensors and are not practical enough. With the increasing power of smartphones, many studies have emerged on the use of smartphones to improve the quality of healthcare services [10, 11, 17, 18]. A cough detection system [19] uses a local Hu matrix and a k-nearest neighbor (KNN) algorithm to achieve 88.51% sensitivity (SE) and 99.72% specificity (SP). SymDectector [12] is a smartphone-based application that implements the detection of sound-related respiratory symptoms in office and home scenarios. SymListener [16] implements three types of respiratory symptom detection in driving environments with strong interior noise. However, SymDetector and SymListener do not consider continuous symptoms. The popularity of earphones provides an opportunity to detect respiratory symptoms in multiple indoor environments. When users wear earphones, their relative position to the human body does not change and they are able to receive the acoustic signals generated by the user more stably.

Driven by these circumstances, we propose an earphone-microphone based system, called SymRecorder, for detecting sound-related respiratory symptoms in a variety of indoor environments. SymRecorder uses the earphone microphone connected to a smart device to sense the environment and detects and recognizes sound-related respiratory symptoms, including cough, sneeze, t-c and sniffle. To achieve the above objective, we face the following challenges: (1) the indoor environment where the user is located may be noisy, which can lead to a lower signal-to-noise ratio and make it difficult to detect sound events; (2) the user may experience continuous respiratory symptoms, especially continuous cough, at very short intervals, SymRecorder needs to accurately subdivide these continuous symptoms.

To address the above challenges, we design a sound event detection method combining dual threshold and Adaptive Band-partitioning Spectral Entropy (ABSE) [3], named RA-ABSE to detect sound-related events occurring in different indoor environments. RA-ABSE uses dual thresholds to detect sound event endpoints in quiet environments. While in the noisy environment, ABSE is used as a feature to detect the endpoints of sound-related events and is combined with Berouti power spectrum subtraction to remove the effect of noises on sound events. With the help of the RA-ABSE, segments of the audio containing sound events are filtered out. After acquiring the sound event fragments, we design a Hilbert Transform (HT) based method to subdivide the possible continuous symptoms. Then we use a combination of features based on Mel Frequency Cepstrum Coefficients (MFCC), Gammatone Frequency Cepstrum Coefficients (GFCC) and spectrogram. SymRecorder adopts the Residual Network (ResNet) and Multi-layer Perceptron (MLP) to classify the four types of respiratory symptoms. We also incorporate the attention mechanism into the ResNet to highlight the unique features of the same respiratory symptoms and reduce the influence of different environments and different populations.

To evaluate the performance of SymRecorder, we collect data from a total of 20 volunteers over 4 months using earphones to build the system model. We implement SymRecorder on the Android platform and comprehensively evaluate its performance. The experimental results show that SymRecorder is effective in four indoor environments: home, office, canteen, and shopping mall. Our contributions are summarized as follows:

  • We propose a detection system, called SymRecorder to detect sound-related respiratory symptoms in different indoor environments. Through acoustic sensing, SymRecorder only uses a pair of earphones and a mobile device to detect and differentiate between cough, sneeze, t-c, and sniffle.

  • We design a dual threshold and ABSE-based sound event detection method, called RA-ABSE, to detect sound events in different indoor environments, and use Berouti power spectrum subtraction to eliminate the environmental noises. We also design a HT based method to subdivide possible continuous respiratory symptoms.

  • We design a combination of features based on the spectrogram, MFCC, and GFCC, and use a deep learning model combining ResNet, attention mechanism, and MLP for classification. The evaluation results show that SymRecorder has an average accuracy of 92.17% and an average precision of 90.04%.

The rest of this article is organized as follows: In Sect. 2, we describe The detailed description of the SymRecorder design. Experimental details and future work on SymRecorder are presented in Sect. 3. Section 4 discusses related work, and finally, we draw our conclusion in Sect. 5.

2 System Design

This section describes the system architecture of SymRecorder. As shown in Fig. 1, the whole system consists of six modules. First, the original microphone audio recording is split into frames and windows, and the frames and windows are sent to the sound event detection module. This module first determines the current environment type and detects sound events using the RA-ABSE method. Next, the sound events are passed through the continuous symptom detection module to subdivide the possible continuous symptoms. Next, features are extracted for each filtered sound event and a deep learning network is used to classify the sound events. Finally, respiratory symptoms are recorded. The design details of each module are described in detail below.

Fig. 1.
figure 1

System overview.

2.1 Sampling and Pre-processing

Existing earphones are capable of sampling audio signals at a variety of sampling rates. We choose 20000 Hz as the sampling rate. The sampled audio stream is then segmented into 10ms non-overlapping frames, which are used to extract time-domain features. The VocalSound [7] dataset contains recordings of 3365 subjects performing six physiological activities: laugh, sigh, cough, t-c, sneeze, and sniffle. We count the distribution of all symptom durations. As seen in Fig. 2a, respiratory symptoms typically last for hundreds of milliseconds and cover multiple frames. Therefore, we group a fixed number of consecutive frames into a single window for processing. In addition, the user may also experience continuous respiratory symptoms, especially continuous cough. To determine the window size, we also count the number of possible occurrences of continuous respiratory symptoms. As shown in Fig. 2b, continuous respiratory symptoms tend to last 1 to 3 times, while reference [5] states that during continuous respiratory symptoms, subsequent symptoms will not include an inspiratory period except for the first symptom, and the duration of each symptom will not exceed 0.5 s. Therefore, the window size is set to 2 s, which can cover any respiratory symptoms. To avoid double counting, there is no overlaph between windows. When a user experiences consecutive symptoms, they are distributed in a maximum of two windows.

Fig. 2.
figure 2

The distribution of symptom.

2.2 Sound Event Detection

We utilize the short-time power (STE) to determine the user’s environment. Specifically, SymRecorder stores data from the current window and the past 4 windows, totaling 5 windows (i.e., 10 s). Subsequently, the STE of the frames within each window is computed. Only when 80% of the frames’ STE in each window are below the STE threshold (i.e., 10), the current environment is classified as a quiet environment. Otherwise, it is considered as a noisy environment. Following this, we design a sound event detection method called RA-ABSE. In quiet indoor environments, the method employs dual-threshold time-domain features for sound event detection, while in noisy indoor environments, ABSE is used as a feature to detect sound events.

Quiet Indoor Environment. In a quiet indoor environment, the energy of the audio signal received by the earphone microphone is typically low except for sound events. Figure 3a illustrates an earphone audio recording in an office scenario with a subject’s speech signal and several respiratory symptoms. It can be observed that the energy of the environmental noise is very low except for the sound events. Furthermore, in addition to discrete sound events containing respiratory symptoms, continuous sound events (e.g., speech or music) are included, which need to be filtered out. In the following, we introduce the employed time-domain features and elucidate how these features can be used to filter out continuous sound events.

Root Mean Square (RMS): Suppose l denotes the frame consisting of N samples, and x(ln) denotes the amplitude value of the n sample in l, then the RMS [8] of the l frame is

$$\begin{aligned} rms\left( l \right) =\sqrt{\frac{1}{N}\sum _{n=1}^N{x\left( l,n \right) ^2}} \end{aligned}$$
(1)

The RMS measures the energy level contained in the current acoustic frame so that the RMS can distinguish between acoustic and non-acoustic event frames.

Above \(\alpha \)-Mean Ratio (AMR): Assuming that w represents a window consisting of m frames, the AMR of the window w is calculated as

$$\begin{aligned} amr\left( \alpha ,w \right) =\frac{\sum _{i=1}^m{ind\left[ rms\left( l_i \right) >\alpha \cdot \overline{rms}\left( w \right) \right] }}{m} \end{aligned}$$
(2)

where \(\overline{rms}\left( w \right) \) is the mean RMS of all frames in window w and \(ind\left( \cdot \right) \) indicates the indicator function that returns 1 when the condition is true and 0 otherwise. \(\alpha \) is the given parameter. AMR measures the ratio of high-energy frames in the window and the experimental parameter \(\alpha \) is jointly set with the mean RMS of the window to distinguish between high-energy and low-energy windows. Given an appropriate value of \(\alpha \), windows containing discrete sound events, continuous sound events, and environmental noise return different AMR. Therefore, this feature can be used to filter windows with discrete sound events. In SymRecorder, \(\alpha \) is set to 0.6.

RMS is first used to find the endpoints of sound events. As shown in Fig. 3b, sound events usually have higher energy, and therefore the RMS of the sound event frames is also significantly larger than the surroundings. Specifically, when the RMS of three consecutive frames is above the RMS threshold \(\beta \) (i.e., 0.005), the beginning of the first frame is considered the start point of the sound event. The end point is obtained when the RMS of three consecutive frames below the threshold. And the AMR is used to filter out continuous sound events, especially the user’s speech signals. As shown in Fig. 3c, windows contain discrete sound events typically have lower AMR due to the windows contain fewer frames of sound events, while the sound events contain much more energy than environmental noise frames. The AMR of the speech event window typically ranges from 0.3 to 0.5, since the voiced frames occupy about 30% to 50% [9] in a fluent speech. Therefore, when the AMR of the window where the current sound event is less than 0.3 and the duration of the sound event is greater than 0.2 s, the sound event is considered as a valid sound event, otherwise, the sound event is discarded.

Finally, we consider the situation when the user experiences continuous symptoms. We observe that when most of the continuous symptoms are distributed across two windows, the AMR of the window containing more symptom parts will be slightly higher, but still below the threshold of 0.3, so that the continuous symptoms are preserved. However, when continuous symptoms are concentrated within a single window, the AMR of that window becomes similar to the AMR of the continuous speech windows, which means that the continuous symptoms will be discarded. Therefore, if the AMR of the window containing the sound event is higher than 0.3 but the duration of that sound event is lower than the window size (i.e., 2 s), the sound event is still preserved.

Noisy Indoor Environment. In a noisy indoor environment, the earphone microphone continuously receives audio signals with higher energy. Figure 4a shows an audio recording in a canteen scene, where it can be observed that the environmental noise in the canteen makes it challenging to accurately detect respiratory symptoms by time-domain features. Therefore, we employ the ABSE as a feature parameter to detect sound events in noisy environments. ABSE divides the spectrum into multiple frequency bands and calculates the spectral entropy within each frequency band, thus avoiding dependence on the entire spectral amplitude variance. The ABSE for the l frame is calculated as

$$\begin{aligned} H_b\left( l \right) =\sum _{m=1}^{N_b}{W\left( m,l \right) }\cdot H_b\left( m,l \right) \end{aligned}$$
(3)

where W(ml) and \(H_b(m,l)\) are the weight and spectral entropy value of the m sub-band, respectively. Then an adaptive signal threshold \(T_{s}\) is set to classify event segment or noise-only segment according to the mean \(\mu \) and variance \(\theta \) of the logarithmic ABSE value of detected noise-only segments. Formally, \(T_s=\mu +\gamma \cdot \sigma \), and \(\gamma \) (i.e., 0.005) is an experimental coefficient. This threshold is compared to the value of the current frame. Whenever the difference surpasses a specified threshold, event segment is detected. If a given segment is detected as a noise-only segment, then the signal threshold is updated. Figure 4b illustrates the trend of the ABSE of Fig. 5(a) and \(T_{s}\). It can be observed that \(T_{s}\) is constantly updated during the pure noise and remains unchanged when a sound event is detected.

After the sound events endpoints are detected, it is necessary to separate the noise components from the sound events. We employ the Berouti spectral subtraction method to reduce the noise components in the sound events. Suppose \(Y(e^{jw})\), \(S(e^{jw})\) and \(N(e^{jw})\) denote the Fourier Transform (FT) result of the mixed noisy signal, the pure signal and the additive noise, then we have \(|Y\left( e^{jw} \right) |^2=|X\left( e^{jw} \right) |^2+|N\left( e^{jw} \right) |^2\). As for the additive noise can not be obtained directly, we use the average power spectral E of several beginning frames to approximately replace \(|N\left( e^{jw} \right) |^2\). Finally, \(|S(e^{jw})|\) can be calculated by \(|S(e^{jw})| = \sqrt{|Y\left( e^{jw} \right) |^2 - E}\). Figure 4c illustrates the processed result, it can be seen that most of the environmental noise has been eliminated, and the sound events can be effectively extracted from the time domain.

Fig. 3.
figure 3

Example of audio recording in office.

Fig. 4.
figure 4

Example of audio recording in canteen.

2.3 Subdivision of Continuous Symptom

Although continuous symptoms mainly refer to continuous cough, in order to cope with other continuous symptoms that may occur, all detected sound events are sent to this module. We design an algorithm based on HT to subdivide the sound events that may contain continuous respiratory symptoms.

The algorithm execution steps are shown in Fig. 5. Firstly, the HT is applied to the sound events detected in the previous stage to extract the envelope, representing the amplitude contour of the sound events. The HT is applied to smooth the sound signal and eliminate the negative values [4].

The envelope is then passed through a Butterworth low-pass filter, as a way to obtain the fundamental frequencies of the continuous respiratory symptoms. The frequency range of the low-pass filter is estimated based on the duration of the current sound event. Assuming the duration of the current sound event is t, as shown in Fig. 2b, the number of possible occurrences of consecutive symptoms is from 1 to 4. Thus, the frequency interval of the current symptom during the time of t is (1/t, 4/t) Hz. We set this frequency interval as the frequency range of the low-pass filter and iteratively increment 0.1 Hz. When the filter frequency approaches the frequency of the current respiratory symptoms, the number of peaks on the filtered envelope corresponds to the number of occurrence counts of the current symptom. Thus, when the criteria for the number of peaks are met, the variance of all peaks is recorded until the iteration process concludes. The set of peaks with the minimum variance is subsequently selected, and the subdivision of sound events is achieved by the distance differences between the peaks. In the algorithm design process, additional conditional statements are incorporated to handle specific situations:

  1. (a)

    Since the number of peaks in the filter envelope corresponds to the number of times during the filter frequency change, only one peak can be detected for a sound event that contains only one respiratory symptom. If only one peak is still detected when the filter frequency iterates to 4/tHz, the current sound event is not processed in the current module.

  2. (b)

    Some single symptoms can have two stages of energy bursts, with the first phase being sharper and containing higher energy, while the second stage is relatively flat and has lower energy. Therefore, two peaks may appear during the filter frequency iterations, indicating the subdivision of a single symptom. To differentiate it from two consecutive symptoms, the values of the two peaks are compared after the set of peaks with the lowest variance is obtained. For two consecutive symptoms, the peaks on the filtered envelope will be evenly distributed. Suppose the first peak value is \(Peak_{1}\) and the second peak value is \(Peak_{2}\). If \(0.8 \cdot Peak_{1} < Peak_{2}\) is satisfied, the sound event is subdivided according to the distance between the peaks, otherwise the current sound event is output directly.

  3. (c)

    Two stages of energy bursts may also occur during continuous symptoms. The variance of the peak set can help filter out such case. During the iteration of the filter frequency, when a smaller peak appears, the variance of the set of peaks increases, and therefore the current peak set is not selected. So, if the number of selected peak set is more than two, the current sound event is subdivided directly according to the distance between the peaks.

Finally, we perform alignment processing on each subdivided sound event to facilitate the next step of feature extraction. Specifically, the duration of the sound event is denoted as d, if \(d < 0.2\,\textrm{s}\), the sound event is discarded; if \(0.2\,\textrm{s} < d < 0.5\,\textrm{s}\), then the sound event is zero-padded to 0.5 s; if \(d>0.5\,\textrm{s}\), then the middle part of the sound event is taken, and the part before and after the length of \(1/2\cdot \left( d-0.5 \right) \) is deleted.

Fig. 5.
figure 5

Subdivision algorithm.

2.4 Feature Extraction and Classified Model

Respiratory symptoms are abnormal manifestations related to the respiratory system, typically emitted from the nasal cavity or throat, presented in the acoustic form of specific audio signals. Many features exist for identifying specific types of audio signals, and one of the most commonly used features is the MFCC. MFCC takes into account the non-linear response of the human ear on the audio spectrum, and is obtained through a frequency transformation of the logarithmic spectrum.

Although MFCC is widely used as a feature in audio signal processing, the performance of MFCC is strongly influenced by the noise level. The Gammatone filter bank can provide higher accuracy compared to the Mel filter bank. To make the acoustic features more robust, we also use GFCC as a feature.

In addition, SymRecorder requires a feature to describe the local information of respiratory symptoms in both frequency and time domains. Short-term Fourier Transform (STFT) splits the original signal into fixed-length time windows and applies the FT, which can capture the short-time spectral features in the original signal.

SymRecorder uses deep learning networks to capture the distinctive representations of each respiratory symptom. The network architecture is shown in Fig. 6. The learning network uses Convolutional Neural Network (CNN) and ResNet as the backbone, MFCC matrix, GFCC matrix, and spectrogram as inputs. To enhance the differences between different sound event features, the lightweight Convolutional Block Attention Module (CBAM) is integrated into the ResNet. Finally, the fine-grained features extracted by the learning network are concatenated into the same feature vector and then classified using the MLP.

Fig. 6.
figure 6

Finer feature extraction and symptom identification network structure.

3 Experimentation and Evaluation

In this section, we present the implementation details and evaluates the performance of SymRecorder based on the data collected from experiments. We also conclude with a discussion on the future work of SymRecorder.

3.1 Experimental Setup

The training set used is derived from two datasets. The first dataset is from VocalSound [7]. We get 2013 cough, 1310 sneeze, 1764 t-c, and 2341 sniffle samples. These samples are utilized to investigate the features extracted from respiratory symptoms and enable deep learning networks to learn the distinctions between different respiratory symptom characteristics.

The second dataset comes from 14 participants we recruited, consisting of 4 females and 10 males. The participants’ ages range from 12 to 58 years. Three participants are from the same family and spend much of their time at home; the remaining 11 participants are graduate students who frequented the office and canteen almost every day. Additionally, they spend one day per week shopping at the mall. Over a period of four months, we collect 2,873 cough, 2,008 sneeze, 2,577 t-c, and 3,135 sniffle samples under the four different environmental conditions. In addition, we also gathered non-symptomatic sound events (e.g., door closing), which are labeled as “other” categories.

To test the performance of SymRecorder, a prototype is developed and installed on Honor-10 and Xiaomi 12Pro smartphones. Four volunteers who participate in data collection are joined by an additional 6 volunteers for evaluation purposes. The evaluation scenarios included home, office, canteen, and shopping mall. Over the course of nearly three months of evaluation, we collect 1331 cough, 797 sneeze, 916 t-c, and 1054 sniffle samples. Table 1 presents detailed information about the utilized dataset. We compare the performance of SymRecorder with the following methods, which also focus on detecting respiratory symptoms through acoustic sensing:

SymDetector [12]: This work classifies cough, sneeze, sniffle, and t-c symptoms using the SVM classifier using time-domain features and frequency-domain features such as symptom length, the center of mass, bandwidth, etc.

SymListener [16]: This work uses MFCC and GFCC features to classify cough, sniffle, and sneeze using Long Short Term Memory (LSTM) networks.

3.2 System Performance

Overall Performance. We first compare the overall performance of SymRecorder with the baseline methods realized in an offline manner. Figure 7a shows the confusion matrix of SymRecorder, indicating that 93.18% of respiratory symptoms are correctly classified. Sniffle has a probability of being classified as “other”, but is less likely to be classified as cough. Cough has a probability of being classified as t-c, while sneeze has a probability of being classified as sniffle. Figure 7b illustrates the overall performance of SymRecorder compares to the two baseline methods. It can be observed that SymRecorder achieved the highest average recall and precision, which are 92.17% and 90.04%, respectively. Due to SymDetector relying only on audio amplitude and RMS to detect sound-related events, it is less robust to the interference of noisy environments, such as canteens and malls. This may result in SymDetector missing sound events in noisy environments. Although SymListener can adapt to strong driving noise, it does not consider the impact of consecutive symptoms, treating them as individual occurrences. Furthermore, both SymDetector and SymListener do not differentiate the source of symptoms, and symptoms generated by other people also lead to overall performance degradation. For SymRecorder, the detection accuracy for cough and sneeze is relatively high. This can be attributed to the high energy density and long symptom duration associated with these two symptoms. In contrast, the detection accuracy for sniffle is relatively low due to its lower energy density and shorter symptom duration.

Table 1. Setup of Datasets.
Fig. 7.
figure 7

The overall performance of SymRecorder.

Influence of Indoor Scenario. Figure 8a and Fig. 8b illustrate the recall and precision in different indoor scenarios. In this context, the term “mall” refers to a comprehensive commercial complex where the environmental noise tends to be more pronounced compared to other scenarios. It can be observed that SymRecorder performs the best in office environments, as offices are typically characterized by relatively quiet surroundings. Across various scenarios, the detection performance for cough and sneeze is consistently good. However, in canteen and mall scenarios, the recall and precision for sniffle are relatively low. This is because these scenarios often feature short and high-frequency sound events such as tray handling noises and buzzing sounds, which can either mask sniffle sounds or be misclassified as sniffle. Additionally, the category of “other” sound events exhibits a lower recall rate but higher precision in the evaluation. This suggests that sound events tend to be classified as respiratory symptoms, while respiratory symptoms are difficult to classify as “other”. This phenomenon may be attributed to the fact that certain sound events can generate acoustic characteristics similar to respiratory symptoms.

Fig. 8.
figure 8

The performance of different scenarios.

3.3 Discussion and Future Work

Although we introduce a subdivision algorithm to handle potential instances of continuous cough events, the subdivision algorithm cannot handle all continuous cough events. The algorithm can fail when the second burst stage of a cough symptom resembles the first stage. Furthermore, if multiple individuals cough simultaneously, causing overlapping cough sounds, the subdivision algorithm may also produce errors. We will subsequently improve the subdivision algorithm and consider acquiring dual-channel signals from headphones to distinguish between different users.

4 Related Work

The audio-based approach has an excellent track record in detecting respiratory health. PulmoTrack-CC [14] achieved 94% overall specificity and 96% overall sensitivity in detecting cough events. VitaloJAKTM [2] proposes to capture signal regions with high energy and high spectral mass to automatically count coughs from recordings. However, all of the above work requires the user to wear a recording device or acoustic sensor, which is extremely inconvenient to use.

In many previous works, smartphones started to be used to collect respiratory health information. iSleep [8] is a smartphone-based sleep monitoring system that detects snoring sounds from the user, but it has high environmental requirements. symDetector [12] is a smartphone-based application that detects sneeze, cough, sniffle, and t-c sounds in a home or office environment, and has high ambient noise requirements. SymListener [16] is also a smartphone-based application that detects sneeze, cough, and sniffle sounds in the driving environment, with a high level of environmental robustness, but without considering the effects of continuous symptoms.

5 Conclusion

We propose SymRecorder, an application based on the microphone of earphones, which can inconspicuously detect user-related respiratory symptoms in various indoor environments, including cough, sneeze, t-c, and sniffle. A method called RA-ABSE is designed to detect the endpoints of sound events, and Berouti power spectral subtraction is employed to remove potential environmental noise. We devise an algorithm to subdivide possible continuous symptoms, utilizing MFCC, GFCC, and spectrogram as features, and employ the ResNet with the stacked attention mechanism and MLP for classification. Extensive experiments are conducted to evaluate the performance of SymRecorder in different indoor environments, and the results demonstrate that SymRecorder can detect respiratory symptoms with high accuracy.