1 Introduction

Obstructive sleep apnea-hypopnea disorder (OSA) is characterized by repetitive narrowing or complete closure of the upper airway (UA) during sleep, leading to complete cessation (apnea) and/or partially (≥ 50%) reduction (hypopnea) of airflow that lasts at least 10 s and is associated with a minimum of 4% drop in oxygen saturation level of blood (SaO2) [1]. The apneic episodes often contribute to frequent arousals or awakenings in order to restore airway functioning. These frequent arousals reduce sleep continuity and quality [2]. Untreated OSA is associated with many deficits, including excessive daytime sleepiness, increased risk of motor vehicle accidents [1], and increased risk of cardiovascular disease [3].

OSA is a common health problem which affects all age groups. Between 9 and 38% of the general adult, the population is suffering from OSA, while it is higher in men compared to women and much higher in the elderly groups [4]. Since OSA is still underdiagnosed, these statistics are believed to underestimate the actual numbers [5]. The severity of OSA is currently measured by the number of apneic episodes per hour of sleep using the apnea/hypopnea index (AHI). AHI < 5 considered as non-OSA, and 5 ≤ AHI < 15, 15 ≤ AHI < 30, AHI ≥ 30 considered as mild, moderate, and severe OSA, respectively [6]. AHI is measured by polysomnography (PSG) assessment over the night sleep. PSG records various signals including electroencephalogram, electrocardiogram, electrooculogram, electromyography of chins and leg, body position, nasal airflow, SaO2, as well as abdominal and thoracic movements in order to provide a full assessment of sleep quality [7]. While full-nocturnal PSG is considered as the gold standard for sleep apnea assessment, it is time-consuming, laborious, expensive, and not easily accessible, particularly in small towns and remote areas. Thus, in emergency situations that the OSA status of a patient is needed, it might not be feasible to perform a quick assessment of OSA status using PSG approach, hence underdiagnose OSA. For example, around 80% of patients presenting for operation are undiagnosed at the time of surgery [5]. Inadequate preoperative assessment of OSA in these patients may increase their postoperative complications risks. Therefore, a quick and reliable screening OSA assessment during daytime without a full-nocturnal sleep study is highly desirable.

The current quick preoperative OSA screening method in hospitals is based on using subjective questionnaires, such as Berlin Questionnaire, Epworth sleepiness scale, STOP, and STOP-BANG [8], that collect anthropometric information (gender, neck circumference, age) [9]. These methods have shown to have a high sensitivity (90%) but at the cost of a very low specificity (< 40%) [8]. Consequently, they could potentially result in identifying a much higher number of participants as high-risk patients and reducing the cost-effectiveness of such assessment. On the other hand, our team and a few other researchers around the world have used tracheal breathing sounds (TBS) for OSA monitoring and screening during daytime to predict their OSA condition with better accuracy than the commonly used method based on questionnaires [10,11,12]. The premise of our technique to screen OSA during wakefulness is based on the fact that the TBS change as the UA structure varies in OSA individuals [13, 14]. It has been shown that these structural changes are present not only during sleep [15, 16] but also during daytime when patients are awake (wakefulness) [17,18,19,20].

In a study reported in [12], the formant frequencies of TBS and their variation from inspiration to expiration were investigated for 10 mild moderate (AHI < 30) and 13 severe (AHI ≥ 30) OSA patients during wakefulness before getting asleep. They used an LDA classification approach using formant features and some anthropometric information. They reported sensitivity (specificity) of 88.9% (84.6%) and accuracy of 86.4%. Since the features were extracted from the entire dataset (due to the small sample size), those results are considered as biased. Also, LDA is a generative approach that makes strong and sometimes unrealistic assumptions such as normality as well as equal variance structure of all extracted features for each class. In our team’s previous study [10], many features were extracted from the TBS’ power spectra of 105 randomly selected participants (56 non-OSA, AHI ≤ 5 and 49 OSA, AHI ≥ 10) out of a total of 130 participants. Next, using data of all participants and within an exhaustive leave-two-out routine (by leaving one subject from each of the two groups as testing and the remaining as training), six features with the best discriminative power were selected. Afterward, using a support vector machine (SVM) classifier with linear kernel, the classification accuracy of all combination of three-feature sets were calculated. This procedure was repeated until each participant’s data was used at least once as a test. Finally, from the most repeated 3 features, 2 of them with the lowest correlation were selected. Then, through another exhaustive leave-two-out cross-validation, the 2-class SVM classifier resulted in 83.83% and 83.92% accuracies for training and test datasets, respectively. The sensitivity (specificity) was 83.92% (85.20%) for training and 82.61% (85.22%) for test datasets [10]. Despite the high accuracy of the classification results, the proposed procedure in [10] is complicated and computationally expensive. Moreover, the random selection of training participates for feature extraction was done only once; thus, the reported results might have some bias.

In this paper, we propose to apply the logistic regression with least absolute shrinkage and selection operator (LASSO) or l1-penalty for classification of apneic individuals using the combination of the anthropometric information and TBS during daytime while the subjects are awake. LASSO is a powerful regularization method that performs the feature selection for statistical models [21] by shrinking and sometimes setting some of the coefficients of the regression variables to zero [21, 22]. Removing some of the coefficients can reduce the variance without a substantial increase of the bias, hence, increase the prediction accuracy. Moreover, by eliminating the irrelevant variables that are not associated with the response variable, the LASSO helps to increase the model interpretability and reduce the overfitting [21, 22]. This method has been widely used in variable selection and classification in many clinical fields [23,24,25]. We also use LASSO logistic regression for our classification problem, as it is a discriminant approach that does not require normality assumption or homogeneity of features in OSA classes. To the best of our knowledge, this is the first research where these techniques are used to assist in OSA diagnosis from TBS signals recorded during the daytime. We have validated our approach on a larger number of participants compared to that of previous studies. The physiological interpretation of the selected features in relation to the structural changes of UA due to OSA are also discussed.

2 Method

2.1 Data

Data of this study were collected from 199 participants suspected of OSA, prior to overnight PSG assessment at Sleep Disorder Lab at Misericordia Health Centre (Winnipeg, MB, Canada). The Biomedical Research Ethics Board of the University of Manitoba approved the study, and all participants signed an informed consent form prior to participating in the study. After our recording, participants proceeded to overnight PSG preparation and sleep assessment. We collected their anthropometric information and calculated their AHI (from PSG assessment) afterward to compare with our proposed method’s AHI prediction and calculate its accuracy. Our collected data included 74 (29 males) non-OSA with AHI < 5, and 90 (66 males) moderate/severe OSA with AHI ≥ 15. The remaining 35 (21 males) mild-OSA participants with 5 ≤ AHI < 15 were dealt separately. Table 1 presents the anthropometric information of the participants.

Table 1 Anthropometric information’s mean and their corresponding standard deviations (std) for the non-OSA, moderate/severe OSA, and mild-OSA groups

2.2 Recording procedure

TBS data were recorded during the daytime, while the participants were awake, and prior to their PSG assessment. Participants’ TBS were recorded by a miniature microphone (Sony ECM-77B) embedded in a chamber with a 2-mm cone-shaped space with skin placed over the suprasternal notch of trachea using a double-sided adhesive ring tape. The chamber was embedded in a soft neckband wrapped around the participants’ neck to prevent plausible movement and comfort of the participant. The recorded breathing sounds were amplified using a Biopac (DA100C) amplifier, band-pass filtered in the frequency range of (0.05–5000 Hz), and digitized at 10240 Hz sampling rate. In supine position with head resting on a pillow, the participants were instructed to breathe deeply in two maneuvers: first through their mouth with a nose clip in place, second through their nose with their mouth closed. and then we recorded 5 full breath cycles in each maneuver.

2.3 Pre-processing and signal analysis

All of our recordings were done in a hospital setting in a relatively noisy background. Thus, the recorded sounds contained various types of noises (e.g., vocal noises, air conditioner sounds) in addition to the desired breathing sounds. We applied different processes to remove unwanted noises from the collected TBS. As the first step, we examined the recorded sounds in the time-and-frequency domain by visual and auditory means using the spectrogram program of MatlabTM software; this step was done manually to ensure 100% accuracy in selecting noise-free breathing sounds segments. Consequently, all of the breathing cycles that included artifacts, vocal noises, swallowing, and low signal to noise ratio compared to the background noise were excluded from the analysis. Figure 1 shows a sample spectrogram that contains breathing sounds in addition to various type of noises including swallowing, coughing, vocal noises, and the michrophone movement sounds due to either fat or lose tissues of the neck touching the microphone.

Fig. 1
figure 1

Spectrogram of a sample recording with vocal noises, swallowing, coughing, and microphone movement sounds due to attachment of the microphone to the participant’s skin

As the second step of noise reduction, we used a 5th-order Butterworth band-pass filter in the range of (75–2500 Hz) to keep the main frequency components of the breathing phases of each selected sound and to suppress the effect of low and high-frequency noises (including heart sounds that are less than 75 Hz [26], neck muscle artefact, fundamental frequency of the power line (60 Hz in Canada), and background noises). The reason for choosing the 2500 Hz as the high-frequency bound is that most of the TBS energy is claimed to be below 2500 Hz [27]. Next, similar to our previous studies [10, 28], we normalized the filtered sounds first by their variance envelope (64-sample sequence moving average filter of the signal), and secondly, by their energy to compensate for probable different flow rates in different breathing cycles.

In this study, we did not record the respiratory flow of participants, however, to ensure the respiratory phases all recording procedures started at the inspiration phase and marked by the voice of the experimenter. Using that auditory marker, the inspiratory and expiratory phases were separated manually and analyzed independently. Then, we used the method introduced in [14] to determine the approximately stationary portion of each normalized sound. In brief, the logarithm of the sound’s variance (LogVar) was calculated, and the sounds segments corresponding to the middle part (50% duration around the maximum) was extracted as the stationary portion of each breathing phase. Finally, we estimated the power spectrum density (PSD) of the stationary portion of each TBS signal using Welch’s method [29] in windows of 205-point (~ 20 ms) with 50% overlap between the successive windows. The values of optimum window size and overlap for segmenting the TBS signal were selected based on the results of our previous study [28] and our various pilot studies that showed these values estimates the mostly differentiable power spectrum density (PSD) signals between the non-OSA and OSA subjects.

Following the above routine, for each participant, we estimated the PSD in both linear and logarithmic scale for TBS recorded in each mouth and nasal maneuvers. The PSD estimation was done for each inspiratory/expiratory phase separately and also for their combination (summation/subtraction). This routine resulted in 16 different PSD signals for each participant; they were then averaged over the breathing cycles of each participant’s data. Figure 2 outlines the abovementioned pre-processing and signal analysis.

Fig. 2
figure 2

Pre-processing and signal analysis framework. Tracheal breathing sounds were extracted during daytime while the participants were awake. After manual exclusion of noisy sounds, the breathing sounds were separated as inspiratory and expiratory phases. Each individual phase of breathing sounds were filtered and normalized to purify tracheal breathing sounds. Afterward, the power spectrum density were estimated for the approximately stationary portion of each normalized sound

2.4 Feature extraction

We used the 5-fold cross-validation routine to randomly split the set of 164 participants associated with the non-OSA and moderate/severe OSA groups into five non-overlapping groups (folds), each consisting of approximately 33 participants (20% non-OSA and 20% moderate/severe OSA). Each time, participants’ TBS data of one fold were left out and considered as blind-testing data, and data of the remaining four folds (131 participants) were considered as training data and used for feature extraction. The test error was estimated by averaging the five resulting error estimates. Data of the mild-OSA group (5 < AHI < 15, n = 35) were not used during feature selection, and they were dealt separately as described later in Section 2.5.

As mentioned in Section 2.3, within each training set, we estimated the PSD for all participants separately. In order to extract the characteristic features between the mild and moderate/severe OSA groups of this study, we separately averaged the PSD signals of mild and moderate groups within each training dataset and calculated their 95% confidence intervals (CI). The 95% CI for either mild or moderate/severe OSA groups were calculated according to the Eq. 1 as follows:

$$ \mathrm{CI}=\overline{\mathrm{PSD}}\pm \mathrm{Critical}\_\mathrm{value}\times \frac{\mathrm{std}\left(\mathrm{PSD}\right)}{\sqrt{n}} $$
(1)

where \( \overline{\mathrm{PSD}} \) is the average of PSD for each OSA group, std is the standard deviation, n is the number of participants in each OSA groups, and Critical _ value is the t-statistics (values of t-distributions) with n − 1 degree of freedom.

We considered the mean and slope of the regions with different slopes and the non-overlapped area between the average power spectra of two groups in association with their corresponding 95% CIs as characteristic features to be further selected for classification. As an example, Fig. 3 shows the average power spectra that calculated from the summation of inspiratory and expiratory mouth TBS signals in linear and logarithmic scale for the non-OSA and moderate/severe OSA groups in one of the training sets as well as their corresponding 95% CIs. TBS features were extracted from the regions between the solid, dotted, and dashed lines. Overall, we extracted 78 TBS features. Combining these TBS features with seven anthropometric features of each participant (sex, age, height, weight, body mass index (BMI), neck circumference (NC), and Mallampati score (MP)) resulted in a total of 85 features.

Fig. 3
figure 3

Average power spectra of the summation of mouth inspiratory and expiratory breathing sounds with their 95% confidence intervals (CI, shadows) in non-OSA (blue) and moderate/severe OSA (red) groups, using one of the training sets with 131 participants in a linear scale, b logarithmic scale with base of 10. The area between solid lines, dotted lines, and dashed lines shows the regions where the features were extracted

2.5 Feature reduction and classification

Due to the high number of extracted features, we used a feature selection/reduction approach to find the most characteristic features for classification of the two moderate/severe OSA and non-OSA groups. The first step of feature reduction was to apply an unpaired t test on the features of the participants in each training set and to select the statistically significantly different features (p < 0.01) between non-OSA and moderate/severe OSA groups within that training set. In order to meet the statistical significance of 99% for the overall test, the significance level of each test was considered as 1 − (1 − 0.01)1/85 ≅ 1.2 × 10−4 [30, 31]. Within each training set, on average 11 features with p value >1.2 × 10−4 were omitted, and the features with the discrimination potential between the two groups were kept for further analysis.

To classify the two non-OSA and moderate/severe OSA groups, we considered a linear logistic regression model for the severity level of participants:

$$ \log \left(\frac{\Pr \left(Y=1|X=x\right)}{1-\Pr \left(Y=1|X=x\right)}\right)={\beta}_0+{\beta}^Tx $$
(2)

where {0, 1} is the true severity label of participants, X is the matrix of (n × p) for n participants and p features, β is the coefficient vector, and Pr(Y = 1| X = x) represents the probability of assigning a participant to the moderate/severe group.

To further reduce the number of features and obtain a parsimonious model, we used the LASSO approach [21] as a powerful and stable feature selection algorithm to automatically select significant features by shrinking the coefficient of unimportant features to zero. This is done by estimating the parameters of the logistic regression via minimizing the negative log-likelihood with an l1- regularization defined as follows:

$$ {\arg \min}_{\beta}\left(-\frac{1}{n}\sum \limits_{i=1}^n\left\{{y}_i\left({\beta}_0+{\beta}^T{x}_i\right)-\log \left(1+{e}^{\beta_0+{\beta}^T{x}_i}\right)\right\}+\lambda \sum \limits_{j=1}^p\left|{\beta}_j\right|\right) $$
(3)

where λ is the penalty parameter that is selected in a way to minimize the out of sample error (generalized error) of the model. For each value of λ, the above optimization problem is solved and the optimal value of λ is tuned using the cross-validation [32]. As the objective of this study was to balance the accuracy and computational cost, we selected λ such that it provides the smallest number of coefficients with a reasonable accuracy. Therefore, the λ that occurred within one standard error of the optimal λ was selected as the optimum value. The LASSO penalized logistic regression model was implemented using the Glmnet package of R statistical software version 3.5.1.

We applied this method to the dataset of entire non-OSA and moderate/severe OSA participants (164 participants) and also to the data of different training sets to investigate the robustness of the procedure. From the selected features of each dataset, the features with a high correlation (> 50%) to other features were excluded, and the remaining were used for classification.

To classify each participant into one of the two groups of non-OSA or moderate/severe OSA, we used the features selected by the lasso feature selection method as input to another LASSO logistic regression method trained for the classification. Once β coefficients are determined, the probability of assigning a participant to either non-OSA or moderate/severe OSA is determined using the Eq. 2. The accuracy, specificity, and sensitivity were calculated using test and training data of each dataset as well as data of entire non-OSA and moderate/severe OSA participants. Data of the mild-OSA participants with 5 ≤ AHI < 15 were also classified at this stage with the same procedure using the models formed based on data of the entire non-OSA and moderate/severe OSA groups as well as data of each training dataset. Figure 4 shows the flowchart of this method for feature selection and classification.

Fig. 4
figure 4

The flowchart of feature selection and classification methods, using the LASSO logistic regression

The total computational cost of our proposed algorithm was also calculated; it is detailed in Appendix B.

3 Results

In this study, features were mainly extracted from the areas with different slopes and the non-overlapped regions of the average power spectra of non-OSA and moderate/severe OSA participants. As these regions were slightly changed among different training sets, for robustness, we chose the common area among the five-fold training datasets. Our feature selection procedure resulted in the following five features as described in Eq. 47; F represents a feature. Table 2 presents the features that were selected using data of each training set as follows:

$$ {\displaystyle \begin{array}{l}\mathrm{F}1:\mathrm{neck}\ \mathrm{circumference}\\ {}\mathrm{F}2:\mathrm{slope}\left(10\log 10\left({\mathrm{PSD}}_{\mathrm{mouth}-\mathrm{Ins}}+{\mathrm{PSD}}_{\mathrm{mouth}-\exp}\left(260:440\ \mathrm{Hz}\right)\right)\right)\ \end{array}} $$
(4)
$$ \mathrm{F}3:\mathrm{mean}\left(10\log 10\left({\mathrm{PSD}}_{\mathrm{mouth}-\mathrm{Ins}}+{\mathrm{PSD}}_{\mathrm{mouth}-\exp}\left(1100:1350\ \mathrm{Hz}\right)\right)\right) $$
(5)
$$ \mathrm{F}4:\mathrm{slope}\left(\left(\ {\mathrm{PSD}}_{\mathrm{mouth}-\exp}\left(250:350\ \mathrm{Hz}\right)\right)\right) $$
(6)
$$ \mathrm{F}5:\mathrm{mean}\left(10\log 10\left({\mathrm{PSD}}_{\mathrm{nose}-\mathrm{Ins}}+{\mathrm{PSD}}_{\mathrm{nose}-\exp}\left(1000:1300\ \mathrm{Hz}\right)\right)\right) $$
(7)
Table 2 Selected features for different training sets and the dataset containing entire non-OSA and moderate/severe OSA participants, captured using the regularized logistic regression with the LASSO penalty feature selection method

Table 3 presents the coefficients of each feature for different training sets and also the entire dataset of non-OSA and moderate/severe OSA participants. Figure 5 shows the correlation between the AHI and the selected features for separating non-OSA and moderate/severe OSA participants within one of the training sets as an example.

Table 3 Coefficients of selected features of different training sets and the dataset containing entire non-OSA and moderate/severe OSA participants, captured using the regularized logistic regression with the LASSO penalty feature selection method
Fig. 5
figure 5

Visualization of the correlation between AHI and the selected features of one of the training datasets, using LASSO feature selection method. On top, the values of the correlation between features plus their significance level as star (***p < 0.001). The overall distribution of features showed as histograms with a fitted curve. The selected features are neck circumference, the average of the power spectrum calculated from summation of mouth inspiratory and expiratory breathing sounds in the high-frequency range of (1100–1350 Hz), and the slope of the power spectrum calculated from summation of mouth inspiratory and expiratory breathing sounds in the frequency range of (260–440 Hz)

The selected features of different datasets were investigated using the unpaired t test to check whether they were statistically significantly different from each other. All the p values were found to be highly significant (p <7.9 × 10−6). The discriminative power of the selected features between non-OSA and moderate/severe OSA participants within different datasets were also investigated using other sets of unpaired t tests. Compared to moderate/severe OSA participants, the non-OSA participants showed to have lower NC, lower high-frequency power, and lower slope around the second formant. All the p values were found statistically significant (p < 4.6 × 10−6).

Different multivariate analysis of variance (MANOVA) tests were run over the selected features of the five different training datasets and also the entire non-OSA and moderate/severe OSA dataset to ensure the combination of the selected features in each feature set were (statistically) significantly different between the two groups; all were found to be highly significant (p < 6.5 × 10−7). Therefore, the selected features were used for classification of the two groups within each training sets as well as the entire dataset (all five-folds’ data).

Table 4 provides the classification specificity, sensitivity, and accuracy of applying the regularized logistic regression with LASSO penalty over data of different training datasets and their corresponding blind-testing sets, as well as data of the entire five-folds’ dataset. The classification of the entire 164 non-OSA and moderate/severe participants showed 81.1% accuracy with sensitivity and specificity of 84.4% and 77.0%, respectively. The average accuracies of the 5 training sets were found to be 82.3% ± 0.7 and 79.3% ± 6.1 for training and testing, respectively. The sensitivity (specificity) were 85.0% ±2.0 (79.1% ± 1.0%) for training and 82.2% ± 7.2 (75.8% ± 9.9%) for testing, respectively. Table 4 also reports the area under the curve (AUC) values associated with the average receiver operating characteristic (ROC) curves of the regularized logistic regression classifier for classification of non-OSA and moderate/severe OSA participants in different datasets, using their selected features. The AUC value associated with the ROC curve calculated using data of all non-OSA and moderate/severe OSA participants was shown to be 89.02. In addition, the average of the AUC values between the 5 training datasets was found to be 89.9 ± 1.7. Figure 6 shows the ROC curve calculated using the non-OSA and moderate/severe OSA participants for one of the training sets. Table 5 shows the number of misclassified participants in each fold of training and testing as well as the entire dataset. Table 6 details the anthropometric information of the misclassified participants when applying the LASSO logistic regression model to the entire dataset.

Table 4 Regularized logistic regression classification results for different training sets and their corresponding blind-testing sets, the average of all five datasets and their standard deviations (std), and also for the dataset containing entire non-OSA and moderate/severe OSA participants, using their corresponding features selected by the LASSO penalized logistic regression feature selection method.
Fig. 6
figure 6

ROC plot of the regularized logistic regression classification over selected features of one of the training datasets, found using the LASSO feature selection method. The selected features are neck circumference, the average of the power spectrum calculated from summation of mouth inspiratory and expiratory breathing sounds in the high-frequency range of (1100–1350 Hz), and the slope of the power spectrum calculated from summation of mouth inspiratory and expiratory breathing sounds in the frequency range of (260–440 Hz)

Table 5 Number of misclassified participants found using the regularized logistic regression classification for different training sets and their corresponding blind-testing sets, the average of all five datasets, and also for the dataset containing entire non-OSA and moderate/severe OSA participants
Table 6 Anthropometric information’s mean and their corresponding standard deviations (std) for the misclassified non-OSA and moderate/severe OSA participants of the dataset containing entire non-OSA and moderate/severe OSA participants, found using the regularized logistic regression with LASSO penalty classifier

The computational cost for each step of our proposed method is detailed in Appendix B. The overall computational cost of our proposed approach is shown in Eq. 8.

$$ 16n\times \mathrm{O}\left(\mathrm{Nlog}(L)\right)+p\ O(n)+O\left(\mathrm{pn}\right)+O(3n)\approx 16n\times O\left(\mathrm{Nlog}(L)\right)+O\left(\mathrm{pn}\right) $$
(8)

where n is the number of participants, N is the length of the TBS after normalization, L is the window length, and p is the number of features. The computational cost of our proposed method is linear in terms of the number of features and the number of participants; hence, the method is very fast.

4 Discussion

In this study, we proposed to use a regularized logistic modeling approach as a simple and fast method for analyzing TBS recorded during daytime when the subjects were awake (wakefulness) with the purpose of predicting their OSA severity by a few minutes breathing sound recording and data analysis. Our proposed approach is discriminative in nature and does not rely on strong model-based assumptions. We developed a robust feature selecting approach, and used it in classifying subjects with AHI < 5 and AHI ≥ 15. The first set of features considered for this study (p = 85) were a combination of the TBS features and the anthropometric information of participants. As there were two breathing maneuvers and we investigated respiratory phases both separately and in combination, the number of features was large. Thus, we used a feature selection approach based on regularized logistic regression methodology with a LASSO penalty to select the best subset of features.

In LASSO feature selection method, shrinking and removing some of the model’s coefficients lead to reducing the variance without a considerable increase in bias; hence, it provides desirable prediction accuracy [32]. This is useful when the number of participants is not much larger than the number of features, as it was the case in our study. The LASSO method is very fast and provides an interpretable model by weighting the selected features. Therefore, it is easy to explain which features are more important for predicting the true class labels; the coefficients of the selected features show their importance ranking (Table 3).

As the LASSO method uses a cross-validation procedure to automatically estimate the optimal value of λ (the penalty parameter), there is no need to perform another cross-validation procedure for obtaining the training and testing results. However, we generated five sets of randomly selected training and blind-testing data to ensure the unbiased test classification. Our findings (Tables 2 and 3) show that in different training sets the selected features and their corresponding coefficient values were different; however, in all of the training sets and also in the entire five-folds’ dataset three selected feature types were common. Those features were (1) the neck circumference, (2) the slope of the average PSD around the second formant, and (3) the average of the high-frequency portion of the average PSD. These features are congruent with the findings of our previous study on a separate dataset [28]. In all training and testing datasets of this study, the unpaired t test showed that the selected features were statistically significantly different from each other and also different between the two groups of non-OSA and moderate/severe OSA. These features also showed high correlations with AHI. In addition, the ROC curves of the different LASSO logistic regression classifiers over different training sets and also on the entire dataset showed high performance with high values of AUC (Table 4). These high AUCs indicate the robustness of the LASSO logistic regression classifiers with respect to our selected features. Appendix A reports the results of using a simple logistic regression for classification as opposed to the LASSO logistic regression method. The simple logistic regression classification of the entire 164 non-OSA and moderate/severe participants showed 80.5% accuracy with sensitivity (specificity) of 82.2% (78.4%), which are lower than those of LASSO logistic regression classification (Table 4). The reason that LASSO logistic regression classification resulted in better than the simple logistic regression is that regularization term of LASSO logistic regression classification method (λ) improves the generalization performance (performance on new data) by reducing the variance of parameter estimates (shrinks the coefficients that contribute most to the error).

It is also important to investigate the characteristics of misclassified participants. For the dataset containing all non-OSA and moderate/severe OSA participants, the overall accuracy of the LASSO logistic regression was 81.1%; out of 164 subjects, 31 were misclassified, out of which, 17 were from non-OSA and 14 were from moderate/severe OSA group (Table 5). Further investigation on the anthropometric information of misclassified subjects (Table 6) and comparing them with the non-OSA and moderate/severe participants of this study, revealed that overall, the non-OSA misclassified individuals are characterized with higher NC, BMI, height, weight, and being male, while the moderate/severe OSA misclassified individuals are characterized with lower NC, BMI, height, weight, and being female. This finding is not surprising, as it is known that these anthropometric parameters affect the breathing sounds while being correlated with AHI [33]. Thus, ideally, we should have the two non-OSA and moderate/severe OSA groups of the training and testing matched in these characteristics. However, that requires a much larger dataset.

In this study, similar to all other similar studies, we used the AHI values as the true label of each individual to be either in non-OSA or moderate/severe OSA group. Although AHI is one of the most important and commonly used indicators for diagnosis of OSA, there are many individuals with similar AHI but different daytime symptoms and different levels of OSA severity. The clinical diagnosis of OSA is not only based on AHI; it is based on a combination of frequency of arousals, daytime symptoms, lack of deep sleep, etc. [34]. We speculate that the TBS features are indeed good representatives of the OSA severity as they are affected by the pathophysiology of OSA. However, the UA pathology of two individuals with similar AHI (e.g., AHI of 14 and 16) may not be that much different to affect the TBS differently enough to be detected by sound analysis. For this reason, to avoid the misclassification in the borderline ranges with artificially crisp nature, we allowed a small gap (5 ≤ AHI < 15) in continuous AHI values of the non-OSA and moderate/severe OSA groups to form the two groups. Once the classifiers were trained, the participants with AHI in this gap range were classified using our proposed method. Using the model obtained based on data of 164 non-OSA and moderate/severe participants, 11 out of 35 individuals in the gap range were assigned to the non-OSA group and the remaining 24 individuals assigned to the moderate/severe group. Further investigation of these participants revealed that the individuals assigned to the moderate/severe group had higher NC, higher slope around the second formant, and higher average PSD in high-frequency ranges compared to those assigned to the non-OSA group, while there was not much difference in their AHI values (8.8 ± 0.6 and 8.4 ± 0.8, respectively). This finding confirms the shortcoming of using AHI as the only indicator for OSA severity. Finding a better OSA severity indicator still remains a challenge. Perhaps a better indicator would be a combination of several factors including AHI, duration of apneic events, and number of arousals, and also the breathing sounds features.

Last but not the least advantage of the proposed method is that the most repeated features among the selected features of all datasets were extracted from the combination of inspiratory and expiratory TBS. This finding implies that there may be no need to separate the breathing phases in the pre-processing step. Although there are various methods for automatic separation of breathing phases into inspiration and expiration [35, 36], however, none of them are 100% accurate. Therefore, there is always a need for manual investigation of breathing sounds. Eliminating this part from the pre-processing procedure will speed up the data analysis significantly; hence, providing a faster online screening tool for OSA.

4.1 Physiological interpretations of the selected features

It is worth mentioning that selected features using the regularized logistic regression approach are physiologically meaningful and highly correlated with AHI (Fig. 5).

Obesity is a risk factor in patients with sleep apnea [37, 38]. The relationship between obesity and OSA has been demonstrated to be totally explained by variation in neck size [39, 40]. It was suggested that the increases in neck circumference and fat deposited around the upper airway might narrow the upper airway [38]. Therefore, we can conclude that NC, a surrogate measure of neck fat, can be considered as a predictor of the presence of OSA.

A highly collapsible/compliant pharyngeal airway is known as the main anatomical abnormality in the UA of OSA patients [41]. A reason for increased collapsibility of OSA subjects compared to that of controls is their smaller cross-sectional area [42, 43]. According to the Helmholtz equation [44], an UA with a smaller cross-sectional area reflects a lower resonant frequency (f) as follows:

$$ f=\frac{c}{2\pi}\sqrt{\frac{A}{VL}} $$
(9)

where c is the speed of sound, A is cross-section area, V is the cavity volume, and L is the length of the bottleneck. The resonant frequencies of the UA are represented by formant frequencies; the peaks in the spectrum of sound signals. Hence, we expect to see lower formant frequencies for moderate/severe OSA participants. This is congruent with what is observed in Fig. 3, especially for the second formant (F2). In our study, the higher slope in PSD of the moderate/severe OSA participants’ around their F2 implied that their F2 occurred in lower frequencies compared to that of non-OSA participants.

On the other hand, stiffness is the property of a tissue that resists the change in shape in response to an applied force. It has been shown that during wakefulness, the stiffness of the tongue muscles and soft palate tissues of OSA patients is higher compared to those of controls [45]. The stiffening of the narrower portion of the UA of OSA individuals was also supported by showing a predominantly bottle shape for their narrowed UA [20]. As an increase in stiffness is presented by an increase in high-frequency components of the breathing sounds, we expected to see higher energy in the high-frequency portion of the average PSD of moderate/severe OSA participants compared to that of non-OSA; this is what is observed in Fig. 3.

5 Conclusion

In this study, we investigated the application of regularized logistic regression model using a LASSO penalty for prediction of OSA severity during wakefulness using TBS analysis and anthropometric information. The selected features by this regression modeling approach are congruent with those selected by other methods in our previous studies; they are physiologically meaningful and highly correlated with AHI. The unbiased blind-testing classification accuracy, sensitivity, and specificity over the non-OSA and moderate/severe OSA groups of five different folds were found to be 79.3% ± 6.1%, 82.2% ± 7.2%, and 75.8% ± 9.9%, respectively. The accuracy for the entire dataset was 81.1% with sensitivity and specificity of 84.4% and 77.0%, respectively. Although results of this study are similar but not necessarily superior to previous studies’ results but they are important from different aspects. The most important contributions of this work are (1) the method presented in this work does not need respiratory phase identification as the best features do not depend on that (2) the method is simple, quick, and computationally more effective than SVM and other previous methods for OSA screening during wakefulness; thus, suitable for online applications; (3) the method can be used for screening OSA individuals with severity as low as 15 with high sensitivity and a reasonable specificity. The main limitation of this study is the lack of a large sample size to have a better way to handle OSA severity prediction of individuals with 5 < AHI < 15.