Introduction

Childhood obstructive sleep apnea syndrome (OSAS) is a prevalent yet relatively under-diagnosed condition [1]. According to the American Academy of Pediatrics, OSAS affects 1 to 5% of children in the general pediatric population [2]. Untreated OSAS has been associated with adverse consequences affecting multiple organ systems in infants and young children, reducing overall health and quality of life while increasing health care use and associated costs [1, 2]. Thus, it is important to screen for the presence of the disease in children showing symptoms in order to perform early and timely diagnosis and treatment, an issue that has been recently re-emphasized by the American Academy of Pediatrics [2]. In-laboratory polysomnography (PSG) is the gold standard test for OSAS [2, 3]. However, the availability and accessibility of specialized sleep laboratories is too limited to match clinical needs [4, 5]. In addition, PSGs are onerous and also intrusive and relatively inconvenient for children and caretakers, which hampers their widespread implementation [6].

The aforementioned drawbacks have led to an extensive search for screening tools over the last decade [1, 2, 7, 8]. Overnight oximetry has emerged as a potentially useful screening approach due to their simplicity and suitability in children [3, 9,10,11,12,13,14]. In this context, the oxygen desaturation index (ODI) is the most widely used oximetry marker [5, 10, 12]. Nevertheless, besides the number of desaturations, there is a lot of additional information present in the oximetric recordings. In this regard, automated signal processing and pattern recognition techniques are able to optimize the diagnostic capability of oximetry leading to a more reliable and effective approach [9, 14]. A recent multicenter study evaluating more than 4000 overnight oximetry recordings indicated that machine-learning approaches enable accurate identification of children with OSAS [15].

In the framework of automated pattern recognition, statistical classifiers model the characteristics of the problem under study, leading to automated tools able to assist physicians in their daily practice. Particularly, linear discriminant analysis (LDA) [9, 16,17,18] and logistic regression (LR) [10, 14, 19, 20] have been previously used with relative success for binary classification of children suspected of suffering from OSAS. Similarly, quadratic discriminant analysis (QDA) also provides a suitable alternative, but has been marginally assessed in this context [21]. Nevertheless, a thorough review reveals that no previous studies have extensively assessed the performance of these classification approaches when using different cutoff criteria for the diagnosis of childhood OSAS. We hypothesized that a comprehensive comparative analysis could provide essential information that would enable widespread use of automated processing of overnight oximetric recordings as a screening tool for pediatric OSAS. Therefore, the aim of this study was to design and assess several statistical binary classifiers based on overnight oximetry using different clinically used cutoffs for pediatric OSAS.

Materials and methods

Participants

A total of 176 consecutive otherwise healthy children (97 boys and 79 girls) ranging from 1 to 13 years of age composed our dataset. All children were referred by pediatricians to the Pediatric Sleep Unit at the University of Chicago Medicine Comer Children’s Hospital (Chicago, IL, USA) for evaluation of habitual snoring and suspected OSAS. Enrolment criteria included habitual snoring and/or witnessed breathing pauses during sleep as reported by their parents or caretakers. The Institution’s Ethical Review Committee approved the study protocol (#IRB14-1241) and informed consent was obtained from all caretakers prior to the enrolment.

Sleep studies

Children’s sleep was monitored using a digital polysomnography system (Polysmith; Nihon Kohden America Inc., CA, USA) [22]. Blood oxygen saturation (SpO2) recordings using pulse oximetry were acquired during overnight PSG (sampling frequency 25 Hz) and subsequently exported for offline processing.

All PSGs were manually scored by trained sleep technologists to derive the apnea-hypopnea index (AHI). The 2012 American Academy of Sleep Medicine scoring rules for children were used to quantify sleep and cardiorespiratory events [23]. In order to extensively assess the screening ability of the proposed classification models, we used several AHI cutoff values routinely used for establishing the diagnosis of OSAS [2, 8, 14], namely, 1, 3, and 5 events/h. Table 1 summarizes the main characteristics of the dataset.

Table 1 Demographic and clinical characteristics of the cohort using three different AHI cutoff values

Automated processing of oximetric recordings

Three signal processing stages were implemented: feature extraction, feature selection, and pattern recognition.

Feature extraction

The aim of this stage was to exhaustively characterize oximetry dynamics in every child. To conduct this critical initial step, several oximetric indices were computed:

  1. i.

    Desaturations due to apneic events modify the shape and thus the characteristics of the oximetric data distribution. Conventional first- to fourth-order statistical moments, i.e., mean (M1t), variance (M2t), skewness (M3t), and kurtosis (M4t), are suitable as well as simple measures able to parameterize these changes in the histogram of SpO2 amplitudes [14, 24].

  2. ii.

    The power spectral density (PSD) function is able to reflect the occurrence of (pseudo)periodic desaturations at each frequency [9, 14]. The PSD of each SpO2 signal was parameterized by means of first- to fourth-order statistical moments (M1f–M4f). In addition, the median frequency (MF) and spectral entropy (SE) were computed to quantify the degree of flatness of the power distribution [14]. Similarly, the total signal power (PT) as well as the peak amplitude (PA) and relative power (PR) in a common apnea frequency band for children (0.021–0.040 Hz) were computed to measure changes in the PSD function due to desaturations [14, 24, 25].

  3. iii.

    In the framework of long-term overnight oximetric recordings, non-linear measures have widely demonstrated to provide complementary information to conventional spectral techniques. Particularly, sample entropy (SampEn), central tendency measure (CTM), and Lempel-Ziv complexity (LZC) were computed due to their ability to quantify slight changes in irregularity, variability, and complexity of nocturnal SpO2 [14, 26].

  4. iv.

    Finally, the conventional 3% oxygen desaturation index (ODI3) was automatically scored and included in the initial feature space in order to quantify the number of drops per hour of recording greater than or equal to 3% from baseline [12].

Feature selection

For each child, all 17 features derived from the oximetry signal were jointly used to characterize the presence of pediatric OSAS. This initial feature set gathers valuable information able to detect the disease. Nevertheless, an improved as well as reduced feature subset can be derived by applying a feature selection algorithm. Previous works have shown that feature selection and dimensionality reduction algorithms enhance the prediction ability of oximetric features in the context of OSAS diagnosis [24, 26, 27]. In this study, the fast correlation-based filter (FCBF) was applied. FCBF is a filter methodology for feature selection independent of the classifier used in the subsequent classification stage [28]. FCBF automatically inspects the original feature space to search for the most relevant as well as non-redundant variables. In order to guide this search, the symmetrical uncertainty (SU) is used [28]. SU is a measure of predictability between two variables based on the information shared between them; i.e., it quantifies the amount of knowledge we can infer about the first variable using the information present in the second one. The higher the SU is, the higher the predictability is, i.e., the information shared between variables. In the context of pediatric OSAS, we considered that a variable is more relevant for characterizing the disease if it shares as much information as possible with the AHI, which is the clinical index used by physicians to establish the presence of the disease and characterize its severity.

FCBF implements two consecutive filtering stages [28]: firstly, a relevance analysis and secondly, a redundancy-based variable selection. In order to measure the degree of relevancy of every variable from the original feature space, the association between each oximetric feature and the severity of the disease was estimated: the SUi between each oximetric feature (Xi) and the actual AHI from standard PSG (Y) was computed. Subsequently, all the variables from the original feature space are ranked in decreasing order of SUi, i.e., in decreasing order of relevancy for the problem under study. In this research, all the oximetric features were considered potentially relevant. Finally, the redundancy filtering stage is implemented as follows: (i) SUi,j is computed between each pair of features according to the previous relevancy-based ranking so that feature i is ranked higher (more relevant) than feature j, and (ii) if SUi,j ≥ SUi, then feature j is removed because it is highly correlated with feature i and its valuable information linked with OSAS is smaller and can be derived from feature i. Therefore, the feature subset built by means of FCBF was composed of the most relevant and non-redundant variables from the proposed oximetric feature space.

In order to increase the generalizability of our results, a bootstrap procedure was embedded within the feature selection stage [29]. Accordingly, several datasets were composed by means of resampling with replacement, so that the FCBF method was repeated 1000 times. At each iteration of the proposed bootstrap approach, the variables automatically selected by FCBF were saved. Finally, only those features selected a number of times significantly higher (> 90%) than the remaining ones composed our optimum feature subset from oximetry.

Feature classification

In this comparative study, the performance of LDA, LR, and QDA binary classifiers is extensively assessed in the context of pediatric OSAS diagnosis. In order to decrease the model complexity and improve the performance, these widely known statistical classifiers assume that the probability density function of the classes under study (OSAS-negative vs. OSAS-positive) is known a priori [30]:

  1. i.

    LDA assumes both data normality (Gaussian or normal distribution) and homoscedasticity (equal variances) to model each class-conditional density function for an input feature pattern and each class. Under these assumptions, a linear decision threshold minimizes the classification error.

  2. ii.

    QDA does not presume homoscedasticity. Then, the Bayes classification rule that minimizes the classification error function establishes a quadratic decision boundary between classes in the feature space.

  3. iii.

    In a more general context, LR does not assume a priori neither normality nor homoscedasticity of the probability distribution of variables involved in the model. A binary LR classifier models the probability density function as a Bernoulli distribution.

Statistical analysis

MATLAB R2015a (The MathWorks Inc., Natick, MA, USA) was used to implement the proposed signal processing methods, as well as to perform both statistical and performance analyses. Descriptive analysis of variables was presented in terms of their median and interquartile range. The non-parametric Mann-Whitney U test was applied to search for statistical differences between groups, and p < 0.05 was considered statistically significant. A separate analysis was carried out to assess statistical differences between the three classification approaches under study. The Bonferroni correction was applied to manage multiple comparisons, and a p value < 0.05 was regarded as significant.

The common bootstrap 0.632 was applied for performance assessment [14, 19, 31]. Given an original dataset of size N, resampling with replacement is applied to build mi (1 ≤ i ≤ M) bootstrap replicates of size N. For each replicate, every oximetric pattern from the original dataset may be selected several times with equal probability (uniform distribution). Therefore, all bootstrap replicates mi will likely contain repeated instances, whereas a number of cases from the original dataset are not selected. According to bootstrap 0.632, the replicates mi are used for training, whereas instances not included in mi are used for validation. At each iteration i (1 ≤ i ≤ M), a performance metric is obtained as the contribution of both the training replicate mi (0.632 times the metric in the training dataset) and its corresponding validation set (0.368 times the metric in the test dataset) [31]. Finally, each metric is estimated as the average of the M bootstrap values. The following diagnostic performance metrics were computed: sensitivity (Se), specificity (Sp), positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (LR+), negative likelihood ratio (LR−), accuracy (Acc), and area under the receiver operating characteristics (ROC) curve (AUC). The AHI from in-lab PSG was the gold standard for evaluation. The 95% confidence interval (95% CI) was provided per each performance metric. In order to obtain a proper estimation of the 95% CI, the number of bootstrap replicates was set to M = 1000 [31].

Results

Table 2 summarizes the main polysomnographic and oximetric characteristics of the cohort based on the proposed PSG-derived AHI cutoff values used for establishing the presence of OSAS. Table 3 shows the optimum features automatically selected from FCBF using the proposed bootstrap procedure. For each cutoff, LDA, QDA, and LR models were optimized in each training bootstrap replicate and subsequently assessed in the remaining validation instances according to bootstrap 0.632. Figure 1 shows the influence of each variable in every model for each diagnostic AHI cutoff. Axes represent the dimensions of the optimum feature space for each cutoff, whereas vertices of each polygon are proportional to the absolute value of each coefficient in the optimized models, i.e., the overall influence of each oximetric variable.

Table 2 Summary of the main polysomnographic and oximetric variables for the whole cohort using three defined AHI cutoffs for positive OSAS designation
Table 3 Optimum features automatically selected using FCBF and bootstrap as well as median values for each AHI cutoff point under study
Fig. 1
figure 1

Optimum models for each AHI-derived diagnostic cutoff for OSAS: a AHI ≥ 1 event/h, b AHI ≥ 3 events/h, and c AHI ≥ 5 events/h. For each statistical multivariate model, polygon vertices are proportional to the influence (weight) of each oximetric variable. For the sake of better representation within the same figure, the magnitudes of the coefficients are proportional to their influence within the same model but not between models

Table 4 summarizes the performance metrics of the models involved in this comparative study for all the AHI cutoffs. Using an AHI ≥ 1 event/h for OSAS, LR achieved the highest diagnostic accuracy of the three models, with 84.3% Acc and a highly unbalanced sensitivity-specificity pair (93.9% Se, 37.8% Sp). Similarly, for an AHI cutoff of 3 events/h, the LR model achieved 77.7% Acc (74.6% Se, 81.7% Sp). Finally, using the AHI cutoff of 5 events/h, both LR and QDA reached similar high accuracy, with 82.7% Acc (70.0% Se, 91.4% Sp) and 82.1% Acc (62.3% Se, 95.5% Sp), respectively. It is noteworthy that QDA achieved significantly high PPV and LR+ values regardless of the cutoff point for OSAS. Regarding the performance of the conventional ODI3, our analyses showed a significant imbalance between sensitivity and specificity using fixed cutoffs for the ODI3 (≥ 1, 3, and 5 desaturations per hour of recording) for all the clinical thresholds under study. In this regard, the higher specificity commonly reported in the literature was also confirmed.

Table 4 Performance metrics of each optimum statistical model and the ODI3 from oximetry for each AHI cutoff value for positive OSAS

Figure 2 depicts the average ROC curves for all the models under study from the bootstrap approach. We can observe that differences between the different approaches in this comparative study decrease as the cutoff for positive pediatric OSAS increases. It is important to note that the QDA model reached the highest AUC using the most restrictive clinical threshold for mild OSAS (AHI ≥ 1 event/h), achieving an area of 0.91. For a cutoff of 3 events/h, both LR and ODI3 achieved an AUC equal to 0.88. Considering a cutoff of 5 events/h for moderate-to-severe childhood OSAS, LR reached the highest AUC (0.89), similar to QDA (0.88) and ODI3 (0.87). A statistical analysis was carried out to assess differences between the ROC curves of the four classification approaches. Using a clinical threshold of 1 event/h for positive OSAS, significant statistical differences (p < 0.05) were found between LDA and all the remaining classifiers (QDA, LR, and ODI3), suggesting a real dependence on the classification model. In addition, visual differences observed between ROC curves of QDA and ODI3 led to slight differences (p < 0.05 and p = 0.21 before and after the Bonferroni correction, respectively). Regarding the cutoffs of 3 and 5 events/h, no significant differences were found neither between the models nor with the conventional ODI3.

Fig. 2
figure 2

ROC curves of each optimum statistical model from oximetry and the ODI3 from the bootstrap procedure for each cutoff for positive OSAS: a AHI ≥ 1 event/h, b AHI ≥ 3 events/h, and c AHI ≥ 5 events/h

Discussion

This prospective study provides a comparative assessment of the most frequently employed analytical techniques, namely, LDA and LR, for determination of suitability of overnight oximetry in the context of OSAS screening in children. In addition, an alternative approach in the framework of statistical classifiers, the QDA binary classifier, was also evaluated due to its potential advantages when compared to LDA or LR. Our results provide additional and important support to the conceptual framework that automated overnight oximetry is a consistent, unbiased, and effective method as an abbreviated screening tool for pediatric OSAS [32], and further confirm the validity of a recent multicenter study that examined 4191 overnight oximetry recordings [15].

Conventional statistical classifiers are commonly proposed as automated tools to assist in the detection of both adult and pediatric OSAS. They allow clinicians to combine different sources of medical information, and they have demonstrated their usefulness to maximize the diagnostic ability of oximetry by merging several oximetric features. In the present study, a thorough assessment of the diagnostic performance of LDA, QDA, and LR has been performed. We found that differences among these classifiers decreased when the clinical threshold for positive childhood OSAS increased. LR showed a more stable behavior than LDA and QDA. Moreover, LR reached the highest diagnostic accuracy for all the thresholds under study. Nevertheless, QDA stands out for reaching PPV values greater than 90% regardless the cutoff, which increases its usefulness as screening test for pediatric OSAS in the clinical practice. In fact, oximetry is commonly proposed as a simplified screening test for OSAS due to its simplicity and availability as portable recording technique at home. Therefore, despite the higher overall performance of LR, QDA showed the best screening capability, particularly for a cutoff of AHI ≥ 1 event/h, where a PPV of 96.5% was reached while maintaining AUC of 0.91 and fair sensitivity and specificity pair. Regarding common indexes from oximetry, statistical binary classifiers outperformed the conventional ODI3 in terms of diagnostic accuracy. When using common cutoffs for classification (ODI3 ≥ 1, 3, and 5 events/h), the ODI3 reached a highly unbalanced sensitivity-specificity pair leading to poor accuracy. Furthermore, regarding the ROC curves, the ODI3 showed higher instability when varying the cutoff, which minimizes its usefulness in the clinical practice.

Some methodological issues deserve comment. First, all the variables composing the various models were automatically selected using unbiased data processing and all of the measures were identified regardless of the classification technique, a feature that expands their generalizability. In the present comparative study, M1t, M4t, PA, PR, SampEn, and ODI3 demonstrated unique robustness in characterizing OSAS because they were selected as optimum features a number of times significantly higher (> 90% of bootstrap replicates) than the remaining variables (< 50% of bootstrap replicates). As shown in Fig. 1, the influence of each oximetric variable in every optimum model illustrates this important attribute of machine classification of the retained measures. When using them jointly, automated multivariate classifiers are able to maximize the diagnostic capability of oximetry. On the other hand, we can observe that M1t and M4t did not show a completely stable behavior; i.e., the sign of their weights (the model coefficients) changed as the clinical cutoff for the disease varied. The remaining optimum features (PA, PR, SampEn, and ODI3) showed different weight and sign for the different statistical approaches but a stable behavior as the cutoff changed. It is also important to note that PA and ODI3 showed the largest weights regardless the model and the cutoff, which highlights their significance in the characterization of oximetry in the context of pediatric OSAS.

In regard to the physiological and clinical interpretation of the optimum features, it is noticeable that all signal processing approaches were included in the optimum feature subset, i.e., time and frequency domains as well as linear and non-linear methods. Similarly, advanced signal processing methods demonstrated their complementarity with ODI3. Conventional ODIs just account for the number of the desaturations along the overnight recording. Nevertheless, M1t, M4t, PA, PR, and SampEn provide additional information linked with the depth, duration, and repetition of the desaturations. M1t accounts for the overall influence of the desaturations in the average saturation level while M4t quantifies how SpO2 values are distributed in a narrow (higher M4t) or in a broader (lower M4t) range according to a higher severity and number of desaturations. Similarly, both PA and PR account for the pseudo-periodicity and depth of the desaturations along the overnight recording so that greater and repetitive desaturations (higher PA and PR) are linked with a higher severity of the disease. Finally, SampEn quantifies the entropy, i.e., the disorder, in the overnight saturation profile due to changes caused by the desaturations so that the higher the severity of the disease is, the higher the irregularity is (higher SampEn).

Changes in the dynamics of oximetric profiles due to apneic events are smaller in children than in adults, thereby making oximetry-based screening tools for pediatric OSAS particularly challenging. The best screening ability emerged when using the cutoff of 1 events/h for detection of mild-to-severe OSAS. QDA and LR models showed higher generalization capability than LDA. Common assumptions such as normal distribution, homoscedasticity, linearity, or stationarity are not always consistent in real-world pattern classification problems. Therefore, our results suggest that more general modeling approaches such as LR and QDA, which both assume less restrictive conditions in the data distribution and implement a non-linear combination of input-dependent features, perform better than conventional LDA in the context of detection of childhood OSAS from oximetry. Notice that QDA reached excellent AUC (> 0.90) for the lowest cutoff. Furthermore, a real dependence on the classification model was found.

Both dependence and intercorrelation of the input variables affects the way coefficients are fitted in statistical models. Intercorrelation may lead to significant coefficients with high values, although these models could not fit properly the problem under study. FCBF is a feature selection methodology able to minimize this issue since it removes redundant features before the classification stage, i.e., the variables with likely high intercorrelation. However, the selected non-redundant features could still show high intercorrelation. Regarding our optimum feature subset from oximetry, we observed that M1t, PA, SampEn, and ODI3 were significantly correlated (p < 0.05), whereas M4t and PR showed no significant correlations. This issue could affect the consistence and generalization of the models. Similarly, dependence of oximetric features should also be considered because they are all derived from the same signal and common statistical classifiers assume independent input variables.

Overfitting is also an important problem in the framework of automated pattern recognition, particularly when the size of the training dataset is small. In addition, some classifiers are more affected by overfitting than others. Particularly, QDA is more flexible in fitting the classes on the training data due to its quadratic characteristic, which increases the risk of overfitting. In this study, we compared the average accuracy computed in the training and in the test bootstrap replicates in order to assess whether the performance of the proposed statistical models was affected by overfitting. We observed that differences between training and test values were lower than 5% regardless the model and the cutoff. This slight difference fits with the common expected decrease between training and test stages, suggesting that there is no overfitting affecting our results. In addition, bootstrap 0.632 accounts for both the training and the test contributions when computing every performance metric, which minimizes these differences.

It is important to assess whether statistical models provided additional value compared to the simple ODI3. According to our results, differences in performance between the statistical automated models and with the conventional ODI3 were maximal for AHI ≥ 1 event/h. In addition, ROC curves of LDA, QDA, and LR were smooth while the curves for the ODI3 showed marked changes in the slope regardless the clinical diagnostic threshold for the disease. This characteristic reveals higher irregularity of ODI3 when the cutoff for classification changes. Using standard classification cutoffs for the ODI, i.e., ODI3 ≥ 1, 3, and 5 desaturations per hour of recording, this conventional index reached a diagnostic accuracy significantly lower than the statistical models, as well as sensitivity vs. specificity pairs highly unbalanced (Se < 10%). On the other hand, QDA reached PPV > 90% and LR+ > 5 with fair sensitivity and specificity balance (Se around 60%). Both characteristics are essential for screening purposes, which is probably the main advantage of statistical models over ODI3. Therefore, our results suggest that automated modeling of overnight oximetry by means of QDA provides further and relevant information on the dynamics of oximetry compared to the conventional ODI3, particularly when AHI ≥ 1 event/h is used as the clinical cutoff for the disease.

As previously mentioned, the QDA model reached notably high PPV and LR+ values regardless of the cutoff for positive OSAS. Particularly, 96.5% PPV and 8.77 LR+ were reached using a threshold of 1 event/h. This agrees with previous studies reporting that oximetry is able to achieve high PPV values for the detection of an AHI ≥ 1 event/h [14, 32]. Most sleep laboratories use a clinical threshold of 1 event per hour during interpretation of PSG [2], which increases the usefulness of these automated techniques for screening purposes. Inspecting false positive cases using QDA, we observed that two no-OSAS (actual AHI < 1 event/h) children were misclassified as OSAS-positive, a number of times notably higher than the remaining patients throughout the 1000 iterations of the bootstrap algorithm. One of these children (actual AHI and ODI from PSG equal to 0 events/h) showed mean SpO2 significantly lower (92%) than the average for the OSAS-negative group (97%), a feature characteristic of severe OSAS. The other one (actual AHI and ODI from PSG equal to 0.4 and 0 events/h, respectively) showed no sleep time with a saturation in the range of 90–100%, which could suggest that other factors may be specifically detracting from the validity of the classifiers in these children, e.g., underlying parenchymal lung disease or skin color artifacts in oximetry readouts. Considering a cutoff of three events per hour, three no-OSAS (actual AHI < 3 events/h) were significantly more frequently misclassified as OSAS-positive by the oximetry-based QDA model throughout the bootstrap samples. Two of them (actual AHI from PSG equal to 1.3 and 1.8 events/h) showed an ODI greater than 13 events/h and lowest SpO2 significantly lower (74 and 56%, respectively) than the average for the OSAS-negative group (91%), while the other one was borderline (actual AHI equal to 2.98 events/h) and also showed significantly higher ODI (7.1 events/h) and lower minimum SpO2 (81%) than the average for the no-OSAS population. Similarly, three children were wrongly classified as suffering from the disease by the QDA model using the cutoff of 5 events/h for positive OSAS. Two of them were also misclassified using a clinical threshold of 3 events/h (subjects with actual AHI equal to 1.3 and 1.8 events/h, respectively), whereas the other one (actual AHI equal to 4.0 events/h) showed an ODI of 7.6 events/h and minimum SpO2 equal to 78%, which both are values characteristic of OSAS-positive children.

Table 5 shows previous studies using these statistical classifiers in the context of childhood OSAS diagnosis using pulse oximetry. LDA [9, 16,17,18] and LR [10, 14, 19, 20] were predominantly used. Nevertheless, to the best of our knowledge, there are no proper and extensive comparisons among these a priori effective techniques in the framework of automated pediatric OSAS detection. In the study by Gil et al. [16], photoplethysmographic and pulse transit time recordings were parameterized and LDA was applied. An accuracy of 80% was reached using leave-one-out cross-validation. In a later similar study by the same group [17], the performance increased up to 86.7% using the pulse rate signal from oximetry to compute the pulse transit time. Garde et al. [9] analyzed SpO2 and pulse rate portable recordings. LDA achieved 84.9% accuracy using 4-fold cross-validation. In an epoch-based classification scheme, Cohen and De Chazal [18] built two LDA models using features from SpO2 and ECG. A maximum accuracy of 74.7% was reached using a leave-one-out cross-validation approach.

Table 5 Summary of the studies using statistical binary classifiers from overnight oximetry to assist in the detection of childhood OSAS

Regarding LR, Chang et al. [10] reported 71.6% accuracy combining questionnaire-based variables and conventional oximetric indices. The model was optimized and validated using the same population. Similarly, Wu et al. [20] built a LR model with clinical history, demographic, and other relevant variables, reaching 78.2% accuracy in an independent test set. In a recent study by our group [14], the accuracy of different LR models from unattended oximetry ranged from 82.8 to 85.5% in a bootstrapping validation scheme. Similarly, a LR model composed of spectral features from airflow portable recordings and the ODI3 achieved 86.3% accuracy [19]. In the present study, our analyses further validate the efficacy of oximetry-based classification models aimed at screening for OSAS in children and attest to their robustness and high predictive ability.

Despite the favorable screening capability of the proposed oximetry-based models, some limitations should be taken into account. The population cohort under study should be expanded to increase the generalization of our results such as to include different types of oximeters, and enable comparisons across diverse populations such as those with craniofacial or genetic syndromes, as well as in infants. However, when compared with similar previous studies in the context of pediatric OSAS, the cohort included herein was quite extensive, and included implementation of an appropriate bootstrap approach for independent optimization and further validation of the models using several datasets. Similarly, a larger dataset would lead to a more balanced OSAS-negative and OSAS-positive groups and thus to a more accurate parameter estimation.

Our results suggest that oximetry is able to provide relevant and useful information in the context of pediatric OSAS. On the other hand, using the proposed oximetric features, differences among the statistical classifiers under study decreased as the cutoff for the disease increased. Therefore, additional features from signal processing methods as well as advanced pattern recognition techniques, such as ensemble learning, or more complex approaches for classification, such as deep neural networks, would be needed to maximize the screening ability of oximetry in moderate-to-severe cases. In addition, it is understood that the ultimate goal of simplified tools such as overnight oximetry is to carry out abbreviated screening tests at home. Therefore, the validity of the proposed models needs to be reproduced in an unattended setting. Finally, despite its usefulness for the screening of mild (96.5% PPV for AHI ≥ 1 event/h) and moderate-to-severe (90.8% PPV for AHI ≥ 5 events/h) childhood OSAS, there is still room for improvement regarding the effectiveness of oximetry to classify no-OSAS children (AHI < 1 event/h). Such limitations will likely require more than a single-step modeling approach and incorporate additional screening stages, whereby a fail/pass no-OSAS decision by the proposed oximetry-based model will then undergo a second scrutiny using an alternative classifier. If we want to screen for no-OSAS children to minimize the number of complete in-lab PSGs, an alternative classifier aimed at maximizing the negative predictive value should be implemented. In this regard, as oximetry is characterized by a high specificity, other sources of information would be needed, such as history and clinical data or additional biomedical recordings, e.g., airflow or pulse rate. On the other hand, additional useful information could be derived from the same sensor so that the simplicity of our proposal as abbreviated screening test for pediatric OSAS is not compromised. Pulse oximetry devices are able to provide information from both oxygen saturation and heart rate. In the present study, we used just the SpO2 signal. Nevertheless, previous researchers demonstrated the usefulness of the pulse rate derived from the photoplethismographic signal to derive relevant information about the presence of pediatric OSAS [9, 16, 17]. Therefore, features derived from different signals but acquired using a single device could be used to improve the performance of automated classifiers without increasing the intrusiveness of the proposal for children.

In summary, automated analysis of SpO2 recordings from nocturnal oximetry emerges as a simple as well as reliable alternative to complete PSG in the screening of children with high pre-test probability of OSAS. Our results provide additional insights on the effectiveness and limitations of statistical modeling of overnight SpO2 recordings as an abbreviated screening tool for childhood OSAS.