Introduction

Obstructive sleep apnea (OSA) is a sleep-related, chronic breathing disorder characterized by recurrent, transient apneas, or hypopneas during sleep caused by intermittent narrowing or collapse of the upper airway. Patients with OSA have frequent sleep disruption resulting in unrefreshing sleep, daytime sleepiness, fatigue, and impaired concentration and memory [1]. OSA increases the risk of motor vehicle accidents [2], hypertension, ischemic heart disease, heart failure, arrhythmias, and stroke [3]. OSA has been associated with a two- to sixfold increase in the risk of all-cause mortality [4, 5]. The prevalence of OSA is increasing, currently affecting 13% of the men and 6% of the women between ages 30 and 70 years [6]. OSA remains a highly underdiagnosed because of lack of awareness and limited access to testing [7].

Patients with symptoms suggestive of OSA are usually referredfor a diagnostic sleep study anda clinical assessment by a qualified sleep specialist [8]. The reference standard test used to diagnose OSA is overnight polysomnography (PSG) conducted in a sleep laboratory, supervised by a qualified sleep technician [9, 10]. The PSG reports on several physiologic parameters captured through seven or more recording channels [11]. The main diagnostic parameter calculated based on PSG is the apnea-hypopnea index (AHI), i.e., the average number of apneas and hypopneas per hour of sleep [10]. The diagnosis of OSA is made if the AHI is ≥ 5 events/h for patients reporting symptoms (e.g., daytime sleepiness, snoring) or ≥15 events/h, regardless of symptoms [8, 9].

Diagnostic sleep studies can be performed at home as well. The sleep monitors are classified as type I–IV where PSG is a type I device for in-laboratory testing, and type II–IV are portable sleep monitors for home sleep apnea testing (HSAT) [11]. According to the American Academy of Sleep Medicine (AASM) criteria, a type II portable monitor (PM) is a full unattended portable PSG (≥ 7 channels), type III monitors have four to seven channels, and type IV monitors have one to two channels with one of them being oximetry [11]. A major limitation of most PMs is their inability to distinguish between the sleep and the wake periods, and reporting of the number of apneas and hypopneas per hour of recording time rather than sleep time, a parameter also known as the respiratory event index (REI) [12]. Since REI tends to underestimate the “true” AHI, the current AASM guideline recommends performing a confirmatory PSG in patients withnegative HSAT [12]. The guideline supports the use PSG or HSAT with a “technically adequate device” in uncomplicated patents with moderate to high risk of OSA and only PSG for those with significant comorbidities [12].

HSAT generally offers a more patient-centered approach by permitting a simplified home sleep testing in a more familiar and comfortable setting, at lower costs and shorter wait times than PSG [13]. These factors may support broader testing for OSA, as a means of expanding the diagnosis of subclinical disease and addressing the population health burden of OSA. Evidence on diagnostic accuracy of HSAT continues to accumulate. Commissioned by the Agency for Healthcare Research and Quality (AHRQ), the Tufts Evidence-based Practice Center conducted the most comprehensive comparative effectiveness review of existing diagnostic and treatment modalities for OSA covering the period up to September 2010 [14]. The aim of this project was to update this systematic review with a specific focus on evaluating the diagnostic ability of type IV PMs compared to PSG in patients with suspected OSA.

Methods

The protocol of this systematic review has been registered in the International Prospective Register of Systematic Reviews (PROSPERO) database on April 20, 2016 (registration number: CRD42016037470). The study reporting followed the Preferred Reporting Items in Systematic Reviews and Meta-Analyses (PRISMA) guideline [15].

Study selection criteria

Population

Our systematic review targeted studies that included patients who were at least 16 years old with symptoms suggestive of OSA. Studies where more than 20% of the study population had any of the following were excluded: a neuromuscular disease (e.g., multiple sclerosis, muscular dystrophy), Down syndrome, Prader-Willi syndrome, major congenital skeletal abnormalities, narcolepsy, narcotic addiction, Alzheimer’s disease, epilepsy, or had experienced a disabling stroke. Studies that included only general population or those with established sleep apnea or other sleep disorders were excluded.

Intervention and comparator

The interventions reviewed included type IV PMs applied at home or in a sleep laboratory for diagnosing OSA. The comparator of interest was overnight PSG conducted in a sleep laboratory. For consistency, we classified types of sleep monitors following the rules applied in the previous systematic review [14]. Based on the classification used in the latter publication, type III monitors have ≥ 4 channels, including at least two respiratory channels (two airflow or one airflow and one effort channel), but cannot differentiate between sleep and wake or measure arousals. Type IV PMs include devices that do not meet criteria for type III monitors. Studies with single-channel PMs that used heart rate, heart rate variability, or actigraphy, and those that used clinical features (e.g., neck circumference, body mass index) as additional predictive factors for diagnosis of OSA were excluded. Studies with type II or III monitors were also excluded.

Outcomes

We included studies that reported at least one of the following measures for diagnostic performance: sensitivity, specificity, area under the receiver operating characteristic (ROC) curve, and Bland-Altman analysis of concordance (mean of the differences (i.e., bias) and levels of agreement) [16] when comparing clinical diagnosis based on the sleep test, AHI, REI, or respiratory disturbance index (RDI). Since these parameters are not defined consistently in literature [14], for each evaluated study, we extracted and reported the definitions used by the authors.

Studies

We included cross-sectional and prospective studies that used experimental, quasi-experimental, or observational designs of any follow-up duration and excluded all other study types (e.g., case reports, case series, reviews, editorials or commentaries, clinical guidelines). We also excluded (1) animal studies, (2) non-English articles, (3) studies that had less than 10 study participants for each test, and 4) studies based on retrospective analysis of existing clinical databases.

Information sources

All eligible studies were identified through a systematic comprehensive search of the Ovid MEDLINE(R), Ovid MEDLINE(R) In-Process & Other Non-Indexed Citations, and Cochrane Library databases for the period from January 1, 2010 to May 10, 2016. The selection of this timeframe was justified by the availability of a prior systematic review that covered the time period up to September 2010 [14].

Search strategy

The search strategy was designed by an information specialist (JB), using an existing past review[14] as a guide. The search strategy included Medical Subject Headings terms and text-words in the following concept areas: sleep apnea, polysomnography and other diagnostic tests, general diagnostic accuracy terms, randomized controlled trials, and other specified designs (see Online Resource 1 for deatiled search strategy). Duplicates were removed at the database level and at the citation manager level. In addition, we hand-searched the reference lists of full-text articles under review.

Selection of studies and data extraction

All stages of the review (review of titles and abstracts, review of full texts, data abstraction, and assessment of quality) were conducted independently by groups of two reviewers (LA, PP, SC, SMC, VR, YS) and compared. Disagreements were resolved by consensus. Reasons for exclusions of full-text articles were recorded. For the full-text articles included in the final review, we extracted the following information into an Excel database: study characteristics (e.g., country, design), participant characteristics (e.g., inclusion and exclusion criteria, age, gender, OSA severity), details on compared sleep monitors (e.g., name, number, and type of channels), and estimates of their diagnostic accuracy. When available, we extracted study specific criteria set by authors to qualify a sleep study (PM and/or PSG) as valid and appropriate for analysis.

Assessment of quality of studies

The quality of studies was assessed using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool [17]. The tool assesses study quality in four domains including patient selection, index and reference tests, and flow and timing for the risk of bias (ROB) and applicability.

Data synthesis

For the studies included in the final review, we descriptively presented the study, patient, and device characteristics as well as the results from the diagnostic accuracy testing. In cases where the authors published the Bland-Altman plots but did not provide the corresponding numerical values (mean of differences, limits of agreement), we used the free Plot Digitizer software program to extract them from the plot. In addition, studies that calculated the bias and 95% limits of agreement estimates as “PSG AHI/RDI minus PM AHI/RDI,” we reversed the values to standardize reporting of all these estimates as “PM AHI/RDI minus PSG AHI/RDI”.

We did not apply indirect comparisons between different PMs considering variability in their structures (i.e., channels) and population studied. Instead, we conducted separate meta-analysis for each PM versus PSG comparison to obtain summary estimates on sensitivity and specificity. For this purpose, we used bivariate random-effects models that consider both the within-study variability in sensitivity and specificity and the correlation between these two measures [18]. PMs were selected for meta-analysis if they have been tested in at least four studies [18] conducted in the same setting (in laboratory or at home) using similar AHI/RDI cutoffs and if the authors provided sufficient details to extract or calculate the number of patients with true positive, false positive, true negative, and false negative test results.

Results

Our search resulted in 6647 MEDLINE and 780 Cochrane records or 6054 total records after removing duplicates (Fig. 1). After screening titles and abstracts, 5939 records were excluded. The full texts of the remaining 115 abstracts were retrieved for more detailed evaluation. Two more potential full-text articles were identified at this stage, one through a review of reference lists and another after contacting an author for a study-related question. After review of full texts, we excluded 93 articles with the most common reasons being not containing an analysis of interest (n = 30), investigating a type III device (n = 26), or not having the population of interest (n = 16). The final review included 24 prospective studies.

Fig. 1
figure 1

Flow diagram of selection of studies

Characteristics of included studies

The studies were conducted in 12 different countries including six in the USA; three in Argentina and Australia; two in Canada, China, and Japan; and one in France, Germany, Ireland, Republic of Korea, Saudi Arabia, and Turkey (Table 1). All studies used a cross-sectional design to test PMs against PSG with three studies applying a random order when testing in-laboratory PSG against home PM [26, 36, 39]. Among the 24 studies, the mean age of participants varied from 40.9 to 64.6 years, the proportion of males from 24.0 to 88.4%, and the mean BMI from 25.5 to 36.3 kg/m2. The type and number of patients with comorbidities were not reported in 12 studies, although five of them excluded patients with serious comorbidities. The remaining 12 studies reported several comorbidities in the patient population including hypertension (20.3 to 55% of the patient population), ischemic heart disease (7–50%), diabetes (5–30%), and asthma (6–16%). Patients had a high pre-test probability of OSA; the mean AHI ranged from 8 to 42.7 events per hour of sleep.

Table 1 Characteristics of included studies

Overall, the 24 studies evaluated 10 different type IV PMs including (i) single-channel devices such as BresoDx [19, 20], ApneaLink [22, 25, 29, 30, 32, 34], SD-101 [27, 38], Flow Wizard [35, 36], SleepMinder [41], and oximetry [21, 23, 33, 37]; (ii) two-channel devices such as ApneaLink Ox [31, 39] and SleepView [42]; and (iii) four-channel devices such as WatchPAT 100 [24] and WatchPAT 200 [26, 28, 40] (Table 1).

Quality of included studies

The results of the quality assessment of these studies using QUADAS-2 are presented in Fig. 2. In all four domains, the proportion of studies with unclear ROB was quite large (38 to 50%), reflecting poor reporting practices. About 17% of the studies were evaluated as high ROB for the “index test” domain that jointly evaluates if the test interpretation was done without knowing the results of the reference test (PSG) and if the thresholds for analyses were pre-specified. In terms of applicability, 17 studies (71%) were scored as high risk because they tested PMs only in a sleep laboratory setting.

Fig. 2
figure 2

Quality of included studies based on QUADAS-2

Overall, blinding of PSG results when interpreting PM results (and vice versa) was applied only in 13 studies (54.2%) and not reported in the remainder (see Online Resource 2). Criteria for a good quality sleep study were defined in 14 studies (58.3%), and the proportion of patients excluded from the final analysis due to technical failure or other errors varied from zero to 25%.

Diagnostic accuracy of type IV monitors

Table 2 presents the results of the diagnostic accuracy assessments in the included studies. Out of 24 studies, six compared the performance of PM both in laboratory (simultaneously with PSG) and in home settings [20, 25, 26, 32, 34, 39]. One compared in-lab PSG with in-home PM [36], and the remaining 17 studies compared in-lab PM and PSG (simultaneously) [19, 21,22,23,24, 27,28,29,30,31, 33, 35, 37, 38, 40,41,42]. The sample size in these studies varied from 25 to 198 patients.

Table 2 Diagnostic accuracy of type IV PMs against PSG: evidence summary

One study did not report concordance analysis [38] and the other did not report AHI/RDI values from the PM [33]. From the remaining 22 studies, 14 calculated the mean of differences as PM AHI/RDI minus PSG AHI/RDI and 8 as PSG AHI/RDI minus PM AHI/RDI (Table 2). After reversing the values from the latter 8 studies, the mean of the differences (bias) between the PM-measured AHI/RDI and PSG-measured AHI/RDI varied from − 14.8 to 10.6 events/h with the lower and upper limits of agreement ranging from − 66.0 to 78.8 events/h (Table 2). Among the five studies that tested the PM both at home and in the laboratory setting and reported bias estimates [25, 26, 32, 34, 39], the estimates were not largely different and limits of agreement estimates overlapped (Fig. 3).

Fig. 3
figure 3

Mean difference between PM and PSG AHI/RDI (forest plot of Bland-Altman analysis). The plot shows the mean difference between PM AHI/RDI and PSG AHI/RDI (bias) and their 95% limits of agreement for each study. Asterisks (*) indicate the studies for which a digitizer program was used to extract values from published plots. Whenever studies reported AHI values both from manual and automatic scoring, only manual scoring results are shown here. AHI apnea hypopnea index, CI confidence interval, RDI respiratory disturbance index, PM portable monitor, PSG polysomnography

None of the studies compared clinical diagnosis of OSA informed by the PM or PSG. Most studies compared AHI/RDI measured as number of events during the total sleep time from PSG against the AHI/RDI measured as number of events over the total recording time from PM (Table 2). One study measured other indices [33], and another did not report a threshold analysis [28]. Most frequently, studies used AHI/RDI cutoffs of 5 and 15 events/h to report on diagnostic performance.

In studies that tested the performance of PMs in both settings (home and laboratory) [25, 26, 32, 34, 39], the sensitivity and specificity values were better from tests conducted in the sleep laboratory (simultaneously with PSG) than when conducted at home. Table 3 reports the sensitivity and specificity ranges for AHI/RDI cutoffs of 5 and 15 events/h for single-, two-, and four-channel PMs. The sensitivity at AHI/RDI cutoff value at 5 events/h ranged between 0.68–1.0 for single-channel PMs, 0.77–0.93 for two-channel, and 0.96–1.00 for four-channel PMs. The sensitivity values somewhat decreased and specificity values increased when moving the threshold from 5 to 15 events/h. For comparison purposes, Table 3 shows the results from the past systematic review [14].

Table 3 Sensitivity and specificity ranges of type IV PMs: current and past systematic review

Meta-analysis of PMs

Only the ApneaLink device was tested in ≥ 4 studies and qualified for the quantitative meta-analysis for summary diagnostic accuracy measures (Table 4). The mean estimates for sensitivity and specificity based on six studies of ApneaLink [22, 25, 29, 30, 32, 34] (all conducted in sleep laboratories) were 0.88 (95% confidence interval (CI) 0.82 to 0.92) and 0.64 (95% CI 0.52 to 0.75) for the AHI/RDI cutoff of 5 events/h and 0.82 (95% CI 0.69 to 0.90) and 0.88 (95% CI 0.83 to 0.91) for the AHI/RDI cutoff of 15 events/h. No heterogeneity was observed between the studies for both cutoffs (I2 = 0 and p value for Q statistics > 0.05).

Table 4 Meta-analysis of diagnostic accuracy of type IV PM ApneaLink (laboratory setting)

Discussion

The systematic review summarized evidence on diagnostic accuracy of type IV PMs for HSAT from English studies published from January 2010 to May 2016. In total, we found 24 studies evaluating 10 different types of portable devices against the current standard testing, PSG. The prior systematic review that covered the time period up to September 2010, reported that in total 23 unique type IV PMs have been evaluated against PSG. Only one study was repeatedly included both in the current and in the past systematic review (because of the time overlap of the literature search) [34]. After completing our review, we can report that 28 unique type IV PMs have been compared against PSG so far.

The portable devices in the current review had one, two, or four channels, and their diagnostic accuracy varied by the type, number of channels, test setting, and AHI/RDI thresholds used for diagnosis. Only one third of studies tested PMs in home setting. The mean difference between PSG AHI/RDI and PM AHI/RDI ranged from − 14.8 to 10.6 events/h. At AHI ≥ 5 events/h, the sensitivity of type IV PMs ranged from 0.68 to 1 for single-, from 0.77 to 0.93 for two-, and from 0.96 to 1 for four-channel PMs indicating some improvement of sensitivity with the increase of number of channels. For the same threshold, the specificity of type IV PMs ranged from 0.43 to 0.97 for single-, from 0.83 to 0.92 for two-, and from 0.25 to 0.83 for four-channel PMs.

As expected, the prevalence of OSA was much higher in patients referred to sleep clinics for sleep studies than what has been reported in the general population [6]. Using an AHI cutoff of 5 events/h, the prevalence of OSA in the included studies ranged from 41.9 to 94.2% and from 16.0 to 83.3% when using a cutoff of 15 events/h. The studies were conducted in 12 different countries and included patient populations that varied in several characteristics such as age, gender, BMI, and comorbidity profiles. These variabilities could partially explain the wide ranges of diagnostic accuracy parameters in this systematic review, similar to what was observed in past reviews [14, 43]. The current AASM guideline does not recommend the use of PMs in patients with significant comorbid conditions because of lack of data supporting their diagnostic accuracy in these patients [12]. Half of the studies in this review included patients with comorbid conditions, in part addressing this evidence gap.

Following current recommendations for quality assessment of studies, we used the QUADAS-2 tool which assesses both the risk of bias and applicability in four major domains [17]. The percent of studies with uncertain ROB varied from 38% to 50% across four ROB domains. Poor reporting quality may indicate poor methodological quality, limiting the strength of inferences that are possible from these data. The prior systematic review used a different tool to evaluate ROB [14]. From 24 studies in the review, 29% were graded as level A (good quality), 46% as B (fair/moderate quality), and 25% as level C (poor quality) [14]. Using this assessment tool, we graded 25% of the studies in our review as level A, 63% as level B, and 13% as level C (data not shown).

We assessed the concordance between PMs and PSG by reviewing the results from Bland-Altman plots. In most studies, the denominator for AHI/RDI calculations was based on total sleep time for PSG and total recording time for PMs. Since the total sleep time is usually shorter than the total recording time, it is more likely for PMs to underestimate than overestimate the risk of OSA. The analyses of concordance, however, showed that this was not always the case; the bias estimates had a wide range varying from − 14.8 to 10.6 events/h with accompanying 95% LOA ranging from − 66 to 78.8 events/h. This was similar to the previous systematic review that also reported a high level of discordance with bias estimates ranging from − 17 to 12 events/h and 95% LOA estimates ranging from − 49 to 61 events/h [14]. In practice, this means that some type IV PMs overestimate and some underestimate AHI/RDI values potentially leading to misdiagnosis of OSA. For PMs tested at home and compared with sleep laboratory PSG, this could relate to either spontaneous night to night variability in sleep apnea severity or to changes in sleep apnea severity that relate to body position, alcohol and other substance use, sleep quality, or other variables. In particular, a significant night to night variability has been reported in studies in this review that tested the PM in the sleep laboratory and then at home [25, 26, 32, 34, 39] or tested the same PM at home over consecutive nights [36]. Considering the significant risk of OSA over- or underestimation with type IV PMs in our review, we further support the current AASM guideline recommending to perform HSAT for at least one night and to perform PSG for those with negative HSAT [12].

A set of AHI/RDI thresholds have been used by the authors to grade OSA severity, most commonly including thresholds of ≥ 5, 10, 15, and 30 events/h of sleep. Expectedly, the estimates of sensitivity, specificity, and area under the ROC curve varied (in some cases, substantially) by the threshold level. Furthermore, due to differences in populations studied, type and number of channels, as well as sleep study setting, we observed wide variations in these estimates under fixed thresholds as well. The results of the current review are similar to those from the past systematic review. For example, for single-channel PMs when using an AHI/RDI cutoff ≥ 15 events/h, the sensitivity ranged from 0.65 to 1 in the current and from 0.43 to 1 in the past review, and the specificity varied from 0.58 to 0.98 in the current and from 0.42 to 1.00 in the past review [14]. The clinical implications of such variations could be quite significant, especially when translating these results to proportions of patients with false positive and false negative results. Another important issue is the setting where the PM testing is done. In our systematic review, only seven out of 24 studies tested PMs at home. The reported sensitivity and specificity estimates of PMs were generally better if they were done in a laboratory setting than at home; a finding that was in agreement with a prior systematic review and meta-analysis of type III devices [43].

One major limitation in current studies was that most of them tested PMs only in a laboratory setting where the issues of technical failure are more easily identified and corrected. Past studies reported that the diagnostic accuracy of PMs is better in the sleep laboratory setting than at home [43]. Policy recommendations regarding specific PMs for HSAT should be supported by evidence gathered in the setting in which the test will be used. Another limitation was that integration with clinician judgment was not explored. OSA is a clinical diagnosis, informed by history, physical examination, and diagnostic test results [8]. The key question about diagnostic devices is whether and how they improve the overall accuracy of the diagnosis of sleep apnea, taking all relevant data into account. This requires understanding of how test results support and improve clinical judgment. None of the studies we evaluated addressed this question. Integration of test results with clinical judgment remains an important evidence gap.

Due to observed heterogeneity in patient populations and differences in the type and number of PM channels, we decided to meta-analyze each specific PM separately (i.e., refrain from indirect comparisons). In addition, we decided to separate studies by setting because a past systematic review of type III monitors showed that testing conducted in sleep laboratories (simultaneously with PSG) resulted in better diagnostic accuracy than sleep testing conducted at home (at different night from PSG) [43]. Furthermore, bivariate random-effects models required having at least four studies of similar type to allow valid conclusions [18]. With all these considerations, we were only able to do meta-analysis for the ApneaLink monitor tested in six studies conducted in sleep laboratories. We concluded that ApneaLink had a higher sensitivity and lower specificity under an AHI threshold of ≥ 5 events/h (lower disease severity) compared to the threshold of ≥ 15 events/h or higher disease severity, similar to what was reported in a past meta-analysis of type III PMs [43].

The limitations of this review warrant discussion. Due to logistical reasons, our search was limited to English language articles. Next, in defining what constitutes a type IV PM, we followed the prior AHRQ systematic review to generate a consistent body of evidence [14]. Both the AHRQ review and the recent AASM guideline defined type III studies as devices that include two respiratory parameters (breathing effort and airflow), oxygen saturation and an electrocardiography or heart rate recording [12, 14]. Following the prior AHRQ review, we considered PMs not meeting type III criteria as type IV, excluding single-channel PMs that use heart rate, heart rate variability, or actigraphy [14]. The AASM guideline defined type IV as devices with one to two parameters that record oxygen saturation, heart rate, and/or airflow [12]. The final AASM recommendation was supportive of HSAT for uncomplicated patients with technically adequate devices that, at minimum, record “nasal pressure, chest and abdominal respiratory inductance plethysmography, and oximetry; or else PAT with oximetry and actigraphy” [12]. Therefore, the guideline supports HSAT only with type III devices and the WatchPAT device (as per current review). We, however, were not able to calculate summary estimates for the diagnostic accuracy of WatchPAT due to the small number of identified studies testing this device in this review.

In conclusion, we found that the diagnostic accuracy of type IV PMs for HSAT varies depending on the number of channels, setting, and disease severity. While evidence is not very strong for their stand-alone use in routine clinical practice, in settings and populations where there is a high demand and a limited capacity in performing PSG or where OSA is highly underdiagnosed (e.g., patients with significant comorbidities), these monitors can help to expand access to early OSA identification and timely management. Future studies should consider testing the diagnostic accuracy of these devices in making a clinical diagnosis of OSA and test their performance both in sleep laboratories and at home. Policy recommendations regarding PM use should consider the health and societal implications of false positive and false negative diagnoses and its cost-effectiveness.