Introduction

Multiple epidemiological studies of obstructive sleep apnea (OSA) in the USA have shown that the public health burden of OSA is high [1]. Primary care providers—as gatekeepers of healthcare—determine whether or not patients are being referred for sleep apnea evaluations. However, due to time, financial, and organizational constraints, primary care providers need rapid and simple screening tools to identify patients at risk for OSA. There are several assessment tools that have been used to identify patients at risk for OSA; however, these tools vary widely in their predictive capabilities and need to be tested in community-based populations that are more representative of patients in primary care physicians’ office than laboratory-based populations.

The STOP and STOP-Bang scoring tools, developed in 2008 and 2009 by Chung et al., are gaining much popularity. A high risk of OSA is defined as positive responses for ≥2 items on the STOP and ≥3 items on the STOP-Bang [2]. Other tools such as the Epworth sleepiness scale (ESS), Berlin Sleep Questionnaire, and 4-Variable screening tools have been used and evaluated rigorously. Silva et al. reported that in a community-based population (Sleep Heart Health Study [SHHS]), the STOP-Bang questionnaire had the best sensitivity (SN) for predicting moderate-to-severe OSA (apnea-hypopnea index [AHI] > 15), whereas the 4-Variable tool had the best specificity (SP) for predicting moderate-severe OSA [3]. While traditionally, the STOP and STOP-Bang have used threshold values, Farney et al. performed weighted models in a laboratory-based population to assess whether such analysis would improve the predictive capability of the STOP-Bang questionnaire. While a weighted model significantly improved the area under the receiver operating curve (AUC) and coefficient of determination, this model was determined to have no clinically significant advantage over a linear model [4]. Such validation was performed in a laboratory-based population with a high prevalence for OSA, but whether the weighted STOP-Bang measure would perform better in a community-based population is currently unknown.

The purpose of this study was to evaluate whether weighted responses to any of the items in the STOP-Bang scoring tool would improve the test characteristics for predicting OSA in a community-based population. Additionally, we wished to evaluate the sensitivity, specificity, likelihood ratio of a positive and negative test (LR+ and LR−), and AUC for the STOP-Bang in comparison to the weighted STOP-Bang in a community-based population.

Methods

Design and sample

This study evaluated 6441 participants who completed in-home polysomnograms (PSGs) in the baseline evaluation of the SHHS [5]. The SHHS is a prospective multicenter cohort study designed to investigate the relationship of sleep-disordered breathing (SDB) with the development of cardiovascular disease in the USA [5]. The study participants were recruited from parent cohort studies that were already in progress: Atherosclerosis Risk in Communities Study (1920 participants), Cardiovascular Health Study (1248 participants), Framingham Heart Study (1000 participants), Strong Heart Study (602 participants), New York Hypertension Cohorts (760 participants), and Tucson Epidemiologic Study of Obstructive Airways Disease and the Health and Environment Cohort (911 participants) [5]. Initial recruitment occurred between December 1995 and January 1998. After recruitment, participants completed questionnaires related to their sleep and health and had a 1-night home polysomnogram (PSG) performed. The SHHS was approved by an institutional review board for human studies; informed consent was obtained from all participants at the time of enrollment. The SHHS participants completed the Sleep Habits Questionnaire (SHQ) 1 to2 weeks prior to their home polysomnograms (PSG) [5]. These questionnaires were checked for completeness and collected by a team of two certified technicians who conducted the in-home PSGs [5].

STOP-Bang questionnaire

The STOP-Bang is a tool developed by Chung et al. (2008) that evaluates eight risk factors for OSA: snoring, tiredness, observed apneas, high blood pressure, body mass index (BMI) over 35 kg/m2, age over 50 years, neck circumference over 40 cm, and male gender. An affirmative answer to an item in the tool is scored as 1 point, and a negative answer is scored as 0 points. The item scores are added to obtain the total score [2]. Although Chung et al. (2008) proposed a cut point of 35 kg/m2 for Canadian preoperative patients, Ong et al. (2010) noted that a cut point of 30 kg/m2 identified more Singapore sleep-clinic patients with a high risk for SDB [6]. For this study, a BMI cut point of 35 kg/m2 was evaluated, as it would compare more directly to the original study of the scoring tool [7].

The variables in the SHHS were used to construct approximate answers to each item of the STOP section of the STOP-Bang [3]. Answers for snoring were deemed affirmative if participants noted loud snoring on the SHQ. tiredness or sleepy was affirmative if the patient reported feeling unrested often or almost always regardless of the amount of sleep obtained and feeling tired all the time, most of the time, or a good bit of the time. Observed apnea answers were noted as affirmative if the participants answered yes to the question, “Based on what you have noticed or household members have told you, are there times when you stop breathing during your sleep?” Answers about high blood pressure were noted to be affirmative if participants answered yes to whether they were taking medication for high blood pressure. Affirmative answers were given a value of 1; negative answers were given a value of 0. For the Bang section of the STOP-Bang questionnaire, an affirmative answer was scored 1 and a negative answer was scored 0 for each of the following items: body mass index over 35 kg/m2, age over 50 years old, neck circumference over 40 cm, and male gender. Some subjects had missing data, and thus the total values for each variable may vary from the total sample.

In-home polysomnograms

The PSGs were completed using a Compumedics Portable PS-2 System (Abbottsville, Victoria, Australia) [8]. Participants were asked to schedule the arrival of the certified technicians approximately 2 h before their normal bedtimes and to make their sleep times and environments as close to their usual patterns as possible. The evening visit lasted between 1.5 to 2 h. The PSG montage included the following: right and left electroculograms; bipolar submental electromyogram; thoracic and abdominal inductive plethysmographic bands; electrocardiogram; oximeter; and sensors for airflow, heart rate, body position, and ambient light [8]. Placement and calibration of all equipment and sensors were done by a team of two certified technicians during the evening visit [8].

The sleep parameters’ results were scored per the guidelines developed by Rechtschaffen and Kales (1968). Apneas were defined as a complete or nearly complete absence of airflow, as measured by the thermocouple sensor signal, for 10 s or more [9]. Hypopneas were defined as a decrease in amplitude from the participant’s baseline airflow or volume of at least 30 % that lasted at least 10 s. Only apneas and hypopneas associated with an oxygen desaturation of 4 % or more were used to determine the AHI, the average number of respiratory events per hour of sleep [10].

Analysis

Data from the baseline SHHS home visit for subjects with complete PSGs were included in the present analysis. The study population was randomly divided into a derivation dataset (n = 1667), and a validation dataset (n = 4774) and the frequencies of each dichotomous variable in the STOP-Bang score were determined. The BMI, age, and neck circumference variables were used as dichotomous variables and in a second model as continuous variables. Differences in proportions between the derivation and validation data set were assessed using chi-square test for categorical variables. Differences in means for continuous variables were assessed using t tests. SDB was defined as an apnea-hypopnea index ≥15 per hour with 4 % oxygen desaturation threshold. Utilizing the initial derivation dataset and univariate logistic regression models, we determine the standardized beta coefficients (coefficient) that would allow us to weight the variables. In these models, each of the individual STOP-Bang variables was entered as predictor variables and the presence or absence of SDB as the outcome variable. Using the coefficients for each variable, we constructed a new scoring model. The coefficients are used to compare the relative strength of the various predictors within the model. Because the standardized beta coefficients are all measured in standard deviations, instead of the units of the variables, they can be compared to one another. The sum of the weighted dichotomous variables yielded a weighted STOP-Bang score (wSTOP-Bang). Further regression models were constructed using BMI, age, and neck circumference as continuous variables as opposed to dichotomous variables; the coefficients for each variable were then used to construct a second scoring model, continuous STOP-Bang (cSTOP-Bang). The cSTOP-Bang tool used the aforementioned continuous variables in addition to the traditional dichotomous variables for snoring, tired or sleepy, hypertension, observed apnea, and gender. The wSTOP-Bang, cSTOP-Bang, and the conventional STOP-Bang scores were then applied to the validation dataset, and the AUCs, sensitivity, specificity, and LR+ and LR− were compared.

Results

Demographic characteristics for the dichotomous variables are presented in Table 1. There were no differences noted between the derivation and the validation samples, except for gender. There were proportionally more men in the derivation dataset. This may have also explained the proportionally higher number of subjects with neck circumferences greater than 40 cm. Table 2 shows the demographic characteristics for the continuous variables. There were no differences in means between the derivation and the validation samples, except for neck circumference where there was a higher mean in the derivation sample likely due to the higher proportion of men in that sample. The linear regression coefficients for the STOP-Bang dichotomous variables predicting SDB are shown in Table 3. Snoring and tired or sleepy have negative and non-significant coefficients predicting SDB, observed apnea, BMI, and gender have the highest coefficients predicting SDB. After applying the derived wSTOP-Bang to the validation data, we obtained the AUC. The AUC for the cSTOP-Bang was 0.738 with a standard error (SE) of 0.010 (95 % confidence interval [CI] 0.72, 0.76) and was greater than the AUC for the conventional STOP-Bang, which had an AUC of 0.71 with a SE of 0.01 (95 % CI 0.68, 0.73), and the wSTOP-Bang, which had an AUC of 0.69 with a SE of 0.01 (95 % CI 0.67, 0.71) (Table 4; Fig. 1).

Table 1 Sample characteristics for dichotomous variables
Table 2 Sample characteristics for continuous variables
Table 3 Individual linear regression models used to determine coefficients for the STOP-Bang dichotomous variables predicting AHI
Table 4 Areas under the curve for the STOP-Bang, cSTOP-Bang, and wSTOP-Bang
Fig. 1
figure 1

Receiver operating curve for each scoring model compared to the reference line

Using the recommended cutoff point of 3 for the conventional STOP-Bang, the sensitivity was 93.2 %, specificity was 23.2 %, with 35 % percent of subjects being correctly classified by the scoring tool. By increasing the cutoff point to 4, in order to increase specificity, sensitivity fell to 75.43 % (Table 5). Again, the possible outcomes ranged from 0 to 8. Due to the nature of the wSTOP-Bang model, both fractions and whole numbers were possible scores. As some coefficients were negative, possible outcomes ranged from −0.517 to 3.474. Thus, a cutoff value of 0.594 yielded a sensitivity of 93.3 % and specificity of 23.6 %, with 35.8 % of subjects being correctly classified by the scoring tool (Table 6). This cutoff was selected as it yielded the maximal sensitivity for the screening tool. In comparison to the traditional scoring tool, there was no improvement in specificity over the conventional STOP-Bang for any given sensitivity. Similar to the wSTOP-Bang, the cSTOP-Bang has the possibility of both whole numbers and fractions. Since BMI, age, and neck circumference were used as continuous variables as opposed to dichotomous variables, the number of possible outcomes increases significantly. However, by selecting a cutoff value that would produce a similar sensitivity to the conventional STOP-Bang, there was a notable increase in specificity. Outcomes ranged from 16.21 to 33.55. A cutoff value of 22.3 yielded a sensitivity of 93.2 % and specificity of 31.8 %, with 42.2 % of subjects being correctly classified. Furthermore, if the cutoff is reduced to 21.6, sensitivity can be increased to 95.7 % while specificity remains similar to conventional STOP-Bang at 23.2 % with 35.4 % of subjects being correctly classified (Table 7).

Table 5 STOP-Bang sensitivity, specificity, and likelihood ratios
Table 6 wSTOP-Bang sensitivity, specificity, and likelihood ratios (only selected values are shown)
Table 7 cSTOP-Bang sensitivity, specificity, and likelihood ratios (only selected values are shown)

Discussion

Several tools have been used to estimate the pretest probability of OSA prior to polysomnography. The Epworth Sleepiness Scale (ESS), developed in 1991 by Johns et al., was the traditional screening method for determining the need for further OSA evaluation. While higher scores (ESS > 10) correlate with moderate-to-severe OSA, the ESS was developed to measure the likelihood of sleep onset rather than to determine OSA risk [11]. The Berlin Questionnaire (BQ) categorizes items known as OSA risk factors. Category 1 includes items on the presence of snoring, Category 2 includes items on daytime sleepiness, and Category 3 includes items on hypertension and obesity. Positive item responses in two of three categories identify patients at risk for OSA [12]. The simple 4-Variable screening tool consists of only four variables: gender, blood pressure, body mass index (BMI), and reported snoring. Values are assigned to each variable; blood pressure and BMI are assigned values based on predetermined ranges. The final score for the 4-Variable screening tool is determined by a linear regression formula. A final score of ≥14 indicates a high risk for OSA [13].

A systematic review in 2010 by Abrishami et al. reported that the BQ, overall, had the highest sensitivity (SN; 80 %) and specificity (SP; 76 %) for predicting OSA (apnea-hypopnea index [AHI] ≥ 5 events per hour) in persons without a history of sleep disorders [14]. Silva et al. compared the ESS, STOP, STOP-Bang, and the 4-Variable screening tools using data from the Sleep Heart Health Study (SHHS), a community-based epidemiological study. Values were assigned to the items in the four tools by extrapolating the SHHS data [3]. They reported that for predicting moderate-to-severe OSA (AHI > 15), the ESS had a sensitivity (SN) of 39 % and a specificity (SP) of 71 %, the STOP had a SN of 62 % and a SP of 56 %, the STOP-Bang had a SN of 87 % and a SP of 43 %, and the 4-Variable tool had a SN of 24 % and a SP of 93% [3]. Based on the SHHS data, the STOP-Bang was determined to be a simple, rapid, and sensitive assessment tool for moderate-to-severe OSA in the general population [3]. The STOP-Bang identifies persons as high risk if there are at least three affirmative responses to the eight items. Interestingly, a 51-year-old male with hypertension would be classified as high risk without any additional OSA risk factors; whereas, a 40-year-old female who has a BMI over 35 kg/m2 and witnessed apnea would be considered lower risk with only two affirmative answers.

Farney et al. noted that as STOP-Bang scores increased from 0 to 3, the probability of having any degree of sleep apnea increased. Also, as the scores increased >3, the probability of severe sleep apnea increased, while the probability for lesser degrees of sleep apnea decreased. In effect, scores <3 virtually excluded the possibility of OSA, scores between 3 and 5 were equivalent for determining the degree of sleep apnea, and scores 6–8 were highly predictive of severe OSA [4]. Notably, Farney et al. constructed three analytical models, including linear, curvilinear, and weighted. While a weighted model significantly improved the area under the receiver operating curve (AUC) and coefficient of determination, this model was determined to have no clinically significant advantage over a linear model [4].

The conventional STOP-Bang scoring tool is a simple and rapid screening tool for identifying those at risk for moderate-to-severe obstructive sleep apnea. However, STOP-Bang questionnaire’s sensitivity for detecting SDB is low and thus carries a high false positive rate at the defined cutoff. By increasing the cutoff to improve specificity, sensitivity drops to unacceptable levels. When comparing wSTOP-Bang to traditional STOP-bang, there was no improvement in specificity at cutoff levels with similar sensitivity. By weighting each variable and using BMI, age, and neck circumference as continuous variables, this study has shown that STOP-Bang can be modified in order to maintain sensitivity while increasing specificity. cSTOP-Bang correctly classified more subjects than did STOP-Bang. One drawback to the cSTOP-Bang is that a calculator must be used to determine the score. The benefit of the STOP-Bang is that there are only 8 possible outcomes, while there are innumerable possibilities with cSTOP-Bang. However, the model can easily be constructed into a calculator application with the value for each variable manually entered and the final score generated by the calculator. With the propagation of electronic health record systems in the clinical setting, the calculation can be done automatically with data entry done by staff. While there was a statistically significant improvement in specificity, there remains to be seen whether a clinical significance exists. Theoretically, by improving specificity while maintaining sensitivity, fewer false positives will occur. As in-lab overnight polysomnograms are costly and time-intensive studies, there can be a cost savings by using a more robust scoring tool.

Another improvement to the conventional STOP-Bang scoring tool could be the addition of more variables. Chung,et al. (2013) showed that by adding a serum bicarbonate level cutoff of ≥28 mmol/L to STOP-Bang, specificity for moderate-to-severe OSA at a score of ≥3 improved to 81.7 % [15]. That study was conducted using a cohort of peri-operative patients. Further studies utilizing general population-based cohorts should be conducted to determine if serum bicarbonate level adds utility to STOP-Bang in those populations. Additional variables could also be investigated. Race, tobacco status, concomitant cardiopulmonary conditions, and Mallampati grade are all possible variables that may improve upon STOP-Bang and cSTOP-Bang.