Introduction

Diagnosis of obstructive sleep apnea–hypopnea syndrome (OSAHS) combines an apnea–hypopnea index (AHI) of five or more events per hour of sleep and a symptom of sleep disorder, usually excessive daytime sleepiness [13]. Despite the importance of this symptom in the OSAHS, there is considerable controversy regarding the reliability of methods used for sleepiness estimation [4]. Measuring sleepiness is complicated by its multiple dimensions [57]. The Epworth Sleepiness Scale (ESS) [8] became a widely used tool to help in the OSAHS diagnosis [917], besides detecting sleepiness in Parkinson’s disease [18], mental disorders [19], and insomnia [20].

Miletin and Hanly [21] expressed concern about ESS being used to measure change over time since its test–retest properties had not been properly assessed. The possibility exists that a change is reflecting recall bias instead of actual clinical change. Although Johns applied the ESS to medical students and repeated the scale 5 months later [22], Chervin [23] considered that test–retest reliability has not been adequately established in clinical samples. Psychometric inconsistencies in the factor analysis were reported by Smith et al. [24].

Rasch analysis, originally developed in education, has been employed to assess psychometric properties of questionnaires used for medical diagnosis [25] according with the item response theory (IRT). Three properties are important in IRT: unidimensionality, invariance, local independence. Unidimensionality means that the items of an additive scale are correlated and represent a single construct or concept. Invariance means that person factors, like age and gender, do not affect the way the instrument is answered; therefore, men or women, young or elderly should have the same performance in the test. Local independence means that the probability of success on an item depends only on item parameters and on testee ability [26].

Retest may modify the score of a scale [27, 28]. It is difficult to ascertain whether, in the clinical setting, the repetition of a test is useful. Many of the studies repeating the ESS involve small number of cases. Considering the widespread use of the ESS [29], considering the doubts still being raised about its reproducibility[21, 23], we decided to verify, in a large number of patients with suspected sleep-disordered breathing (SDB), whether the score and psychometric properties of ESS change when subjects are tested before and retested after the polysomnography (PSG).

Methods

This is a cross-sectional, prospective, descriptive study, in which the intervention was to duplicate the ESS administration before and after the PSG. The study included patients attending a university-affiliated sleep clinic with suspected OSAHS who gave informed consent to the use of their questionnaires, in terms approved by the institutional ethics committee. The inclusion criteria were: age 18 years and over, a full-night, baseline, intervention-free polysomnography, and thoroughly filled ESSs in the evening, on arrival at the sleep laboratory (ESS1) and in the morning, after the PSG (ESS2).

Both at the first and second administrations of the ESS, patients were allowed to interact with the technicians to clarify doubts on filling the questionnaires, as is inherent to sleep laboratory routine, avoiding experimental non-generalizable situations. Patient–technician communication tends to be more concentrated at the evening, during electrodes placement than in the morning when people fill the forms rapidly with rare questions before leaving the laboratory. Occasionally, patients were exposed to posters and handouts concerning sleep disorders, but the exposition occurred mostly at the reception, before replying to ESS1. Questions were divided in soporificity levels, based on the average score of each question in ESS1, as described by Johns [22]. The mean of the scores of the three or two questions in each soporificity category (most, intermediate, and least soporific), were rounded so that the mean scores were integers from 0 to 3. We considered an “inappropriate” answer to the ESS when a patient responded to one of the most soporific situations with a score lower than the score used in intermediate or least soporific situations. Stanford Sleepiness Scale (SSS) was administered before (SSS1) and after PSG (SSS2); the difference SSS1–SSS2 (diffSSS) was calculated to estimate the reduction in sleepiness obtained by each patient after the night’s sleep. Reliable change index, a widely used tool to establish the significance of a change in a test administered in two occasions, was calculated as described before [30, 31].

Polysomnography

The PSG was carried out according to the standard method as described before [32], with electroencephalogram (EEG; C3-A2, C4-A1), electrooculogram (left eye and right eye), submental and anterior tibial electromyogram, and electrocardiogram. The airflow was measured by nasal cannula attached to a pressure transducer; respiratory effort was detected by respiratory inductance plethysmography (QRIP, Braebon, Kanata, Canada) the arterial oxygen saturation (SaO2), by pulse oximeter (Xpod, Nonin, Plymouth, Minnesota, USA).

Apneas were defined as the reduction in the baseline airflow to 10% or less for 10 s or longer; hypopneas were defined as reductions in the airflow by 50% or more associated with a 3-s arousal or SaO2 fall by 3% or more. The AHI was calculated by dividing the total apneas and hypopneas by number of hours of sleep.

Rasch analysis

The IRT psychometric properties analyzed were unidimensionality, invariance, and local independence. Because all items in the ESS are polytomous, they were verified to certify whether their response categories were ordered. The boundaries between categories, called thresholds, were checked for “disordered thresholds”, indicating the need to collapse adjacent categories. Following this, data were fitted to the model to determine overall fit, and how well each item suited the model. Three overall statistics were considered to determine model fit. Two were item–person interaction statistics distributed as a Z statistic with mean of 0 and standard deviation of 1, indicating perfect fit to the model. A third was an item–trait interaction statistic reported as chi-square (chi2), reflecting the invariance across the trait, indicated by a non-significant chi2. Besides, individual item-fit statistics are presented, as residuals (acceptable within the range ±2.5) and as a chi2 statistic (a non-significant chi2 is required).

Differential item functioning (DIF) analysis tests the invariance of the ESS, in such a way that individuals with identical sleepiness scores had the same probability of responding uniformly to an item, irrespective of gender, age, etc. All items were checked for DIF by gender, age (split at median of 40 years), educational level (with or without college education), and AHI (split at five events per hour) as person factors. The item logistic function of DIF analysis, i.e., the proportion of individuals at the same sleepiness level with the same answer to a given item, was considered violating the requirement for unidimensionality and invariability if items did not yield the same item response function for two or more ESS score groups. ANOVA was used for detecting DIF of the person–item deviation residuals with person (age, gender, etc.) and class intervals (e.g., group along the ESS range) as factors. We identified uniform and non-uniform DIF where, in the former, the difference between groups in the probability of affirming an item across the trait was constant and, in the latter, the difference varies across the trait.

Lastly, we considered the items that had the highest correlation with the first principal component in both directions as the set most likely to breach the assumption of local independence. The person estimates of the total fitting item set against the person estimates of the item set for the positively loading items was compared to the person estimates for the total fitting item set against the person estimates of the item set for the negatively loading items. The result is an indication of unidimensionality; being the paired t test of this comparison non-significant, at p > 0.05 level, the items were considered unidimensional. Rasch analysis was performed using the RUMM 2020 package (Rumm Laboratory Pty Ltd, Perth, Australia).

Statistical analysis

Data was analyzed with the program SPSS v.16 (SPSS, Chicago, USA). The results are described as percentages, mean and standard deviation, or median and interquartile interval 25–75%. The statistical tests used—Student’s t test, chi2, Cronbach’s alpha, Spearman’s rho, and Pearson’s correlation—are specified in the description of results. Factor analysis with varimax rotation was used with AHI > 5 as selection variable. Area under the ROC curve was utilized to estimate the diagnostic accuracy of each question and of ESS1 and ESS2 to predict AHI ≥ 5, AHI ≥ 15, AHI ≥ 30. Bivariate correlations and multiple regression analysis were used to assess the relationship between ESS2–ESS1 difference (DiffESS) and the variables gender, age, educational level, diffSSS, and AHI. Due to the asymmetry of the data distribution, natural logarithm transformation of AHI (lnAHI) was employed in regressions.

Results

We obtained data from 929 patients (Table 1). Educational level was elementary school, in 19.1% of the cases; high school degree, in 47.1%; and college degree, in 31.4%; 2.4% were health professionals. OSAHS was mild in 24.7%, moderate in 21.8%, and severe in 36.3% of the patients. ESS2 increased in 64.4%, remained the same in 29.8%, and diminished in 5.8% of cases (Fig. 1).

Table 1 Means, standard deviations, medians, and interquartile range of the anthropometric and polysomnographic characteristics of the sample
Fig. 1
figure 1

Histogram of the difference in Epworth Sleepiness Scale (ESS) scores, before (ESS1) and after (ESS2) polysomnography

The means of the scores for each of the ESS items and the total ESS score in the morning, ESS2, were significantly greater than the score in the evening, ESS1 (Table 2, Fig. 2). Disagreement between ESS1 and ESS2 is greater for items exploring highly soporific situations as in questions 5 and 7.

Table 2 Means, standard deviations, medians, and interquartile range of the scores for each Epworth question before (ESS1) and after (ESS2) the polysomnography and for the difference between scores
Fig. 2
figure 2

Bland–Altman plot of the difference and average values of the Epworth Sleepiness Scale (ESS) score, before (ESS1) and after (ESS2) polysomnography. Horizontal lines represent mean ± 1.96 standard deviations of the ESS2 to ESS1 difference. The largest variability is seen in average scores from 7 to 12, a range in which the diagnosis of sleepiness is more likely to change

The Bland–Altman plot depicts the reproducibility of the ESS in Fig. 2. DiffESS was not significantly correlated with gender, diffSSS, educational level, and AHI; when all these regressors were included in a multiple regression model, the only significant variable was age.

In Fig. 3, the percentage of ESS scores >10 changed from 37 to 49 (12 percent points) in the group with AHI < 5 and more than 20 points in the groups with AHI ≥ 5. The average ESS score in cases with AHI < 5 increased from 9.5 to 11.2 (1.7 points) in the group with AHI < 5 and more than 2 points in the groups with AHI ≥ 5.

Fig. 3
figure 3

Mean ESS score and percentage of cases with Epworth Sleepiness Scale (ESS) score >10 before (panel a; ESS1) and after the polysomnography (panel b; ESS2). Different letters mean significant difference in post hoc Tukey’s test

Diagnostic performance of ESS1 > 10 and ESS2 > 10 to predict AHI ≥ 5 is displayed in Table 3. Several properties, such as sensitivity, specificity, accuracy changed significantly from ESS1 to ESS2.

Table 3 Diagnostic performance of the Epworth Sleepiness Scale score > 10 to predict AHI > 5/h in the evening and in the morning

Performance of ESS1 and ESS2 to predict AHI ≥ 5, AHI ≥ 15, and AHI ≥ 30, assessed by the area under the ROC curve, is displayed in Table 4. Area under the ROC curve was unchanged or changed non-significantly from ESS1 to ESS2 for all questions. The best performance is seen in the prediction of AHI ≥ 5. In the prediction of AHI ≥ 30, questions 3, 5, and 8 displayed a small non-significant reduction in area.

Table 4 Area under the ROC curve and significance for each question of ESS1 and ESS2 to predict AHI >5/h, >15, and >30 in the evening and in the morning

The test–retest reliability of the ESS, assessed using Chronbach’s alpha, was 0.92 (p = 0.00001) with single measures intraclass correlation of 0.85. Kappa coefficients for each question are displayed in Table 5. Question 5 has the lowest kappa.

Table 5 Kappa coefficients between evening and morning scores for each ESS question

In the evening, the three most soporific situations were those described in questions 5, 2, and 4; the three intermediate soporific situations were in questions 1, 3, and 7, and the two least soporific situations in questions 6 and 8. Considering that scores in questions with different levels of soporificity for one individual are supposed to be in a logical order, i.e., more soporific situations receive higher scores than less soporific situations, the results were arranged in tables comparing soporificity levels: (1) most and intermediate soporific, and (2) most and least soporific, as displayed in Table 6. The lower the Kappa coefficients for the agreement between two levels, the more likely the logic of the scale is preserved. Thus, the significant reduction of the kappa coefficients in ESS2 indicates better performance in terms of the “appropriateness” of the whole set of answers, as can be seen by the reduction in the number of inappropriate answers from 70 to 62 and from 8 to 6.

Table 6 Four by four tables of the rounded mean scores in the three most soporific situations (questions 5, 2, and 4) by three intermediate soporific situations (questions 1, 3, and 7), and by the two least soporific situations (questions 6 and 8), in the evening (ESS1) and in the morning (ESS2)

Two components were identified in the rotated matrix of factor analysis for the eight items both in ESS1 and ESS2. In ESS1, component 1 included items 1, 2, 4, 5, 7 and component 2 included items 3, 6, 8. In ESS2, component 1 incorporated items 1, 3, 6, 8 and component 2 included items 2, 4, 5, 7. This could indicate that the ESS is bi-dimensional.

Table 7 shows significant correlation coefficients of ESS1 and ESS2 with OSAHS variables. The increases in r, from ESS1 to ESS2, are small and within the 95% confidence interval. The increase of the correlation coefficients between lnAHI and ESS, from 0.15 in ESS1 to 0.21 in ESS2, is marginally significant (p ∼ 0.06).

Table 7 Pearson’s Moment Correlation Coefficient Between Several Polysomnographical and Clinical Outcomes, Including Questions of the Berlin Questionnaire, and First and Second Epworth Sleepiness Scale Administrations

The calculated reliable change index for ESS1 was 4.1. Then, changes larger than ±4 points reflect actual variation in the appraisal of sleepiness. In our sample, only seven cases (0.75%) displayed a reduction larger than 4 points or more and 157 cases (16.9%) an increase of 5 points or more. These patients’ PSG results are not significantly different from the results of the remaining ones.

IRT psychometric properties were assessed in the Rasch model (Table 8). The analysis of response threshold showed that the category responses 0, 1, 2, 3 are significantly different in most questions. However, for question 5, the levels 2 and 3 are marginally similar. In the ESS1, question 5 does not fit the model for presenting a residual of 3.24, higher than the acceptable limit of 2.5. In the ESS2, the item 5 presented disordered response thresholds and, to solve this, the codes 0123 were substituted with 0112, reducing the chi2 of the item–trait interaction of highly significant value (p = 0.00002) to values near the optimum (p = 0.02). Both in ESS1 and in ESS2 a redundancy was noted in question 5, which was reduced, but was not eliminated on the second administration. For ESS1, person fit residual was −0.280, PSI was 0.84, and paired t test 95% confidence interval was 0.03–0.06; for ESS2, person fit residual was −0.310, PSI was 0.83, and paired t-test 95% confidence interval was 0.05–0.07. In ESS1, there were statistically significant differences in scores, representing lack of invariance in gender (item 3; sleeping in public), age (items 1, 2, 4, and 8) and educational level (item 1; reading). In ESS2, the lack of invariance was present only for age, in the same questions. The Rasch analysis identified 13 cases in the ESS1 and 12 in the ESS2 that did not fit the model.

Table 8 Individual item fit in the Rasch model

Discussion

Based on extensive literature revision, this is the first study to assess the effect of short-term ESS repetition on its diagnostic performance and both classical and IRT psychometric properties in a large sample. The IRT psychometric properties show small improvement from ESS1 to ESS2.

Validation studies show that the ESS score is relatively maintained after 1 year [33] or months [34, 35]. In such long intervals, the learning effect may be attenuated. The present study did not aim to identify the effect of different time intervals on the ESS score. It may be useful testing the effect of ESS repetition in the office, within a shorter time interval of one or two hours.

Pien et al. considered as evidence of increased sleepiness during pregnancy the ESS changing from 8.6 to 10.2 [36]; this is less than the 2.3 points overnight change in the present report. Chin et al. described as proof of alteration in perception of sleepiness an increase in the pre-treatment ESS, recalled after CPAP use, from 8.5 to 11.1[37], indistinguishable from our findings. On the other hand, reductions of ESS score after treatment of OSAHS [31, 38] could be greater than reported if the effect of repetition was controlled for.

The Bland–Altman diagram (Fig. 2) shows a wide range of individual variation in ESS score, implying directly to the utility of the ESS in clinical practice. In 50% of the cases, score changed between 0 and 4 points. The range of two standard deviations, from −10 to 12, indicates that although the bias is small, significant individual variability is seen.

In Fig. 3, the average ESS and the percentage of ESS scores >10 increased more in the groups with AHI > 5 than in the negative group, with AHI < 5. This non-significant finding suggests that the increase in ESS score is more pronounced in the abnormal AHI bins. The group with AHI < 5, is not a healthy control group since subjects underwent PSG to investigate a sleep complaint. Normal AHI does not exclude sleepiness secondary, for instance, to upper airway resistance syndrome or sleep deprivation. Considering that all subjects feel that their sleep needs medical attention, it is more likely that a higher ESS score in the second administration correctly captured actual sleepiness, even in people with AHI < 5.

ESS questions in which the patients scored higher indicate a more soporific nature of the situation. One can consider an inappropriate answer responding “would never doze” to the more soporific question 5 and mark a “high chance of dozing” to question 7, less soporific. In our sample (Table 6), for ESS1, 8.9% of the testees had a higher score in question 3 than in question 5; in the repeated test, only 3% maintained this “inappropriate” pattern. Thus, the reduction in inappropriate answers indicates that the retest leads to a more judicious choice of options.

In the Rasch analysis, age separated at median of 40 years, affects questions 1, 2, 4, and 8, violating the invariance assumption, in both ESS1 and ESS2. As the two age groups were similarly sleepy, it is possible that from middle age on, people are more aware of sleepiness sensation. This is consistent with the report of more sleepiness-related car accidents among younger drivers because of them being more often unaware of their sleepiness [39]. The effect of age on lack of invariance on ESS answers is similar in ESS1 and ESS2. The fact that items 1 and 3 started to display invariance for educational level and gender in the ESS2 indicates an improvement of psychometric properties assessed in the Rasch model.

The reliable change index defines the variation in a score that is necessary and sufficient for the change to reflect an actual alteration in the measured variable. In the present study, the calculated reliable change index is 4.1, meaning that a five-point change is a reliable change of the ESS score. This is similar to the index of six points reported by Smith and Sullivan [31]. Chin et al.’s [37] finding of post-treatment reduction in ESS from 8.5 to 4.3 points is within the reliable change index. Several studies report changes in ESS without a control group for comparison. Having patients being their own controls in studies involving ESS is a problematic method often used. For instance, ESS reduction with weight loss repeated four times over 24 months, from 7.9 to 4.8 points, was not controlled for repetition effect [40]. Evidence based on such comparisons might need reassessment.

The correlation coefficients between clinical and PSG outcomes and ESS increase from ESS1 to ESS2 (Table 7). This improvement suggests that ESS2 is a better predictor of sleepiness-related polysomnographic outcomes.

Patient–technician communication during electrodes placement may have amplified the morning ESS score. The possibility that sleep inertia in the morning made the patients more prone to respond with higher ESS scores is countered by the negligible change in the SSS. Although in the morning patients filled the forms swiftly, allowing less time for deliberation before leaving the laboratory, the correlation between AHI and ESS is higher in the second administration.

Repeating the scale using results obtained from the patients’ bed partners is another approach to increasing the diagnostic accuracy of the ESS [33, 41]. The results in small samples (60 and 82 subjects) did not show clinical utility of duplicating the scale administration. The increases of 1.2 and 1.8 in the ESS by the bed partner are less than what is here reported.

Polysomnography is paramount for OSAHS diagnosis, but a full sleep study is requested usually when a symptom, more often sleepiness, is present and preferably confirmed by a valid instrument. An abnormal score in the ESS will prompt the physician to order PSG and motivate the patient to undergo the exam. One determination of ESS score below 10, however, should not preliminarily exclude the existence of sleepiness, before the physician avouches the correctness of the ESS result by repeating the scale administration or even questioning the bed partner.

For the individual patient, to whom treatment for OSAHS may be denied on the basis of a low ESS score, repeating the scale may be lifesaving, at a negligible cost. Repeating ESS with bed partners or with patients may be equally beneficial before dismissing patients as “asymptomatic”. As was demonstrated, in individual cases, ESS may be inappropriately understood by some patients, reducing their opportunity for diagnosis and treatment.

Is it adequate to repeat the ESS? Vis-a-vis our results, in a sleep laboratory population, the answer is yes. It does not change the psychometric properties of the scale, improves the diagnostic performance of the test and the correlation with OSAHS variables. Further research is required to complete the understanding of cost-effectiveness of ESS repetition, measuring its effect on primordial outcomes such as mortality reduction.