Introduction

The Epworth Sleepiness Scale (ESS) is a quantitative method to evaluate sleepiness or the propensity to fall asleep in eight everyday life situations, through a Likert scale [14]. This instrument has been used in different contexts, populations, and clinical conditions, with particular interest in respiratory sleep disorders [57]. Several versions of the ESS have been validated for different languages and populations, mainly to Portuguese (from Brazil), Chinese, French, German (from Switzerland), German (from Germany), Greek, Italian, Japanese, Korean, English (from New Zealand), Norwegian, English (from USA), Spanish (from Peru), Serbian, Spanish (from Spain), Thai, Turkish, and Spanish (from Mexico) [827]. Although the majority of these studies have been conducted with classical measurement methods, overall results have indicated good measurement properties of previous versions of the ESS [2830]. Some studies have also demonstrated the suitability of this instrument through confirmatory factor analysis [31] and Rasch analysis [32, 33].

Concerning validation studies of measurement health-related patients-reported outcomes, the relevance to assess its methodological quality has been increasing, mainly in terms of the instrument’s measurement properties, standards for design requirements, and preferred statistical methods—The COSMIN checklist [34]. Through this checklist, a relative consensus from a large group of researchers [35] was reached on the inclusion of the following measurement properties: internal consistency, reliability, measurement error, content validity (including face validity), construct validity (including structural validity, hypotheses testing and cross-cultural validity), criterion validity, responsiveness, and interpretability.

Despite the widespread clinical use of the ESS, there are no known studies that assess the psychometric properties of the Portuguese version of this scale in Portugal. Furthermore, the existing literature regarding the validation of the ESS is mainly based on classical measurement theory, neglecting the difficulty of scale items and the underlying characteristics of each individual. Thus, our primary goal is to validate a Portuguese version of the ESS in adults by using the classical measurement theory methods (CMT) and the Rasch model (RM) based on the item-response theory.

Methods and materials

Design and procedures

This study is based on a one-shot design. The required sample size was calculated a priori with Cohen’s d effect size for t tests. A total sample size of 220 participants was required for this study in order to detect a medium effect size (d = .40; 1 − β = .80; one-tailed α = .05) for an expected allocation ratio of .30. All subjects were volunteers (not paid) that gave their informed consent to the study objectives. The study was approved by the scientific and ethical committee of the clinical institutions where the subjects were diagnosed and treated for obstructive sleep apnea (OSA).

Participants

The sample consisted of 222 Portuguese adults (97 men and 125 women). Forty-six of them (35 men and 11 women) were polysomnographically (type 1) diagnosed with untreated OSA, presenting different types of severity (measured by Apnea-Hypopnea Index (AHI), not positional). Patients were recruited from three public hospitals, while the remaining participants were healthy volunteers without a clinical diagnosis of OSA (62 men and 114 women), who were recruited from a sample community based (universities and companies) through convenience method.

Prior to enrollment, for the inclusion criteria, these subjects completed a short-form sleeping habits questionnaire, in which controls reported not having any diagnosis or symptoms of sleep disturbances. None of the subjects assessed (healthy participants and OSA patients) was a shift worker, had clinical history of any neurological or psychiatric disorders, or was doing any type of psychotropic medication.

Table 1 shows the basic demographic characteristics of the sample, as well as the AHI for the clinical sample of OSA.

Table 1 Demographical data

The comparisons between healthy subjects and OSA patients showed statistically significant differences regarding age. Tukey HSD revealed statistically significant differences between healthy participants and OSA patients, but not between the groups of different OSA severity. As for the gender distribution, standardized residuals showed a difference in the gender distribution, particularly in subjects diagnosed with mild to moderate OSA.

Materials

Clinical history

As stated before, a short-form questionnaire was developed to assess the demographic variables, sleeping habits, and if there was a previously diagnosed sleeping disorder.

Epworth Sleepiness Scale

The ESS consists of eight items, rated on a scale of 0–3, in which the total score is computed through the sum of item responses. The total score represents a measure of subjective daytime sleepiness (ranged 0–24, higher results indicate greater propensity to fall asleep).

The English version used of the ESS [3] was translated to Portuguese language by two different specialists in sleeping disorders, an independent neurologist and a neuropsychologist, and a native English speaker. The retroversion of the Portuguese version of the ESS was done by two other independent experts in sleeping disorders and a native in English language. The final translation of the ESS considered the most consensual designation for “dozing” which was translated to Portuguese as passar pelas brasas. This expression was chosen because it is used often in Portuguese language to describe a “nap” during the daytime.

Statistical procedures

Classical measurement theory

The psychometric properties through distribution measures, reliability, and validity were estimated according to the assumptions of the Classical Measurement Theory. In order to study data distribution, several descriptive measures were used, such as the mean, standard deviation, skewness, kurtosis, and missing values. The structural validity was tested with principal component analysis (varimax rotation), whereas a Student’s t test for independent samples was performed to compare patients and healthy participants in criterion validity. Inter-item correlations were also conducted to study the internal validity. Reliability was tested under the Cronbach’s alpha procedure.

Rasch model

The Rating Scale Model (RSM) is an extension of the RM for polytomous items. These models transform ordinal response data of the subjects on an interval scale [36, 37] and are especially recommended for testing psychological assessment instruments [38] because of properties such as conjoint measurement of persons and items responses (i.e., parameters for persons and items are expressed in the same units). Moreover, the patterns of subjects’ responses are adjusted to the model (i.e., the probability of an item response depends only on the levels of person-item in the measured attribute) [39]. The RM is empirically useful in determining the quality of response categories in particularly in Likert-type scales [40].

Results

The classical methods to assess the effectiveness of the ESS were performed using the SPSS v.20 for Windows.

The item-response theory through Rasch model was conducted testing the Linacre guidelines [41]. This analysis was performed using the Winsteps 3.80.1 [42].

Classical measurement theory

Descriptive and distribution analysis

Table 2 shows the descriptive statistics for items and the total ESS.

Table 2 Descriptives for the eight items and total ESS score

The descriptive statistics indicate that there are no notable deviations in skewness and kurtosis on the total score of the ESS. There are, however, non-normal distributions for some individual items of the scale, with particular attention to item 6 (i.e., short range of values) and item 8, both showing moderate positive skewness and kurtosis. Furthermore, an analysis of missing values has clarified these results (i.e., three participants failed to respond to item 8).

Structural validity

To study the structural validity of the ESS, a factor analysis with a principal component analysis method was performed using orthogonal varimax rotation on the eight items of the ESS.

The initial solution was satisfactory [KMO = .806, χ 2(28) = 432.760, p = .000]. The communalities ranged from .45 to .64. From this initial solution, two factors were extracted with eigenvalues greater than 1 according to the Kaiser and Guttman rule (eigenvalue: 3.194 and 1.418), explaining 57.65 % of scale variance (Fig. 1).

Fig. 1
figure 1

ESS scree plot for variance decomposition

Table 3 displays the component matrix loadings for each item (>30) after varimax rotation for a two-factor solution.

Table 3 Rotated component matrix

As shown in Table 3, there are three items of the ESS that loads on both factors. Furthermore, the data from the component transformation matrix also suggests interrelation between the two extracted factors (see Table 4).

Table 4 Component transformation matrix

Given the issues raised with a two-factor solution, another exploratory principal component analysis was performed with a forced one-factor solution. The results revealed that commonalities ranged from .27 to .55 with a single factor model. The total variance explained by this solution was 39.92 %. As shown in Table 5, all items have higher loadings (>.50) within a single factor. This unidimensional solution is suitable to describe our data since it is appropriate to describe the construct and the underlying factor structure.

Table 5 Component matrix—unidimensional solution

Internal validity

To study the internal validity of the ESS, item-total correlations with r Pearson were performed. The results show moderate to strong positive correlations (.46 < r > .73) between each individual item and the total scale (all p levels = .000). Item 6 and item 8 are the ones that have the lowest correlation with the total scale.

Criterion-related validity

The criterion-related validity of the ESS was tested with an independent t test that was performed to compare patients diagnosed with OSA vs. healthy participants in ESS scores. Table 6 depicts mean scores and standard deviations of the ESS total score in OSA patients and healthy participants.

Table 6 Descriptives for ESS (global score)

The results presented in Table 6 indicate that subjects diagnosed with OSA have higher levels of sleepiness compared to healthy participants [t(220) = −2.234, p = .026].

Reliability

The internal consistency was estimated using Cronbach’s alpha method that was performed to study the reliability of the ESS in evaluating sleeping disturbances. The Cronbach’s alpha (.77) was acceptable for the version of the ESS with the original eight items, even after the possibility of increasing the alpha level when items were removed was studied. The average inter-item correlation was r = .31 (.05 < r > .47), in which item 6 and item 8 were the most problematic items according to this analysis.

Rasch model

In line with the recommendations of Linacre [41], the category effectiveness of rating scale was tested with a four-category system. The item polarity in the actual scale shows a range from .43 to .72, in which all items are aligned in the same direction in the latent variable (Table 7).

Table 7 Effectiveness of rating scale categories of the Epworth Sleepiness Scale

The category statistics is shown in Table 8. The results for categories are consistent with the guidelines of Linacre [41], indicating the following: at least 10 observations for each category; an unimodal distribution (peak at category 0 and a negative skew to category 3); the average measure advances monotonically with category, with an increase from .89 to 1.55 logit between consecutive categories; the Outfit mean-square values of the four categories are around 1.0 logit, all categories are less than 2.0 logit, suggesting a reasonably uniform level of randomness in the data; the step calibration advances about .69 to .79 per step, suggesting that people with higher levels of sleepiness are more prone to choose higher categories, which were modal along the variable.

Table 8 Statistics of ESE

As shown in Fig. 2, each category has a real probability of being selected by the sample; the relationship between measure and ratings for each category is considered adequate; the small and large advancements of steps’ difficulties between consecutive categories show about 1.2 to 1.6 logit. Overall, these results suggest that all the guidelines of Linacre have been satisfied.

Fig. 2
figure 2

Probability curves of the categories of ESS

The principal component analysis of residuals from the RM indicates that the assumption of unidimensionality is empirically corroborated given that the analysis of residuals explains no more than 10 % of the variance (exact value was of 9.2 %) and the percentage of variance explained by the RM is over 20 % (exact value of 61 %).

The score statistics is presented in Table 8. Model fit is adequate: no item outfit is over 2 (severe misfit), the percentage of people with outfit over 2 is small, and average outfit values, for items and people, are close to 1 (perfect fit). Furthermore, the score reliability through Item Separation Reliability value (.99), Person Separation Reliability value (.85), and Cronbach’s alpha (.78) was considered high.

The item-person map is depicted in Fig. 3, which shows the “ability” of the people and “difficulty” of the items on the same pathway. A visual inspection of Fig. 3 suggests that item 6 (sitting and talking to someone), item 8 (in a car, while stopped for a few minutes in traffic), and item 3 (sitting inactive in a public place; e.g., a theater or meeting) are those considered as more difficult according to these results. On the other hand, item 2 (watching television) and item 5 (lying down to rest in the afternoon when circumstances allow it) are the less difficult items according to RM.

Fig. 3
figure 3

Item-person map

Discussion

In order to increase variability and to prevent floor effects, the statistical analysis was performed for the total sample, including both healthy participants and patients with OSA. However, it is worth noting that the proportion of gender is different between OSA and healthy participants. In this regard, one normative study using the ESS [43] does not show any effect of gender on the ESS total score, although there is evidence [44] for an impact of respiratory sleep disorders on daytime sleepiness, especially under the age of 40. This becomes, however, much less likely for the overall sample, in which the main statistical procedures were carried out.

Item analysis showed that item 6 and item 8 have relevant deviations from normal distribution (as regard to kurtosis, skewness, and range of values, especially in item 6). In our view, this result may describe a floor effect which is often observed in healthy individuals [13], but it is also possible that these items are those that are most affected by social desirability. One possible explanation for this is that situations as described in item 6 (during a conversation with another person) and item 8 (driving a car) are difficult to assume as possible situations where one can fall asleep compared to other contexts in which attention is not socially expected.

The structural validity, evaluated through a principal component factor analysis, initially showed two strongly interdependent factors. However, a forced factor analysis shows an interpretable unidimensional structure, with item loadings (.52 to .74) close to those found in a clinical sample with the original version from English language [28]. The analysis of the item-total correlations supports the internal validity of the scale, with exception for item 6 and item 8. These results are also consistent with a previous data, despite being more similar to clinical than the control samples of Bloch and colleagues [13].

The comparisons of the ESS total score among patients diagnosed with OSA and healthy participants reveal adequate criterion validity. It is important to consider that the scores found in our study are lower than those reported for the original version [28], either for patients with OSA or healthy subjects, being more adjusted to other versions of the ESS found in the literature [9, 13, 27].

One important issue with previous research on this topic is that most studies include convenience samples of college students as controls that are younger than most of the patients with OSA. In these versions [9, 13], the mean age of the control group is higher (about 10 years higher), being more similar to our control sample. Another important confounder in these samples of college students is that the need for sleep may vary depending on the time of school year. During exam periods, students have less sleeping hours, which may increase the propensity to fall asleep during some of the situations described in these measures.

The internal consistency of the total scale, estimated through the Cronbach’s alpha, suggests minimal adequacy of the scale according to Nunnaly [45]. In addition, the Cronbach’s alpha coefficient found in our study is in the range of values found for clinical and control samples [9, 11, 13, 18, 20, 22, 27].

The RM analysis suggests that the standard version of the scale with four-category system is consistent with the guidelines [41], and the data fit the Rasch model fairly well. Consistent with previous reports [32, 33], no items with high misfit were found through the RM, which also supported the unidimensionality of the ESS.

Overall results from the CMT and RM showed adequate structural, internal, and criterion-related validity of the ESS, and reliability as well, suggesting this as a useful and effective measure for assessing sleepiness in Portuguese adults.