Introduction

The Dutch Eating Behavior Questionnaire (DEBQ; van Strien et al. 1986) measures three common psychological dimension of eating behavior: emotional, external and restrained eating. Emotional eating implies an inclination to eat in response to negative emotions such as depression, disappointments and feelings of loneliness. External eating means eating more in response to external food cues such as the sight, smell and taste of food. Restrained eating implies conscious determination and efforts to restrict food intake and calories in order to control body weight.

The DEBQ has demonstrated a stable three-factor structure among different groups (i.e., men, women, obese and non-obese individuals) with high test–retest reliability and internal consistency, as well as convergent and discriminant validity (see van Strien et al. 1986). The DEBQ is increasingly employed as an outcome measure in clinical and research settings (Anderson and Williamson 2002) and has been extensively used for assessing eating behaviors in normal-weight, overweight and obese individuals (Burton et al. 2007; van Strien et al. 2012). Furthermore, the DEBQ is also sensitive to changes in eating behavior as scores were decreased after bariatric surgery and behavioral weight-loss treatment in obese individuals (e.g., Pepino et al. 2014).

The test–retest reliability and internal consistency of the DEBQ have been established. Cronbach alpha coefficients were above the recommended cut-off values of 0.80 for adequate consistency (Tabachnick and Fidell 2007), whereas the test-retest reliability remains good (values over 0.80; Bozan et al. 2011; Dakanalis et al. 2013; Dutton and Dovey 2016; Vidal 2006). However, most of these studies employed non-clinical subjects, who were predominantly normal-weight; furthermore, only one study reported long-term reliability of the DEBQ after the 9-month interval (Dutton and Dovey 2016). Sixty women filled out the DEBQ 9 months after initial assessment in order to verify the temporal stability of the DEBQ. The group means on each sub-scale between time 1 and time 2 correlated above 0.90, suggesting good temporal stability; however, the study was conducted on non-clinical participants as well. It should be noted that instruments are particularly useful if they can measure significant change in behavior indicators over time (e.g., pre- vs. post–intervention/treatment). Results from pre–post research designs are only credible to the extent that the measure used is not prone to error across administrations. That is, only to the extent that apparent effects are not due to score instability (Tabachnick and Fidell 2007). Although the DEBQ is a widely used questionnaire in research on eating behavior, to our knowledge, no previous studies have been conducted on the long-term stability of the DEBQ in clinical population. Thus, the goal of the present investigation was to test the reliability of the DEBQ after an interim of 12 months in clinical sample and compare with a non-clinical sample. More specifically, our objective in this study was to assess the test–retest stability and internal consistency of the inventory in a clinical sample of pre-bariatric patients and normal-weight controls.

The additional purpose of the study was to determine whether the three-factor structure would describe the data from both the normal-weight and pre-bariatric samples. Scholars typically assumed that the instrument being used operates the same way and contains the same construct across different groups. However, such between-group differences can only be interpreted when measurement invariance is given, i.e., if the same construct is measured in every subgroup (Byrne 2008; Cheung and Rensvold 2002); thus, the additional aim of our study was to examine the measurement invariance of the DEBQ across BMI-status. It was hypothesized that the three-factor structure of the DEBQ would be invariant across BMI groups. Finally, the study also aimed to assess the predictive validity of the DEBQ in differentiating between BMI groups (normal-weight and pre-bariatric) and replicate the association of the DEBQ-subscales to BMI-status as that highlighted in previous research (see Cebolla et al. 2014; Dakanalis et al. 2013; Wardle 1987).

Participants

The present study was conducted in 241 morbid obese patients with BMI ≥ 35, who presented for the first time to one of two clinics for the treatment of their obesity, and in age and gender-matched 253 healthy normal-weight subjects with BMI ≤ 25. Thus, the first sample consisted of 241 obese patients (182 women) who were psychologically screened as part of their pre-bariatric surgery assessment. Pre-bariatric patients were enrolled from two German clinics that specialized in the treatment of eating disorders (Berlin and Hannover). All patients were assured that information provided for the present research study would not influence their candidacy for surgery. The mean BMI of the remaining sample was 44.7 (SD = 3.75; range 35–49): 28.6% (N = 69) suffered from obesity grade 2 (BMI 35–39.9) and 71.4% (N = 172) from obesity grade 3 (BMI ≥ 40). The mean age of the sample was 33.9 years (SD = 7.1; range 24–41). The DEBQ was administered as part of a larger battery of diagnostic and clinical interviews by trained and monitored Ph.D. research clinicians. The second DEBQ interviews occurred before any treatment was offered for pre-bariatric individuals.

A control group was formed by 253 normal-weight volunteers (185 women) with BMI < 25 and similar characteristics (age, sex) to those of the patients. Individuals were recruited to participate in an anonymous study on eating patterns, ranging in age from 26 to 40 years (M = 35.1, SD = 5.18). The average BMI was 23.9 (SD = 2.08). Statistical analyses revealed that, consistent with their inclusion criteria, the two groups (pre-bariatric and control groups) did not significantly differ in sex or age (both p-values >0.05) but differed in BMI (t = 26.57, p < 0.01).

Study inclusion criteria for both the patients and the control group were determined as following: being over the age of 18 years and having no medical or psychiatric disease that poses an obstacle to understand the scales. We excluded potential subjects who had diabetes, smoked cigarettes, had previous intestinal surgery, inflammatory intestinal disease, a history of chronic rhinitis, severe organ dysfunction, bulimia, any substance or alcohol abuse (for example, individuals answered yes/no questions: “Have you ever received medical attention or treatment for a psychological/psychiatric illness?”; “Have you ever been dependent upon alcohol or any drugs?”). Additionally, only patients who were seeking a first bariatric surgical procedure were recruited. Participation to both studies was voluntary and no incentives were given for participation. All respondents taking part in the research provided informed consent after the nature of the study had been explained to them. The mean time between completion of the first and second phases was 344.8 days (SD = 23.59; range = 309–388). This study was approved by the University Human Participants Review Committee in accordance with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Dutch eating behavior questionnaire

The subjects completed the translated version of the DEBQ (van Strien et al. 1986; German version by Nagl et al. 2016), which consisted of 33 items, grouped in three subscales—emotional eating, which included 13 items (e.g., “Do you have a desire to eat when you are depressed or discouraged?”), external eating, including 10 items (e.g., “If you have something delicious to eat, do you eat it straightaway?”), and restrained eating, which included 10 items (e.g., “If you have put on weight, do you eat less than you usually do?”). As in the original version, responses were given on a 5-point Likert scale from “never” to “very often”.

Anthropometric measurement

Previous studies showed that self-reported height shows overestimation and weight shows underestimation in women compared to direct measurement (Engstrom et al. 2003). Thus, for both the pre-bariatric and the normal-weight groups, height and weight were measured using an electronic scale, with the participants being lightly dressed and without shoes.

Results

Internal consistency reliability

Internal consistency reliability statistics (Cronbach’s α) were calculated for pre-bariatric participants, normal-weight individuals, and all respondents from both subsamples, for the overall DEBQ scores, and for each of the content scores: emotional, external and restrained eating scales. Results are displayed in Table 1.

Table 1 Alpha coefficients and correlations across test sessions 1 and 2

For the total sample (both pre-bariatric and normal-weight individuals included), internal consistency was excellent during the first measurement time (session 1). Reliability for the total DEBQ score was 0.93. Reliability was high for the content components, ranging from 0.86 (external eating scale) to 0.90 (restrained eating scale). When data from normal-weight individuals were excluded (pre-bariatric participants N = 241) and the coefficients were recalculated, internal consistency for the overall DEBQ (α = 0.87), and content score reliability (ranged from 0.81 for restrained eating scale to 0.86 for emotional eating scale) remained high. For subsample 2 (N = 253 of normal-weight individuals; pre-bariatric excluded) overall internal consistency was high (α = 0.88). Among the content scores, external eating scale yielded the best reliability (α = 0.94), and emotional eating scale produced the lowest (α = 0.84).

Internal consistency was also calculated for the 494 participants who completed the DEBQ a second time 12 months after the original administration. Alpha scores for this second administration were high for the DEBQ overall (α = 0.93), as well as for the items measuring each scale (α from 0.89 to 0.92). Again, when the data from the normal-weight participants were excluded and the coefficients recalculated, internal consistency measures remained high for the DEBQ overall (α = 0.86), as well as for the three content scores ranged from 0.83 to 0.92. When the data from the pre-bariatric subsample were excluded, overall internal consistency was 0.91, and the coefficients for the content scores ranged from 0.82 to 0.87.

Test–retest reliability

Of the original sample, all participants (N = 494) completed the DEBQ again after a delay of 12 months. Test–retest reliability was determined by calculating intra-class correlations (ICCs) between initial and follow-up scores obtained after 12 months for the overall score on the DEBQ, as well as for each scale separately (see Table 1 for results). Two-way mixed-effects model was used in the present test-retest reliability study because repeated measurements cannot be regarded as randomized samples. In addition, absolute agreement definition should always be chosen for both test-retest studies because measurements would be meaningless if there is no agreement between repeated measurements (McGraw and Wong 1996). Nowadays, ICC has been widely used in conservative care medicine to evaluate interrater, test-retest, and intra-rater reliability (for review see Koo and Li 2016).

Total DEBQ score temporal stability was adequate for the whole sample (0.91). Test–retest coefficients for the component scores were generally adequate, ranging from 0.81 (restrained eating scale) to 0.88 (emotional eating scale). A second analysis was conducted on data from participants who seek bariatric operations. For these participants, test–retest reliability was 0.89 for the overall DEBQ score, and content scores ranged from 0.79 for restrained eating scores to 0.86 for external eating scores. The comparison test–retest reliability for the normal-weight individuals was 0.90, whereas scores on each scale correlated at 0.84 (restrained eating) to 0.92 (external eating). Generally, results showed that stability of the pre-bariatric population was not lower than those observed for the normal-weight individuals.

Factor structure and measurement invariance across BMI-status

Next, in an attempt to study measurement invariance across BMI-status, successive multi-group CFA was conducted (Byrne 2008; Cheung and Rensvold 2002), with ML estimation, to test for configural, factor loading, and intercept invariance. First, to examine the measurement invariance across different subgroups, the sample was split by the variable of interest (BMI-status). Across BMI-status, the sample was divided into two groups based on their BMI (van Strien et al. 2009, 2012)—normal weight (i.e., 18 < BMI < 25; n = 253) and pre-bariatric (i.e., BMI > 25; n = 241).

The model is considered to have an acceptable fit if the GFI (goodness-of-fit index), TLI (Tucker–Lewis index), and CFI (comparative fit index) values are approximately 0.90 or above. The RMSEA (root mean square error of approximation) represents reasonable errors of approximation in the population; a value of approximately 0.05 or less would indicate a close fit, and a value of up to 0.08 would represent a reasonable fit of the model. We note, however, that the choice of indices and cutoff values is a topic surrounded by considerable controversy (see, e.g., Mulaik 2007).

Next, we followed the Vandenberg and Lance (2000) suggestion to test measurement invariance with progressively restrictive stages. In step 1, we conducted a configural invariance test to establish a baseline model across groups. When this type of invariance is met, it only suggests that the factor structure is similar but not equivalent across groups. In step 2, we tested the metric invariance model in which the equivalence of the factor loadings across groups is established. The metric invariance is a weak invariance test and the establishment of this test means that different groups responded to the indicators in the same way. In step 3, we conducted the scalar invariance model in which the factor loading and indicator intercepts were constrained to be the same across groups. The scalar invariance test is a strong invariance test and the establishment of this test is required before the latent means can be compared across groups. As the χ2-difference test is very sensitive to large sample sizes, the metric and scalar invariances tests were examined by assessing the change of CFI value (ΔCFI). Therefore, according to Cheung and Rensvold (2002), lack of evidence for metric and scalar invariance will be determined if the CFI change is >0.01.

The three-factor model had a good fit to the data from the whole sample. Second, the factor structures of the data for the normal-weight (n = 253) and pre-bariatric (n = 241) populations were also examined with the CFA. The analysis for the normal-weight subsample showed that the three-factor model had a good fit to the data according to the goodness-of-fit indicators. CFA for the pre-bariatric participants showed that the three-factor model had an acceptable fit to the data.

Next, configural, weak and strong invariance across BMI-status was examined (see Table 2 for the results for measurement invariances of the DEBQ across BMI-status). The progressive invariance test showed that configural invariance model across BMI-status demonstrated acceptable model fit. The metric invariance test constrained factor loading to be equal across levels of BMI. Findings of this test showed that the model fits the data well; additionally, the change of CFI between configural and metric invariance tests is within the threshold of 0.01, supporting the metric invariance across BMI-level. Finally, the scalar invariance test also demonstrated that the indicators’ intercepts were invariant across levels of BMI, as the CFI change between the scalar and metric invariance tests was not greater than 0.01; thus, configural, weak and strong invariance of the DEBQ across BMI-status was supported.

Table 2 Goodness-of-fit for the DEBQ three-factor model across different groups and measurement invariance across BMI

Group differences and correlational analysis

All analyses reported in the following were done on data from the first measurement time. For the pre-bariatric participants there was a moderate positive correlation between the emotional- and external-eating factors (r = 0.53; p < 0.001). A significant relationship was found between restraint and emotionally induced eating (r = 0.31; p < 0.001). Finally, restrained eating was significantly associated with the external eating (r = 0.28; p < 0.001). A second analysis was conducted on data from participants who were normal-weight and revealed comparable results. Significant positive correlations between subscales were found (restraint–external eating: r = 0.23; restraint–emotional eating: r = 0.38; emotional eating–external eating: r = 0.57; all p-values < 0.001). Next, after including a Bonferroni correction for multiple comparisons, t-tests for group differences revealed pre-bariatric individuals with obesity scored significantly higher on all the DEBQ subscales than normal-weight individuals (all p-values 0.05).

Discussion

The goal of this investigation was to determine two aspects of the reliability of the DEBQ (van Strien et al. 1986) and to extend such reliability measures beyond the non-clinical individuals. More specifically, our objective in this study was to assess the test-retest stability and internal consistency of the inventory in a clinical sample of pre-bariatric patients and normal-weight controls after an interim of 12 months.

Internal consistency

Internal consistency is important in this regard because the validity of a screening measure is constrained by its reliability. The required reliability of the scores depends on their intended uses (Tabachnick and Fidell 2007). When scores are to be used for clinical assessment, and individual scores will be interpreted, reliabilities over 0.80 or 0.90 are needed. Internal consistency values (i.e., Cronbach’s alpha) obtained for all subscale of DEBQ were above the recommended cut-off values of 0.80 for adequate consistency (Furr and Bacharach 2008). The scores from the three scales satisfied the requirement for basic research based on summed scores, as well as where an individual scores are interpreted. In the present study, in terms of internal consistency, the DEBQ performed extremely well regardless of whether normal-weight individuals were included in, or excluded from, the analyses. Cronbach’s alpha coefficient for emotional, external and restrained eating far exceeds the recommend cut-off values of 0.80 for adequate internal consistency (Tabachnick and Fidell 2007). The lowest observed Cronbach’s alpha was 0.81. As a point of comparison, these results were superior to those reported in previous studies, where the consistency exceed 0.80 (e.g., Dakanalis et al. 2013; Vidal 2006).

Test–retest reliability

With regard to temporal stability, results indicated that DEBQ subscales have a high test–retest reliability over a 12-month period, and the results of the study are close to the test–retest reliability coefficients obtained in previous studies done on non-clinical samples (Bozan et al. 2011; Dakanalis et al. 2013; Vidal 2006). The stability of eating behaviors is also fairly strong. The total DEBQ score (rtt = 0.91) fell well within the limits of acceptable reliability (Tabachnick and Fidell 2007). Additionally, stability was comparable for pre-bariatric individuals (from 0.79 to 0.86 concerning subscales of the DEBQ) and for normal-weight individuals (0.84–0.92 concerning subscales of the DEBQ). This finding is informative and important with respect to extending the utility of the DEBQ as a clinical instrument, supporting the utility and feasibility of using the DEBQ for the assessment of eating behaviors in clinical population. For both subsamples (pre-bariatric and normal-weight), the lowest correlation was observed for restrained eating scale and the highest correlation was observed for external eating scale. The finding that the stability of the restraint subscale was lower than that of the other subscales may reflect that fact that items of this scale address more behavioral aspects of eating behavior, such as avoidance of particular foods and adherence to specific dietary rules. These results may suggest that at least some aspects of eating behaviour are not stable traits; however, it is acknowledged that further evidence is required before such a conclusion can be made.

Measurement invariance, group differences and correlational analysis

Although not a central question of the current research, the inter-factor correlations between three eating scales of the DEBQ followed the expected pattern, given the results of previous studies (e.g., Bańos et al. 2011; Dakanalis et al. 2013; Malesza 2019; Ricca et al. 2009). The most closely related dimensions were emotional and external eating. These results are comparable with previous findings and theoretical proposals about different eating patterns (Bańos et al. 2011; Dakanalis et al. 2013; van Strien et al. 1986, 2012). Also, the strongest relationship between the emotional and external eating styles can be explained by Heatherton and Baumeister’s (1991) notion that negative emotions may cause individuals to increase their awareness of the immediate food environment and decrease their awareness of the self (Dakanalis et al. 2013).

This study also assessed the DEBQ’s predictive value in differentiating between normal-weight and obese pre-bariatric individuals. In agreement with prior research (Bańos et al. 2011; Cebolla et al. 2014; Dakanalis et al. 2013; van Strien et al. 2012), obese individuals had higher external, emotional, and restrained eating scores than individuals with normal weight. In line with prior research (Bańos et al. 2011; Caccialanza et al. 2004; van Strien et al. 1986, 2012), our study has also found that restrained eating is more frequent in obese than normal weight individuals, which suggests that, paradoxically, this type of behavior may be a risk factor of overeating and weight gain, when self-control is lacking (van Strien et al. 2012). This finding might indicate that the subjects are aware of their status and thus need to decrease their weight (Vidal 2006). Further, higher emotional and external eating scores in the obese group may lead to higher food intake (van Strien et al. 2000), thus maintaining excess weight, which in turn might trigger a dieting behaviour and ensuing higher restraint scores (Vidal 2006). These findings also give credence to the arguments of restraint theory (Herman and Polivy 1980) whereby persons intending to diet may hold two incompatible goals such that periods of restraint may be accompanied by periods of emotional eating (see Elfhag and Morey 2008). Thus, each eating style may be a risk factors of overeating and weight gain.

Finally, the CFA was performed to test the adequacy of the structure of this model. Factor analysis results and scale psychometrics supported a three-factor solution, reflecting the anticipated three types of eating behaviour, as empirically established in previous studies (van Strien et al. 1986). Additionally, we found that the DEBQ had a three-factor structure for both normal-weight and pre-bariactric individuals. Furthermore, this research demonstrated that configural, metric and scalar invariance exist across BMI-status. Thus, BMI-status differences in DEBQ mean scores and relationships with other variables are not due to instrument artifacts (Byrne 2008; Cheung and Rensvold 2002).

Limitations and future research

The present research has a few limitations. First, the study focused exclusively on young adults with the age range from 24 to 41 years old; therefore, future research needs to determine whether the questionnaire is valid with other samples (e.g., older population). Second, our sample size was relatively small; thus, the findings require replication in larger samples. Finally, in the current study, patients with pathological eating behavior such as bulimia have been excluded from participation. Since the DEBQ in question is intended to assess eating behavior, it would have been of particular interest to include such a clinical sample in the future research. By considering these limitations, it can be concluded that the DEBQ is an instrument with adequate reliability that can assess adult eating behaviors among clinical and non-clinical populations.