Introduction

Body dissatisfaction is considered to be one of the major risk factors for eating disorders and is a condition present in a significant number of people of all ages, from childhood to adulthood; however, body perception changes in different stages of development [1], and there are critical moments for body dissatisfaction, for example, in adolescence [2].

The Body Shape Questionnaire (BSQ-34 [3]) is one of the most widely used self-report measures, both in clinical practice and in research; it was created to assess the phenomenological experience of feeling fat and included items to evaluate dissatisfaction with body size and shape. The original questionnaire consists of 34 items. In subsequent studies, eight different short versions have been proposed: four included eight items (BSQ-8A, BSQ-8B, BSQ-8C and BSQ-8D [4]), one included 10 items (BSQ-10 [5]), one included 14 items (BSQ-14 [6]) and two included 16 items (BSQ-16A and BSQ-16B [4]).

The BSQ has been translated or adapted in 13 countries: Germany, Brazil, Colombia, Spain, the United States, France, Japan, Korea, Malaysia, Mexico, Portugal, Switzerland and Turkey [7]; and its psychometric properties have been analyzed in different studies, finding excellent internal consistency for the BSQ-34 (α = .93–.98) in both clinical and nonclinical samples [4, 5, 7,8,9,10,11,12,13,14,15]. Moreover, temporal stability reliability at 2 weeks (rxx = .90 [12, 16]), 3 weeks (rxx = .88 [17]) and 1 month between applications (rxx = .81–.95 [15, 18]) was adequate.

The factorial structure of the BSQ-34 is controversial due to the variety of structures identified using both exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). One [4, 15], two [8, 9], three [18], four [11, 15] and up to five [14] factors have been identified using EFA; whereas, only three studies conducted CFA analyzing one-factor model, just one study found good fit indices [5] and in two studies no adequate goodness-of-fit indices were obtained [7, 19]. Likewise, one study showed that the BSQ-34 had factorial invariance in cross-cultural and cross-linguistic comparisons between women from two countries [7] and other study found that there is no invariance among gender [20].

Analyses of the short versions of the BSQ have found that the internal consistency is also good (α = .83–.98); the version with 14 items and the two versions with 16 items showed greater internal consistency (α = .93–.97 [4,5,6,7, 12, 13, 21, 22]) than did the versions with eight items (α = .83–.98 [4, 5, 7, 13, 16]). Likewise, the factorial structure of the short versions has been examined with CFA and the unidimensionality was confirmed only for some models: the BSQ-14 [12, 21], the BSQ-10 (5), the BSQ-8B [7, 13, 20] and the BSQ-8C [13, 16].

However, only four studies have tested the measurement invariance of short versions: the BSQ-10 showed metric invariance in cross-cultural and cross-linguistic comparisons between women from two countries [5], the BSQ-16A and BSQ-8B showed strong invariance when applied to independent samples of women [7]; the BSQ-8 (the author no specified which version) presented metric invariance between the Brazilian and the Portuguese women samples [19]; and the BSQ-8B was not invariant between women and men [20]. Therefore, it is clear that there are no solid and sufficient findings regarding the factor structure of BSQ short versions and behavior in populations that differ by age, sex or some other characteristic. Additionally, the short versions were also correlated with other instruments, such as the Eating Attitudes Test, the Bulimic Investigatory Test of Edinburgh [6], the Body Weight and Body Shape Concerns test [7] and the Body Dissatisfaction subscale of the Eating Disorders Inventory [12].

These findings showed that BSQ-34 scores are valid and reliable; however, the scores of some short versions show better psychometric properties than the original version. In spite of these findings, no studies have jointly analyzed which of the eight short versions shows better psychometric properties in women of different ages. In addition, studies have been conducted without considering the body perception changes that occur with age and the possibility that the scale is not equivalent or invariant among women of different age groups.

Therefore, the aim of the present study was to analyze the factor structure of the eight short versions of the BSQ and to analyze the factorial equivalence of the best model, its convergent validity and its internal consistency in Mexican women in three age ranges: 12–15 years, between 15 and 18 years, and between 18 and 25 years.

Methods

Participants

The total sample comprised 802 women with three different educational levels. “Sample A” consisted of middle-school students (n = 261), “sample B” consisted of high-school students (n = 245), and “sample C” consisted of university students (n = 296), with an average age of 13.49 years (SD = 1.03 years), 16.09 years (SD = 1.09 years) and 19.55 years (SD = 2.16 years), respectively. These women attended public institutions of Ciudad Guzmán, a small city located in the southern region of the state of Jalisco, Mexico. For this study, educational levels corresponded to the age ranges indicated in the objective.

Instruments

The Body Shape Questionnaire (BSQ-34 [3]) was adapted to Spanish (Spain) by Raich et al. [14]. This instrument evaluates dissatisfaction with body image due to body weight and shape. The BSQ consists of 34 direct items with six response options on a Likert scale ranging from never = 1 to always = 6. The study for the adaptation and validation of the BSQ-34 in Mexican women showed excellent reliability (α = .98 [8]). The eight short versions were analyzed in the present study. Two versions by Evans and Dolan [4] include 16 items, the BSQ-14 version by Dowson and Henderson [6]; the BSQ-10 version by Warren et al. [5] was also included, and finally the study analyzed the four 8-item versions by Evans and Dolan [4].

The Eating Attitudes Test (EAT-40 [23]) evaluates abnormal eating behaviors associated with eating disorders. The EAT-40 is a 40-item questionnaire with six response options on a Likert scale from never to always, and each item is rated on a scale from 0 to 3. A total score that fluctuates between 0 and 120 is calculated, with a higher score indicating a greater presence of abnormal eating behaviors. The EAT-40 was validated in Mexico and showed an internal consistency of α = .90 in the clinical sample and α = .93 in the total sample. The cut-off point for abnormal eating behaviors is 28 [24].

The Questionnaire on Influences of the Aesthetic Body Model (CIMEC, its acronym in Spanish [25]) evaluates the influence of social agents and situations in which the aesthetic model of thinness is promoted. The CIMEC consists of 40 items with three response options. The analysis of the psychometric properties of the CIMEC in Mexican women showed adequate internal consistency (α = .94), and a four-factor structure was identified [26].

Procedure

The research was conducted in educational institutions, and the institutions’ directors allowed the research team to collect data during class time. The objective of the research was explained, and the women’s voluntary participation in the study was requested. Participants who provided written informed consent were asked to complete the questionnaires. For women in middle school and high school, their parents needed to provide informed consent. Those who agreed to participate needed approximately 15 min to complete questionnaires. At all times, one of the researchers in charge remained in the classroom to answer any questions and to ensure that participants do not share or discuss their response. Participants did not receive any remuneration or other form of inducement for their participation. Rates of refusal were calculated. For the total sample 4.74% (n = 38) of women refused to participate, 20 from middle school, 10 from high school and 8 from university. Reasons for denying were not asked.

The present research was carried out with strict adherence to the Code of Ethics for Psychologists [27] and the ethical principles of the American Psychological Association [28]. The protocol was approved by Bioethics Committee of the Centro Universitario del Sur of the Universidad de Guadalajara.

Statistical analysis

The analysis was performed in AMOS 21.0 (Analysis of Moment Structures). The internal structure of the scale was examined with confirmatory factor analysis (CFA), which was used to determine the following: (1) whether the original 34-item scale fits a one- or two-factor model; (2) how well the short versions of the BSQ-34 perform against the full version of the questionnaire; and (3) the factorial equivalence of the best version of the BSQ in the three age groups, based on analysis, to show that equal scores can lead to equal interpretations of results among women of different ages [29].

The CFAs were performed using a random subsample of 50% of the data from groups “A”, “B” and “C” and employed the maximum likelihood estimation (ML) method. A critical assumption running ML estimation is that data are multivariate normal. We reviewed univariate kurtosis and its critical ratio as prerequisite, values range from − 4.17 to 1.3. Kline (2005) considers absolute values equal to or greater than 7 are indicative of departure from normality. At the same time, the Mardia’s coefficient evaluates multivariate normality and it was equal to 4.76; thus, it is considered as multivariate normality. The ML method provides consistent, efficient, and unbiased parameter estimates as well as an omnibus test of model fit [30]. The Chi-square statistic (χ2) is presented to assess the fit of each model. Since this indicator is sensitive to sample size, other model fit indicators are used as well [31], such as the comparative fit index (CFI), the Tucker–Lewis index (TLI), and the root mean square error of approximation (RMSEA) and its corresponding 90% confidence interval (90% CI). Acceptable model fit is defined according to the following criteria: RMSEA < .08 (90% CI), CFI > .90, and TLI > .90; however, a good fit of the model is attained when RMSEA < .05, CFI > .95, and TLI > .95 [31, 32]. We also used the Akaike information criterion (AIC) in the comparison of models, with smaller values representing a better fit of the hypothesized model [33].

Factorial invariance

The invariance or equivalence of a factorial model assesses the psychometric equivalence of a construct across groups. Invariance test implies the evaluation, by multigroup CFA (MG-CFA), of three basic levels of invariance: (1) Configural invariance, (2) Measurement invariance and (3) Structural invariance [34, 35]. Configural invariance considers that the factorial structure is similar across the different groups. The invariance in the measure, refers to the degree to which the measurement parameters of each item are similar across groups [36]; this analysis focuses on the invariance of the factorial loadings (metric or weak invariance), the invariance to which the intercepts are added (scalar or strong invariance) and the invariance to which the residuals are added (residual or strict invariance). Finally, structural invariance focuses on the equality of latent or unobserved variables, thus adding constraints to the matrix of factorial variances and covariances.

Testing for factorial equivalence encompasses a series of hierarchical steps that begins with the separate determination of a baseline model for each group. This model represents the one that best fits the data, considering both parsimony and substantive meaningfulness [37]. The importance of this model lies in the fact that it serves as a baseline model (unconstrained model) with which the other models are compared to evaluate equivalence. If the configural model (unconstrained model) has an adequate fit, it can be concluded that both the number of factors and the factorial loadings pattern are similar across the groups, and the evaluation of more restrictive invariance models is justified [38].

To evaluate the weak invariance, the baseline model is taken, and the factorial loadings of all items are constrained (made equal) for the three groups. If the overall model fit is significantly worse in the weak invariance model compared to the configural invariance model, it indicates that at least one loading is not equivalent across the groups, and therefore each item loading should be tested. To evaluate strong invariance and strict invariance, global tests are also performed in which the intercepts and the residuals, respectively, were constrained, in addition to the restriction of the factorial weights. If the global tests do not meet the criteria of invariance, an analysis of the corresponding parameter by item must be performed; those procedures are also known as Differential Item Functioning (DIF) analysis. Finally, to evaluate the structural invariance, the common factor variances are equalized.

The likelihood ratio (LR) test and an increase in CFI (ΔCFI) are used as equivalence criteria. For the LR test, the invariant (constrained) models are considered to be nested in the baseline (unconstrained) model, and it is, therefore, possible to estimate a significant loss of fit in the restricted model. This analysis uses the likelihood ratio (LR) test, which determines the difference in the Chi-square (Δχ2) statistic between the baseline model and the constrained model. This difference follows a Chi-square distribution, with degrees of freedom equal to the difference between the degrees of freedom of the models that are compared (Δdf), and if this value is significant, then the models are not equivalent. However, since χ2 is sensitive to sample size, the criterion proposed by Cheung and Rensvold [39] was used in addition to the LR test. This criterion refers to the change in the CFI between the models, so the authors note that if ΔCFI > .01, then the model in which the parameters are restricted does not hold up; in this study, the fulfillment of both criteria is considered as a proof of lack of invariance.

Reliability and validity evidence

To analyze reliability, Cronbach’s alpha coefficient (α) and the omega coefficient (ω) were calculated, regarding that alpha coefficient estimation is affected by number of response alternatives, omega coefficient is a better estimator given that considers factorial weights. Thus, this is a more stable measure of reliability [40]. For the latter, the factorial loads of the model of best fit were used, according to the process described by Viladrich et al. [41]. The construct validity for the best-fitting model was analyzed, and Pearson’s correlation was calculated with the EAT-40 and the CIMEC. Furthermore, the scores for the women with and without abnormal eating behaviors were compared.

Results

Normative data

Table 1 shows the mean and standard deviation of the original BSQ and each of the short versions that were evaluated for each group and the total sample, as well as the statistical tests of differences between groups and their level of significance.

Table 1 Normative data for each version of the BSQ

CFA and comparison models

Table 2 shows the evaluation of the two-factor solution and the one-factor solution of the original BSQ-34 and for each short version of the BSQ-34 (the two versions with 16 items, the version with 14 items, the version with 10 items and the four 8-item versions) using CFA. Goodness-of-fit indices were analyzed for the nine models evaluated. In the original 34-item scale, none of the models showed a good fit, even when the two-factor model performed better. In the 16-item versions, the BSQ-16B showed better fit indices, and version A did not have a good model fit. All other short versions of the BSQ showed good model fit indices, with better indices for the 8-item versions, except for the BSQ-8A. For the model comparison, the BSQ-8D showed better overall fit indicators and the lowest AIC; thus, it was retained as a final model.

Table 2 Values of goodness-of-fit indices for confirmatory factor analysis models

For the BSQ-8D version, the factorial weights for each item in each of the samples and in the total sample are shown in Table 3. All the standardized factorial weights were > .50, ranged from .60 to .84 for sample A, .61 to .88 for sample B, .54 to .86 for sample C, and .61 to .84 for total sample.

Table 3 Standardized factorial loadings (λ) of the BSQ-8D in each of the samples and in the overall sample

Analysis of the factorial invariance of the BSQ-8D by age

The MG-CFA was performed to test the validity of the BSQ-8D in the different age groups. Table 4 shows the Chi-square and degrees of freedom values for each model, as well as, the LR test and the ΔCFI to assess the invariance. The unconstrained model fit examines configural invariance of the BSQ-8D; current conventions for evaluating model fit consider the Chi-square test, and two more alternative fit indices. Although the Chi-square test does not show a good fit (χ2 = 175.69, df = 60, p < .05), the rest of the indices contradict this conclusion: the Tucker–Lewis Index (TLI = .96) and the mean square root (RMSEA = .04) allow us to accept the base model of the configural invariance. For testing (1) Measurement invariance and (2) Structural invariance, the models that do not cover the invariance criteria (a significant LR test and a ΔCFI > .01), are Italic. The existence of metric invariance and scalar invariance (weak and strong) has been proven in the global tests, a. Measurement loadings (p = .045, ΔCFI = .003) and b. Measurement intercepts (p = .001, ΔCFI = .009), but not strict c. Measurement residuals (p = .001 and ΔCFI = .02). In that case, the tests for the residual of each item were carried out; results show that only items 2 and 30 are equivalent in the three age groups. (3) Structural invariance holds up because ΔCFI = .01, which suggests that the variances of the latent variable are the same for the three groups.

Table 4 Invariance models test for three age groups (sample A: 12–15, sample B: 15–18 and sample C: 18–25 years old)

Reliability

The BSQ-8D showed adequate reliability indicators, as evaluated by Cronbach’s alpha (α = .89) and by the Omega coefficient (ω = .89).

Convergent validity

The correlations between the total scores of the BSQ-8D and the EAT-40 (r = .60, p < .001) and between those of the BSQ-8D and the CIMEC (r = .77, p < .001) were calculated.

Discriminant validity

Considering the cut-off point of the EAT-40, two groups were identified: women with abnormal eating behaviors (n = 94) and women without abnormal eating behaviors (n = 100). The total score of the BSQ (t = 15.99, p < .001, r = .75) and its eight items discriminated between both groups, with significantly higher scores in women with abnormal eating behaviors.

Discussion

The objective of the present study was to analyze the factorial structure of the eight short versions of the BSQ and to analyze the factorial equivalence of the best model, its convergent validity and its internal consistency in Mexican women in three age ranges: 12–15 years, between 15 and 18 years, and between 18 and 25 years. First, evidence was generated that did not confirm the two-factor structure for the BSQ-34 previously generated with EFA in Mexican women from the general population and from a clinical sample [8]. In the present study, the two-factor model demonstrated better performance than the one-factor model; however, the goodness-of-fit indices were inadequate.

The models in the different short versions were analyzed. The model with the best fit was the BSQ-8D, and the unidimensionality of this scale was confirmed. This finding is in line with previous studies where unidimensionality of different short versions of the BSQ was analyzed [5, 7, 12, 16, 19,20,21].

Once the best model of the different short versions of the BSQ was identified, the MG-CFA was conducted. The main objective of generating evidence for the invariance of an instrument is the establishment of a multigroup reference model that acceptably fits the data. In other words, this reference model allows the researcher to determine whether the same elements are indicators of the same latent factor in each group [42]. The results obtained showed configural invariance, metric (weak) and scalar (strong) invariance, but no residual (strict) was observed. Although residual invariance is a component for full measurement invariance, it is not a prerequisite for testing mean differences because the residuals are not part of the factor. For this reason, this test is often omitted in invariance research [38]. Finally, structural invariance has been proven which suggests that the variances of the latent variable are the same for the three groups.

In the validation study of the original BSQ and in subsequent validations, it has been shown that both the full version and the short version show evidence of convergent validity because their scores are correlated positively with those of other tests validated in the field of eating disorders [6, 7, 12]. In the present study, the score on the BSQ-8D was positively correlated with the score on the EAT-40 (r = .60), one of the most widely used instruments to assess specific characteristics of eating disorders. The BSQ-8D score was also correlated with the CIMEC score (r = .77). That is, the relationship between body dissatisfaction and the presence of abnormal eating behaviors was confirmed [43], as well as between body dissatisfaction and the influence of the aesthetic model of thinness.

As expected, women with abnormal eating behaviors showed significantly higher mean scores both in the total BSQ and in the eight items compared than those who did not show abnormal eating behaviors. These findings demonstrate that the BSQ-8D clearly discriminates between unhealthy and healthy samples, which is consistent with the findings of other studies for both the original scale [4, 5, 7, 13, 16] and for the short versions using clinical samples [3, 8, 14, 22], women considered to have probable cases of bulimia nervosa [3], women with a history of vomiting [6], women with different levels of body dissatisfaction [10], and obese women who diet [17].

The internal consistency of the BSQ-8D was considered to be excellent (α = .89) and was within the range obtained in previous studies for both the short versions (α = .83–.98 [4, 5, 7, 13, 16, 19, 20] and for the original version (α = .93–.98 [4, 5, 7,8,9,10,11,12,13,14,15, 19]. Furthermore, this value was confirmed by calculating the Omega coefficient. This abbreviated version of the BSQ represents less than 25% of application time without detectable losses in the metric quality of the instrument. Model comparison is a technique to identify which version of the instrument measures best the concept and, therefore, contains the most appropriate items for it.

One limitation of this study is that it did not include a clinical sample diagnosed with an eating disorder. Furthermore, on the original scale (BSQ-34), some items are not appropriate for men; an advantage of the BSQ-8D is that its items do not show gender bias. Thus, it is suggested that future research incorporates samples of males; thus far, only five studies have analyzed the validity of the BSQ scores in samples of males [10, 12, 16, 17, 22], and three had samples of fewer than 60 participants [12, 16, 17], which is considered an insufficient number of studies for performing certain statistical analyses, such as factor analysis.

Given that the BSQ is frequently used in studies to evaluate body dissatisfaction in Mexican women, the main contribution of this study is to provide evidence that the scores of BSQ-8D show adequate psychometric properties. Specifically, the BSQ-8D was found to have a one-factor structure, and it was also found to be a multigroup reference model that acceptably fits the data. Likewise, the scores on this instrument correlate with those of other tests, discriminate between women with and without abnormal eating behaviors and present excellent internal consistency. Therefore, given the low number of items and their psychometric properties, the BSQ-8D is emerging as a useful tool for evaluating body dissatisfaction in different age groups. However, it is important to note that since reliability and validity are not characteristics of the instrument but rather of the instrument scores obtained in a particular sample, it is necessary to evaluate the psychometric properties of the BSQ-8D when it is used with other samples or in other contexts.

What is already known on this subject?

The Body Shape Questionnaire (BSQ) was designed to measure body shape concerns and it has been demonstrated to be a psychometrically sound measure. However, there is evidence that its items could be redundant and several short versions of the BSQ have been proposed and it is necessary to evaluate which of the eight short versions shows better psychometric properties in women of different ages.

What does this study add?

The BSQ-8D short version showed good psychometric properties in the current study and structural invariance has been proven on women in three age ranges. This measure could serve a useful tool to assess body concerns among Mexican women of different ages.