Introduction

Assessment of health-related quality of life (HRQOL) has become widespread recently, and among the most widely used questionnaires are the World Health Organization Quality of Life (WHOQOL) instruments [1]. Their appeal stems from the fact that they have been developed with international collaborations. Initially, 15 centers in 14 countries were involved in the development of the WHOQOL-100 [2, 3]. Thus, the WHOQOL can claim to possess broad cultural applicability [4, 5]. Its psychometric properties are acknowledged to be excellent [6].

As HRQOL measures are commonly taken in conjunction with other measures, participant burden is an important consideration. Therefore, the short version, the WHOQOL-BREF [5], has proven to be very popular [1]. Until now, WHOQOL research in New Zealand [711] has used the Australian version of the WHOQOL-BREF [12], as no New Zealand version had been available. However, with the recent establishment of the New Zealand WHOQOL Group [1], more research will be conducted in New Zealand using the WHOQOL-BREF. It is timely that the suitability of this instrument is formally established for general use in New Zealand.

The purpose of the present study was to test the psychometric properties of the WHOQOL-BREF for use in New Zealand. Data were collected using a sample from the general population. Psychometric properties of the instrument were assessed, including tests of the four-domain factor structure using confirmatory factor analysis (CFA) and Rasch analysis.

Methods

Using the New Zealand national electoral roll, 3,000 questionnaires and self-addressed return envelopes were posted out randomly. A cover letter explained the purpose and procedure of the study. Four months after the mailout, 710 questionnaires had been returned (a response rate of 24 %). Since participants aged 18–30 were underrepresented (8.2 %) compared with national census figures (22.8 %) [13], additional data from this group were collected using purposive convenience sampling. Using community group networks, young adults were approached in Auckland (the largest city in New Zealand with a population of approximately 1,400,000) and Palmerston North (population of approximately 82,000). As a result, the total number of participants increased to 808, and the percentage of participants in the category 18–30 years increased to 18.4 %.

Participants completed the Australian version of the WHOQOL-BREF [12]. To control for order effects, half of the questionnaires presented items in a different order. The study had prior approval by the ethics committee of the authors’ university. CFA was conducted with LISREL v. 8.80 [14] and Rasch analysis with RUMM2030 [15]. All remaining data analyses used SPSS v.18.0. The CFA used diagonally weighted least squares with polychoric correlations, as data were ordinal [16, 17]. The four-domain structure of the WHOQOL-BREF was tested, by allowing correlations between domains, but no correlations of error variance.

Each of the WHOQOL-BREF domains was tested against the polytomous partial credit Rasch model [18] to examine its internal construct validity [19]. Rasch analysis has been explained in detail elsewhere [2023], so only the analytical concepts are briefly outlined in Table 1. Testlets were created if problems with response dependency or differential item functioning (DIF) were found to test if bias is canceled out at the test level and also if they remove the dependency in the data [24]. Bonferroni corrections were used to allow for multiple testing.

Table 1 Fit to the Rasch model

Results

Of the 808 participants, 337 were male and 469 female (2 missing). The mean age was 49.69 years, with a standard deviation of 17.85. For their highest level of education, 50 % reported tertiary, 45 % secondary, and 3 % primary education. Fifty-eight percent were married, 18 % single, 8 % lived as married, 7 % divorced, 6 % widowed, and 2 % separated. To the question “Are you currently ill?” 140 participants answered “yes” and 656 answered “no.”

For the majority of WHOQOL-BREF items, kurtosis and skewness coefficients were within the acceptable range of −1.00 to 1.00. Several items were slightly outside this range, but still within −1.10 to 1.10. Item 23 (condition of living place) had a skewness and a kurtosis coefficient of −1.38 and 2.19, respectively, and a mean of 4.30 (SD = 0.84). Item 25 (transport) had a mean of 4.06 (SD = 0.93) and a kurtosis coefficient of 1.18.

Cronbach’s alpha values were 0.91 for the overall scale, 0.80 for the physical domain (seven items), 0.82 for the psychological domain (six items), 0.71 for the social domain (three items), and 0.81 for the environment domain (eight items). All values were above 0.70, and thus showed adequate internal consistency. Criterion-related validity was assessed by correlating item and domain scores with Items 1 (global quality of life) and 2 (global health). Items 1 and 2 were significantly correlated with all 24 remaining items (P < 0.01). All domain scores were significantly (P < 0.01) correlated with Item 1 (Pearson’s r ranged from 0.45 to 0.60), as well as Item 2 (0.31–0.64). Domain mean scores and standard deviations are shown in Table 2.

Table 2 Correlations between latent factors of the CFA

The suitability of the factor structure of the WHOQOL-BREF was evaluated using three goodness-of-fit values [25]. Root mean square error of approximation was 0.072, and thus above the <0.060 criterion for a good fit. The comparative fit index was 0.966 and thus above the >0.95 criterion for a good fit. A final goodness-of-fit index, the standardized root mean square residuals was 0.067, thus indicating that the fit was good (criterion <0.080). Standardized factor loadings were all above 0.50, except for Items 3 (pain) and 4 (medication), which had factor loadings of 0.48 and 0.36, respectively. The correlations between the four factors ranged from 0.48 to 0.75 (Table 2).

A smaller sample was used to test the data against the Rasch model because large sample sizes can result in Type I errors, that is, falsely rejecting an item as not fitting the Rasch model [26]. Four hundred and twenty people were included in the analysis, a sample size large enough to have 99 % confidence that the estimated item difficulty is within ±½ logit of its stable value [27]. All participants who had self-identified as unwell (n = 140), as well as a random sample of 280 from the remaining participants were included. The findings are summarized in Table 1. For each of the domains, a small number of people fit the Rasch model better than expected, as indicated by the negative fit residuals smaller than −2.50. However, as they did not misfit the model, they were retained in the analysis. Ten items had disordered thresholds and were successfully re-scored prior to further analysis. Each domain was unidimensional but included items that displayed uniform DIF by a range of variables. In addition, each domain had a few items that were dependent on one another. The creation of testlets for DIF or dependent items overcame these problems with subsequent item fit and overall fit to the Rasch model. The domains had acceptable reliability at the group level, except the social domain, which had a low Person Separation Index of 0.54.

Discussion

The present study evaluated the psychometric properties of the WHOQOL-BREF for use in the general population in New Zealand. Reliability and criterion-related validity were very good. A CFA suggested that the generally acknowledged four-domain structure provided a good to very good fit, with goodness-fit-values that were slightly better than those from similar analyses reported elsewhere [6, 2830]. The overall conclusion of the Rasch analysis supported the CFA findings after dealing with problems of threshold ordering (in 10 items across three domains), local dependency (all domains), and DIF (all domains: 15/24 items).

Classical test theory assumes that thresholds are ordered, while in Rasch measurement theory this is specifically assessed [31]. Disordered thresholds are problematic and indicate that the response categories are not used consistently along the latent trait. In our analysis, we were able to account for disordered thresholds by collapsing some item response categories, which is a common approach [32]. In one other study, disordered thresholds were found, though to a lesser extent [33]. By contrast, one study did not identify this issue at all [34], and others did not report on this [35, 36].

The assessment of local dependency has undergone changes recently. Current protocols are very strict, accepting a scale as locally independent if correlations between residuals are smaller than 0.30 [37]. Local dependency was only assessed in one other study [33] and was only an issue in the physical domain. The authors were able to resolve this by creating a testlet as was conducted here.

The present study found significant problems with DIF in 15 of 24 items. DIF was more prevalent than in other studies that formally assessed this [33, 34, 36]. Rather than deleting these items [36], we created testlets containing items that showed DIF in opposite directions (either statistically or visually observed) [24]. This led to satisfactory fits to the Rasch model.

Our analysis differed from that by Wang et al. [36] who used a multi-dimensional Rasch model. In the multi-dimensional model, domains are simultaneously calibrated and the correlations between traits are taken into account [38]. The disadvantage of this approach is that the raw scores are no longer sufficient statistics for the Rasch derived person estimates since their estimates on one domain are dependent upon the other three domains as well. Our findings also differ from a Danish study, which established that the WHOQOL-BREF did not fit the Rasch model but that it did fit a two-parameter model [35]: This is an item response theory model that allows different discrimination of items in a scale. Problems with this approach have been widely documented and essentially concern the model’s inability to provide a separate estimation of item and person parameters, which highlights fundamental measurement problems [39].

The following limitations need to be acknowledged. Firstly, the New Zealand electoral roll was used to collect a sample of the general population, thus equating general population with the population of people enrolled to vote. Typically, more than 90 % of the eligible voting population (which includes people with New Zealand citizenship or permanent residency) is enrolled to vote, although this proportion is lower for people aged 18–24. Secondly, the extent of self-selection bias is likely to have been strong. The low response rate of younger participants prompted collection of further data using purposive sampling, but a gender bias still remained, with 58 % of respondents being female. No data on ethnicity were collected, and it therefore cannot be determined to what extent the sample represents the ethnic mix of the New Zealand population. The present study was conducted 1 year after the 2008 national elections, and people who had moved house since the elections would have received the questionnaire at their old address. This may explain why the response rate was particularly low for young people, who may be prone to moving house more frequently than others. Due to budget constraints and confidentiality requirements, no incentives, sweepstakes, reminder letters, etc. could be offered, which would have increased the response rate [40].

Unlike the original validation studies of the WHOQOL-BREF [5], the present study did not purposively collect data from individuals with identified health issues as it was intended to validate the WHOQOL-BREF for use in the general population. This intention was a considering factor during the interpretation of the factor structure of the instrument. A frequently used cutoff criterion for factor analysis is that factor loadings need to be above 0.30, although 0.40 appears to be more frequently used. In our case, the factor loading of 0.36 for Item 4 (medication) was therefore marginal. However, since the WHOQOL-BREF is a well-established measure that has been used worldwide, we believed a laxer criterion to be justifiable. Additionally, the item fit the Rasch model and was thus retained in the analysis. Our own previous work [41] and that of others [42] found that this item, as well as Item 3 (pain), did not perform well in samples of young university students with a low proportion of self-reported health problems. The fact that the present study finds similar (albeit less pronounced) problems with these items could reflect the fact that a large proportion of respondents in our sample also did not have health issues.

To summarize, the WHOQOL-BREF is suitable for use in New Zealand with samples from the general population. As a result of the collapsing of some of the response categories in our analysis, the total achievable domain scores have changed. Consequently, it is not possible to provide a straightforward ordinal-to-interval scores conversion table. Future work should either analyze WHOQOL-BREF domain scores using non-parametric statistics or data should be fitted to the Rasch model to derive interval person estimates.