1 Introduction

This article describes an instrumental study whose goals consisted in calibrating a direct observation quality of life questionnaire and an outcomes-based measurement of services quality of for users of social and human services: the GENCATFootnote 1 Scale (Verdugo et al. 2008a, b). It is widely accepted that quality of life concept is important in social services to implement person-centered programmes and practices, to assess and report personal outcomes, to guide quality improvement strategies, and to improve the effectiveness of those practices and strategies using evidence-based or outcomes-based measurements (Schalock et al. 2008a; Verdugo et al. 2010). The quality of life concept has recently begun to be applied in social policies since it has become a reference model for service provision, a basis for evidence-based practices, and a tool to develop quality improvement strategies (Schalock and Verdugo 2002, 2007).

The GENCAT Scale has been developed on the basis of the eight-domain model proposed by Schalock and Verdugo (2002) and subsequent work regarding the model’s validation and cross-cultural use (Jenaro et al. 2005; Schalock et al. 2005, 2008b). According to this framework, quality of life is composed of eight domains and core indicators that define operationally each quality of life domain. The indicator measurement results in personal outcomes that can be used for both reporting purposes and guiding organization improvements (Claes et al. 2009; Keith and Bonham 2005; Schalock and Bonham 2003; Schalock et al. 2008a, b). According to this model, individual quality of life is a multi-dimensional phenomenon composed of core domains that are influenced by personal characteristics and environmental variables; these core domains are the same for all people, although they may vary in relative value and importance, and quality of life domains are assessed on the basis of culturally sensitive indicators (Schalock et al. 2010).

The current approach to the measurement of quality of life is characterized by: (1) its multidimensional nature involving core domains and indicators; (2) the use of methodological pluralism that includes the use of subjective and objective measures; (3) the incorporation of a systems perspective that captures the multiple environments impacting people at the micro, meso, and macrosystem levels; and (4) the increased involvement of persons with risk of social exclusion in the design and implementation processes (Bonham et al. 2004; Verdugo et al. 2005). Moreover, depending on the purpose and the perspective of the instrument used, quality of life related personal outcomes would be used to assess either person’s perceived well being on the item (self-report) or the person’s life experiences and circumstances (direct observation) (Claes et al. 2009; Schalock et al. 2008a). In this sense, we can speak about objective and subjective measures and measurement instruments, depending on their purpose, content, and respondent. If an evaluator desires to assess personal outcomes and develop person-centered programs, subjective Likert-type scales answered by the client or user of the service should be applied (Bonham et al. 2004; Cummins 2005; Keith and Bonham 2005; Keith and Schalock 2000; Schalock and Felce 2004). In distinction, when the goal is program evaluation, service quality improvement, or to assess organizational changes, it is recommended to use objective questionnaires based on the direct observation of personal experiences and circumstances (Perry and Felce 2005; Schalock and Felce 2004; Schalock et al. 2007; Schalock and Verdugo 2002; Verdugo et al. 2007a; Walsh et al. 2006).

The purpose of this study was to calibrate the GENCAT Scale and evaluate its psychometric properties with the goal of having a tool for carrying out program evaluations, making evidence-based service quality improvements, and assessing and guiding organizational changes. In reference to this purpose, psychometric properties of the measures provided by the Catalonian version of the GENCAT Scale were analyzed using Rasch Rating Scale Model (Andrich 1978; Wright and Masters 1982; Rasch 1960, 1980, 1992). The Rasch Model is often classified under Item Response Theory (IRT) that specifies how people, test items, probes, or others must interact statistically through probabilistic measurement models for linear measures to be constructed from ordinal observations. This model requires the investigation and quantification of accuracy, precision, reliability, construct validity, quality control fit statistics, statistical information, linearity, local dependency and unidimensionality (Linacre 2010). On the other hand, IRT is a system of models that defines one way of establishing the correspondence between latent variables (e.g., quality of life) and their observed manifestations (e.g., test items). For that, persons and items are located on the same continuum, and it is considered that, for an item to have any utility, it must be able to differentiate among persons located at different points along a continuum.

The specific goals of GENCAT’s calibration were next: (1) to verify the unidimensionality of each factor; (2) to obtain validity evidence for the instrument through the observed fit of the data to the model, regarding both items and persons; (3) calculate the indices of reliability and separation; (4) estimate the calibration of the items; (5) verify the accuracy of the measurement; and (6) determine whether there is Differential Item Functioning (DIF) by gender and type of collective.

2 Methods

2.1 Participants

A total of 608 professionals working in 239 centres of social services participated filling in the field-test version of the GENCAT Scale for 3,029 users from Catalonia (Spain). The mean number of evaluated persons per service was 12.67 (SD = 7.75) and the mean number of evaluated persons per professionals was five (M = 4.98). A probabilistic polietapic sample design was carried out to select the participants with the purpose of guaranteeing the representativity. The sample was composed of two main groups: elder people group (sampling error of 2.43 with 95% confidence and p = q) and people at risk of social exclusion (sampling error of 2.91 with 95% confidence and p = q).

The requirements for professionals to participate in this study were: (1) to be working in some kind of social service for handicapped adults; and (2) to have been working directly with the client for at least 3 months. The only requirement to apply the scale to a social service user was that this was older than 16 years old. Related to the main socio-demographic characteristics of professionals, most of them were female (85%), had been working with the client for more than 2 years (55.74%), were psychologists (23.01%) and social workers (18.41%), and had been working in social services more than 5 years (52.80%). Regarding the users of the social services evaluated, the sole criteria governing their inclusion in the research were: (1) to be users of some kind of social services attached to the ICASS (Catalan Welfare and Social Services Institute), and have done so for at least 3 months; (2) to be over the age of 18. Concerning to their socio-demographical data, 55.7% were female. Their ages ranged between 16 and 105 (M = 64.72; SD = 21.34). More than half of sample (57.57%; n = 1,711) was older than 60. Actually, the biggest group (n = 791) was composed of 81–90 years old people and only 17.39% (n = 515) were younger than 41.

The most representative group was the one composed of elder people living in residence settings (44.70%), followed by people with intellectual disabilities (19.35%), physical disabilities (11.72%), mental health (10.33%), and old people in day centres (8.75%). Percents of people with drug dependences and HIV/AIDS ranged from 2.48 to 2.67%.

2.2 Instrument

Although there is a Spanish version of the GENCAT Scale (Verdugo et al. 2009), the Catalonian version of the instrument (Verdugo et al. 2008a) was applied in this study. The GENCAT Scale is a self-administered questionnaire in which professionals answer objective and observable questions about user quality of life based on direct observation of person’s life. It is composed of eight scales—that correspond to the eight quality of life domains—and 69 items that are formulated as third person declarative statements and random organized by domains. The answer format is a frequency scale with four options (‘never o hardly ever’, ‘sometimes’, ‘often’, ‘always or almost always’).

The GENCAT Scale was developed through a systematic and rigorous method (Verdugo et al. 2008b, 2010) and has served internationally as a model for developing other multidimensional quality of life scales focusing on the context (Verdugo et al. 2007b; van Loon et al. 2008).

2.3 Procedure

Once the participant services were selected, a letter was sent in order to explain the researching goals and make the participation request. This letter was sent by post and by email. After that, the research team phoned every single service to confirm the post address (since the scales were sent by a courier company) to confirm if they were determined to collaborate. Once their participation was confirmed, the specific number of scales they must complete plus 5 (to be sure of achieving a big enough sample) and an evaluator’s guide were sent. About 4,500 scales were sent.

The statistic software that was used to analyze data was WINSTEP 3.68.0 (Linacre 2008; Linacre and Wright 1999).

3 Results

3.1 Preparing the Data

Before describing the results obtained with the logistical analysis of a parameter, we performed the pertinent prior verifications of the model’s data fit: (1) the point-biserial correlations were positive in all cases (.13 and .82); (2) the function of the categories was suitable in all cases; in fact, each category had more than 1,000 observations, comfortably surpassing the recommended minimum number of responses (i.e., 10 observations). Furthermore, the mean measurements for the categories progressively increased in all the dimensions except in Material wellbeing and Rights, in which category 3 was especially noisy; and (3) the study of item misfit confirmed the sitting of the items within the recommended range (Linacre 2002) with the sole exceptions being three items (‘He/she is exposed to exploitation, violence or abuse’ in Rights, ‘The service he/she attends caters for their preferences’ in Self-determination and ‘He/she has a satisfactory sex life’ in Interpersonal relations, which returned values slightly higher than 2).

3.2 Unidimensionality

The principal components analysis of the eight subscales’ residuals gave rise to percentages of between 36 and 58.7% of variance explained by the modelled data. More specifically, an analysis of the subscales Interpersonal relations and Self-determination gave rise to percentages that were slightly lower than the recommended value of 60%, whilst the majority exceeded the commonly used one of 40%. Only the subscales of Physical wellbeing and Material wellbeing did not reach that value, although they came very close. These results, together with the first tests in each principal components analysis (with eigenvalues below 3.0, which is considered to indicate the existence of a second dimension), led us to confirm the unidimensionality of the eight subscales of quality of life.

3.3 Fit of the Data to the Model

Secondly, we analysed the model’s data fit. On the one hand, regarding the fit for persons, it is noteworthy that Interpersonal relations is the only subscale that does not have extreme data, whereas Rights and Material wellbeing have more than 600 extreme cases. In addition, we have a perfect fit for persons in Self-determination and Social inclusion in the case of both the infit and the outfit, and in Physical wellbeing in the case of the infit. All the other values ranged between −.1 and .1. Finally, the values of MNSQ were very close to 1 in all cases (see Table 1). For these reasons, we can conclude that the overall fit for persons shows that the responses are consistent with the response patterns foreseen by the model. On the other hand, regarding the overall fit of the items, the MNSQ values confirmed the items’ fit to the RSM in all cases. On this occasion, Emotional wellbeing and Social inclusion were the subscales with a perfect fit and Material wellbeing exceeded the value |1.0|, albeit only slightly. Given that all the values fell within the range considered to be acceptable, we confirmed the overall fit of the items to the model. A more detailed analysis of the items’ fit furthermore revealed that the more accurate dimensions are Emotional wellbeing, Personal development, Self-determination and Social inclusion. By contrast, the least accurate items are in Material wellbeing, Physical wellbeing, and Rights.

Table 1 The GENCAT Scale calibration

No item presented dependence or determinism (values below 0.60). However, five of the 69 items revealed a lack of fit, noise or high random variability in the data (values substantially higher than 1.5). In brief, most of the parameters for the items in the GENCAT Scale have an acceptable behaviour according to the postulates of the Rasch model, whereby they all have a suitable fit, with the exception of the final item (‘He/she is exposed to exploitation, violence or abuse’ in Rights, whose fit is highly debatable).

3.4 Item and Person Separation Reliability

The calculation of reliability involved the use of the item separation reliability index and the person separation reliability index (Fig. 1). Concerning the former, we obtained a separation reliability index for the items equal to 1 in all the subscales, so we can affirm that the items have the utmost reliability. However, the separation reliability indices for people were lower and varied greatly between the subscales: Self-determination was the only dimension that returned a value considered to be acceptable (.82), whereas Material wellbeing returned a reliability coefficient that was so low (.15) that its reliability as regards persons is highly questionable. All the other values ranged between .23 (Physical wellbeing) and .77 (Emotional wellbeing). The results for the separation indices for items and people confirmed the previous results, whereby the separation index for the items exceeds the value of 2.00 in all cases (in fact, it exceeded the value of 18 in all cases and reached a value of 39.19 in Interpersonal relations). However, the separation index for people only exceeded the value of 2.00 in the case of the subscale Self-determination (2.16). The lowest separation indices corresponded to Material wellbeing (0.41) and Physical wellbeing (0.55).

Fig. 1
figure 1

Item and person separation reliability

3.5 Calibration

The next step involved the calibration of the items (Table 1). Generally speaking, we can conclude that there is equilibrium in all the scales regarding the number of difficult items (above 0 logits) and easy items (below 0 logits). Regarding the distribution along the continuum of the dimension of quality of life they are evaluating, Interpersonal relations is the one with the greatest range and best distribution of items together with Social inclusion. On the other hand, the dimensions with the largest gaps between items are Material wellbeing, Personal development, Physical wellbeing, Self-determination and Rights. Finally, Emotional wellbeing is the subscale whose items are grouped into a smaller range, albeit evenly distributed without large gaps between them.

3.6 Items’ Level of Difficulty

An analysis of the suitability of the items’ level of difficulty for the sample confirms these results, highlighting the almost perfect adaptation of the items in Interpersonal relations. For Emotional wellbeing, by contrast, it would be advisable to include not only easier items but also, and above all, more difficult ones. It is highly advisable to include more difficult items in Material wellbeing, Physical wellbeing and Rights, as most of those included are too easy for the participants. In Personal development and Self-determination it would be convenient above all to include more difficult items, although there is a noticeable lack of one or two in the easier levels. Finally, in Social inclusion it would be appropriate to include more difficult items, whilst one item (‘He/she is rejected or discriminated against by others’) is too easy for the participants.

3.7 Response Categories

Regarding the adaptation of the response categories to the sample, we observe that, although the extreme categories are the most probable ones in all cases, the four response options are suitable in all the subscales, except for Interpersonal relations, Material wellbeing and Rights, in which option 2 (‘sometimes’ or ‘often’ according to the item’s valence) is not very suitable. Furthermore, the information curves for the categories reveal that the extreme options provide the most information (‘never or hardly ever’ in first place, followed by ‘always or almost always’) in Interpersonal relations, Material wellbeing and Rights. Emotional wellbeing is the only subscale in which the intermediate options (‘sometimes’ or ‘often’) provide more information than the extreme ones. In Self-determination and Social inclusion, the category that provides the most information is ‘often’, followed by ‘never or hardly ever’. Finally, the category in Social inclusion and Physical wellbeing providing the most information is ‘never or hardly ever’, with the least being provided by ‘sometimes’.

3.8 Differential Item Functioning

Differential item functioning (DIF) was analysed by gender and by group or condition. On one hand, the analysis of the DIF between men and women confirmed solely the differential function of one item: i39 ‘He/she maintains good personal hygiene’ in Physical wellbeing, which was much easier for women than for men. However, if we consider its content, we cannot affirm that there is a bias in favour of one gender or the other. On the other hand, the analysis of the DIF confirmed only the differential function of 10 of them: ‘He/she has problems of conduct’ in Emotional wellbeing, which was more difficult for people in a situation of social disadvantage; ‘Most of the people with whom he/she interacts are in a similar situation to their own’ in Interpersonal relations, which was more difficult for elderly people; ‘He/she has access to new technologies’ in Personal development, which was more difficult for elderly people; ‘Technical aids are available if he/she needs them’; ‘He/she has healthy eating habits’ and ‘He/she maintains good personal hygiene’, which were more difficult for people at social disadvantage; ‘He/she has personal targets, goals and interests’ and ‘He/she chooses who they live with’ in Self-determination, the former is more difficult for the elderly whilst the latter is for people at social disadvantage; ‘He/she frequents communal areas’ in Social inclusion, was much more difficult for the elderly; and in Rights ‘One or more of his/her legal rights has been impaired’ was more difficult for people at social disadvantage. This meant that Physical wellbeing and Self-determination were the most problematic subscales in this sense. We thus note that 6 of the 10 items specified are more difficult for people at risk of social exclusion, whereby there could be a certain bias for this collective in these items.

4 Discussion

It is worth stressing what, in general terms, we believe to be the main contribution of the analyses made from the perspective of the IRT: not only do they corroborate the results obtained through Classical Test Theory (CTT), but they also act as a complement by shedding some light on their possible interpretations or explanations. There follows an overview of some of the analyses’ more specific conclusions: (1) the data fitted the model, with the only exception of the item 10 in Rights, that is highly recommended to be eliminated; (2) the highest point–biserial correlations (whose interpretation is similar to the alpha coefficient in the CTT) were recorded in Self-determination and Emotional wellbeing; the lowest appeared in Physical wellbeing; (3) the dimensions Rights and Material wellbeing recorded the most extreme scores (i.e., very high scores); (4) most of the items are considered to be very accurate: 59 (85.5%) have a measurement error of .02 to .03; (5) the most accurate items (i.e., whose observed score is closest to the person’s true score in the construct evaluated) are in the dimensions Self-determination, Emotional wellbeing, Personal development and Social inclusion (in descending order); on the other hand, the least accurate ones are (in ascending order) in Rights, Physical wellbeing, Material wellbeing and Interpersonal relations; (6) there is a major discrepancy between the excellent reliability or separation of the items and the moderate or low reliability of the individuals: this means it would be convenient to include a number of more difficult items in these dimensions in order to raise their level of accuracy (not just simply to increase their number, but rather to adjust the items’ difficulty to the level of competence); (7) the calibration of the items shows that the dimensions Interpersonal relations, Self-determination and Rights cover a greater range of personal results compared to all the others: this phenomenon could indeed be explained by the greater number of items included in them; (8) the dimension Interpersonal relations is the one with the greatest suitability in terms of the difficulty of the items for the sample; (9) the response categories are generally suitable: nevertheless, the extreme categories (‘never or hardly ever’ and ‘always or almost always’) are the ones used most compared to the intermediate ones. The option ‘often’ is used much more than ‘sometimes’ (i.e., for the items of positive valence; and the opposite occurs for those of negative valence). The latter, furthermore, does not seem to be appropriate for Interpersonal relations, Material wellbeing and Rights, given that there are few opportunities for it to be used in these dimensions; (10) as a whole, the items and dimensions provide the most information in intermediate ability levels, whilst the level of information drops in the lower levels; (11) there is no differential item functioning (DIF) by gender with the exception of ‘He/she maintains good personal hygiene’ in Physical wellbeing which, a priori and strictly applying the assumptions of the Rating Scale Model, should be deleted due to a possible bias in favour of women; that is, it appears to measure men and women differently, whereby it might reflect a content more closely associated with social aspects of men and women instead of with attributes of Physical wellbeing; and (12) there is a differential function in 10 of the 69 items (14.5%) by condition or collective, with the sole exception of Material wellbeing, where there do not appear to be any biases. We thus find that 6 of the 10 items specified are more difficult for the people at risk of social exclusion and the other four are easier for them than for elderly people. This points to a possible bias in favour of one collective or the other. Nevertheless, a qualitative analysis of the items with differential function does not permit us to single out any explanation or characteristic that might be detrimental to one of the groups and which is not directly related to those people’s quality of life, so we conclude that they are not biased items.

As the main conclusion of this work, we should note that this study is a first and unprecedented approach to the evaluation of the quality of life of the users of social services. We can affirm that the GENCAT Scale has sufficient evidence regarding its validity for evaluating the quality of life of users of social services in Catalonia and it stands as the only instrument so far available that is sensitive to those intervention programmes designed to improve personal results (and therefore the quality of life of users) and so can be of considerable use when applied in the services and in the development and assessment of programmes. Indeed, the results forthcoming comfortably outperform those provided by other instruments designed to evaluate quality of life. Nonetheless, we do not dismiss other kinds of complementary evaluation; quite the contrary, we consider it highly recommendable to apply it together with, for example, subjective evaluations of quality of life (e.g., INTEGRAL Scale; Verdugo et al. 2009).