Introduction

In recent years the concept of responsiveness has been promoted as a desirable measure to evaluate the performance of health systems. Responsiveness relates to a system’s ability to respond to the legitimate expectations of potential users about non-health enhancing aspects of care [1]. In broad terms, it can be defined as the way in which individuals are treated and the environment in which they are treated, and encompasses the notion of an individual’s experience of contact with the health system [2].

One of the most ambitious attempts to implement a cross-country comparative instrument aimed at measuring health system performance is the World Health Survey (WHS), which includes modules on the responsiveness of a system to user preferences. Respondents are asked to rate their experiences of health systems using a 5-point categorical scale (ranging from “very good” to “very bad”). A common problem with such data is that individuals, when faced with the instrument, are likely to interpret the meaning of the response categories in a way that differs systematically across populations or population sub-groups according to their preferences and norms (for example, see Salomon et al. [3]). Accordingly, the response categories will not be comparable across populations if they do not correspond to the same underlying level of the responsiveness construct. We refer to this phenomenon as “reporting heterogeneity”.

Recently, the use of anchoring vignettes has been promoted as a means for controlling for reporting heterogeneity across populations or population sub-groups. Vignettes represent hypothetical descriptions of a fixed level of a latent construct, such as responsiveness. Since these are fixed and predetermined, systematic variation across individuals in the rating of the vignettes can be attributed to differences in reporting behaviour [4]. The idea is to use information from the vignettes to adjust self-reported experiences of health system performance to increase cross population comparability by removing the influence of reporting heterogeneity.

In recent years anchoring vignettes have been utilised to address the issue of heterogeneous reporting behaviour in many studies regarding, for example, health and health-related behaviours (i.e. [37]), health system responsiveness [2, 810], happiness and job satisfaction [11, 12], national identity [13] and state effectiveness [14]. Despite the growing popularity of the vignette methodology to address the issue of reporting heterogeneity, the formal evaluation of the validity of the approach remains a topic of research [1518]. Two critical assumptions need to hold in order for the method to be valid. The first, termed response consistency, implies that individuals classify the vignettes in a way that is consistent with the rating of their own experiences of health system responsiveness. This implies that the mapping used from the latent levels of responsiveness given by the vignettes to the response categories is the same as the mapping used to translate latent responsiveness of own experiences of contact with health services to the available response categories. The second assumption, termed vignette equivalence, implies that “the level of the variable represented by any one vignette is perceived by all respondents in the same way and on the same unidimensional scale” [7, p. 194]. This assumption implies that, conditional on the socio-economic characteristics that determine reporting behaviour, for each vignette there is an actual (unobserved) level of responsiveness that all individuals agree to, irrespective of their country of residence, their socio-demographic characteristics or the level of responsiveness they actually face.

In this paper, we focus attention on the assumption of vignette equivalence.Footnote 1 A limited number of other studies have tried to assess the validity of this assumption. These were focussed on self-reports of the ratings of work disability [5], mobility [19], visual acuity and political efficacy [7, 21], job satisfaction [11] and life satisfaction for income [21], largely making use of non-parametric methods using tests based on the global ordering of the vignettes. Our study explores the validity of the vignette equivalence assumption, making reference to the concept of responsiveness and using data from the WHS. Moreover, we adopt several strategies to assess the validity of the vignette equivalence assumption, using both non-parametric and parametric methods. The use of a two-step regression procedure to evaluate whether a vignette construct is perceived in the same way across respondents is novel in this context.

Data and methods

Data

To assess the validity of the vignette equivalence assumption we use data from the WHS. The WHS is an initiative launched by the WHO in 2001 aimed at strengthening national capacity to monitor critical health outputs and outcomes through the fielding of a valid, reliable and comparable household survey instrument (see Ustun et al. [22]). The basic survey mode was an in-person interview, consisting of either a 90-min in-household interview (53 countries), a 30-min face-to-face interview (13 countries) or a computer-assisted telephone interview (4 countries). In total, 70 countries participated in the WHS 2002–2003. All surveys were drawn from nationally representative frames with known probability resulting in sample sizes of between 600 and 10,000 respondents across the countries surveyed. Data collection was on a modular basis covering different aspects of health and health systems, including information on health state valuation, health system responsiveness and health system goals. Samples have undergone extensive quality assurance procedures, including the testing of the psychometric properties of the responsiveness instrument [23], and close attention has been paid to the issue of comparability [22].

The WHS responsiveness module gathers basic information on health care utilisation for both inpatient and outpatient services. In the analysis that follows we make reference only to inpatient services. The measurement of responsiveness was obtained by asking respondents to rate their most recent experience of contact with the health system within a set of eight domains by responding to set questions. The domains consist of “autonomy” (involved in decisions), “choice” (of health care provider), “clarity of communication” (of health care personnel), “confidentiality” (e.g. talk privately), “dignity” (respectful treatment and communication), “prompt attention” (e.g. waiting times), “quality of basic facilities” and “access to family and community support”.Footnote 2 The following five response categories were available to respondents when rating their experience of health systems: “very good”, “good”, “moderate”, “bad”, and “very bad”.

The WHS further contains information on respondent characteristics. We make use of age, gender, level of education and income. These variables have been extensively used in the studies investigating differential reporting behaviour in self-reported measure of health [2, 4, 19] and heath-related disabilities [5]. Level of education is a continuous variable measuring the number of years in education. Gender is a dummy variable coded 0 for women and 1 for men. Income is derived from a measure of permanent income based on information on the physical assets owned by households. The approach to its measurement, which relies on a variant of the hierarchical ordered probit model (HOPIT) to improve cross-country comparability, is provided by Ferguson et al. [24]. We construct dummy variables to indicate the tertiles of the within-country distribution of household permanent income to which individuals belong. For the analysis presented here, the first income tertile is considered as the base category.

The WHS contains a number of vignettes describing the experiences of hypothetical individuals within each of the eight domains of responsiveness. The vignettes have been divided into four sets (A–D) with each set containing five vignettes for each item present across two domains. For example, Set A contains five vignettes for each of the two items in the domain of “Dignity” and five vignettes for each of the two items in “Prompt Attention”. Due to constraints of interview length, each respondent in the survey rated the vignettes present in only one of the sets. Therefore, each vignette has been rated by approximately 25% of survey respondents. The response scale available to respondents answering the vignettes is the same as the scale available when reporting their own experiences of health system responsiveness. Examples of the WHS vignettes are provided in Table 1 for the domains “Confidentiality”, “Choice”, “Clarity of communication” and “Quality of basic amenities”.

Table 1 Examples of vignettes for the domain of confidentiality, choice, communication and quality of basic facilities

We attempt to take into consideration the different levels of socio-economic development of countries to assess whether this influence the perception of the vignettes by making use of the Human Development Index (HDI) to stratify the countries into high, medium and low HDI groups. The HDI is a composite index of human development that combines indicators of life expectancy, educational attainment and income [25]. We also try to take into account the presence of different values and norms in different countries and evaluate if those values and norms affect the way individuals perceive the vignettes. To do this, we stratify our sample on the basis of the Inglehart–Welzel Cultural Map of the World, represented in Fig. 1 (http://www.worldvaluessurvey.org) [26].Footnote 3 This map reflects the presence of a strong correlation between a large number of basic values common to several countries. If we focus on European countries only, according to the Inglehart–Welzel map it is possible to identify three sets of countries that share similar social norms and values: Catholic countries, Protestant countries and ex-communist countries. At a broader level, if we consider all countries across the world, the basic values can be represented across two major dimensions of cross-cultural variation: Traditional/Secular-rational and Survival/Self-expression values (http://www.worldvaluessurvey.org). The first dimension reflects the contrast between societies in which religion is considered as an important element of life and those in which it is not. The second dimension reflects the contrast between industrial and post-industrial societies. In the former societies emphasis is given to economic and physical security while in the latter societies there is an increasing emphasis on subjective well-being, self-expression and quality of life. We follow this stratification in the analysis that follows.Footnote 4

Fig. 1
figure 1

Inglehart–Welzel cultural map of the world. Source: http://www.worldvaluessurvey.org/

Methods

Consistent and near-consistent ordering of vignettes

We assess the vignette equivalence assumption by first considering the global ordering of the vignettes. A minimal condition for the assumption of vignette equivalence to hold is that individual responses are consistent with the global ordering of vignettes. The global ordering for a domain can be obtained by pooling all the responses across countries and considering the average categorical response for each vignette [19]. Similar tests of the vignette equivalence assumption based on the global ordering of vignettes, but for health-related disabilities, job satisfaction and self reported measures of health, have been undertaken by [5, 9, 11, 21]. Due to the presence of stochastic measurement errors we cannot expect all individuals to order the vignettes in exactly the same way as each other. Adopting the approach of Murray et al. [19], we define a consistent ordering as “a set of categorical vignette ratings that could be consistent with the global ordering in the latent variable space, if ambiguities were resolved in favour of the global ordering” [p. 373].Footnote 5 Accordingly, for each domain and for each country we compute the percentage of respondents that gave an ordering of vignettes consistent with the global ordering, or had an ordering where only one vignette moved one or two ranks or two vignettes moved one rank each. Further, we compute the average percentage of respondents in each country that gave an ordering of vignettes consistent or near-consistent with the global ordering, where countries have been stratified by HDI groups and by the Inglehart–Welzel map groups.Footnote 6

Spearman rank order correlation coefficient

Individuals’ ordering of the vignettes might differ due either to measurement errors (caused, for example, by incorrect phrasing, translation or implementation of the vignette questions) or to problems of multidimensionality and variation in the cultural construct of a domain [19].Footnote 7 An analysis of the more common alternative patterns of vignette ordering can provide information about the relative importance of the problem of measurement error versus the problems of multidimensionality and variation in the cultural construct of a domain. Measurement error is generally associated with a large number of alternative orderings (due to chance). The prevalence of multidimensionality or cultural variation in a construct should, however, lead us to observe a limited number of alternative orderings, “reflecting some other weighting of the components of a multidimensional construct or alternative cultural constructs” [19, p. 376]. Multidimensionality of the responsiveness construct provides evidence of a violation of the vignette equivalence assumption. The Spearman rank order correlation coefficient (SROCC), which quantifies the extent to which an ordering is consistent with the global ordering of vignettes, has been suggested as a means to investigate the relative importance of the two sources of difference in ratings of vignettes [19].Footnote 8 For each domain we compute the SROCC between the vignettes rankings of each respondent and the global ranking.

We calculate the frequency distribution, together with several descriptive statistics, of the SROCCs across all individuals in the WHS dataset for the eight domains considered.Footnote 9 First, for each domain, we compute the percentage of individuals who report an ordering of vignettes that is positive and the percentage of individuals for which the correlation coefficient between the individual and the global ordering of vignettes is larger than 0.5. Secondly, following Murray et al. [19], we report the number of different rank order correlation coefficients observed in each domain and the number that occur with a frequency greater than 1%. The greater the number of different rank order correlation coefficients reported in each domain together with a smaller number occurring with a large frequency, the higher the probability that alternative orderings are due to measurement errors rather than to multidimensionality or cultural variation. We also show the median SROCC for each domain and the average SROCC across domains for each country.Footnote 10

The HOPIT model

An alternative way to check the vignette equivalence assumption implies estimating a model for responsiveness that takes into account possible biases due to reporting heterogeneity. This approach, adopted by Kristensen and Johansson [11] when considering self-reported job satisfaction, consists of firstly estimating a model on a pool of countries. Secondly, the sample is split into groups of countries according to the values, social norms, economic development, etc. that characterise these countries. Models are then estimated on the sub-samples and the coefficients are compared to those obtained from the pooled sample. If the model is robust and the vignette equivalence assumption is not violated, then we would expect the coefficient to be similar in the two samples. However, if the differences in culture and values across the country groups lead individuals to interpret the meaning of vignettes differently (and thus to violate the vignette equivalence assumption), we should observe very different estimated coefficients across the country groups [11].

Since the data on responsiveness in the WHS are self-reported and categorical, we use the HOPIT model developed by Tandon et al. [27] (also see Terza [28]), to adjust for reporting behaviour. The model can be specified in two parts. The first part draws on the use of the anchoring vignettes to provide a source of information that enables the thresholds to be modelled as functions of relevant covariates (reporting behaviour equation). The second part maps the relevant covariates to underlying self-reported health system responsiveness while controlling for differences in reporting behaviour obtained through the first step (responsiveness equation). A more formal description of the two parts of the model is reported in “Appendix” (also see Rice et al. [9]). The use of vignettes to identify reporting heterogeneity relies on the assumptions of response consistency and vignette equivalence described in the Introduction.

As a preliminary analysis, we apply the HOPIT model across the pool of 27 European countries present in the WHS, using the domain “Dignity”. For the purposes of our model, we use the dummies for country of residence together with individual specific characteristics (age, gender, level of education and income) as relevant covariates in both the reporting behaviour and the responsiveness equation. Austria is taken as the baseline country. We then stratify the European countries in three groups according to the Inglehart–Welzer map to reflect similar cultures, social norms and values. We finally re-estimate the HOPIT model for each of the three groups of countries.

We further extend the analysis by considering all the countries present in the WHS.Footnote 11 Mexico, which has the largest sample size, is taken as the baseline country. Countries are stratified into four groups according to the Inglehart–Welzer map (“Self-Traditional”, “Self-Secular”, “Survival-Traditional”, “Survival-Secular”) and the HOPIT model is estimated separately for each of these groups of countries.

We also consider the possibility that differences in the level of socioeconomic development of countries might induce individuals to interpret the meaning of vignettes differently. Accordingly, we stratify the countries in the WHS according to their level of HDI and again apply the HOPIT model for each of these groups of countries.

Assessment of multidimensionality of the constructs represented by vignettes

An analysis of the characteristics of individuals described in the vignettes offers a further tool to check the vignette equivalence assumption. If the person described in a vignette is characterized by specific socio-demographic characteristics, it is possible that respondents are influenced by these characteristics, which may induce them to perceive the vignettes differently to other respondent. This would represent a violation of the vignette equivalence assumption. As an example, consider a vignette about “Autonomy” representing an elderly person. Some respondents may feel that elderly people are incapable of making appropriate decisions about treatments and may have lower expectations about the level of autonomy afforded to elderly individuals. Other respondents, however, could consider elderly people equally able to be involved in decisions about treatments as young people and hence would have the same expectations about the level of autonomy for elderly and young people. Specifying the age of the person described in the vignette may therefore induce some respondents to perceive the construct as representing “autonomy for elderly people” and for others to perceive it as “autonomy” in general.

Information on the characteristics of the individual described in the vignette have been used to assess vignette equivalence in a study by Kapteyn et al. [5]. The authors use responses obtained from two internet surveys on work disability conducted in the Netherlands and in the US. Vignettes were presented to respondents by randomly using either a female or a male name (i.e. faced with the same vignette, some respondents rated the health conditions for a woman while others rated the same condition for a man). Variability across the ratings allowed the authors to model the reporting behaviour of respondents as a function of the gender of the individual described in the vignette by explicitly including this variable as a regressor in the HOPIT model.Footnote 12 They reported that “for a given vignette description, a male vignette person is seen as more work disabled than a female vignette person, by both male and female respondents” [5, p. 469]. In a similar vein, we evaluate whether individuals judge vignettes differently according to the gender of the person presented in a vignette and whether the person suffers from physical pain. We choose these individual characteristics for two reasons. First, on practical grounds, vignettes tend to represent “neutral” individuals, with little information on personal characteristics. Gender and pain are two of a very limited set of characteristics we can identify in the 20 vignettes considered. Secondly, while Kapteyn et al. [5] suggest that respondents tend to judge the vignettes differently according to whether the person in the vignette is female or male, Bago d’Uva [4] suggests that the elderly and the young interpret the construct of a vignette differently where the vignette describes a situation of physical pain.

For our analysis, we consider the pool of countries present in the WHS and, for illustration, make reference to the set of vignettes contained in the domains of “Dignity” and “Prompt attention”.Footnote 13 This set comprises 20 vignette questions answered by 858,570 individuals across all countries. Unfortunately, in the WHS there is no variability within a vignette in the gender of the individual described. The gender of the individual represented in each vignette is fixed and, accordingly, we are unable to adopt the methodology of Kapteyn et al. [5]. However, since within each domain of responsiveness in the WHS respondents are asked to evaluate a set of vignettes, we can exploit the variability in gender that is present across the vignettes within a given domain. To exploit this variability we perform a two-stage analysis using an estimated dependent variable regression model (EDV), as described by Lewis and Linzer [29]. In the first stage we model the reporting behaviour of respondents using a standard ordered probit model. We regress respondent ratings of the vignettes on the socio-demographic characteristics of the respondents and on a set of vignette-specific dummy variables [30, p. 61].Footnote 14 We then “store” the coefficients of the vignette-specific dummy variables.Footnote 15 In the second stage we regress the coefficients of the vignette-specific dummies on a dummy variable indicating if the person in the vignette is female, and on a dummy indicating if the person is in pain. Given the small sample size of the data we use in the second step regression, we correct for the potential presence of heteroskedasticity using the Efron robust standard error estimator [31], as suggested by Lewis and Linzer [29].

Results

Consistent and near-consistent ordering of vignettes

Using the data on health system responsiveness contained in the WHS, Table 2 reports the percentage of respondents for each domain in each country that gave an ordering of vignettes consistent with the global ordering, or had an ordering where only one vignette moved one or two ranks or two vignettes moved one rank each.Footnote 16 For each domain, there was no substantial variation across countries. For all countries (with few exceptions) more than 90% of respondents report consistent or near-consistent vignette orderings. For each domain, this percentage is equal to or greater than 95% in at least 52 countries. These preliminary results provide support for the assumption of vignette equivalence.

Table 2 Percent of consistent and near-consistent ordering by domain and country

Table 3 presents the average percentage of respondents in each country that gave an ordering of vignettes consistent or near consistent with the global ordering, where countries are stratified by HDI groups and by the Inglehart–Welzel map groups. Average percentages are reported for each domain. In general, the average percentages are slightly higher for High HDI countries compared to Medium and Low HDI countries, and for countries characterised by “Secular-Rational” values compared to “Traditional” ones. However, the variation across HDI groups and across the Inglehart–Welzel grouping of countries is very small. These results provide further evidence that individuals across different countries tend to interpret the vignettes in a consistent way.

Table 3 Average percent consistent and near-consistent ordering, by Human Development Index (HDI) groups and by Inglehart–Welzel map groups

Spearman rank order correlation coefficient

Table 4 provides frequency distributions for the SROCCs for an illustrative domain, “Clarity of Communication”, and Table 5 provides descriptive statistics across all domains. For each domain, the majority of the individuals reports an ordering of vignettes that is positive and highly correlated with the global ordering (the percentage of individuals whose SROCC is positive is between 87 and 95%, and the percentage of individuals with a SROCC larger than 0.5 is between 64 and 90%). The number of different rank order correlation coefficients reported in each domain appears to be high, and varies quite substantially (between 59 and 145) across domains. Accordingly, in some domains there is a large number of alternative orderings (i.e. “Prompt Attention” and “Quality of Facilities”), while for others the number of ordering is small (i.e. “Clarity of communication”, “Autonomy” and “Social Support”). The number of SROCCs that occur with a frequency greater than 1% does not appear to be particularly large (on average 19) and it varies across domains much less than the number of alternative orderings.Footnote 17 Overall, the results suggest that vignettes ordering inconsistencies are more likely to occur because of measurement errors than because of the multidimensionality or cultural variation in the constructs of a domain. However, the possibility of some problem of multidimensionality appears to be higher in some domains (domains presenting a smaller number of alternative orderings, i.e. “Autonomy”) than in others.

Table 4 Spearman’s rank order correlation coefficient between individual ordering of vignettes and the global ordering, for the domain “Clarity of Communication”
Table 5 Descriptive statistics about the Spearman rank order correlation coefficient, by domain

Figure 2 shows the median SROCC across the data for each domain.Footnote 18 For most domains the vignettes appear to work well, with the median correlation assuming values between 0.85 and 0.95. Only the domains “Confidentiality” and “Choice” appear to have a slightly worse performance, presenting a median correlation that varies between 0.75 and 0.80. Figure 3 shows the median value of the SROCC across domains in each country. This value ranges from very high levels observed for Bangladesh and Comoros Islands (1.00 each) to more moderate values for Cote d’Ivoire and Namibia (0.84 and 0.74, respectively). However, the coefficient is greater than 0.90 in the majority of countries. The high values presented by the average SROCCs imply that cultural differences in the interpretation of vignettes across countries may not be of great concern.Footnote 19

Fig. 2
figure 2

Median Spearman rank order correlation coefficient (SROCC) across domains

Fig. 3
figure 3

Median SROCC across countries

Table 6 provides the average SROCCs across all countries for individuals belonging to different socioeconomic groups. We perform this analysis following the suggestion of King et al. [7, p. 200], that “the key in detecting multidimensionality [of the vignette construct] is searching for inconsistencies that are systematically related to any measured variable”. In particular, Table 6 provides the SROCC between the ordering of vignettes defined at global level and the median ordering given by individuals within different education groups. The same information is provided for individuals stratified according to their level of income and gender. The vignettes appear to be ordered in a similar way across the different socio-economic groups. The exception is individuals with a high level of education for the domain “Confidentiality”. For these individuals the ordering of the vignettes is less close to the global ordering, since the SROCC assumes values inferior to 0.8.

Table 6 Average Spearman rank order correlation coefficient (SROCC) across all surveys

The HOPIT model

Table 7 presents the results from the responsiveness and reporting behaviour equation of the HOPIT model estimated on the pool of the 27 European countries present in the WHS. For brevity, only the results related to the first cut point (the cut point separating the response category “very bad” from “bad”) are presented in the table. Results relating to other cut points are available on request. Belonging to the top income tertile, compared to the bottom, appears to be significantly related to experiencing a high level of responsiveness, while being a woman is negatively related to responsiveness (although this effect does not attain statistical significance). Elderly people and more educated people appear to face higher levels of responsiveness, but only for the former is the association statistically significant. On average, individuals in Eastern European countries appear to face lower levels of responsiveness than in Austria, while we can not draw general conclusions for individuals in western European countries.

Table 7 European countries: coefficients and standard errors for the responsiveness equation and the reporting behaviour equation (first cut point) of the hierarchical ordered probit model (HOPIT) model, for the domain “Dignity”, for the pool of countries and for countries stratified by the Inglehart–Welzer value map

We stratify the European countries into three groups, according to the Inglehart–Welzer map, to reflect similar cultures, social norms and values. When we estimate the HOPIT model for each of the three groups of European countries separately (Catholic, Protestant and ex-communist), the coefficients for the country dummy variables are very robust both in the responsiveness equation and in the reporting behaviour equation. The coefficients retain the same sign when compared to the coefficients for the model where all the European countries are pooled together. Further, few of them change substantially. These results lend further support to the assumption of vignette equivalence.

Table 8 presents the results of the HOPIT model estimated across the full pool of countries and on “Self-Traditional”, “Self-Secular”, “Survival-Traditional”, “Survival-Secular” countries separately. Again, the coefficients for the country dummy variables, both in the responsiveness and in the reporting behaviour equation, appear robust. Similar results, presented in Table 9, are obtained when the HOPIT model is estimated separately for countries stratified according to their level of HDI.Footnote 20 For both the responsiveness equation and the reporting behaviour equation, the coefficients for the country dummy variables again appear robust. These results provide further evidence in favour of the assumption of vignette equivalence.

Table 8 All countries: coefficients and standard errors for the responsiveness equation and the reporting behaviour equation (first cut point) of the HOPIT model, for the domain “Dignity”, for the pool of countries and for countries stratified by the Inglehart–Welzer value map
Table 9 All countries: coefficients and standard errors for the responsiveness equation and the reporting behaviour equation (first cut point) of the HOPIT model, for the domain “Dignity”, for the pool of countries and for countries stratified by HDI group

Test for multidimensionality of the constructs represented by vignettes

When we perform the two-stage analysis described in the section “ Assessment of multidimensionality of the constructs represented by vignettes”, neither the regressors nor the constant term in the second step regression are statistically significant at the 95% percent level.Footnote 21 This result suggests that the gender of the person represented in the vignettes and his/her condition of pain do not influence the way respondents judge the vignettes.Footnote 22 Again, these results provide support to the vignette equivalence assumption.

Conclusion and discussion

Despite the growing popularity of the vignette methodology to address the issue of systematic reporting heterogeneity in self-reported data, the formal evaluation of the validity of this methodology has remained a topic of research. Two critical assumptions need to hold in order for the method to be valid. This paper presents analyses to assess the validity of the assumption of vignette equivalence using data on health system responsiveness contained within the WHS.

We first performed non-parametric analyses based on the global ordering of the vignettes. Secondly, after estimating a HOPIT model for responsiveness on the pool of countries, we performed sensitivity analyses stratifying the countries in our sample on the basis of the Inglehart–Welzel map and HDI groupings. Thirdly, we adopted a two-step regression procedure to evaluate the possibility that an individuals’ perceptions of the construct described by a vignette differ according to the characteristics of the person described in the vignette. The results derived from our analysis do not contradict the assumption of vignette equivalence. Accordingly, they lend support to the use of the vignette methodology to correct for the presence of reporting heterogeneity.

A potential limitation of our analysis is that, for brevity, only a limited set of domains of responsiveness were used. For the analysis in the section on “The HOPIT model” we considered only “Dignity”, while in “Test for multidimensionality of the constructs represented by vignettes”, we refer to “Dignity” and “Prompt Attention”. Some caution is, therefore, required in generalising our results to other domains of the responsiveness construct.

The results refer only to the assumption of vignette equivalence and do not consider response consistency. Recent literature has tried to assess the validity of the latter assumption [6, 17]. The majority of these studies test this assumption by comparing self-reported data to objective data (for example, comparing self-reported data on health to objectively measured levels of health). Unfortunately, the WHS does not contain objective measures of the level of responsiveness faced by respondents. Hence, we are currently unable to test this assumption in the WHS.

Our study provides an original contribution to the literature on anchoring vignettes by exploring the validity of the vignette equivalence assumption with reference to the concept of responsiveness. We adopt several strategies to assess the validity of the vignette equivalence assumption, employing both non-parametric and parametric methods. Overall, our results do not provide strong evidence to suggest that the assumption does not hold and, accordingly, support the use of the anchoring vignette approach to adjust self-reported data for systematic differences in reporting behaviour.