Introduction

Health-related quality of life (HRQOL) is gaining significant research attention in terms of prevention, treatment and rehabilitation studies among different ethnic and cultural groups worldwide [1]. However, studies have shown that the HRQOL concept is strongly influenced by cultural values, traditions, and beliefs [25]. Consequently, questionnaires designed for HRQOL assessment are inherently sensitive to the language, dialect and community values of the local cultures [6]. Hence, in cross-cultural research, not only should HRQOL instruments be translated well linguistically, but they also should be culturally adopted to preserve the content validity of the questionnaire across different cultures [3, 7]. As borrowed from cultural psychology, the primary requirement for the cross-cultural adaptation process of one questionnaire is to ensure its cross-cultural measurement equivalence [8]. The establishment of measurement equivalence is especially important in making valid comparisons of HRQOL scores across cultural groups. Measurement invariance evaluates whether the probability of responding to specific items within a measure is the same across the compared groups after controlling for the construct being measured [9]. For the quality of life (QOL) concept, this implies that the same theoretical construct is measured in two or more cultures in the same way [5]; and when differences in the QOL construct are found across different cultural or language groups, these differences should more likely be true difference between the groups rather than differences in the perception and interpretation of the items as a function of health status [10, 11].

To date, although many generic QOL questionnaires were developed for paediatric use, very few have documented cross-cultural measurement equivalence. For example, the KIDSCREEN is the first questionnaire with sound psychometric properties that was simultaneously developed across several European countries [12, 13]. Two recent studies provided evidence for the cross-cultural measurement equivalence of the KIDSCREEN using a short, self-report version, which was evaluated across multiple language versions [14, 15]. In addition, for the Pediatric Quality of Life Inventory Version 4.0™ (PedsQL™ 4.0) [16] self-report, the findings of a US study revealed that children and adolescents across four ethnic groups interpreted items and factors in a similar manner regardless of their ethnicity, which is indicative of cross-cultural measurement equivalence [11].

The KINDL [17], PedsQL™ 4.0 [16] and KIDSCREEN [1214] are the most frequently used questionnaires in paediatric QOL studies. All three questionnaires are generic, multidimensional and originally designed to assess QOL in children and adolescents. However, a systematic review has shown that many paediatric instruments do not reflect QOL definitions provided by World Health Organization (WHO) [2]. By mapping the instruments to the International Classification of Functioning, Disability and Health for Children and Adolescents (ICF-CY), the researchers found that the same perspective of QOL cannot be obtained from each instrument [2]. The KIDSCREEN particularly measures HRQOL, the PedsQL™ 4.0 covers a wide definition of functioning, disability, and health (FDH), and the KINDL is an appropriate instrument to measure FDH with some HRQOL features. Therefore, researchers should be clear about whether they intend to measure HRQOL (which includes the expectations, standards, or concerns about health domains) or FDH (i.e., the performance, capacity, presence/absence, frequency and severity of psychosocial domains) [2].

The KINDL has previously been translated and culturally adapted for use in various languages and countries, and has shown acceptable psychometric properties [1825]. Although translated forms of the KINDL may achieve linguistic equivalence (i.e., each translated word appropriately matches its translated counterpart), the additional aspect of equivalence (i.e., measurement equivalence) should be assessed to ensure that the KINDL operates exactly in the same way in different countries, especially in those with diverse cultural backgrounds.

Up to present, the measurement equivalence of the KINDL was only evaluated across Iranian children and their parents [21]. That study examined whether children and their parents responded consistently to the KINDL items, and it was found that the KINDL failed to produce equivalent measurement between the two groups [21]. Although a previous research has shown that the KIDSCREEN-27 can be used for cross-cultural comparison between Iranian and Serbian samples [15], such an explanation has never been provided to address whether the KINDL is an invariant measure for cross-cultural research. The aim of the present study was to evaluate the cross-cultural measurement equivalence of the KINDL using two samples of Iranian and Serbian children and adolescents, alongside their parents. Iranian and Serbian samples were selected because of the followings. Iran and Serbia are two countries with developing economies, but the first is an Asian and the second is a European country. Moreover, there are significant religious and cultural differences between them. While children, adolescents, and their parents in Iran are mostly Muslims, their peers in Serbia are mostly Orthodox, which has led to the variations in the way of life, beliefs, traditions and laws and possibly different health perceptions. Finally, we would like to compare HRQOL scores across the two countries. However, if substantive DIF is evident, we follow a removing and retaining strategy to determine whether HRQOL scores, and the resultant conclusions, will be altered in a meaningful way with and without DIF items.

Methods

Participants

For the present study, data were used for children and adolescents aged eight to 16 from Serbia and Iran who participated in two previous studies [21, 23, 24]. In brief, the Iranian sample included 1086 school children and adolescents (62.4 % boys, 37.6 % girls) and 1061 parents, while the Serbian sample consisted of 756 children and adolescents (50.1 % boys, 49.1 % girls) and 618 parents. The mean (±SD) age of the Iranian and Serbian sample was 13.32 ± 2.26 and 12.23 ± 1.85 years, respectively. In both countries the samples were selected from urban areas. The Iranian sample was selected from Shiraz the largest city in southern of Iran and the Serbian sample was selected from Belgrade, the capital of Serbia. Moreover, almost all children in both cultures were healthy (more than 90 % of Iranians and 95 % of Serbians), and were recruited from public schools. In the present study, data from two separate studies [21, 23, 24], which were not originally designed for cross-cultural comparison, were merged to do a cross-cultural research. Hence, other sociodemographic variables including parents’ income, education, age, and gender were not available in both samples.

While Iranian children, adolescents and their parents completed the questionnaire at home, almost all of the participants in the Serbian study completed the questionnaire in school. In both samples, children, adolescents and their parents signed the informed consent forms, and they were instructed in detail how to complete the KINDL. The process of translation and linguistic validation for the Iranian and Serbian version of the KINDL pertaining to other relevant details are fully described in [21, 23, 24].

Questionnaire

The KINDL [17] questionnaire includes a child self-report and parent proxy-report. Both KINDL reports contained 24 items in six subscales: physical well-being, emotional well-being, self-esteem, family, friends and school. There is a version for eight- to 12 year-old children (Kid-KINDL) and for 13- to 16 year-old adolescents (Kiddo-KINDL). In the present study, self and proxy-reports of the Kid-KINDL was completed by 353 and 345 of Iranian, and 356 and 282 of Serbian children and their parents, respectively. Self and proxy-reports of the Kiddo-KINDL was completed by 733 and 716 of Iranian, and 400 and 336 of Serbian adolescents and their parents, respectively. Both age versions had the same items and scoring format; the only difference was in the slightly different wording for an instance that was more relevant to one group then the other (i.e., item 17 in the child version is “played with friends”, while in the adolescent version, the same item is “doing things with friends”). All participants responded to the items on a 5-point Likert scale (0 = never, 1 = seldom, 2 = sometimes, 3 = often and 4 = all the time). For ease of interpretation, rating scale categories were reversed so that higher categories indicated better QOL. Overall raw sub-scale scores were formed from the item’s mean values. The raw sub-scale scores were transformed into a 0–100 scale, with higher scores indicating better QOL.

Statistical Analysis

In the present study, ordinal logistic regression (OLR) as a proportional odds model described by Swaminathan and Rogers (hereinafter referred as the SR criterion) was used to assess the measurement equivalence of the KINDL across Iranian and Serbian children and adolescents, alongside their parents [26, 27]. All 24 items are considered as the response variables in ordinal regression models. Testing for the presence of DIF (uniform and non-uniform) under the OLR model, adjusted by child’s gender, is based on comparing three different models as follows:

$${\text{Model}}\, 1 :\,{\text{Logit }}\left[ {{\text{P }}\left( {{\text{Y}} \le {\text{K}}} \right)} \right] =\upalpha_{\text{k}} +\upbeta_{1} \times { \theta } +\upgamma \times {\text{gender}}$$
$${\text{Model}}\, 2:{\text{Logit }}\left[ {{\text{P }}\left( {{\text{Y}} \le {\text{K}}} \right)} \right] = { \alpha }_{\text{k}} + { \beta }_{1} \times { \theta } + { \beta }_{2} \times {\text{country}} +\upgamma \times {\text{gender}}$$
$${\text{Model}}\, 3 :\,{\text{Logit }}\left[ {{\text{P }}\left( {{\text{Y}} \le {\text{K}}} \right)} \right] = { \alpha }_{\text{k}} + { \beta }_{1} \times\uptheta + { \beta }_{2} \times {\text{country}} +\upbeta_{3} \times { \theta } \times {\text{country}} +\upgamma \times {\text{gender}}$$

The term θ is used here to represent the trait measured by the KINDL instrument as the observed sum score, and country is the grouping variable with two levels (Iran and Serbia). According to the above models, uniform DIF could be detected by comparing the log likelihood values for Model 1 and 2, and non-uniform DIF by Model 2 and 3. For both uniform and non-uniform DIF twice the difference in log likelihoods is compared to a Chi-square distribution with one degree of freedom. According to SR criterion, a slight difference in log-likelihood of the Models for detecting uniform and non-uniform DIF could be statistically significant, given a large enough sample. Hence, in response to this concern, we used Zumbo and Gelin (ZG) and Crane, van Belle, and Larson (CvBL) criteria to quantify the magnitude of DIF, which may not be practically or clinically important. According to ZG criterion, the magnitude of uniform and non-uniform DIF could be determined by the difference in pseudo R2 (ΔR2) between Model 1 and 2, and Model 2 and 3, respectively. ΔR2 values less than 0.035, between 0.035 and 0.070, and above 0.070 are classified as negligible, moderate, and large DIF, respectively. Moreover, according to CvBL criteria, the absolute proportion change in point estimates for β1 from Models 1 and 2, Δβ1 = |β1 (model 1) − β1 (model 2)/β1 (model 1)|, is used to identify items with uniform DIF. Based on simulation studies, 10 % change in Δβ1 is considered as a practically meaningful effect. An advantage of the OLR method applied in the present study is that it provides effect size measures to quantify the magnitude of DIF, which may not be practically meaningful. In OLR approach, a slight difference in log-likelihood of models for testing uniform and non-uniform DIF could be statistically significant, given a large sample. Hence, inclusion of effect size reduces false discovery DIF detection rate [28]. In the present study, a removing and retaining strategy is used to assess the potential impact of DIF in comparing HRQOL scores across two countries [28].

Expected item score curves were also used to determine the direction and magnitude of DIF visually across Iranian and Serbian children and adolescents, alongside their parents. This curve is a function of θ and provides a better understanding of uniform and non-uniform DIF. It should be noted that for items with non-uniform DIF the direction of DIF differs along the subscale, leading to the effect of DIF cancel-out at the scale level [29]. Hence, expected item score curves are only depicted for items with uniform DIF. The OLR procedure was implemented using the computer program SAS 9.1.

Results

DIF Results for Kid-KINDL

The results of the OLR DIF analysis for the child self-report and parent proxy-report of the Kid-KINDL across Iranian and Serbian samples are shown in Table 1. In the child self-reports of the Kid-KINDL, 14 out of 24 (58 %) items were flagged with DIF according to SR criterion. Of these 14 items, nine (64 %) items displayed uniform DIF and five (36 %) items displayed non-uniform DIF. However, based on the ZG criterion, five items exhibited uniform, where change in R-squared (ΔR2) associated with these items ranged from 0.044 to 0.15. Moreover, according to the ZG criterion, item 1in the school subscale exhibited non-uniform DIF (ΔR2 = 0.187). Additionally, four items had uniform DIF according to the CvBL criterion, where the change in β1 coefficients (Δβ1) for these items ranged from 0.1 to 0.291. As shown in Fig. 1, item 3 in the physical subscale and item 2 in the emotional subscale had uniform DIF and their effects could not be cancelled-out at the domain level by uniform DIF items in the opposite direction. Moreover, for items 1, 3 and 4 with uniform DIF in the self-esteem subscale, item expected scores were in one direction and could not cancel one another out. In contrast, expected scores for items 2 and 4 in the family subscale, and items 1 and 2 in the friend subscale were in the opposite direction, indicating that the effect of uniform DIF could be cancelled-out for these subscales.

Table 1 The results of the ordinal logistic regression DIF analysis on the Kid-KINDL across Iranian and Serbian children (8–12)
Fig. 1
figure 1

Expected item score function of uniform DIF for Iranian (solid line) and Serbian (dashed line) children aged 8–12 (Kid-KINDL)

In the parent proxy-reports of the Kid-KINDL, 20 out of 24 (83 %) items were flagged with DIF according to SR criterion; 13 (65 %) items displayed uniform DIF and seven (35 %) items displayed non-uniform DIF. However, based on the ZG criterion, six items exhibited uniform DIF, where the difference in R-squared (ΔR2) for these items varied from 0.036 to 0.173. According to the CvBL criterion six items had uniform DIF, where changes in β1 coefficients associated with these items varied from 0.109 to 0.486. As shown in Fig. 2, of the three items with uniform DIF in the physical well-being subscale, the item expected scores for items 2 and 3 went in one direction, and item 4 went in the opposite direction, indicating that the effect of uniform DIF could not be cancelled-out for this subscale across the two samples. Similar pattern can also be observed in the emotional, self-esteem, and friend subscales. In contrast, there was just one item with uniform DIF in the family subscale; consequently, its effect could not be cancelled-out.

Fig. 2
figure 2

Expected item score function of uniform DIF for Iranian (solid line) and Serbian (dashed line) parents aged 8–12 (Kid-KINDL)

DIF Results for Kiddo-KINDL

The results of the OLR DIF analysis for the child self-report and parent proxy-report of the Kiddo-KINDLare shown in Table 2. In the child reports, 20 out of 24 (83 %) items were identified with DIF according to the SR criterion. Of these 20 items, 13 (65 %) items displayed uniform DIF and seven (35 %) items displayed non-uniform DIF across Iranian and Serbian adolescents. According to the ZG criterion, items 1 and 2 in the self-esteem subscale displayed uniform DIF, where the change in R-squared associated with these items were 0.051 and 0.079, respectively. Moreover, according to the CvBL criterion item 1 in the self-esteem and item 2 in the friend subscales had uniform DIF. The changes in β1 coefficients associated with these items were 0.211 and 0.114, respectively. As shown in Fig. 3, item expected scores for items 2 and 3 went in one direction, and item 1 went in the opposite direction, indicating that the effect of uniform DIF could not be cancelled-out for this subscale across Iranian and Serbian adolescents. Moreover, for items 1 and 3 with uniform DIF in the friend subscale, expected scores were in one direction and could not cancel one another out. Of the four items with uniform DIF in the physical well-being subscale, item 2 showed DIF in one direction, whereas items1, 3 and 4 showed DIF in the opposite direction; hence, they could not cancelled each other out. In contrast, items 2 and 3 in the family subscale were in the opposite direction, indicating that the effect of uniform DIF could be cancelled-out for this subscale. Moreover, in the friend and school subscales, only items1 and 3 had uniform DIF, respectively; hence, their effects could not be cancelled-out at the scale level.

Table 2 The results of the ordinal logistic regression DIF analysis on the Kiddo-KINDL across Iranian and Serbian children (13–18)
Fig. 3
figure 3

Expected item score function of uniform DIF for Iranian (solid line) and Serbian (dashed line) children aged 13–18 (Kiddo-KINDL)

In the parent proxy-reports, the SR criterion showed that 20 out of 24 (83 %) items were flagged with DIF between Iranian and Serbian parents. Of these items, eight items (40 %) exhibited non-uniform DIF and 12 items (60 %) exhibited uniform DIF. According to the ZG criterion, five items exhibited uniform DIF, where the difference in R-squared (ΔR2) for these items ranged from 0.042 to 0.155. Based on CvBL criterion, five items had uniform DIF, where changes in β1 coefficients associated with these items varied from 0.106 to 0.545.

As shown in Fig. 4, in the physical subscale, item 3 showed DIF in one direction, whereas item 4 showed DIF in the opposite direction; hence, they cancelled each other out. In contrast, expected scores for items 1 and 2 in the emotional well-being subscale were in the same direction, and could not cancel each other out.Moreover, of the four items with the uniform DIF in the self-esteem subscale, items 1, 3, and 4 went in one direction, and item 2 went in the opposite direction, indicating that the effect of DIF cannot be cancelled-out at the scale level. Similarly, in the friend subscale, the item expected scores for items 2 and 3 went in one direction, and item 1 went in the opposite direction, indicating that the effect of uniform DIF could not be cancelled-out for this subscale across Iranian and Serbian parents. There was only one item with uniform DIF in the school subscale; consequently, its effect could not be cancelled-out at the scale level.

Fig. 4
figure 4

Expected item score function of uniform DIF for Iranian (solid line) and Serbian (dashed line) parents aged 13–18 (Kiddo-KINDL)

As shown in Table 3, Iranian children and their parents rated the child’s QOL significantly lower than their Serbian counterparts in all domains, except for the physical and friend subscales in the child self-reports for 8–12 year-old children, as well as in the school subscale for the parent proxy-reports for 8–12 year-old children. In order to assess the potential impact of certain items with DIF in comparing HRQOL scores across two countries, items with uniform DIF, which were practically important, were removed from the subscales of the KINDL.Our finding revealed that ignoring or accounting for DIF items in the subscales had no considerable effects on group differences, except for the family subscale in child self-report of the Kid-KINDL. After removing item 1 “I felt fine at home” from family subscale in child self-report of the Kid-KINDL, mean scores across two countries were not statistically significant anymore.

Table 3 Comparison of QOL subscale scores of the KINDL across Iranian and Serbian samples

Discussion

The importance of cross-cultural QOL comparisons is that they provide insight into the effects of cultural disparity pertaining to QOL scores across different countries or languages [30, 31]. However, a prerequisite for valid cross-cultural QOL comparisons is a questionnaire that demonstrates measurement invariance, so that subjects across different cultures perceive and respond to the questionnaire’s items in the same or almost the same way [5].

The results of the present cross-cultural research showed that Iranian and Serbian children, as well as their parents, responded differently to the KINDL items. According to SR criterion, the majority of items in both the self- and proxy-reports of the Kid-KINDL and Kiddo-KINDL versions exhibited DIF. However, applying the effect size measures such as ΔR2 and Δβ1 substantially decreased the detection rate of DIF. In addition, our findings revealed that the effect of DIF could not be cancelled out at the scale level for some subscales of the self- and proxy-reports. This finding implies that the direction of DIF for some items consistently favoured one group over another, resulting in significant scale-level bias. It was observed that, in the self-report of the Kid-KINDL, the effect of DIF could not be cancelled out for the physical, emotional and self-esteem subscales, and for the physical, emotional, self-esteem, friend and school subscales in the self-report of the Kiddo-KINDL. Moreover, in the parent proxy-reports, cancellation effect could not be obtained for the physical, emotional, self-esteem, family, and friend subscales of the Kid-KINDL, and for the emotional, self-esteem, friend, and school subscales in the Kiddo-KINDL. Nevertheless, it should be noted that full DIF cancelation at the subscale level rarely occurs in practice and should be interpreted with reserve [32]. Therefore, we should be cautious about comparing HRQOL subscale scores across Iranian and Serbian samples that have fulfilled the DIF cancellation criteria. Considered together, these findings indicate that the KINDL self- and proxy-reports do not produce invariant QOL measurements across Iranian and Serbian samples and cannot be used for cross-cultural comparisons, at least according to the findings of the present study. In this light, although Serbian children and their parents rated children’s QOL better than their Iranian counterparts across the domains, it is likely that the differences observed in QOL scores between the two countries reflected bias result [33]. It should be noted that techniques used for examination of the impact of DIF vary across DIF detection methods. In the present study, a removing and retaining strategy is used to determine how much mean group differences in subscale scores change with and without inclusion of items with DIF. In QOL instruments such as KINDL, where scales are often short, removing a number of items with DIF from the subscales may be a serious threat to the validity of the estimated subscale scores [28]. In response to this concern, we removed items with uniform DIF which were practically important. According to this strategy, our findings did not change in a meaningful way by ignoring or accounting for items with DIF.

An alternative way to test cross-cultural equivalence of the KINDL across two countries is structural equation modeling (SEM). A special case of SEM for detecting DIF is multiple-group confirmatory factor analysis (MGCFA). According to MGCFA, various types of measurement invariance hypotheses including configural, metric, and scalar equivalence can be tested. Configural invariance investigates whether an underlying construct of interest (in the present study six factors) is measured by the same set of items across two groups [34]. If configural test fails by showing poor fit indices, measurement invariance hypothesis is rejected and consequently latent means cannot be compared across groups [34].

Although there are no clear data about which factors lead to different interpretations of QOL items, possibilities include specific factors, such as values and perceptions about health. Different perceptions of well-being and functioning characterize QOL items across cultures. For example, the KINDL physical well-being domain generally represents how one feels about physical abilities. It could be perceived as identifying whether one can physically function when not directly asked to do so, which signifies the physical functioning domain. Different values may be placed on QOL items across cultures. The KINDL family domain measures how a child functions within the family and is probably valued differently by a child from a Persian family that is more cohesive with more family members than, for example, by a child from a Western family with a single parent. More general factors may include intrapersonal characteristics (e.g., ability to cope or independence), interpersonal characteristics (e.g., social relatedness or social support), or extra-personal characteristics (e.g., living conditions or financial resources). All of these, and many other factors, deserve further exploration by studies addressing cross-cultural measurement equivalence.

Since this is the first cross-cultural study organized with the aim of assessing the measurement equivalence of the KINDL self- and proxy-reports across two countries, there was no similar study in the literature for comparison. However, our findings were different from those in two previous studies, which reported substantial evidence for the cross-cultural equivalence of the KIDSCREEN-27 across 13 European countries [14, 15] and for the PedsQL™ 4.0 across four racial/ethnic subgroups in the United States [11]. Possible explanations for the differences between our findings and those of the two previous studies was likely due to the preconception that the KIDSCREEN had originally been developed using various cultures at the initial phases of the questionnaire’s development, and due to the different statistical methods employed for invariance testing, or considering that only original language versions had been employed. For example, in the US study, the four ethnicity groups were compared within the same country and only children who had completed the PedsQL™ 4.0 in English were included in the study. Moreover, our findings were different from those of previous research [15], which revealed that the KIDSCREEN-27 was equivalent across Iranian and Serbian children, alongside their parents. However, the selected samples in the present research are different from those of the KIDSCREEN study. This issue indicates that detecting DIF could vary substantially across different samples and from one measure to another.

One of the strengths of the present study is that the results of the DIF analysis were not affected by the child’s age within and between cultures. In the present research, the Kid-KINDL (for 8–12 year-olds) and the Kiddo-KINDL (for 13–16 year-olds) were not combined to obtain a larger sample size for DIF analysis. If the two forms of the KINDL were merged while they had not been equivalent within one or both culture/language groups, it might have distorted the interpretation of the observed DIF at the national/language group level. Moreover, the effect of child’s gender was controlled by taking it into account as a covariate in the OLR model, and the child’s age was automatically matched across two groups. Hence, child’s age and gender were both balanced between Iranian and Serbian samples.

However, the present study had a number of potential limitations that should be taken into consideration when interpreting the results. First, because almost all children in Iranian and Serbian samples were healthy, the findings may not be generalized to children with chronic conditions. Second, differences in sample composition of the two countries with respect to important background variables such as parents’ income, education, age, and gender can complicate the interpretation of observed DIF across two cultures. Because background variables (except gender and age) were not available in both samples, it was impossible to add them to the OLR model to explore whether they made a difference in DIF analysis results. This is especially important when we are trying to differentiate between true cultural disparities and differences in socio-demographic composition. Finally, it was not clear whether uncontrollable variables such as a parent and child’s reading skills, social desirability [35, 36] and acquiescence [37] contributed to the observed discrepancies across Iranian and Serbian samples.

Summary

The present study revealed that Iranian and Serbian children, as well as their parents, perceived and interpreted the meaning of almost all KINDL items differently. Although the directions of DIF showed that the effect of DIF could not be cancelled out at the scale level of some domains, further analysis revealed that removing or retaining the items with uniform DIF in subscales did not change our findings significantly when comparing HRQOL scores between the two countries. However, these findings indicate that self- and proxy-reports of the Kid-KINDL and Kiddo-KINDL have to be revised in order to be used for cross-cultural QOL comparisons. To gain a better insight into the measurement of non-invariance pertaining to the KINDL for cross-cultural comparisons, further research is needed to test whether Iranian and Serbian versions of the KINDL are equivalent to the original version and other available versions.