Introduction

Participation is one of the most important outcomes of rehabilitation for children and adolescents with a disability. Successful participation can lead to increased emotional and psychological well-being and ultimately improved quality of life (QoL) [13]. Participation is essential for the development of skill competencies, socialisation, exploring personal interests and the enjoyment of life [4]. For example, without the opportunity to participate in leisure activities ‘people are unable to explore their social, intellectual, emotional, communicative and physical potential and are less able to grow as individuals’ [5]. Although several measures of participation are available for children and adolescents with a disability, there is no general consensus on the definition of ‘participation’. A clear definition of the construct is essential to determine the validity of questionnaires, and for meaningful selection and usage of these instruments [6].

Since the development of the International Classification of Functioning, Disability and Health for Children and Youth (ICF-CY), several instruments have been modelled on the World Health Organisation’s definition of participation [7]. The ICF-CY defines participation as ‘a person’s involvement in a life situation’. This indicates that participation represents the societal perspective of functioning and includes the concept of involvement, which is further defined by the ICF-CY as ‘taking part, being included or engaged in an area of life, being accepted or having access to needed resources’. The authors add that ‘the concept of involvement should also be distinguished from the subjective experience of involvement (a sense of belonging or satisfaction with the extent of one’s involvement)’. Participation is part of the ‘Activities and Participation’ category of the ICF-CY. Nine domains are included in its assessment: (1) ‘learning and applying knowledge’, (2) ‘general tasks and demands’, (3) ‘communication’, (4) ‘mobility’, (5) ‘self-care’, (6) ‘domestic life’, (7) ‘interpersonal interactions and relationships’, (8) ‘major life areas’, and (9) ‘community, social and civic life’. However, by combining the two terms, it is often unclear what constitutes ‘activity’ and what ‘participation’. Where participation refers to the involvement in a life situation, activity is regarded as ‘the execution of a task’ [7]. Researchers have tried to further clarify this distinction [815]. According to Whiteneck and Dijkers [8], ‘activities’ are tasks performed individually. Activities focus on functional performance of an individual that can be done in solitude, while participation is the social role performance as a member of society with or for others. Activities tend to be straightforward and unambiguous, while participation tends to be more abstruse and will generally involve the performance of several activities. Generally, agreeing with this operationalisation, Eyssen et al. [9] defined participation as ‘performing roles in the domains of social functioning, family, home, financial, work/education, or in a general domain’. They distinguished activities from participation by stating that participation requires a social context, including both environmental factors and other people. However, McConachie et al. [10] stated that participation might be defined or operationalised differently when children are involved. They argue that the following life situations must be covered by an instrument that aims to measure participation in children and adolescents: ‘participation essential for survival’, ‘participation in relation to child development’, ‘discretionary participation’ and ‘educational participation’. The distinction that participation is more likely to involve other people and to be more environmentally dependent may not be beneficial for children. When performing activities, children tend to be in interaction with others (caregivers) or supported in their performance by environmental supports and modifications [5]. McConachie et al. [10] agree with this notion and add: ‘it may not be practical to place a clear boundary around the child when describing their participation; survey instruments should encompass the notion that for some purposes the child participates as part of a family rather than as an individual’. In accordance with previous research, using the framework of the ICF-CY and developing distinct sets of activity domains and participation domains by dividing the current classification into two mutually exclusive lists (without overlap) seems to be a useful strategy [8].

To contribute to evidence-based instrument selection, the measurement properties of questionnaires evaluating participation in children and adults with a disability have to be evaluated. Several reviews have been conducted [1619]; most of these reviews have focused on instruments applied in specific populations (e.g. acquired brain injury, cerebral palsy). Morris et al. [16] demonstrated that all selected measures have sound psychometric properties; however, when selecting instruments, no distinction was made between the constructs of ‘activity’ and ‘participation’. Moreover, four of the included instruments pertained to measure QoL or general health status. Sakzewski et al. [17] showed that most participation instruments included in their study had adequate reliability and validity, but limited data were available to determine the responsiveness of several questionnaires; they also stated that ‘a combination of assessments is required to capture participation of children in home, school and community environments’. In their review, Ziviani et al. [18] noted that there is a paucity of information on the psychometric properties of participation instruments; moreover, each systematic review used a different definition of participation for instrument selection.

Besides a lack of consensus on the definition of participation, previous reviews of participation measures have also lacked the application of an adequate tool to critically appraise the methodological quality of the included studies [1619]. This is required in order to (a) draw valid and reliable conclusions about the methodology applied, (b) to be able to reliably infer quality assessments and (c) to recommend specific participation questionnaires. Several authors have proposed guidelines for how measurement properties of health status questionnaires should be measured and criteria for what constitutes good measurement properties [2023]. The Scientific Advisory Committee (SAC) defined a set of 8 attributes together with some criteria to perform instrument assessments (SACMOS) [20]. The attributes encompassed properties such as ‘conceptual and measurement model’, ‘reliability’, ‘validity’, ‘responsiveness’ and ‘cultural and language adaptations’. The criteria mainly focussed on information the author should provide in an article. In addition, a few criteria are offered on acceptable reliability coefficients and standard error of measurement (SEM). Another tool for the standardised assessment of patient-reported outcome measures (PROs) is Evaluating the Measurement of Patient-Reported Outcomes (EMPRO) [21]. It also addresses 8 attributes, e.g. ‘reliability’, ‘validity’, ‘responsiveness’, ‘interpretability’ and ‘administration burden’. Similar to the SACMOS, the criteria corresponding to these attributes mainly consist of information authors need to report and some suggestions on acceptable values for reliability measures. Andresen reported another standard, offering additional criteria to contribute a grade to the quality of the measurement properties (A: high standard; B: adequate standard; C: low or inadequate standard) [22]. All these guidelines offer some insight into the standardised assessment and appraisal of measurement properties; however, they all lack comprehensive, detailed and consensus-based descriptions of what constitutes an adequate measurement property. Although some recommendations were made about acceptable values for Cronbach’s alpha, intraclass correlation coefficient (ICC) or the SEM, no relevant specifics were established for attributing factors e.g. sample size. To assess construct validity, hypotheses testing is often recommended by the standards. However, no information is provided about the number of hypotheses that need to be drawn up and to what extent these hypotheses should be confirmed.

The international ‘COnsensus-based Standards for the selection of health status Measurement Instruments (COSMIN) checklist’ fills this gap. It was developed by more than 50 international professionals and can be applied to evaluate the methodological quality of studies on measurement properties [23]. It offers a standardised assessment method to comprehensively appraise the validity, reliability, responsiveness, generalisability and interpretability and assign a score from poor to excellent to each measurement property. The checklist aims to improve evidence-based instrument selection.

This is the first study to (1) select measures of participation developed for children and adolescents (aged 0–18 years) with a disability according to a clearly pre-defined operationalisation and (2) critically appraise the measurement properties of these measures using the standardised approach of the COSMIN checklist.

The study could contribute to instrument selection, improving the assessment of participation in clinical practice.

Methods

Construct definition

We adhered to the following definition of participation: ‘Social role performance in the domains of interpersonal relations (e.g. with family or friends), education/employment, recreation and leisure and community life as a member of society in interaction with others’. This definition is in concordance with the method of Whiteneck and Dijkers [8] who also divided the Activity and Participation subscales of the ICF-CY in two mutually exclusive lists. By removing the ICF-CY items covered by ‘(d660) assisting others’ from domain (6): ‘domestic life’, domain six now consists solely out of activities. In order not to discard this subsection of items, it will be added as a new ‘major life area’ to domain (8) entitled: ‘care giving and assisting others’. The final three domains of the ICF-CY: (7) ‘interpersonal interactions and relationships’, (8) ‘major life areas’, and (9) ‘community, social and civic life’ are considered as ‘participation’.

Search strategy

The following computerised databases were searched to obtain a comprehensive result: Medline (1966—Oct 2013), EMbase (1974—Oct 2013) and PsycINFO (1806—Oct 2013). Index terms such as: ‘participation’, ‘personal autonomy’ and ‘role functioning’ were combined with terms identifying minors, for example: ‘infant’, ‘child’, ‘adolescent’ and ‘schoolchild’. To identify articles assessing measurement properties, an adaptation of the sensitive search filter for measurement properties was used, which included terms as ‘reliability’, ‘validity’, ‘generalisability’ and ‘psychometric’ [24]. The full search strategy is available upon request. Reference lists of included articles were screened to identify additional relevant studies.

Selection criteria

Articles were selected for inclusion when it concerned a full-text original article (e.g. not a review or manual), aiming to develop a self-report or parent proxy-report measure of participation in disabled children and adolescents (0–18 years old) or to evaluate one or more of the measurement properties of such a measure. The measure had to meet the definition of participation stated under the heading ‘construct definition’. Disability was defined according to the definition of the WHO: ‘any restriction or lack (resulting from any impairment) of ability to perform an activity in the manner or within the range considered normal for a human being’ [7]. Articles had to be written in English (although the developed/evaluated questionnaires can be in a different language). Articles were excluded when the measurement instrument consisted predominantly of activity items, according to our preset definition. In addition, because QoL and ‘adaptive behaviour’ were not considered to measure the same underlying construct as participation, measures assessing these constructs were excluded. Two researchers (LR, RvN) independently screened the titles, abstracts and full-text articles. For the updated search strategy, one researcher (LR) and a research assistant screened the additional studies independently on title, abstract and full-text. A third researcher (GvR) was consulted when consensus could not be reached through discussion.

Methodological quality assessment and quality criteria

The COSMIN checklist (with a four-point scale) was used to evaluate and calculate overall methodological quality scores per study on a measurement property [25]. It assesses measurement properties on three dimensions: reliability, validity and responsiveness. In addition, general requirements for studies that applied Item Response Theory (IRT) models and evaluated interpretability and generalisability were also appraised when applicable. Items can be rated excellent, good, fair and poor. Two researchers (LR, CvdZ) independently rated the included articles per measurement property. An overall score was established by taking the lowest rating of the items in a box. A third researcher (RvN) helped reach consensus when necessary. Although the inter-rater agreement of the checklist is adequate, the item inter-rater agreement is low; this is thought to be due to the required subjective judgement and being accustomed to different standards [26]. Therefore, decisions were made a priori regarding the appraisal of items. Quality criteria by Terwee et al. [27] were used to rate the quality of the evaluated measurement properties. The COSMIN definitions of the measurement properties and the quality criteria are described in Table 1.

Table 1 COSMIN definitions and operationalisation of measurement properties (adapted from www.cosmin.com)

Synthesis of results

The evidence on the measurement properties of the questionnaires was synthesised by combining the results, taking into consideration a) the number of studies, b) the methodological quality of the studies and c) the consistency of their results. The overall score can be rated ‘positive’, ‘negative’ or ‘indeterminate due to conflicting evidence’. Criteria developed by Terwee et al. (2007) were used to determine the overall score of the measurement properties per questionnaire [24].

Results

The search strategy identified 3,977 unique publications, of which 277 articles were selected after title and abstract screening. The full-text of these 277 articles was assessed, which resulted in the exclusion of 260 articles. Reference checking identified five additional clinimetric studies. In total 22 articles, evaluating the measurement properties of eight different questionnaires were included in the present study (Fig. 1). No other articles, evaluating the measurement properties of the eight included participation questionnaires, were available. The search strategy identified another 53 measurement instruments that pertained to measure participation or contained several items that could be considered participation items; these instruments were not included in the review, because the questionnaires did not comprehensively evaluate participation according to the definition used in this review. However, because they do provide insight into our operationalisation of the construct ‘participation’, these questionnaires are presented in Appendix A of ESM: ‘Characteristics of the excluded measurement instruments: construct clarification’. The general characteristics of the included studies and participation measures are presented in Tables 2 and 3, respectively. The methodological quality of each study per measurement property is presented in Table 4. The synthesis of results for each questionnaire and the quality of its measurement properties is presented in Table 5 (+++ or −−− Strong evidence positive/negative result). The results for each measure are discussed, separately, below.

Fig. 1
figure 1

Flowchart search strategy and article selection

Table 2 Characteristics of the included studies
Table 3 General characteristics of the selected participation measures
Table 4 Methodological quality of each study per questionnaire
Table 5 Result synthesis of measurement properties per questionnaire

Assessment of Preschool Children’s Participation (APCP)

No studies were found examining the measurement error, content validity and structural validity of the APCP. The evaluation of the internal consistency and the responsiveness was of poor methodological quality. Hypothesis testing for the English version of the APCP showed significant differences in diversity and intensity scores for gender, age and income [29, 30]. Moderate positive correlations were found between baseline diversity and intensity scores on the APCP and the PEDI, the WeeFIM and the GMFM-66 (r = 0.51–0.78; r = 0.46–0.82; r = 0.51–0.77, respectively) [29, 30]. With regard to the interpretability of the English questionnaire, estimates were provided for the minimal detectable change (MDC) and minimal clinically important difference (MCID) [30]. The MDC95 values for diversity scales of the APCP were play (PA): 5.1 %, skill development (SD): 2.5 %, active physical recreation (AP): 7.8 %, social activities (SA): 16.7 %, and total score: 3.8 %. The MDC95 values for intensity scales were PA: 0.6, SD: 0.1 AP: 0.5, SA: 0.7, and total score: 0.2. The MCID was estimated based on anchor- and distribution-based approaches. The anchor-based MCID values for diversity scales were PA: 16.7 %, SD: 19.4 %, AP: 11.0 %, SA: 16.5 %, and total score: 16.3 %. The anchor-based MCID values for intensity scales were PA: 1.1, SD: 1.2, AP: 0.8, SA: 0.9, and total score: 1.0. The distribution-based MCID values for diversity scales were PA: 11.7 %, SD: 11.4 %, AP: 11.0 %, SA: 9.6 %, and total score: 10.1 %. The distribution-based MCID values for intensity scales were PA: 0.7, SD: 0.7, AP: 0.6, SA: 0.4, and total score: 0.6. The Dutch APCP was translated from English using one forward and one backward translation [31]. The Dutch version was not pretested. The test–retest reliability of the overall diversity and intensity scores for children with and without physical disabilities is acceptable (ICC = 0.83–0.91). Two out of five subscales had an ICC below 0.70 [31]. The ICC cannot be evaluated for the group of participants with a disability due to small sample size (N = 24). There is moderate positive evidence for hypothesis testing, showing significant differences in diversity and intensity scores for gender, disability and age [31]. Floor and ceiling effects were not reported.

Children’s Assessment of Participation and Enjoyment (CAPE)

No articles were found evaluating the structural validity and responsiveness of the CAPE. The Greek translation provided limited evidence of inadequate internal consistency with Cronbach’s α for the five subscales ranging from 0.08 to 0.64 [32]. The internal consistency of the Swedish [33], Norwegian [34] and Spanish [35, 36] versions could not be evaluated due to assessments of poor methodological quality. The test–retest reliability of the five domains of the Dutch translation was shown to be adequate (ICC = 0.61–0.78) [37]. Limited positive evidence is provided for the intrarater reliability of the Dutch translation (ICC = 0.65–0.83) [37]. There is limited evidence of adequate reliability of the Norwegian translation for children with typical development (ICCTotal = 0.66) [34]. The sample of children with a disability was too small to make any meaningful conclusions. The reliability of an English version of the CAPE specifically adapted for children and adolescents with high functioning autism (HFA) was also assessed, but the study was of poor quality due to small sample size (N = 14) [38]. The measurement error cannot be formally determined, as the three studies evaluating the measurement error, do not provide an estimate of the minimal important change (MIC). Therefore, a comparison between the smallest detectable change (SDC) or limits of agreement (LoA) and the MIC cannot be made, and thus, the measurement error cannot be assessed. Bult et al. [37] reported SDC values ranging from 0.89 to 1.91 for the inter-rater reliability and from 1.14 to 1.86 for test–retest reliability. Hypotheses testing showed a positive correlation between Dutch CAPE activity scores and scores on instruments measuring family environment (r = 0.26–0.34), adaptive behaviour (r = 0.31–0.51) and picture vocabulary (r = 0.24) [37]. In addition, positive correlations between English CAPE activity scores and measures of athletic competence (r = 0.29) and physical functioning (r = 0.15–0.42) were found [39, 40]. As well as negative correlations between the CAPE and environmental factors (perceptions of barriers in the physical structural environment; r = −0.17) and financial constraints (r = −0.13 to −0.21) [39]. Children from families with a lower income participated less often in active physical activities. Activity scores on the Spanish version of the CAPE were both positively and negatively correlated with a QoL measure. For example, CAPE diversity scores were positively correlated with ‘social support and peers’ (r = 0.41) and negatively correlated with ‘self-perception’ (r = −0.29) [36]. This last result was considered to be due to cultural differences. When parents judged their children’s self-perception to be low, they would motivate them to participate in more activities, due to the importance of family interaction and involvement. Differences in scores between subgroups (children vs. adolescents vs. adults; male vs. female; disability vs. no disability) have also been reported [32, 3537]. No information was available on floor and ceiling effects.

Child and Adolescent Scale of Participation (CASP)

No methodological studies were found evaluating measurement error and responsiveness of the CASP. There is limited positive evidence for content validity of the Chinese adaptation [41]. The methodological quality of the translation process is fair. Exploratory factor analysis showed limited evidence for a two-factor solution on the parent-report version [41], but there is also limited evidence for a three-factor solution [42, 43]. For the youth version, exploratory factor analyses showed limited evidence for a three-factor solution [43]. Evidence of good internal consistency is demonstrated with regard to the total score of the parent-report measure (Cronbach’s α = 0.96) and the youth-report measure (Cronbach’s α = 0.87) [41, 43]. The four subscales of the CASP parent report show acceptable internal consistency (Cronbach’s α = 0.88–0.90) [41]. One study assessed the internal consistency of the subscores on the three-factor solution model identified for both the youth and parent report, resulting in a Cronbach’s α of 0.67–0.90 [43]. Three studies performing Rasch analysis on the parent-report questionnaire showed moderate evidence for a unidimensional construct [41, 42, 44]. The average CASP item difficulty order ranged from 1.46 logits to −1.51 logits and from 1.36 logits to −1.97 logits [42, 44]. All three studies identified items pertaining to community participation as most challenging, whereas items regarding skills learned at a younger age, such as mobility, communication and self-care were identified as least challenging. Three items are identified as potential misfits or deviant to the Rasch measurement model. Inter-rater reliability examining agreement between parent and youth report showed moderate agreement on the total score (ICC = 0.63) [43]. On the subscales of the three-factor solution, there is limited evidence that the reliability is inadequate, showing limited to moderate agreement (ICC = 0.51–0.70). Hypotheses testing showed that the CASP has a positive correlation with an instrument measuring functional skills (r = 0.51–0.72) and a negative correlation with instruments measuring extent of impairment (r = −0.58 to −0.66) and environmental barriers (r = −0.43 to −0.57) [42, 44]. It was also found that children without a disability have significantly higher and less variable CASP scores than children with a disability (p < 0.001) [41]. Floor and ceiling effects have been reported [4144]. No information was available on the MIC.

Children Participation Questionnaire (CPQ)

One methodological study evaluated the following measurement properties of the CPQ: internal consistency, reliability, content validity, and hypotheses testing [45]. No studies were available on the measurement error, structural validity and responsiveness of the questionnaire. There is limited positive evidence for content validity. The evaluation of the internal consistency was of poor quality (unidimensionality of the scale was not checked) and therefore yields little information. There is limited positive evidence for adequate test–retest reliability with the ICC for the total scores ranging from 0.84 to 0.89 and the ICC for subscores ranging from 0.72 to 1.00. Hypotheses testing showed that the questionnaire can distinguish between subgroups (age, socioeconomic status and disability vs. no disability). Floor and ceiling effects or the MIC have not been reported.

Assessment of Life Habits (LIFE-H)

There were no original methodological studies evaluating the internal consistency, measurement error, structural validity and responsiveness of the child version of the LIFE-H [46]. There is limited positive evidence for content validity [47, 48]. There is limited positive evidence for acceptable inter-rater reliability (ICC = 0.63–0.93) [48]. In addition, limited positive evidence is provided for satisfactory intrarater reliability for 10 of the 11 subscales of the short questionnaire (ICC > 0.78) [48]. Hypotheses testing showed positive correlations between the LIFE-H and questionnaires measuring functional capabilities (r = 0.70–0.94). The methodological quality of this assessment was rated ‘poor’, because the comparator instruments were not adequately described regarding construct and methodological properties. Differences in scores between subgroups (cerebral palsy, neuropathy, myelomeningocele) were noted [48]. Floor and ceiling effects or the MIC have not been reported.

PART

No methodological studies were found evaluating the internal consistency, measurement error, content validity, structural validity and responsiveness of the PART. There is limited positive evidence for reliability for the total score (ICC = 0.92) and for each of the subscales (ICC = 0.84–0.89) [49]. Hypothesis testing showed positive correlations between PART scores and scores on a measure of functional skills (r = 0.35–0.62) [46]. Differences in scores between subgroups (mobility limitations vs. no mobility limitations and environmental barriers vs. no experienced environmental barriers) have been reported [49]. No information was available on floor or ceiling effects or on the MIC.

Participation and environment measure for children and youth (PEM-CY)

No methodologically studies were found assessing the measurement error, structural validity and the responsiveness of the PEM-CY. The assessment of internal consistency was of poor quality and will therefore not be reported. There is limited positive evidence of content validity [50]. Limited positive evidence is provided for reliability on the total scores for ‘participation frequency’ in each of the three settings (ICCschool = 0.58, ICChome = 0.84 and ICCcommunity = 0.79). The reliability for individual items within each setting varied within the same range (ICChome = 0.68–0.96, ICCschool = 0.73–0.91 and ICCcommunity = 0.73–0.93). Reliability estimates for the other sections across settings were all moderate to good (ICC = 0.66–0.96) [51]. Hypotheses testing showed a significant effect of age group for ‘participation involvement’ in both the home settings (df = 3,512–3,568; F = 7.17) and school settings (df = 3,485–3,506; F = 3.81). A significant negative correlation between the ‘desire for change’ score and the ‘environmental supportiveness’ total score was found for each setting (r home = −0.42, r school = −0.59 and r community = −0.53) [51]. No information was available on floor or ceiling effects or on the MIC.

Questionnaire of Young People’s Participation (QYPP)

No methodologically sound studies were found evaluating structural validity of the QYPP. The measurement error and responsiveness of the questionnaire were not assessed. One methodological study evaluated the internal consistency, reliability, content validity and hypotheses testing [52]. There is limited positive evidence for content validity. Due to small sample size in relation to the number of items in the questionnaire (N = 107), the assessment of internal consistency offered insufficient information (Cronbach’s α = 0.61–0.86). There is limited positive evidence for adequate test–retest reliability with ICCs ranging from 0.83 to 0.98. Hypotheses testing showed that the questionnaire could distinguish between people with CP and the general population (Mann–Whitney U test P < 0.01 for all domains). Floor effects have been reported. No information regarding the MIC was available.

Discussion

The methodological quality of studies evaluating the measurement properties of eight different questionnaires measuring participation in children and adolescents with disabilities was evaluated, using the COSMIN taxonomy. Overall, the CASP was evaluated most extensively, generally showing moderate positive results on the assessed measurement properties. Remarkably, very few studies evaluating the measurement properties of participation questionnaires were available. In addition, at least 50 % of the measurement properties per questionnaire were not (or only poorly) assessed. Therefore, no final conclusions can be made about the methodological quality of the questionnaires. For some questionnaires (i.e. QYPP, APCP), this is understandable, because the measures are relatively new. However, other participation measures could have been validated more comprehensively.

The content validity of several of the questionnaires was not assessed. Good content validity is a prerequisite for sound validity and reliability. The fact that there is no general consensus on the definition of the construct ‘participation’ highlights the importance of a well-substantiated framework, on which the questionnaire should be built. The finding that the items within this framework are often not appraised with regard to the relevance to the construct, study population and the purpose of the measurement instrument, gives cause for concern. This is not an attempt to imply that the content validity of the included questionnaires is of poor quality. It is merely an observation that emphasises the necessity of analysing the content validity of these questionnaires in future studies. The notion that each participation questionnaire is developed using a different take on the definition of the construct was highlighted in Appendix A of ESM. Several questionnaires purported to measure the construct of participation [i.e. Children Helping Out—Responsibilities, Expectations and Support (CHORES), Pediatric Community Participation Questionnaire (PCPQ), and Rotterdam Handicap Scale (RHS)], but upon inspection, the items of the measures were often single tasks performed alone [5355]. The definition used in the present review combined generally accepted aspects of previously developed descriptions (i.e. role performance, interaction and community life). General agreement is needed in order to attribute meaningful interpretations to the outcomes of the questionnaires.

Another measurement property that is underreported is the responsiveness. This is unexpected, as questionnaires measuring participation are often used in rehabilitation settings, where increased participation is a main treatment outcome. To make meaningful comments about patients’ progress in participation, the responsiveness of the questionnaires needs to be evaluated in longitudinal studies of good methodological quality; this will have positive implications for both research and clinical practice. Angst has voiced his concern about the COSMIN rules used to examine responsiveness [56]. According to Angst, responsiveness aims to detect change over time in the construct of interest. He argues that this is not solely a question of longitudinal validity, but a matter of determining which instrument detects changes over time more accurately using a quantitative measure. According to the COSMIN taxonomy, these methods are considered inappropriate and deem a study to be of poor methodological quality. Whereas it is true that other methods can be (and often are) used to evaluate the responsiveness of the questionnaire (i.e. effect size), these methods provide less insight than previously thought. The COSMIN panel argued that the effect size can only be used as a measure of responsiveness if the effect of an intervention has been determined or assumed beforehand [56]. However, these methods need not be disregarded. Although the COSMIN taxonomy has been criticised for its adherence to optimal statistical methods, rather than generally accepted and commonly used methods, it remains a consensus-based tool, which offers a standardised way of assessing measurement properties. It is crucial to use one’s own methodological knowledge and insight and apply the COSMIN checklist as a framework, not as an absolute truth. For clinical practice, it is important that future research aims to study the responsiveness of these measures, to enable valid evaluation of clients’ progress and to allow for (cost) effectiveness studies assessing current treatment techniques. Despite of this clinical relevance of responsiveness, it has to be noted that for an individual child, a higher score on a participation questionnaire does not necessarily equal preferred improvement. It can indicate that a child is capable to perform certain activities more easily, but these might not be considered important or relevant by the child or parents. A higher score can also indicate that the child has received more help, rather than independently performed the activities. Therefore, scores on a participation questionnaire always need to be interpreted qualitatively as well as quantitatively. When looking at the qualitative applicability of these questionnaires, the responsiveness is a less important measurement property.

Cross-cultural validity was difficult to assess for the included questionnaires. The adaptation of questionnaires for other cultures and languages requires a rigorous and integral process of expert translation (including multiple forward and backward translations) item revision and pretesting in a similar study population [28]. Particularly, the pretest is often disregarded, but remains an essential part of the validation process. A simple translation of the items is not sufficient as the meaning and significance of items might vary according to culture, setting and circumstance. A poorly executed cross-cultural adaptation could result in less optimal findings regarding the validity and reliability of the instrument. This could explain the inconsistent findings in the present study when comparing the reliability and validity of the original questionnaire and some of the translated versions.

The studies of poor methodological quality showed similar shortcomings: small sample size and omitting to execute (confirmatory) factor analysis. Performing factor analysis is an important method to evaluate the internal consistency, structural validity and cross-cultural validity of an instrument. By excluding this analysis, no valuable information is obtained about the (uni) dimensionality of the scales and the distribution of items. Several studies still proceed to determine the internal consistency of a scale, without looking at unidimensionality, or even when unidimensionality has been disproved [31, 35, 42, 44]. Adhering to general statistical requirements when evaluating measurement properties will improve the methodological quality of these studies.

This is the first systematic review in which the measurement properties of eight participation questionnaires were analysed using a standardised, consensus-based taxonomy, preceded by a construct operationalisation process. Based on the results, it can be concluded that there is still a shortage of good quality information regarding the psychometric properties of questionnaires measuring participation in children and adolescents with a disability. Therefore, a recommendation for future research is to assess the psychometric properties of these identified questionnaires using good qualitative research methods. The COSMIN checklist can be consulted when determining which statistical methods are required and preferred when assessing these properties. IRT can be a valuable and useful tool to determine the quality of a measurement instrument [5759].

Future research should pay special attention to the content validity and the responsiveness of the questionnaires. The development of new questionnaires measuring participation in a general population of children and adolescents is not considered a direct priority at this time. Recently, several new questionnaires have been developed (e.g. APCP, PART, PEM-CY, and QYPP). Therefore, it is recommended to evaluate existing questionnaires using studies of high methodological quality, preferably using IRT models, to contribute to the practical application of the instruments and to be able to accurately measure participation in children and adolescents with disabilities.