Introduction

The early mid-1990s heralded an important shift in the national narrative about health-related outcomes as the child-centered perspective became increasingly integrated into the translational research agenda. Fundamental to this burgeoning perspective was the availability of conceptually sound and well-validated child/adolescent specific measures [1,2,3,4,5]. Historically, the parent was often selected as the child’s “voice”; however, developers advocated that whenever feasible health outcomes should be elicited directly from the child [6,7,8,9].

Current research further promotes this premise by examining the specific age at which children can understand health and illness concepts and by identifying essential methods/technologies for doing so [10,11,12,13]. These efforts are fundamental to understanding the impact and potential trajectory of health beliefs across the life course and continue to inform enhancements to “legacy” measures that utilize classical test theory and newly released measures that utilize modern test theory such as the NIH-funded family relationships measures [14] and pediatric outcomes measures used in federal initiatives such as PEPR [15] and ECHO [16].

The Child Health Questionnaire (CHQ), a family of parent and child-reported general health outcomes measures, [3, 6, 7] now considered “legacy” tools, continues to be an important touchstone in furthering the advancement of PRO measurement among children and adolescents. The CHQ was constructed using classical test theory as its developmental cornerstone, although item response theory was also employed (e.g., Rasch analysis for the Physical Functioning Scale). The development of an experimental CHQ child-report short form was explored at the time the parent-reported CHQ was released [17]; however, it was determined that further work was warranted. To pre-empt a misleading message that the parent was the most reliable reporter for children, and to ensure that the child’s own perspective was included, the full-length 87 item self-report version (CHQ-CF87) was released with a long-term objective to develop a more practical short form [17].

Despite its length (87 items), which has been cited as a potential barrier, there are more than 100 peer-reviewed publications representing several dozen countries that provide evidence of the reliability and validity of the measure across an array of conditions and settings including randomized clinical trials, clinical practice, schools, academic research, and epidemiologic studies [18]. The CHQ-CF87 has been rigorously translated and linguistically validated [19] in 34 languages. In direct response to demand, work was initiated in 2015 to derive a shorter CHQ self-report. In early 2017, CHQ-CF87 U.S.-based norms were collected and data were used to evaluate the short form. The purpose of this paper is threefold: (1) report on the analytic methods used to derive the short form (CHQ-CF45); (2) detail feasibility findings from a school-based initiative; and (3) provide evidence of the measure’s reliability and validity. To further support use and data interpretation, two complementary initiatives—the estimation of U.S.-based norms and linguistic validation [19] of 34 translations—were completed as part of the CHQ-CF45 comprehensive scientific agenda.

Sample used to derive the short form

A Dutch dataset was used to empirically derive the short-form CHQ-CF45. Reliability and validity of the Dutch translation has been well-established [20,21,22,23,24]. This dataset was selected for many compelling reasons: (1) it provided an international perspective of previously published work [25]; (2) it included randomization of both paper–pencil and Internet administration; (3) the psychometric properties of the CHQ-CF87 were equal between the modes of administration which has been identified as a key equivalence guideline [26]; (4) the sample was sufficiently large (N = 933); (5) the sample was representative (87% participation rate across seven secondary schools); (6) the dataset included those with/without health conditions; and (7) feedback about the feasibility/acceptance of the CHQ-CF87 was collected.

The Medical Ethics Committee of the Erasmus University Rotterdam Medical Center approved the study protocol (reference number MEC-2007-163). Table 1 provides a brief demographic summary of the sample.

Table 1 Characteristics of the sample used to derive the CHQ-45 (n = 933)

The age range was 13–17 years (mean age 15). Fifty-four percent of the sample were girls and the majority were of Dutch ethnicity (76%). Fifty-eight percent of the sample were enrolled in lower secondary/vocational education; 22% upper secondary; and 19% intermediate secondary. The CHQ-CF87 performed well in the Dutch sample. The median Cronbach’s alpha coefficient was 0.83 (paper/pencil; range 0.69 physical functioning to 0.92 mental health) and 0.81 (internet; range 0.70 role/social-emotional/behavioral to 0.90 mental health). All scales demonstrated discriminant validity between adolescents with/without health conditions [25].

Analytic methods used to derive the short form

A multimethod approach was used to derive the short form: (1) stepwise regression; (2) exploratory factor analysis; (3) confirmatory multi-trait analysis; and (4) alpha reduction. The overall objective was to reduce each scale by 50%; but only if such a premise was empirically supported.

The multimethod approach was applied to the full sample as well as 15 unique subgroups defined by variables in the dataset: (1) mode of administration—paper/pencil or Internet; (2) child gender—boys/girls; (3) age group—13–14,15,16–17 (groupings based on data distribution); (4) smoking status—current/former smoker/never; (5) diagnostic Status—none, 1–2 conditions, ≥ 3 conditions; and (6) problems with CHQ Completion—no, yes.

The conceptual framework for the CHQ-CF87 is based on the WHO multi-dimensional model of health [27]. Scale–scale correlations (not shown but estimated as part of the short-form development) provided empirical support that each scale represented a unique construct. Thus, to maintain this conceptual/definitional structure, all scales/concepts were retained. Decision rules were comparable to those employed for previous published short-form work [28, 29].

Using the Likert-type format and a graduated response continuum that ranges from 4 to 6 levels, the CHQ measures global health; limitations with regard to daily physical activities; limitations in school and social activities due to emotional/behavioral issues and physical health issues, bodily pain, behavior, self-esteem, general health perceptions; limitations in family activities; and family cohesion. A 4-week recall is used to capture a more “reasonably stable” assessment of the average [30] rather than an acute (7-day) perspective common to more newly developed IRT measures [14].

Items are reverse scored (if appropriate) so that a higher score is better and a mean score is calculated to derive an overall scale score as is the standard for ordered categorical or Likert-type scales. For ease of interpretation, the raw scale scores are converted to a 0–100 continuum using a standard mathematical formula [31].

For each scale, items that had consistent results across all analytic methods for the full sample and 15 subgroups were retained. For the regression analysis, an adjusted R2 between 0.80 and 0.90 was considered acceptable. The dependent variable is the full-length scale; the corresponding items for that scale are the independent variables. Each model was analyzed separately and item selection was noted where little gain was observed if the item was retained. For the item scaling analyses, items had to have an observed loading of ≥ 0.40 with their respective scale [32]. Item reliability thresholds, as measured by Cronbach’s alpha [33], were set at ≥ 0.80.

Table 2 identifies the concepts, the number of items for each scale in both lengths of the CHQ, and the item content that was omitted/retained in the shorter version.

Table 2 Concepts, number of items, and items omitted/retained in the CHQ-CF45

Samples used to evaluate the short form

Feasibility/ease of completion/acceptance

Since the resulting CHQ-CF45 was derived from, and is embedded within the full-length CHQ-CF87, a small unfunded feasibility study to assess the short version as an independent measure was conducted at a private school in the western part of the U.S. Table 3 provides a brief description of the convenience sample.

Table 3 Feasibility sample descriptors of participants (n = 114)

The school was selected because of accessibility and consistency in class size and range. The CHQ could be self-administered over a few days within the usual school flow. The small class size (approximately 13/grade; range 12–19) and the grade range (1–8) afforded a compelling opportunity to evaluate acceptability through direct observation and open-ended discussions with participants. The same researcher (JL) introduced the CHQ, oversaw its self-completion, and was available to assist if issues arose. Understanding and ease of CHQ completion were obtained from each participant. Teachers were not present during the field-testing to allow for greater anonymity and freedom with regard to open-ended discussions about health, well-being, and the CHQ items/layout/format. The project was approved by the Executive Director, Director, and classroom teachers. Parent consent and child assent was obtained. One child declined participation although the parent consented. The field test was performed in April 2014.

The school utilizes a modular learning approach, i.e., some students within each grade switch to separate classrooms for particular subjects/activities. This was an important consideration in the school’s selection as it allowed for further protection of student confidentiality for those whom parents did not provide consent. A priori arrangements were made with each teacher so that non-participants could engage in a different activity in a separate room without calling attention to their absence during the CHQ administration.

Table 4 provides feasibility details. Invitations were sent to 153 parents with 114 providing consent (74% response rate). The average completion time was 11 min (range 4–35 min; high number reflects 1–3 outliers). There were negligible missing data (range 0.02–0.05%). In general, students did not have problems with CHQ content and did not feel items were difficult to complete.

Table 4 Completion times and ease of completion by grade (n = 114)

There were some initial issues with layout in Grade 2 but once the format was explained these younger children had no problems completing the CHQ. Grades 3 and 4 students reported that the CHQ was not difficult to complete. However, they did seem to have more challenges than other grades but when items were clarified (e.g., stoop, tasks, body/looks, family) they were able to answer. It was later disclosed that learning and comprehension challenges were not unique to the CHQ for these grades. Interestingly, a 3rd grader commented, “It was hard not because I did not understand but because it was tough deciding how I really felt.” A 5th grader noted, “I am going through a few things so I would prefer not to answer some questions.” A 4th grader commented, “This is not what I expected. I thought you’d just ask about the usual stuff but it was not just about that, but my body, my family. That’s important.” Others commented “this was fun and easy” (Grade 4), “it’s cool” (Grade 3), “I love this. It’s fun and makes me feel like a grown-up (Grade 2).” Grades 7 and 8 felt it complemented what they were learning in health/science. One of the most compelling comments came from a 3rd grader, who emphatically voiced, “It’s about time people started to ask US how we feel instead of our parents. We do know about these things, you know!.”

Internal consistency reliability for the nine CHQ-CF45 multi-item scales was evaluated using Cronbach’s alpha coefficient [33]. Due to small class sizes, coefficients were estimated for the entire sample (n = 114). Change in Health and Family Cohesion was retained as independent single items in the CHQ-CF45 but test/retest was not possible; therefore, they were not included in the estimations. Table 5 provides a summary for the CHQ scales. All coefficients were ≥ 0.71; the median coefficient was 0.82 (range 0.71–0.87; Role/social-Physical, and Mental Health, respectively) suggesting that the CHQ performed well in the feasibility sample.

Table 5 CHQ-CF45 alpha coefficients for the feasibility sample (n = 114)

National U.S. sample

The CHQ-CF45 was validated in a 3rd independent project using a U.S.-based representative sample [34, 35]. Responses from children aged 8–18 years were collected between November 2015–January 2016 using an online research panel (GfK KnowledgePanel) representative of the U.S. population. All eligible respondents completed the CHQ-CF87.

The GfK recruitment protocol relies on probability-based sampling of all U.S. addresses from the latest Delivery Sequence File (DSF) of the U.S. Postal Service and a stratification plan to boost representation of subgroups with higher rates of attrition (e.g., Hispanic/Non-Hispanic, ages 18–29) and other hard to reach adults (e.g., no Internet). To further improve the demographic composition, GfK makes adjustments to account for oversampling and every quarter, invites adults from randomly selected addresses to join the panel. Thus, the parent sample, from which children/teens were recruited, generally represents an equal probability selection method sample [34]. The merits/disadvantages of probability-based and convenience Internet panels have been addressed by others [36]. Both parental consent and child assent was obtained.

Table 6 provides a summary of the demographic characteristics of the U.S. representative sample. The sample (N = 1500) comprised an equal number of males/females (N = 750) overall and within each of the targeted age groups which were created based on development parameters and distribution within the GfK panel (8–9, 10–11, 12–13, 14–15, 16–18; n = 150, respectively). The mean sample age was 13 years. The least represented age was 18 (N = 73). The majority of the sample attended public school, with a little more than 8% attending private schools and almost 6% (N = 88) being home-schooled.

Table 6 Background descriptors for the validation sample (N = 1500)

More than half of the sample was White Non-Hispanic (59%), 21% were Hispanic with the remaining identifying as Black, Other non-Hispanic, and 2+ Races, non-Hispanic. Regional representation was varied (17–36%) with slightly more representation from the South and Midwest U.S. An overwhelming majority (85%) lived in attached/detached one-family homes with married working parents (80; 73%, respectively—data not shown).

Few child conditions (e.g., asthma, ADHD) were reported by parents (mean 1.5; range 0–12; not shown). Table 7 presents the conditions with/without comorbidities. Due to small samples within discrete conditions, even with comorbidities, children were re-classified for analytic purposes as having no condition (51%), one condition (24%), two conditions (12%), or ≥ 3 conditions (12%).

Table 7 Prevalence of parent-reported conditions and comorbidities in the CHQ-CF45 U.S. normative sample: ages 8–18 (N = 1500)

Methods used to evaluate the short form

Multi-item scaling

Scaling criteria and confirmatory factor analysis [32] were used to evaluate the psychometrics and the a priori hypothesized structure of the CHQ-CF45. A subtype of structural equation modeling, MAPR is recognized as a hallmark classical test methodology for evaluating legacy tools that utilize a multi-item Likert-type approach to scale construction. Unlike traditional factor analysis, MapR allows for the direct test of a priori structures (i.e., item “fit”) and can be used to estimate “error free” constructs.   MAPR extends traditional tests of internal consistency by testing item discrimination across other item sets in the matrix (i.e., how well an item represents a given construct relative to other constructs). Correlations between items are corrected for indicator overlap so that estimates are not spurious [30, 32].

Specific examination using MAPR included (1) item convergent validity: correlations between 0.30 and 0.40 for items within their hypothesized scale; (2) item discriminant validity: correlations one to two standard errors higher than correlations with other scales; (3) floor and ceiling effects; percentage of respondents at the lowest and highest possible scale score; (4) internal consistency reliability as measured by Cronbach’s alpha coefficient [33, 37]. Strict analytic selection criteria were used; completion of at least half of the items for all nine scales was required. The final sample for analysis was 1468.

Relative precision in clinical subgroups

The validity of the short and full-length scales to detect differences among clinical subgroups was estimated by computing the ratio of pair-wise F-statistics [28, 29, 35, 38, 39]. The F-statistic represents the ratio of systematic variance (between groups) versus error variance (within group). The larger the F value, the better the concept distinguishes across the different classifications. Following published convention, the full-length CHQ-CF87 scales were considered the “standard” (i.e., denominator was set to 1.0) against which the shorter scales were compared. Precision estimates greater than 1.0 indicate that the short-form scale is superior in detecting differences relative to the corresponding full-length version. Estimates under 1.0 provide evidence, that although respondent burden is reduced, some loss of precision is observed. Parametric methods are routinely used with summated rating scales, however, for the precision assessment, the Kruskal–Wallis non-parametric test was also performed. No differences were found. Thus, per convention, data are reported using parametric methods (ANOVA). All analyses were performed using SPSS Version 23.

Results

Short-form scale evaluation

Multi-trait item scaling results are summarized in Table 8 for the full sample, and by gender, age group, and condition group (none, 1, 2, or ≥ 3 conditions). For ease of data presentation, item convergent and discriminant validity findings are summarized in the row “Scaling Success.” In the full sample and across all 10 subgroups, all item-scale correlations (not shown) were ≥ 0.40 for all items with the exception of 3 items—global health (General Health Scale) (0.34) in the 8–9 year-old group and headache (Mental Health) in the same age group and those with 1 or 2 reported conditions (0.39 and 0.37, respectively) and trouble sleeping (Mental Health) in 8–9-year-olds. 100% item scaling success rates were observed for all nine scales across the full sample and 10 subgroups with the exception of Mental Health, which although not perfect, were still quite high (range 97%–99%). Given the reported good health of the sample, minimal floor effects were observed indicating that few if any respondents scored at the lowest possible end of the response continuums across all items (range 0.0 Mental Health and General Health to 0.6% Role Physical). However, not surprisingly, ceiling effects—the percentage of respondents scoring at the highest/best end of the response options—were detected. High ceiling effects included the Role Physical, Role Emotional/Behavioral and Physical Functioning scales (91, 84, and 71%, respectively). Moderate ceiling effects were observed for the Family Activities (55%), Bodily Pain (44%), and Self-Esteem (26%)  scales.   Negligible/no ceiling effects were observed for the  Behavior and General Health Perceptions scales. Ceiling effects were much smaller for those in the condition groups, especially those with ≥ 3. Overall, findings indicate that the CHQ-CF45 short-form scales exceeded item level scaling criteria even for the youngest ages (8/9 year-olds).

Table 8 Item scaling success rates, floor/ceiling effects, and alpha coefficients for the CHQ-SF45 Scales

Table 8 also presents alpha coefficients for the full sample and the subgroups. In the full sample, with the exception of General Health (0.76), alpha coefficients for the remaining eight scales were ≥ 0.84 with the highest reliabilities observed for Self-Esteem (SE; 0.93) and Family Activities (FA; 0.90) followed by bodily Pain (0.89) and Behavior (0.87) and Physical Functioning (0.86). Similar alpha coefficients were observed across all 10 subgroups. These findings suggest that the CHQ-CF45 is reliable even in the younger age groups and across children with/without reported health conditions. The reliability estimates are remarkable considering that several of the CHQ scales comprised only 2 or 3 items (REB, RP, BP, FA) with the remaining scales defined by ≤ 9 items.

Relative precision

Table 9 provides group means, p values, f-statistics, and relative precision (RP) estimates. As noted, the groups were created by classifying children based on the number of parent-reported conditions (0, 1, 2, ≥ 3). The two item Bodily Pain scale, and the two single items (Change in Health, and Family Cohesion) were excluded from the RP estimates since they are identical in both lengths of the CHQ.

Table 9 Relative precision of the CHQ-CF45 for children with and without reported health conditions

Statistically significant differences (p = .000) were observed for all scales in the full-length CHQ-CF87 and the derived short-form CHQ-CF45. Overall, lower scores were observed for the more “psychosocial” scales (e.g., Mental Health, Behavior, Self-Esteem) relative to Physical Functioning, Role Physical. As might be expected, the lowest scores were observed for the ≥ 3 conditions group. Precision estimates for the short-form scales that met/exceeded the standardized estimate for the corresponding full-length versions included Role/social-Emotional Behavioral (REB = 1.37 vs. RE), Role/social-Emotional Behavioral (REB = 1.00 vs. RB), and Physical Functioning (PF = 1.09). Precision estimates for all remaining scales (except GH) were high: RP and MH (0.95), SE (0.88), and FA (0.81). A lower, but still quite modest estimate was observed for General Health (GH 0.73) which is noteworthy given that the short-form scale was reduced by 42%. RP estimates are remarkable given that overall the CHQ was reduced by 52%. The reduction for individual scales ranged from 3 to 9 items (41–66%) with the single exception being the role/social limitations-physical (RP) which was reduced in length by only one item.

Discussion

Shorter, well-validated, child-centered PRO measures, irrespective of methodological development (i.e., legacy/classical or modern test theory) are needed to keep pace with emergent innovative technologies (e.g., geographic information systems, fit-for purpose activity trackers/“wearables”) that are increasingly being used to assess clinical/biomedical markers in concert with PROs [15, 16]. The objective of the present work was to derive and psychometrically evaluate a more practical-length version of a comprehensive, multi-dimensional child-focused PRO “legacy” measure that has been rigorously translated and used across an array of conditions and countries/cultures for many years.

An important strength of the present investigation is that the CHQ short form was empirically derived from a large Dutch representative sample using a multi-dimensional strategy but “externally” validated in a second larger U.S. representative sample that included 8-year-olds. Even though Flesch–Kincaid readability statistics [40] places the reading level for the CHQ-45 at Grade 5, with a reading ease of 74%, both the feasibility sample and the U.S. sample provide evidence that the CHQ-CF45 can be used with confidence in younger primary grades (e.g., 2nd–4th).

Items in the CHQ-CF45 exceeded all scaling criteria in the full U.S. sample and across all 10 subgroups (gender, age, number of health conditions) including item convergent and discriminant validity. All nine scales demonstrated strong internal consistency (alpha’s ≥ 0.73) in the full sample and across all subgroups with the exception of GH in ages 8–9 (0.67). RP estimates ranged from 0.73 to 1.37 providing strong evidence that using a multimethod approach, it was possible to achieve the objective of a 50% reduction in items with only a small loss in precision (range 5–27 RP/MH and GH, respectively). RP estimates for two short-form scales (PF, REB) were higher than the full-length versions (1.09, 1.37, and 1.00, respectively). Overall, these findings are encouraging and support the structural integrity of the CHQ-CF45 and provide evidence that it is reliable and valid and can be used with confidence.

A few study limitations should be noted. First, the U.S. representative sample was healthy with about 50% for whom no condition was reported. Thus, not unexpectedly, high ceiling effects were observed for the “physical scales” (PF RP BP) but less so for “psychosocial” scales (MH BE SE). Although the sample was representative of U.S. children, it is possible that the parent panel from which the sample was drawn did not truly represent children with chronic conditions and this resulted in small numbers for common “exclusive” conditions (e.g., asthma, ADHD). Although systematically recruited, it is possible that parents of children with more debilitating conditions may be less likely to participate in the panel because of time/personal limitations as their children may be less independent and require more attention/care. It is important to note, however, that for roughly 24% of the sample parents did report ≥ 2 chronic conditions. Due to small samples for discrete conditions, it became necessary to classify children for the RP estimates based on number of conditions that included both physical, mental, and/or behavioral diagnoses. Further work is needed to extend the current findings across specific condition groups to better determine potential ceiling effects and tradeoffs in precision.

Second, although item scaling results were more than satisfactory for ages 8–9, findings in the feasibility study suggest that younger ages or those with learning challenges may require some guidance as to the CHQ-CF45 layout/format. This could just be an artifact of the paper–pencil methodology since students may be accustomed to a certain style in which questions and sub-items are posed in school-related work. Further study may be warranted.

Finally, it will be important to assess the performance of the CHQ-CF45 across other countries/cultures and across different data collection platforms (e.g., social networks, smartphones, OLED-based displays, Google Glass, and other AR-enabled wearables/devices). It is important to reiterate that data for the U.S. sample were Internet-based which allowed for privacy and flexibility of questionnaire completion. In both the feasibility and the U.S. sample, respondents were allowed to skip items but little missing data were observed which suggests that children felt comfortable answering the CHQ using either print or electronic administration.

The use of family-centered child-reported PROs across an array of integrated settings is not new. More than 20 years have passed since the release of well-validated measures. Currently, the pressing need is for shorter yet equally robust child self-report measures that capture the physical, emotional, and social well-being of the child and his/her family. The integration of PRO content (such as the CHQ-CF45) with emergent technologies being used in classrooms, clinical care, and academic settings presents as an exciting evolutionary “next” step. Central to these initiatives is the child’s own voice. It is essential that he/she be heard if we are to truly advance child-focused PROs as a viable and important source for meaningful and actionable data that can enhance communication and better inform decision-making.