Introduction

Anxiety and depression are highly prevalent in children and adolescentsFootnote 1 and among the leading causes of youth disability worldwide [1, 2]. Prevalence rates of child anxiety and depression are around 6.5 and 1.3%, respectively [1]. Long-lasting episodes of child anxiety and depression predict recurrence of these disorders [3] and the development of other psychosocial problems later in life, like substance abuse or dependence, suicidal behavior, and failure to complete secondary school [4,5,6]. To prevent deterioration, it is critical to assess, treat, and monitor anxiety and depression in children [7, 8].

To assess and monitor anxiety and depression in children, self-report questionnaires, based on classical test theory (CTT), are often used [7, 9, 10] (e.g., the Revised Child Anxiety and Depression Scale [RCADS] [11, 12]). Although CTT questionnaires are valuable in showing the number and severity of symptoms, it assumes that all symptoms contribute equally to severity ratings of a construct, while research has shown this is not the case [13, 14]. In addition, many CTT questionnaires are relatively long, which makes them time consuming [9]. Furthermore, the qualitative meaning of scores is not always clear [15].

To advance the measurement of self-reported health, the Patient-Reported Outcomes Measurement Information System (PROMIS®) initiative in the United States of America (U.S.) developed multiple adult and pediatric item banks, which are sets of questions measuring a same construct (e.g., anxiety or depression) [16, 17]. The use of PROMIS item banks has several advantages over the use of CTT questionnaires. PROMIS item banks have the potential to measure with a higher validity and reliability, due to a careful item selection and adaptation, and the application of Item Response Theory (IRT) [16, 18,19,20]. IRT is a psychometric method by which items and persons are ordered on the same scale in terms of severity of the construct. Due to this ordering, item banks can be administered through computerized adaptive testing (CAT). In CAT, items are automatically selected from an item bank, based on an individual’s response to a previously completed question. With CAT, fewer items are needed to obtain a reliable result than with CTT questionnaires, which need to be administered entirely [21]. When computers are unavailable, fixed-length short forms can be used consisting of e.g. four to eight items. Furthermore, PROMIS item banks are generic in nature, which makes them universally applicable in clinical and general populations. Finally, PROMIS item banks are standardized on a universal T-score metric where a score of 50 represents the average of the U.S. reference population with a standard deviation (SD) of 10, which makes it possible to interpret results of different item banks alike.

The U.S. PROMIS pediatric item banks v1.0 Anxiety and Depressive Symptoms have been validated in a diverse set of children at public schools, hospital-based outpatient general pediatrics, and subspecialty clinics [22, 23]. These item banks were translated into, among others [24], Dutch-Flemish [25] and recently updated to versions v2.0. This study aims to validate the Dutch-Flemish PROMIS pediatric item banks v2.0 Anxiety and Depressive Symptoms, the short forms 8a, and CATs in a large sample of children from the general Dutch population and to provide reference data; it adds to former research examinations on cross-cultural validity.

Methods

Participants

Participants were children aged 8–18, who lived in the Netherlands and could read Dutch. They were recruited via their parents by two internet survey providers–Kantar Public and Panel Inzicht–from January to July 2018. Figure 1 describes the sampling procedures.

Fig. 1
figure 1

Sampling procedures by two internet panel survey providers (i.e., Kantar Public and Panel Inzicht). a Two to six or more persons household. b Native, first and second generation of western immigrants (i.e., immigrants from Europe excluding Turkey, North America excluding Mexico, Oceania, Japan, and Indonesia), first and second generation of non-western immigrants (i.e., immigrants from Africa, Latin America, and Asia excluding Japan and Indonesia). c Five social classes. d Three biggest cities in the Netherlands (i.e., Amsterdam, Rotterdam, The Hague), their outskirts, region west without the three biggest cities and their outskirts, region north, region east, and region south. e Representative per age group 8–12 and 13–18 years old on the variables gender, age, household size, ethnicity (with the exception of native children aged 8–12), social class, and region compared to the general population in 2017

Kantar Public drew a representative sample of children from the general Dutch population and an additional sample of children with a migration background in the three biggest cities in the Netherlands, since it expected an under-representation of these participants. Representativeness was determined per age group 8–12 and 13–18 years on the variables gender, age, household size, ethnicity, social class, and region (deviation from gold standard < 2.5% [26]). Kantar Public expected a total response rate of 32% based on previous experiences. It offered participants a gift voucher of €1.50.

Panel Inzicht approached all parents of children aged 8–18 in their panel. It expected a response rate of 10 to 20% and offered participants €0.95.

Procedure

Participants completed an online questionnaire consisting of general questions about demographics; the PROMIS pediatric item banks v2.0 Anxiety and Depressive Symptoms [22, 23]; and the Revised Child Anxiety and Depression Scale short version (RCADS-22) [11, 12, 27]. We added one question at the end of the RCADS-22 with an opposite wording (i.e., “I feel happy”) to detect respondents who completed the questionnaire without paying attention to the formulation of the questions, and one question at the end of the total questionnaire to check whether respondents participated in both internet suveys. No questions could be skipped to avoid missing data.

Measures

PROMIS pediatric item banks v2.0 and short forms 8a Anxiety and Depressive Symptoms [22, 23]

The PROMIS pediatric item bank v2.0 Anxiety contains 15 items, the PROMIS pediatric item bank v2.0 Depressive Symptoms contains 14 items, and both short forms 8a contain a subset of eight items. All items use a seven-day recall period and are scored on a five-point Likert scale: 1 (never), 2 (almost never), 3 (sometimes), 4 (often), 5 (almost always). Level of severity is expressed as theta (θ), and a T-score is calculated by the formula (θ × 10) + 50, with higher scores representing higher levels of anxiety or depressive symptoms.

Revised Child Anxiety and Depression Scale short version (RCADS-22) [11, 12, 27]

The RCADS-22 contains 15 items measuring symptoms of anxiety and seven items measuring symptoms of depression in accordance with the DSM-IV [12, 27]. All items are scored on a four-point Likert scale: 0 (never), 1 (sometimes), 2 (often), and 3 (always). Total scores are calculated by adding all item scores per subscale, leading to a total score from 0 to 45 on the anxiety subscale and from 0 to 21 on the depression subscale. Higher scores represent a higher level of anxiety or depression. Previous studies have demonstrated strong psychometric properties of the anxiety subscale [12, 27] and the seven items version of the depression subscale [27]. In the present study, the anxiety and depression subscales showed a Cronbach’s alpha of 0.87 and 0.84, respectively.

Analyses

We examined whether participants gave identical answers to all RCADS-22 questions and the question with opposite wording, and whether they completed the questionnaire twice for both survey providers. Next, we examined differential item functioning (DIF) for the two samples to assess whether the data could be combined for psychometric analyses.

We performed the following analyses in accordance with the PROMIS analysis plan for psychometric evaluation of item banks [28]. First, the assumptions of the IRT model were examined: unidimensionality, local independence, and monotonicity. Unidimensionality was examined by confirmatory factor analyses (CFAs) using the R package Lavaan (version 0.6–3) [29]. One-factor model fit was examined using the polychoric correlation matrix with a diagonally weighted least squares estimator. Four fit indices were evaluated: the scaled comparative fit index (CFI), the scaled Tucker-Lewis index (TLI), the scaled root mean square error of approximation (RMSEA), and the standardized root mean square residual (SRMR) [30]. Model fit was considered sufficient if the scaled CFI and TLI > 0.95, the scaled RMSEA < 0.06, and SRMR < 0.08 [28, 31].

In case of insufficient one-factor model fit, an exploratory bi-factor analysis was examined using the R package psych (version 1.9.12.31) [32]. In a bi-factor model, it is assumed that covariance among item responses can be accounted for by a general factor representing shared variance among all items, and orthogonal group factors representing shared variance over and above the general factor among subsets of items [33, 34]. In case of a strong general factor, an item bank might be considered as unidimensional enough for IRT modeling [33, 34]. Unidimensionality was examined by the Omega-hierarchical (ωh) and the explained common variance (ECV). An ωh > 0.80 in combination with ECV > 0.60 were regarded as indicators of unidimensionalality [35].

Local independence was examined by evaluating residual correlations after controlling for the dominant factor. Residual correlations > 0.20 were considered as indicators of local dependence [28]. Since residual correlations < 0.20 can still lead to model misfit, in addition, we permitted residual correlations with the highest modification indices (MI) and examined improvement of model fit. A change of 0.01 for the scaled CFI and 0.015 for the scaled RMSEA was considered as improved model fit [36].

Monotonicity was examined by a non-parametric IRT model fit with Mokken scaling using the R package Mokken (version 2.8.11) [37]. Model fit was examined by the scalability coefficient H. Coefficient H ≥ 0.30 per item and ≥ 0.50 for the total scale were considered as indicators of an acceptable monotonicity [38].

Second, IRT Graded Response Model (GRM) fit was examined using the R package Mirt (version 1.30) [39]. GRM is an IRT model for ordinal data in which discrimination and threshold parameters are estimated per item using marginal maximum likelihood. The sizes of residuals between observed and expected response frequencies were examined with generalized Orlando and Thissen’s S-X2 statistics for polytomous data; S-X2 p value > 0.001 was considered as an indicator of sufficient item fit [40, 41].

Third, DIF was examined for gender, age group (i.e., aged 8–12 and 13–18), region, ethnicity, and language (i.e., Dutch and English) using the R package Lordif (version 0.3–3) [42]. DIF for language was examined by comparing item responses in our dataset to the dataset PROMIS 1 Pediatric Supplement downloaded at HealthMeasures Dataverse (N = 1,525, mean age (SD) = 12.1 (2.6), girls = 52.1%) [43]. Uniform and non-uniform DIF were examined by ordinal logistic regression models, in which the probability of giving a certain response to an item was modeled as a function of the trait, the group variable, and the interaction of the trait and the group variable. McFadden pseudo R2 > 0.02 was considered as an indication of DIF [42].

In addition to the PROMIS analysis plan, we examined reliability, which is conceptualized as “information” in IRT. Information (I) is inversely related to standard errors (SEs) and can differ across levels of the measured trait (θ) as indicated by the formula: \(SE \left(\uptheta \right)= \frac{1}{\sqrt{I}(\uptheta )}\). We calculated SEs of the short forms and CAT simulations, which were performed using the R package catR (version 3.16) [44]. The CATs started with an item on trait level 0 (which corresponds to a T-score of 50) and used the stopping rule of a minimum of four items administered, a maximum of 12 items administered, or an SE < 0.316 (which corresponds to a reliability of 0.90).

We calculated SEs by using two sets of parameters. The first set of parameters was retrieved from the GRM item fit. We used these Dutch parameters to compare the SEs of the short forms and CATs to the SEs of the GRM fitted RCADS-22 items per subscale. All SEs were plotted on a Dutch metric with a mean T-score of 50 and a SD of 10 in the Dutch sample. The second set of parameters was the official set of U.S. item parameters in the U.S. calibration sample, obtained from HealthMeasures. We used these U.S. parameters to standardize the SEs on the official PROMIS T-score metric with a mean T-score of 50 and a SD of 10 in the U.S. reference sample.

Next, we examined construct validity of the short forms and CATs using SPSS Statistics version 21. We tested predefined hypotheses (following the internationally consensus-based COSMIN checklist [45]). We expected positive correlations ≥ 0.70 between T-scores on the Dutch metric and the corresponding RCADS-22 total subscale scores. We expected lower positive correlations between T-scores and the non-corresponding RCADS-22 total subscale scores.

Finally, we calculated Dutch reference scores on the universal U.S. PROMIS T-score metric per age group (i.e., aged 8–12 and 13–18) and gender in the representative Kantar Public sample and in the total sample. We determined severity cut-offs based on percentiles in the Kantar Public sample [46]: minimal (< 75th percentile), moderate (75–95th percentile), and severe (≥ 95th percentile).

Results

Sample characteristics

Kantar Public and Panel Inzicht had a response rate of 51 and 39%, respectively. Of 2,933 respondents, 40 were deleted because of conflicting responses on the reverse coded question (Fig. 1). No children participated in both surveys.

The Kantar Public sample was representative per age group 8–12 and 13–18 years: all deviations from the gold standard [26] were < 2.5%, except for native children aged 8–12 (the deviation was 2.9%) (Table 1). No items were flagged for DIF for panel, and the two samples were combined for psychometric analysis (N = 2,893).

Table 1 Demographic characteristics of the various study samples

Anxiety

The IRT assumptions were considered to be met. Initially, unidimensionality was partly shown (scaled CFI = 0.96; scaled TLI = 0.96; scaled RMSEA = 0.10; SRMR = 0.04; factor loadings varied from 0.71 to 0.91). Since the scaled RMSEA was > 0.06, an additional exploratory bi-factor analysis was conducted, which yielded high factor loadings on a general factor (0.60 to 0.81). The ωh was 0.83, and the ECV was 0.79, indicating the item bank could be considered as unidimensional enough.

No local dependence was found. Permitting residual correlations between the two items 2230R1r “I got scared really easy” and 227bR1r “I felt afraid” with the highest MI (i.e., 482.785) improved model fit (scaled CFI = 0.97, scaled RMSEA = 0.09). Additionally permitting residual correlations between two different items with the second highest MI did not improve model fit anymore.

Monotonicity was considered sufficient; Mokken scalability coefficients of the items ranged from 0.53 to 0.65, and H of the full length item bank was 0.61.

All 15 Anxiety items showed sufficient GRM model fit (Table 2). Discrimination parameters ranged from 1.79 to 3.45. Threshold parameters ranged from − 0.10 to 3.52.

Table 2 IRT item characteristics of the PROMIS pediatric Anxiety and Depressive Symptoms item banks in a general Dutch population (N = 2,893)

No items were flagged for DIF for gender, age group, region, social class, ethnicity, or language.

Figure 2a shows the SEs of the full length item bank, short form, CATs, and RCADS-22 anxiety subscale along the T-scores scale, calculated with Dutch parameters. The short form showed a SE < 3.16 for 51% of the participants, the CATs for 59% of the participants. The CATs used an average of 8.3 items. Item 5044R1r “I felt worried” had the highest discriminating value at T = 50 and was therefore administered first in the CATs. The short form and CATs showed a higher reliability over a broader range of T-scores than the RCADS-22 anxiety subscale with a smaller (average) number of items.

Fig. 2
figure 2

a Standard error of measurement over the range of T-scores for the full length Dutch-Flemish PROMIS pediatric item bank v2.0 Anxiety, short form 8a, and CATs, based on Dutch parameters, compared to the RCADS-22 anxiety subscale; b Standard error of measurement over the range of T-scores for the full length Dutch-Flemish PROMIS pediatric item bank v2.0 Anxiety, short form 8a, and CATs, based on official U.S. parameters; c Standard error of measurement over the range of T-scores for the Dutch-Flemish PROMIS pediatric item bank v2.0 Depressive Symptoms, short form 8a, and CATs, based on Dutch parameters, compared to the RCADS-22 depression subscale; d Standard error of measurement over the range of T-scores for the Dutch-Flemish PROMIS pediatric item bank v2.0 Depressive Symptoms, short form 8a, and CATs, based on official U.S. parameters. CAT computerized adaptive test; RCADS Revised Child Anxiety and Depression Scale; M mean; SD standard deviation

Figure 2b shows the SEs of the full length item bank, short form, and CATs along the official U.S. T-score metric. The short form showed a SE < 3.16 for 2% of the participants, the CATs for 26% of the participants. Especially participants with T-scores < 43 were unreliably estimated (i.e., reliability < 0.80). The CATs used an average of 11.5 items. Item 227bR1r “I felt scared” had the highest discriminating value at T = 50 and was therefore administered first in the CATs.

Both hypotheses to examine construct validity were confirmed. Pearson’s r between the short form and CATs and the RCADS-22 anxiety subscale was 0.75 and 0.74, respectively. The correlations were lower with the RCADS-22 depression subscale: r = 0.70 and r = 0.70, respectively.

Table 3 shows mean T-scores and SDs per age group and gender in the representative Kantar Public sample and in the total sample on the official U.S. T-score metric. The mean (SD) T-score of the representative sample was 43.8 (9.9) and varied from 41.1 to 45.5 across subgroups. T-scores < 50.77 indicated minimal symptoms, 50.77 ≤ T-scores < 61.49 indicated moderate symptoms, and T-scores ≥ 61.49 indicated severe symptoms. The mean (SD) T-score of the total sample was 44.0 (10.5) and varied from 41.5 to 46.2 across subgroups.

Table 3 Mean T-scores and standard deviations for each PROMIS pediatric item bank, age group, and gender in a representativea general Dutch sample and in the total sample

Depressive Symptoms

The IRT assumptions were considered to be met. Initially, unidimensionality was partly shown (scaled CFI = 0.99; scaled TLI = 0.99; scaled RMSEA = 0.07; SRMR = 0.02; factor loadings varied from 0.72 to 0.94). Since the scaled RMSEA was > 0.06, an additional exploratory bi-factor analysis was conducted, which yielded high factor loadings on a general factor (0.60–0.90). The ωh was 0.95, and the ECV was 0.93, indicating the item bank could be considered as unidimensional enough.

No local dependence was found. Permitting residual correlations between two items with the highest MI did not improve model fit.

Monotonicity was considered sufficient; Mokken scalability coefficients of the items ranged from 0.57 to 0.75, and H of the full length item bank was 0.69.

Three out of 14 items did not show sufficient GRM item fit: 2697R1r “I wanted to be by myself”, 7010 “I felt sad for no reason”, and 9001r “I felt too sad to eat” (Table 2). Item discrimination parameters ranged from 1.82 to 4.86. Threshold parameters ranged from − 0.30 to 3.78.

No items were flagged for DIF for gender, age group, region, social class, and ethnicity, but two items were flagged for uniform DIF for language: 2697R1r “I wanted to be by myself” (R2 = 0.030), and 488R1r “I could not stop feeling sad” (R2 = 0.031).

Figure 2c shows the SEs of the full length item bank, short form, CATs, and RCADS-22 depression subscale along the T-scores scale, calculated with Dutch parameters. The short form showed a SE < 3.16 for 54% of the participants, the CATs for 65% of the participants. The CATs used an average of 6.8 items. Item 461R1r “I felt lonely” had the highest discriminating value at T = 50 and was therefore administered first in the CATs. The short form and CATs showed a higher reliability over a broader range of T-scores than the RCADS-22 depression subscale.

Figure 2d shows the SEs of the full length item bank, short form, and CATs along the official U.S. T-score metric. The short form showed a SE < 3.16 for 34% of the participants, the CATs for 41% of the participants. Especially participants with T-scores < 42 were unreliably estimated (i.e., reliability < 0.80). The CATs used an average of 9.8 items. Item 5035R1r “I felt like I couldn’t do anything right” had the highest discriminating value at T = 50 and was therefore administered first in the CATs.

Both hypotheses to examine construct validity were confirmed. Pearson’s r between the short form and CATs and the RCADS-22 depression subscale was 0.78 and 0.76, respectively. The correlations were lower with the RCADS-22 anxiety subscale: r = 0.69 and r = 0.67, respectively.

Table 3 shows mean T-scores and SDs per age group and gender in the representative Kantar Public sample and in the total sample on the official U.S. T-score metric. The mean (SD) T-score of the representative sample was 44.7 (10.6) and varied from 42.9 to 47.5 across subgroups. T-scores < 52.78 indicated minimal symptoms, 52.78 ≤ T-scores < 62.69 indicated moderate symptoms, and T-scores ≥ 62.69 indicated severe symptoms. The mean (SD) T-score of the total sample was 45.0 (11.2) and varied from 43.2 to 47.9 across subgroups.

Discussion

We evaluated the psychometric properties of the PROMIS pediatric item banks v2.0 Anxiety and Depressive Symptoms, the short forms 8a, and CATs in a general Dutch population. The results support the unidimensionality, local independence, and monotonicity of both item banks and suggest sufficient GRM item fit—except for three Depressive Symptoms items. Both item banks did not show DIF for gender, age group, region, social class, and ethnicity, but two Depressive Symptom items showed DIF for language. With short forms and CATs, reliable scores > 0.80 were obtained for children with moderate and severe levels of anxiety and depression. Construct validity of both short forms and CATs was considered sufficient. Mean T-scores for Anxiety and Depressive Symptoms were 43.8 and 44.7 in a representative sample, respectively.

Permission of residual correlation between two Anxiety items with the highest MI (i.e., 2230R1r “I got scared really easy” and 227bR1r “I felt afraid”) improved model fit, but did not distort parameter estimates. When deleting the item with the lowest discrimination parameter (i.e., 227bR1r “I felt afraid”), discrimination parameters did not change meaningfully (differences ranged from 0.00 to 0.12 and was 0.37 for item 2230R1r “I got scared really easy”).

Three Depressive Symptoms items showed poor GRM item fit: 2697R1r “I wanted to be by myself”, 7010 “I felt sad for no reason”, and 9001r “I felt too sad to eat”. These items are not included in the short form but were used in 18% to 63% of the CATs, despite the fact that 7010 “I felt sad for no reason” and 9001r “I felt too sad to eat” had low response curves and therefore a low probability of being selected. A possible explanation is that the Depressive Symptoms item bank consists of only 14 items, and that more informative items measuring similar trait levels are lacking. Also, U.S. discrimination parameters are low, and therefore, almost all items needed to be administered to get a reliable result.

Reliability of the short forms and CATs seemed higher when based on Dutch parameters than when based on U.S. parameters. An explanation might be that more items show DIF for language than presented by Lordif (i.e., 2697R1r “I wanted to be by myself”, and 488R1r “I could not stop feeling sad”) [47]. Therefore, we additionally examined DIF for language for both item banks using IRT PRO. According to IRT PRO, all Anxiety and Depressive Symptoms items showed DIF for language, except for the Anxiety item 5044R1r “I felt worried”. Another explanation might be that calibration samples differed. First, the U.S. calibration sample consisted of a combined general population subset and clinical sample, while the Dutch calibration sample consisted of a general population sample only. This may explain the lower T-scores in the Dutch sample (means were 44.0 and 45.0 in the total sample) as compared to the centered average score of 50 in the U.S. calibration sample. The fact that the Dutch calibration sample was a general population sample led to a skewed distribution in scores, which may have led to inflated discrimination parameters and overestimation of reliability in the Dutch sample [47]. Second, U.S. participants were on average younger than Dutch participants (57.7% versus 47.5% of children aged 8–12 in the U.S. and Dutch sample, respectively). Third, U.S. participants were recruited in person, while Dutch participants were recruited via the internet.

Construct validity of both short forms and CATs was considered sufficient, although differences in correlations between corresponding and non-corresponding constructs were small. These results could be expected given that RCADS short version subscales correlate highly [27, 48].

Strengths of this study are its large sample size and state-of-the art analyses. A limitation is the use of internet survey providers for recruitment of participants, which hampers replication of research procedures; however, it enabled taking a representative sample. Furthermore, the skewed distribution of our data might have caused problems in item parameter estimation. Given the differences between the Dutch and U.S. calibration samples, it is currently not possible to conclude which item parameter set is most appropriate for use in the Dutch population.

The results of this study support the use of both item banks in the Netherlands. Both short forms and CATs showed a reliability > 0.80 for most T-scores ≥ 43. The reliability is higher over a broader range in level of anxiety or depression and with fewer items than the reliability of a CTT questionnaire like RCADS-22 [27]. Therefore, both item banks seem useful for assessing and monitoring anxiety and depression in a general population. For now, we recommend the use of U.S. item parameters according to PROMIS convention.

Future research could compare country specific to universal U.S. parameters by collecting additional data in Dutch and U.S. samples using equal inclusion criteria. Furthermore, future research could examine whether more items can be developed, given that both item banks consist of a limited number of items, and three Depressive Symptoms items showed poor GRM item fit.

To conclude, the Dutch-Flemish PROMIS pediatric item banks v2.0 Anxiety and Depressive Symptoms showed sufficient unidimentionality, local independence, monotonicity, and GRM item fit—except for three Depressive Symptom items—in a general Dutch population. DIF for language results were mixed. The short forms 8a and CATs showed sufficient reliability in children with moderate and severe levels of anxiety and depression and sufficient construct validity. More research is needed to examine whether Dutch or U.S. item parameters are optimal for use in the Dutch population.