The consistent measurement of a construct is critical for evaluating change in treatment outcome studies (i.e., the anxiety construct measured at baseline is the same anxiety construct measured at posttreatment). Furthermore, longitudinal research with youth occurs across their development, be it in the short- or long-term. Without demonstrations that the measurement is consistent across time, it is unclear whether changes reflect true changes, or changes in measurement properties across time. Longitudinal measurement invariance (LMI) is a statistical approach to test this assumption by examining equality of measurement properties over time [1]. For measures of anxiety in youth this assumption is infrequently checked, particularly in the context of treatment.

A review of the literature found three studies that examined LMI in measures of youth anxiety: the Revised Child Anxiety and Depression Scale (RCADS) [2], the Social Anxiety Scale for Adolescents (SAS-A) [3], and the Screen for Child Anxiety and Related Disorders (SCARED) [4]. There was support for all levels of invariance (i.e., configural, metric, and scalar) for the RCADS and the SAS-A, both self-report measures, indicating that the same construct is measured over time [2, 3]; however, there was inconsistent support for different levels of invariance across subscales for SCARED parent- and youth self-reports [4]. All three studies used community samples and naturalistic follow-ups from 2.5 years to 4 years. However, studies have yet to examine LMI in treatment studies of youth anxiety. Evaluating LMI in treatment studies is integral in the determination of whether changes over time are due to changes resulting from the treatment or potentially influenced by changes in the measurement properties of the measures used (e.g., if items relate differently to the latent anxiety construct before and after treatment, perhaps due to increased psychoeducation or a change in severity, that may result in the observed change in scores).

There are a small number of commonly used measures of anxiety symptomatology in clinical trials. The Pediatric Anxiety Rating Scale (PARS) [5], the Multidimensional Anxiety Scale for Children (MASC) [6], and the Screen for Child Anxiety Related Disorders (SCARED) [7, 8] are frequently used measures in treatment outcome studies (e.g., PARS: [9,10,11]; MASC: [12,13,14]; SCARED: [15,16,17]). The PARS is an Independent Evaluator-rated measure of anxiety severity and impairment based on interviewing both youth and parents, and the MASC and SCARED are measures of anxiety symptoms with both a child- and parent-report. The PARS has a single-factor structure [5], the MASC a four-factor structure [6], and the SCARED a five-factor structure [7]. Despite their use in treatment outcome studies, previous psychometric evaluations are largely cross-sectional or have focused on their ability to detect change in anxiety. However, these analyses assume that the pre- and post-intervention assessments are equivalent and that the change detected is actually a change in the construct measured. The question of whether these measures assess the same anxiety construct consistently throughout treatment (e.g., baseline and posttreatment) has yet to be evaluated.

The present study examined longitudinal measurement invariance of five measures of anxiety severity and symptoms (i.e., total and/or subscale scores for the PARS, MASC parent-report, MASC child-report, SCARED parent-report, SCARED child-report) in a large sample of anxious youth at baseline and posttreatment. A series of models with increasing levels of invariance was estimated. Due to prior findings (e.g., Olino and colleagues [4]), we hypothesized that thresholds or intercepts may change across time and, thus, scalar invariance would not consistently be found (e.g., for the SCARED). This may reflect that youth have changes in thresholds of experienced anxiety needed to endorse higher level severity options.

Methods

Sample

The study included 488 youth, aged 7–17 years (M = 10.69, SD = 2.80), who participated in the Child-Adolescent Anxiety Multimodal Study (CAMS). The sample was 49.6% female and 25.4% of the sample was characterized as low socioeconomic status. 78.9% of the sample was White, 9.0% was Black, 2.5% was Asian, and 9.6% of the sample identified as a different racial group. All participants met diagnostic criteria for a principal diagnosis of generalized anxiety disorder (GAD), social anxiety disorder (Soc), and/or separation anxiety disorder (Sep). 35.9% (n = 175) of participants met criteria for all three diagnoses, 27.7% (n = 135) met criteria for Soc and GAD, 8.0% (n = 39) met criteria for Sep and GAD, 6.8% (n = 33) met criteria for Soc and Sep, 6.8% (n = 33) met criteria for GAD, 11.5% (n = 135) met criteria for Soc, and 3.3% (n = 16) met criteria for Sep. Comorbid diagnoses of lesser severity include attention-deficit hyperactivity disorder (11.9%; n = 58), oppositional defiant disorder (9.4%; n = 46), obsessive compulsive disorder (8.6%; n = 42), and other internalizing disorders (43.6%; n = 213). For more details about participants, see Kendall and colleagues [18].

Measures

Pediatric Anxiety Rating Scale (PARS)

The PARS is an Independent Evaluator (IE)-rated assessment of youth anxiety severity and impairment [5]. In the treatment study where this sample originated [19], a 6-item PARS total score was computed by summing six items assessing anxiety severity, frequency, distress, avoidance, and interference during the previous week. PARS item 1 (number of symptoms) is typically not included in the total score. Items were rated on a six-point scale from 0 to 5, with higher scores indicating greater impairment and severity. Historically, a 5-item PARS total score that further excludes the physical symptoms item has also been examined [5]. The 5-item PARS total score demonstrated r = .97 inter-rater reliability, r = .55 4-week retest reliability, and strong convergent and divergent construct validity [5]. In the present sample, the 6-item PARS demonstrated α = 0.72 internal consistency at baseline and α = 0.89 internal consistency at posttreatment.

Multidimensional Anxiety Scale for Children (MASC)

The MASC is a 39-item child- (MASC-C) and parent-report (MASC-P) measure of anxiety symptoms in the prior two weeks [6]. Items were rated on a four-point scale from 0 to 3, with higher scores indicating greater anxiety symptoms. The MASC consists of four subscales: physical symptoms (12 items), social anxiety (9 items), harm avoidance (9 items), and separation anxiety (9 items). The MASC demonstrated good convergent and divergent validity, retest reliability, and diagnostic accuracy [6, 20,21,22,23]. In the present sample, internal consistency (α) of the MASC-C subscales and total score ranged from 0.69 (separation anxiety subscale) to 0.91 (total score) at baseline and 0.72 (separation anxiety subscale) to 0.93 (total score) at posttreatment, and internal consistency of the MASC-P subscales and total score ranged from 0.67 (harm avoidance subscale) to 0.88 (total score) at baseline and 0.73 (separation anxiety subscale) to 0.93 (social anxiety subscale) at posttreatment.

Screen for Child Anxiety Related Disorders (SCARED)

The SCARED is a 41-item child- (SCARED-C) and parent-report (SCARED-P) measure of anxiety symptoms [7, 8]. Items were rated on a three-point scale from 0 to 2, with higher scores indicating a greater presence of anxiety symptoms. Respondents are typically instructed to consider the past three months, however, in the present study, the time frame was reduced to the prior two weeks due to repeated administration. The SCARED consists of five subscales: panic/somatic (13 items), general anxiety (9 items), separation anxiety (8 items), social phobia (7 items), and school phobia (4 items). As CAMS excluded youth who refused to attend school due to anxiety, the school phobia subscale was not examined in the present study. The SCARED demonstrated good retest reliability as well as strong convergent and divergent validity [7, 8, 24]. In the present sample, internal consistency (α) of the SCARED-C subscales and total score ranged from 0.83 (separation anxiety subscale) to 0.94 (total score) at baseline and 0.83 (separation anxiety subscale) to 0.95 (total score) at posttreatment, and internal consistency of the SCARED-P subscales and total score ranged from 0.83 (generalized anxiety subscale) to 0.90 (total score and social phobia subscale) at baseline and 0.83 (separation anxiety subscale) to 0.94 (social phobia subscale) at posttreatment.

Procedures

Institutional review board approval and participant informed consent and assent were obtained. Treatment spanned a 12-week period with assessments completed by the child and parent as well as interviews conducted by IEs at multiple timepoints. Only data from assessments conducted at baseline and posttreatment (i.e., 12 weeks following the start of treatment) were used in the present study. Cognitive-behavioral therapy (CBT; Coping Cat) consisted of 14 sessions over 12 weeks with two parent/guardian-only sessions occurring on the same day as the child session. Medication (sertraline) was administered at a dose up to 200 mg per day. For a more complete description of the design, see Compton and colleagues [25].

Data Analysis

All analyses were conducted in R [26], version 3.5.2, using the lavaan package [27] and irr package [28]. For measures with total and subscale scores, single factor models for the full set of items as well as models for individual subscales were estimated to reflect the different ways the measures are used. Models examining the MASC and SCARED were estimated using the weighted least square mean and variance (WLSMV) estimator due to 4-point and 3-point scale categorical responses, respectively. As PARS items contain six response options, items were treated as continuous, and models were estimated using the robust maximum likelihood (MLR) estimator. Thus, for the MASC and SCARED, thresholds were modeled, and intercepts were modeled for the PARS.

Acceptable and good model fit was indicated by a Comparative Fit Index (CFI) score greater than 0.95 and 0.97 and a Root Mean-Square Error of Approximation (RMSEA) score less than 0.08 and 0.05, respectively [29]. A non-significant χ2 test also indicates good model fit, however, this index has been known to be overly sensitive in large samples. When acceptable model fit was not indicated, model residuals and modification indices were examined to determine whether inclusion in the model of any covariances between variables would improve model fit. This process was repeated until the “revised” model reached adequate fit. For tests of LMI, equivalent “reconciled” models were used where added covariances to the baseline model were also included in the posttreatment model and vice versa.

Subsequently, a series of models with increasing levels of LMI were estimated for all measures. Residual covariances between the same item across time were permitted in each model. Models were specified freely estimating all factor loadings, intercepts, and thresholds and fixing the variance of the latent variable to 1. The following sequence of models was tested: configural invariance, metric (or weak) invariance, and scalar (or strong) invariance. The configural invariance model assigns the same factor structure to both the baseline and posttreatment latent factors (i.e., the same five PARS items are assigned as indicators in both models) with minimal other constraints (i.e., only the first item’s intercept and factor loading are constrained to be equal). Next, the metric invariance model examines differences in factor loadings by timepoint by placing equality constraints on the loadings for each observed indicator. Finally, the scalar invariance model examines differences in intercepts (for PARS) or thresholds (for MASC and SCARED) by timepoint by placing equality constraints on the intercepts for each item or thresholds between response options for each item. Put simply, scalar invariance indicates that mean-level comparisons can be conducted. Measurement invariance was indicated by a change in the CFI  0.01 and the RMSEA  0.015 [30]. A metric of effect size (dMACS) [31] is also provided for each item from each measure/subscale to aid in interpretation of the degree of invariance. dMACS integrates both factor loadings and intercepts/thresholds into a single effect size metric. Values were interpreted as small (0.4), medium (0.6), and large (0.8) effects in accordance with guidelines for practical importance by Nye and colleagues [32].

When model fit substantively diminished (i.e., decrease in CFI > 0.01 or increase in RMSEA > 0.015), partial invariance was assessed. Non-invariant items were identified by examining differences in estimates of model parameters. Equality constraints were lifted starting with the parameter with the largest difference and continued until the model achieved adequate fit. When equality constraints on thresholds required lifting, specified thresholds were lifted individually one at a time. Unconstrained parameters remained unconstrained for subsequent measurement invariance models. When lifting equality constraints did not substantially impact model fit, it was deemed that partial invariance was not found.

Finally, as a means of estimating the substantive impact on the item parameters with and without residual covariances, a sensitivity analysis was conducted by estimating intraclass correlations (ICC). ICCs were calculated based on absolute agreement, 2-way mixed effects models and compared factor loadings, thresholds/intercepts, and residual variances. ICCs greater than 0.5, 0.75, and 0.9 indicate moderate, good, and excellent agreement, respectively [33]. Greater agreement would indicate that the addition of residual covariances did not substantively impact model parameters.

Results

Data from the MASC-C, MASC-P, and the SCARED-C was present for all 488 participants at baseline. Baseline data were missing from only one participant on the PARS and from only three participants on the SCARED-P. Due to attrition, data at posttreatment were available for 439 (90.0%) participants on the PARS and the SCARED-C, for 436 (89.3%) participants on the MASC-C and the MASC-P, and for 435 (89.1%) participants on the SCARED-P. Attrition rates differed by treatment condition, with significantly lower rates for participants in the CBT condition (4.3%) than in the medication (17.3%) or placebo (19.7%) groups. The combination group (9.3%) did not differ with any other treatment condition [19]. Results of tests of unidimensionality and LMI are reported separately for the individual measures.

PARS

Tests of Unidimensionality

The initial baseline model was an excellent fit to the data, however, the initial posttreatment model was not. A revised posttreatment model, including two residual covariance paths, was a good fit to the data. Specific added covariances can be found in Supplemental Materials. Model fit for final reconciled models with equivalent residual covariance paths can be found in Table 1. As noted in the methods section, historically, a 5-item PARS total score has also been used. A similar pattern of fit to the data was found for models used in tests of unidimensionality and LMI, and can be found in the Supplemental Materials.

Table 1 Fit statistics for PARS baseline, posttreatment, and longitudinal measurement invariance models

Tests of LMI

The configural invariance model was an excellent fit to the data and, subsequently, the metric invariance and scalar invariance models were a good fit to the data (see Table 1 for fit statistics). Changes in the CFI and RMSEA between models were within acceptable limits indicating that model fit did not deteriorate with the inclusion of constraints. Thus, it is possible to conclude that the PARS total score has scalar invariance. All PARS items showed a large effect size difference (dMACS > 0.8).

MASC and SCARED Total Scores

There was an attempt to fit single factor models for the full set of items at baseline and posttreatment, however, all models were a poor fit to the data (see Supplemental Materials). Acceptable fit was not attainted despite attempts to add residual covariances. Likewise, an attempt was made to fit models with a second-order latent factor structure where the anxiety measure was specified as a second-order latent factor indicated by its subscales, which in turn were indicated by the items comprising the subscale. Though the baseline SCARED-C model demonstrated acceptable fit [CFI = 0.977; RMSEA = 0.050 (90% CI = 0.046–0.053)], all other models either failed to converge or were a poor fit to the data and did not improve following attempts to add residual covariances.

MASC-C Subscales

Tests of Unidimensionality

For the physical symptoms subscale and the separation anxiety subscale, initial baseline and posttreatment models were an excellent fit to the data. However, for the social anxiety and harm avoidance subscales initial models were a poor fit to the data (see Table 2). Revised social anxiety subscale models, including five residual covariance paths in the baseline model and one residual covariance path in the posttreatment model, were an acceptable fit to the data. Similarly, revised harm avoidance subscale models, each including one residual covariance path, were an acceptable fit to the data. Specific added covariances can be found in Supplemental Materials. Model fit for final reconciled models with equivalent residual covariance paths can be found in Table 2.

Table 2 Fit statistics for MASC-C baseline and posttreatment models

Tests of LMI

The configural invariance model and, subsequently, the metric invariance model for all four subscales had good fit (see Table 3 for all fit statistics). Changes in the CFI and RMSEA between these models were within acceptable limits indicating that model fit did not deteriorate with the inclusion of constraints. The scalar invariance model for the physical symptoms subscale and separation anxiety subscale both had excellent fit and changes were within acceptable limits. Good fit was also found for the scalar invariance model for the harm avoidance subscale, however, changes in both the CFI and RMSEA were in excess of acceptable limits. Partial scalar invariance was attained after the freeing of five thresholds (i.e., the first threshold for items 2, 9, 28, and 36 and the second threshold for item 28). Furthermore, the scalar invariance model for the social anxiety subscale demonstrated poor fit. Attempts to free equality constraints on thresholds did not yield a change in fit. Thus, it is possible to conclude that the MASC-C physical symptoms and separation anxiety subscales have scalar invariance, the MASC-C harm avoidance subscale has partial scalar invariance, and the MASC-C social anxiety subscale has metric invariance.

Table 3 Fit statistics for MASC-C longitudinal measurement invariance models

For the physical symptoms subscale, small effect size differences (i.e., dMACS < 0.4) were found for 16.7% of items and moderate effect size differences (i.e., 0.5 < dMACS < 0.7) were found for 75.0% of items. For the social anxiety subscale, moderate effect size differences were found for 66.7% of items and no small effect size differences were found. For the harm avoidance subscale, small effect size differences were found for 22.2% of items and moderate effect size differences were found for 44.4% of items. For the separation anxiety subscale, small effect size differences were found for 33.3% of items and moderate effect size differences were found for 33.3% of items. No large effects (i.e., dMACS > 0.8) were found for any item.

MASC-P Subscales

Tests of Unidimensionality

All MASC-P subscales required the addition of residual covariance paths for at least one timepoint. For the physical symptom subscale, the initial posttreatment model was an acceptable fit for the data, however, the initial baseline model was not. A revised baseline model, including two residual covariance paths, was an acceptable fit to the data (specific added covariances can be found in Supplemental Materials). For the remaining models, all initial models were a poor fit to the data (see Table 4). Revised social anxiety subscale models, each including six residual covariance paths, were an acceptable fit to the data. Similarly, revised harm avoidance subscale models, including two residual covariance paths in the baseline model and one residual covariance path in the posttreatment model, were an acceptable fit to the data. Finally, revised separation anxiety subscale models, including three residual covariance paths in the baseline model and one residual covariance path in the posttreatment model, were an acceptable fit to the data. Model fit for all final reconciled models with equivalent residual covariance paths can be found in Table 4.

Table 4 Fit statistics for MASC-P baseline and posttreatment models

Tests of LMI

Configural invariance was not found for the MASC-P social anxiety subscale (see Table 5 for all fit statistics). For the remaining three subscales, the configural invariance model and, subsequently, the metric invariance model had good fit. Changes in the CFI and RMSEA between these models were within acceptable limits indicating that model fit did not deteriorate with the inclusion of constraints. The scalar invariance model for the separation anxiety subscale had good fit and, changes were within acceptable limits.

Table 5 Fit statistics for MASC-P longitudinal measurement invariance models

For the physical symptoms subscale, no participants endorsed the highest option for item 18 at posttreatment so only two thresholds were specified in the scalar invariance model. Good fit was found for the scalar invariance model, however, changes in the CFI were in excess of the acceptable limit. Partial scalar invariance was attained after freeing seven thresholds (i.e., all three thresholds for items 1, the third and second threshold for item 31, and the first threshold for items 27 and 20). Finally, the scalar invariance model for the harm avoidance subscale demonstrated poor fit. Partial scalar invariance was attained after freeing eight thresholds (i.e., all three thresholds for items 9 and the first threshold for items 2, 25, 13, 26, and 21). Thus, it is possible to conclude that the MASC-P separation anxiety subscale has scalar invariance, and the MASC-P harm avoidance and physical symptoms subscales have partial scalar invariance, however, the MASC-P social anxiety subscale did not even have configural invariance.

For the physical symptoms subscaleFootnote 1, moderate effect size differences (i.e., 0.5 < dMACS < 0.7) were found for 16.7% of items, large effect size differences (i.e., dMACS > 0.8) were found in 41.7% of items, and no small effect size differences (i.e., dMACS < 0.4) were found. For the social anxiety subscale, moderate effect size differences were found for 11.1% of items, large effect size differences were found for 88.9% of items, and no small effect size differences were found. For the harm avoidance subscale, small effect size differences were found for 11.1% of items, moderate effect size differences were found for 22.2% of items, and large effect size differences were found for 22.2% of items. For the separation anxiety subscale, small effect size differences were found for 11.1% of items, moderate effect size differences were found for 55.6% of items, and large effect size differences were found for 33.3% of items.

SCARED-C Subscales

Tests of Unidimensionality

For the panic/somatic subscale, initial baseline and posttreatment models were an excellent fit to the data. For the general anxiety subscale, the initial posttreatment model was an adequate fit for the data, however, the initial baseline model was not. A revised baseline model, including two residual covariance paths, was an acceptable fit to the data (specific added covariances can be found in Supplemental Materials). For the remaining subscales, all initial models were a poor fit to the data (see Table 6). Revised separation anxiety subscale models, including two residual covariance paths in the baseline model and one residual covariance path in the posttreatment model, were an acceptable fit to the data. Similarly, revised social phobia subscale models, including one residual covariance path in the baseline model and three residual covariance paths in the posttreatment model, were an acceptable fit to the data. Model fit for all final reconciled models for the above subscales with equivalent residual covariance paths can be found in Table 6.

Table 6 Fit statistics for SCARED-C baseline and posttreatment models

Tests of LMI

The configural invariance model and, subsequently, the metric invariance model for all subscales had good fit (see Table 7 for all fit statistics). Changes in the CFI and RMSEA between these models were within acceptable limits indicating that model fit did not deteriorate with the inclusion of constraints. The scalar invariance model for the panic/somatic subscale, the separation anxiety subscale, and the social phobia subscale each had excellent fit and, changes were within acceptable limits. However, the scalar invariance model for the general anxiety subscale demonstrated poor fit. Attempts to free equality constraints on thresholds did not yield a change in fit. Thus, it is possible to conclude that the SCARED-C panic/somatic, separation anxiety, and social phobia subscales have scalar invariance and the SCARED-C general anxiety subscale has metric invariance.

Table 7 Fit statistics for SCARED-C longitudinal measurement invariance models

For the panic/somatic subscale, small effect size differences (i.e., dMACS < 0.4) were found for 15.4% of items, moderate effect size differences (i.e., 0.5 < dMACS < 0.7) were found for 38.5% of items, and no large effect size differences (i.e., dMACS > 0.8) were found. For the general anxiety subscale, moderate effect size differences were found for 33.3% of items, large effect size differences were found for 11.1% of items, and no small effect size differences were found. For the separation anxiety subscale, moderate effect size differences were found for 50.0% of items and no small or large effect size differences were found. For the social phobia subscale, moderate effect size differences were found for all items.

SCARED-P Subscales

Tests of Unidimensionality

For the panic/somatic subscale, initial baseline and posttreatment models were an acceptable fit to the data. For the remaining models, all initial models were a poor fit to the data (see Table 8). Revised general anxiety subscale models, including four residual covariance paths in the baseline model and one residual covariance path in the posttreatment model, were an acceptable fit to the data. Similarly, revised separation anxiety subscale models, including four residual covariance paths in the baseline model and three residual covariance paths in the posttreatment model, were an acceptable fit to the data. For the social phobia subscale, revised models, including three residual covariance paths in the baseline model and five residual covariance paths in the posttreatment model, were an acceptable fit to the data. Model fit for all final reconciled models for the above subscales with equivalent residual covariance paths can be found in Table 8.

Table 8 Fit statistics for SCARED-P baseline and posttreatment models

Tests of LMI

The configural invariance model for all subscales had good fit (see Table 9 for all fit statistics). For the panic/somatic subscale and the separation anxiety subscale, the metric invariance model and, subsequently, the scalar invariance model each had excellent fit and changes were within acceptable limits. Good fit was also found for the metric invariance model for the general anxiety subscale, however, the scalar invariance model for the general anxiety subscale demonstrated poor fit. Attempts to free equality constraints on thresholds did not yield a change in fit. Finally, good fit was found for the metric invariance model for the social phobia subscale, however, changes in the RMSEA were in excess of acceptable limits. Partial metric invariance was attained after freeing the factor loading for item 39. The resulting partial scalar invariance model demonstrated adequate fit. Thus, it is possible to conclude that the SCARED-P panic/somatic and separation anxiety subscales have scalar invariance, the SCARED-P social phobia subscale has partial scalar invariance, and the SCARED-P general anxiety subscale has metric invariance.

Table 9 Fit statistics for SCARED-P longitudinal measurement invariance models

For the panic/somatic subscale, small effect size differences (i.e., dMACS < 0.4) were found for 15.4% of items, moderate effect size differences (i.e., 0.5 < dMACS < 0.7) were found for 53.8% of items, and large effect size differences (i.e., dMACS > 0.8) were found for 15.4% of items. For the general anxiety subscale, moderate effect size differences were found for 11.1% of items, large effect size differences were found for 88.9% of items, and no small effect size differences were found. For the separation anxiety subscale, moderate effect size differences were found for 12.5% of items, large effect size differences were found for 50.0% of items, and no small effect size differences were found. For the social phobia subscale, moderate effect size differences were found for 25.0% of items, large effect size differences were found for 50.0% of items, and no small effect size differences were found.

Sensitivity Analysis

As covariances were added to nearly all baseline and posttreatment models, intraclass correlations were calculated for all models comparing the factor loadings, intercepts/thresholds, and residual variances between models with and without covariances. All ICCS were greater than 0.955, indicating excellent agreement between parameters in models with and without covariances. Specific ICCs and model fit statistics for the models without added residual covariances can be found in Supplemental Materials.

Invariance Across Treatment Condition

As the present sample consists of multiple treatment conditions, we explored invariance across treatment condition to ensure that treatment condition did not confound LMI conclusions. Results support measurement invariance across treatment condition at baseline and posttreatment. Fit statistics for scalar invariance models for all measures at each timepoint can be found in Supplemental Materials.

Discussion

The present study examined longitudinal measurement invariance of five measures of anxiety (i.e., PARS, MASC and SCARED parent- and child-reports). Models were assessed with increasing levels of invariance and results present a mixed picture. Scalar invariance, which indicates that valid mean levels comparisons can be conducted [34], was found for the PARS total score and many, but not all, MASC and SCARED subscales (total score models for both the MASC and SCARED were a poor fit to the data and LMI would have had limited validity). Thus, conclusions from prior studies using the PARS are not contaminated by changes in measurement. Most MASC and SCARED subscales are similarly acceptable (e.g., MASC separation anxiety subscale, SCARED panic/somatic and separation anxiety subscales), however, caution is advised for conclusions drawn from longitudinal analyses based on the MASC social anxiety subscale and the SCARED general anxiety subscale. Likewise, caution is advised for longitudinal analysis on the MASC and SCARED total scores until it can be determined whether the total scores are invariant over time.

Results for the SCARED differ slightly from those found in a previous examination. That study found scalar invariance only in the parent-report general anxiety subscale (we only found metric invariance), partial scalar invariance in the child-report panic/somatic and social anxiety subscales (we found full scalar invariance), and partial metric invariance for the parent-report separation (we found full scalar invariance) [4]. Similar results of metric invariance for the child-report general anxiety subscale and partial scalar invariance in the parent-report social anxiety subscale were found in both studies. The previous report elected to not examine LMI when initial fit at one timepoint was poor (e.g., for the child-report separation anxiety subscale). Had the same approach been used in the present study, LMI would only have been tested for both reports of the panic/somatic subscale as all remaining subscales required the addition of residual covariances due to poor fit. As was concluded in the previous study and replicated here, changes in SCARED scores over time may reflect changes in measurement properties rather than solely changes due to an intervention [4].

For the MASC, all subscales other than the social anxiety subscale showed full or partial scalar invariance. For the child report of the MASC social anxiety subscale, metric invariance was found which indicates equality of factor loadings but not of thresholds. Given this level of invariance, tests of relative standing (e.g., correlations, regression) for these constructs would be valid. Unfortunately, tests of mean-level changes would not be valid based on the lack of support for scalar invariance. However, for the parent-report of the MASC social anxiety subscale, configural invariance was not supported. This indicates that the factor structure of the parent-report MASC social anxiety subscale at baseline and posttreatment are not equivalent and leads to challenges in many modeling contexts.

It is notable that the PARS, which had scalar invariance, is an Independent Evaluator-report based on interviewing both youth and parents, while the MASC and SCARED, which did not consistently demonstrate scalar invariance, are child or parent-report measures. Research on measurement invariance by informant in youth has largely been conducted comparing child- and parent-reports (e.g., Olino and colleagues) [4], and no studies were found that included data from a therapist- or Independent Evaluator-report group. This situation may be a biproduct of the dearth of measures that have both a therapist-report and either a child- or parent-report. Furthermore, this practice is only available in randomized controlled trials or stringent research settings. In the present study, the IEs may have had the strongest basis for evaluating the criterion items, so the support for measurement invariance may reflect that with training, measures may be more stably assessed. Nevertheless, further examination of the present study’s discrepant findings are warranted, particularly in the context of the strong body of literature on informant differences and the benefits of a multi-informant approach [35].

Failure to find unidimensionality in the majority of models merits discussion. Although it is not uncommon to include residual covariances in LMI models of psychological constructs [36], it is noteworthy both the amount of baseline and posttreatment models that required their inclusion (i.e., 27 out of 38 models) and the amount of residual covariances that were required to be added in certain models (i.e., up to six) to reach adequate fit. Conversely, there was excellent agreement between models with and without residual covariances, indicating that the addition of residual covariances did not substantively impact findings. Failure to find unidimensionality has also been found in measures of depression [37]. A lack of unidimensionality indicates that the scale/subscale totals comprise multiple factors rather than one factor and, therefore, may be indicative of multiple constructs rather than the one intended construct. Concerns differentiating between diagnostic criteria for certain anxiety and related disorders in the DSM-5 have been raised [38] and may contribute to the lack of unidimensionality. In the present study, 78.5% of participants met criteria for at least two of the target anxiety diagnoses (i.e., separation anxiety, social anxiety, and generalized anxiety disorders) and 35.9% met for all three diagnoses [18]. Measures of anxiety likely reflect this overlap and contain items that fit criteria for or represent symptoms of multiple disorders rather than a single disorder. For example, the MASC item “The idea of going to camp scares me” that loads onto the separation anxiety subscale could also comprise an element of social anxiety if the fear is related to evaluation at camp rather than (or in addition to) a fear of being away from a loved one. Furthermore, unidimensionality is not required to attain a high value of Cronbach’s alpha [39], a commonly used measure of reliability (i.e., internal consistency) in validation studies. Thus, existing measures with high values of Cronbach’s alpha and believed to be unidimensional, may not be unidimensional. As measures are developed and assessed, an added focus on unidimensionality rather the simply internal consistency is warranted to ensure that measure do not simply contain items that relate to one another but actually represent the same construct. Future research should also examine this in existing measures.

A futher examination of the included residual covariances revealed a potential theme across measures: redundant items. For example, for the MASC social anxiety subscale, multiple residual covariances were added between the item “I feel shy” and other items that are indicative of shyness (i.e., nervous about performing, difficulty asking others to play, worrying about being called on in class). For the SCARED, the generalized anxiety subscale included a residual covariance between items “I am nervous” and “I am a worrier” which to youth may be synonyms. Likewise, the SCARED social phobia subscale includes items “I don’t like to be with people I don’t know well” and “I feel nervous with people I don’t know well” that may address the same content. Clinicians and researchers should be mindful of this apparent redundancy when selecting measures to assess anxiety in youth. The present findings should also be viewed in light of a recent content analysis of youth anxiety measures which found low overlap such that only 23% of the 42 symptom categories were included in four or more of the seven examined measures (which included the MASC and the SCARED) [40]. As new measures are created and others are refined and updated, it is worth considering this redundancy as an avenue to address the lack of overlap and an effort to more broadly capture differing anxiety presentations.

As the present sample is derived from a treatment outcome study that examined multiple treatments, a possible explanation for these findings (as well as a potential study limitation) is that one of the treatments may have differentially impacted the measurement of anxiety. For instance, it is possible that psychoeducation, an element of the two psychological treatment conditions, may alter ones’ perception and understanding of anxiety differently than the non-therapy treatment conditions. Another consideration was that perhaps medication side-effects not present at baseline impacted the model at posttreatment. Though minimally present overall, significant differences in some adverse events between the medication and CBT conditions were found (e.g., fatigue, restlessness, insomnia) [19]. It is also worth noting that the PARS has an item on physical symptoms and the MASC and SCARED have physical and somatic symptom subscales, respectively. Furthermore, it is possible that treatment drop out may have heightened these effects. Although little drop out occurred, participants in the cognitive-behavioral therapy condition (i.e., Coping cat; 4.3%) were significantly less likely to drop out from treatment than those in the medication (i.e., sertraline; 17.3%) or placebo (19.7%) groups (the combination group, 9.3%, did not significantly differ from other groups with respect to drop out) [19]. If it is believed that medication side effects impacted the endorsement of items, differential participant drop out by treatment condition may exacerbate this effect. Future research is required to examine this further, as well as potential invariance between treatment conditions in general.

The study findings should also be considered in light of limitations. First, as with all examinations of LMI, findings may be exclusive to the characteristics of the present sample (e.g., age range, time interval). Nevertheless, it is important to not take for granted measurement invariance found in the present sample. Next, as noted above, to pursue measurement invariance models, the majority of models had to be modified with the inclusion of covariances. Although statistically supported, it is possible that the added covariances may not be appropriate or may be providing a “crutch” to the model. Likewise, covariances were added following examination of modification indices, which is not theory driven. Results should be interpreted within this context and future research should replicate these findings utilizing a theory driven approach. Additionally, MASC and SCARED analyses used WLSMV estimation which relies on pairwise data and thus is less adept at handling missing data than MLR or imputation. Finally, the CAMS sample was predominantly white, potentially limiting the generalizability of these findings to more diverse samples. It is possible that when replicated in more diverse samples, conclusions on longitudinal measurement invariance may differ from the present study’s findings, particularly if the measures are not invariant across demographic characteristics. Future studies should assess longitudinal measurement invariance in more diverse samples as well as assess measurement invariance across racial and other factors.

Summary

The present findings, combined with the existing literature, illuminate a complicated picture of whether anxiety measures consistently assess the same construct across treatment. Greater attention to longitudinal measurement invariance is needed when measures are designed and initially validated as researchers need confidence that changes over time are due to treatment effects and not due to changes in the measurement properties. Presently, clinicians and researchers utilizing the MASC or the SCARED to monitor changes in anxiety symptoms may want to use an alternative measure, such as the RCADS, which, in addition to demonstrating measurement of the same construct over time [2], has also been recommended as a consensus measure of youth anxiety [41].