Oppositional Defiant Disorder (ODD) is one of the most commonly-occurring disorders in young children (Egger and Angold 2006; Lavigne et al. 2009). Relationships between early ODD and later developing conduct disorder (Burke and Loeber 2010), depression (Burke et al. 2010; Burke et al. 2005; Copeland et al. 2009; Lavigne et al. 2001) and anxiety (Drabick and Kendall 2010; Lavigne et al. 2001) are well established. However, less is known about whether ODD comprises a single construct or consists of multiple dimensions that differentially predict later externalizing or internalizing disorders. Recent studies have provided support for a multi-dimensional structure to ODD in which particular dimensions differentially predict later psychopathology. For example, in a sample of boys, Burke and Loeber (2010) found that ODD consisted of two dimensions, an affective dimension and a behavioral dimension, with the affective dimension associated with later depression. In a sample of girls, a model of ODD with three dimensions was identified; the negative affect dimension was associated with depression, while oppositional and antagonistic behavior were associated with subsequent conduct disorder (Burke et al. 2010). Similarly, other studies have found that an irritability dimension of ODD predicted emotional problems (Stringaris and Goodman 2009a, b) or anxiety (Rowe et al. 2010), while a headstrong dimension predicted later hyperactivity (Stringaris and Goodman 2009a, b) or depression (Rowe et al. 2010). In order for research on the homotypic and heterotypic continuity of ODD (longitudinal relationships with both later ODD/CD and internalizing disorders, respectively) to be meaningful, a critical first step is to identify the best dimensional structure of ODD itself.

Models of the Dimensions of Oppositional Defiant Disorder

To date, six different models (Figs. 1 and 2) of the dimensions of ODD symptoms have been identified. These include: (a) the single-factor DSM-IV model; (b) a two-factor model (oppositional behavior, negative affect) identified by Burke and colleagues (Burke and Loeber 2010; Burke et al. 2005) with a male sample at the University of Pittsburgh (Pitt-2 model). In the factor analysis used to develop this model, the symptoms of “blames others” and “annoys others” did not load onto either factor and were not included in the model (Burke and Loeber 2010) ; (c) a two-factor model (irritable, headstrong/spiteful) identified by Rowe et al. (Rowe et al. 2010) with the Great Smoky Mountains dataset (GSMS model); (d) a three-factor model (oppositional behavior, negative affect, antagonistic behavior) of ODD dimensions identified by Burke et al. (2010) with a female sample (Pitt-3 model); and (e) a three-factor model (irritable, hurtful, headstrong) developed by Stringaris and Goodman (2009a, b) in the United Kingdom. That model (UK/DSM-5 model) has been adapted for use in DSM-5 with the factor labels changed (in DSM-5 the dimension labels are now angry/irritable, argumentative/ defiant, and vindictiveness, respectfully); and (f) a three-factor model (irritable, headstrong, hurtful) developed by Aebi et al. (2010) with a European/Middle Eastern clinical sample (EUR) model. The three factors identified by Aebi were highly correlated with one another (irritability with headstrong, 0.89; irritability with hurtful, 0.70; headstrong with hurtful, 0.63). The two factors identified by Rowe et al. were correlated 0.55. Correlations between factors were not reported for the other models. Other studies have examined the structure of a broader category of disruptive behavior disorders (Wakschlag et al. 2012) or a specific group of ODD symptoms associated with anger and irritability (Drabick and Gadow 2012) but not ODD per se, and are not considered further herein.

Fig. 1
figure 1

a DSM-IV single-factor model b Pitt two-factor model 1 (Burke et al. 2005) c GSMS two-factor model (Rowe et al. 2010) d Pitt three-factor model 2 (Burke et al. 2010)

Fig. 2
figure 2

a UK/DSM-5 three-factor model (Stringaris and Goodman 2009a, b). b EUR three-factor model (Aebi et al. 2010)

Researchers who identified these six models were primarily concerned with the homotypic or heterotypic continuity of their model with other disorders, and relatively little attention was paid to the validity of the factor structures. Several studies (Aebi et al. 2010; Burke et al. 2005, 2010; Rowe et al. 2010) used exploratory factor analyses (EFAs) to identify the models, but none of these studies replicated the factor structure of their models in subsequent samples using a confirmatory factor analysis (CFA) approach. Stringaris and Goodman (2009a, b) identified the three-factor UK/DSM-5 model on an a priori basis rather than using EFAs or CFAs to support the validity of their model. Although more than one study has identified either two- or three-factor models, the items loading on these factors differed; thus, none of the models identified were replications of one another.

In addition, only two studies have conducted comparisons of model fit across multiple models. Ezpeleta et al. (2012) compared several models (single-factor DSM-IV model, UK/DSM-5 model, Rowe’s GSMS model , Pitt-3 model) using CFAs. Two fit indices were used with CFI values > 0.90 and an RMSEA value < 0.06 considered good and a CFI value > 0.85 and an RMSEA value < 0.10 considered moderately good. For parent-reported questionnaire data, only the Pitt-3 model showed good model fit on both indices while the DSM-IV, UK/DSM-5, and GSMS model showed good fit on the CFI and moderately good fit on the RMSEA. For teacher-reported data, all four models showed good model fit on the CFA and moderately good fit on the RMSEA indices. The authors concluded there was “no compelling reason” (p. 8) to prefer one model to another. In a sample of parent-reported ODD symptoms among Brazilian children ages 6–12 years-old, Krieger et al. (2013) compared goodness of fit of four models (single-factor DSM-IV, UK/DSM-5 model, Pitt-3 model and the GSMS model). Krieger et al. considered CFI and TLI values of > 0.95 as preferred and > 0.90 acceptable, and RMSEA values of < 0.05 preferred and “up to” 0.08 acceptable. Krieger et al. did not describe the rules used to combine the findings for the different fit indices. However, for three of the four models (DSM-IV, GSMS, and Pitt-3), at least one of the three fit indices was not acceptable. Only for the UK/DSM-5 model were all three fit indices good or acceptable. They concluded that the three-factor UK/DSM-5 model best fit the data. Thus, the findings of the two comparative studies were quite different, with one finding all models reasonably acceptable, and the second finding only one to be acceptable. Furthermore, neither the Ezpeleta et al. nor the Krieger et al. studies included the Pitt-2 or EUR models, and these could have been important omissions.

One major limitation to the existing studies of the dimensions of ODD is the lack of attention to the factorial invariance of the proposed models. Factorial invariance concerns the degree to which the items used to measure a construct have the same meaning and measure the construct in the same way across different groups of respondents (Brown 2006; Saban et al. 2010). When invariance is not present, there is the possibility of construct bias in which the meaning of a construct differs across those groups or longitudinally (Kline 2011). If invariance is not present, it is impossible to determine how to interpret differences in risk factors or correlated features of symptoms, prognosis, or treatment outcomes that may be associated with defining group differences, such as gender (Burns et al. 2006) or age.

Presently, only one study (Burns et al. 2006) has examined the measurement and structural invariance of ODD symptoms across genders. Using parent reports of ODD in American (ages 3–16 years) and Malaysian (school-age) samples, Burns et al. found support for measurement and scalar invariance across genders in both samples for the single-factor DSM-IV model. In addition, while Burns et al. found measurement and structural invariance across gender for the DSM-IV model, they did not examine age invariance for this model. None of the existing studies have examined gender or age differences among the models positing specific dimensions of ODD symptoms. While Burns et al. found gender invariance for the single dimension DSM-IV model, there are indications that the specific dimensions in two- and three-factor models may not be gender- or age- invariant. Specifically, the studies by Burke and colleagues showing a different number of factors for boys and girls suggest that the structure of ODD is not invariant across gender. Furthermore, Pardini et al. (2010) note that the DSM-IV field trials for ODD included relatively few girls, making it difficult to determine if the ODD construct is the same across genders. Pardini et al. also note that inadequate attention has been paid to longitudinal aspects of the development of ODD, while Burke and Loeber (Burke and Loeber 2010) suggest that longitudinal changes in the prevalence of ODD may be accompanied by corresponding developmental shifts in its factor structure. Because ODD can occur in young children and lead to later internalizing and externalizing disorders even in the early school years, it is particularly important to understand the factor structure of ODD in young children, for whom ODD is the most common psychiatric disorder (Egger and Angold 2006; Lavigne et al. 2009).

Using multi-group CFA methods, it is possible to examine several important aspects of the invariance of a model, including the pattern of factor loadings (configural invariance), the magnitude of factor loadings (metric invariance), and the magnitude of intercepts (scalar invariance). If a model is not invariant, then the heterotypic continuity between ODD dimensions and other behavior problems may differ across ages and genders, so determining the invariance of the best fitting models for the dimensions of ODD has important implications and should be considered before examining heterotypic continuity.

The Present Study

Although DSM-5 adopted a three-factor model of ODD, that model was developed a priori and without adequate attention to the model’s gender and longitudinal invariance. Identifying the best model fit and the invariance of the model is important to guide future studies of the predictive ability of the model’s dimensions with other disorders. For that reason, the first aim of the present study was to determine which of the six existing measurement models provides a more accurate representation of the ODD construct. These six models are: (a) a single-factor model of oppositional behavior represented in DSM-IV (DSM-IV model); (b) a two-factor model (oppositional behavior, negative affect) of dimensions of ODD developed by Burke and colleagues (Burke et al. 2005) with a male sample at the University of Pittsburgh (Pitt-2 model), (c) a two-factor model developed by Rowe et al. (Rowe et al. 2010) with the Great Smoky Mountains dataset (GSMS model); (d) a three-factor model (oppositional behavior, negative affect, antagonistic behavior) of dimensions of ODD developed by Burke and colleagues (Burke et al. 2010) with a female sample (Pitt-3 model); (e) a three-factor model (irritable, hurtful, headstrong) developed by Stringaris and Goodman (2009a, b) in the United Kingdom (UK/DSM-5 model); and (f) a three-factor model developed by Aebi et al. (2010) with a European/Middle Eastern clinical sample (EUR model). After establishing which model provides the best fit, the second aim was to test: (a) cross-sectional hypotheses about this model’s measurement and structural invariance with respect to gender; and (b) longitudinal hypotheses about its measurement and structural invariance with respect to age.

Method

Participants

Data were collected as part of a longitudinal study of risk factors for the development of psychopathology across an important developmental period, ages 4 (preschool), 5 (kindergarten, transition to school), and 6–7 (early school-age). To obtain a diverse sample, 796 4-year-old children and their families were recruited from 23 primary care pediatric clinics throughout Cook County, Illinois and 13 Chicago Public School preschool programs. At the time of the initial interview, eligible children: (a) were 4 years of age; (b) had lived with the parent who participated in the study for at least 6 months; (c) spoke English or Spanish; (d) did not meet criteria for an Autism Spectrum Disorder; (f) obtained a standard score on the Peabody Picture Vocabulary Test ≥ 70 (Dunn and Dunn 1997), were not enrolled in a special education class for the intellectually disabled, and did not have a school IQ test score below 70.

The initial sample of 796 4-year-olds (mean age = 4.44) included 391 (49.1 %) boys and 405 (50.9 %) girls. Parent-reported racial/ethnic group membership included: 433 (54.4 %) White, non-Hispanic; 133 (16.7 %) African American; 162 (20.4 %) Hispanic; 19 (2.4 %) Asian; and 35 (4.4 %) multi-racial or “Other.” Race/ethnicity was not reported by 14 (1.8 %) parents. All social classes (Hollingshead 1975) were included, with 303 (38.1 %) children in Class I (highest), 290 (36.4 %) in Class II, 79 (9.9 %) in Class III, 63 (7.9 %) in Class IV, and 61 (7.7 %) in Class V. Other details about the age-4 sample are available (Lavigne et al. 2009).

Of the initial sample, 626 children and families (78.6 %) participated in all three waves of data collection. The sample of families and children who completed all three waves of data collection differed from those who did not complete all three waves with respect to: (a) race, with a greater proportion of minority participants dropping out, χ 2(5, N = (796) = 77.7, p = 0.001; (b) SES, with a greater proportion of lower SES groups dropping out, χ 2(4, N = (796) = 69.61, p = 0.001; and (c) age, with those who dropped out being on average 25 days older at study entry, t(773) = 2.41, p = 0.02. Because imputation is generally preferable to listwise deletion (Graham 2009) missing data were imputed using single imputation with the SPSS V15.0 missing data program. That program uses maximum likelihood procedures utilizing all study variables (child age, sex, race/ethnicity, SES, all ODD symptom items for all 3 age groups) to estimate values. With the imputation, the final sample N was 796.

Measures

Demographics

Parents completed a demographic questionnaire to obtain information about child’s age, race/ethnicity, and parental education and occupation that was coded for socioeconomic status using the Hollingshead Four-Factor Index of Social Status (Hollingshead 1975).

Peabody Picture Vocabulary Test: 3rdEdition (PPVT-III)

Children completed a measure of receptive vocabulary, the PPVT-III (Dunn and Dunn 1997), to assess language skills needed for certain tasks used in the larger study (but relevant to the present report only in terms of exclusion criteria). The PPVT has been shown to have good to excellent concurrent validity (rs = 0.63–0.92) with tests of verbal intelligence (Dunn and Dunn 1997).

ODD dimensions

Measures of the eight DSM-IV symptoms of ODD were derived from two DSM-IV-coded instruments. The early childhood form of the Child Symptom Inventory (CSI) (Gadow and Sprafkin 1997, 2000) is a parent-completed checklist for which child symptoms are rated on a four-point rating scale ranging from 0 (never) to 3 (very often). The CSI has been used in prior studies of ODD dimensions (Burke et al. 2010) and the nosology of externalizing problems in girls (Keenan et al. 2010). Internal consistency is good (alpha = 0.70).

The Diagnostic Interview Schedule for Children-Parent Scale-Young Child (DISC-YC) version (Fisher and Lucas 2006) is a developmentally-appropriate, structured parent interview that includes items measuring the DSM-IV symptoms of ODD. High levels of agreement are obtained for concrete, observable symptoms, and test-retest reliabilities for the DISC-YC are high. DISC-YC interviewers were clinical psychology graduate students trained to criterion by trained by DISC-YC trainers. Overall reliability of the ODD symptom scale is high, test-retest reliability is 0.88 (C. Lucas, personal communication, September, 2006).

Several alternative approaches were taken to measuring individual ODD symptom. The CSI and DISC-YC each included an item for the eight ODD symptoms (for the CSI, 0 = never, 1 = sometimes, 2 = often, 3 = very often; for the DISC-YC, 0 = symptom not present, 1 = symptom present). Initially, a measurement model was tested in which each CSI and DISC-YC item served as an indicator for the relevant ODD item (e.g., the CSI and DISC-YC “temper tantrum” items were separate indicators for a latent DSM “temper tantrum” item). This approach resulted in an inadmissible solution when the CSI and DISC items were freely estimated, when they were fixed to have equivalent factor loadings, and when errors were allowed to correlate. Subsequently, the one CSI and one DISC-YC item corresponding to each particular ODD DSM-5 symptom was converted to a standard score and the two comparable items (e.g., the CSI and DISC-YC items for temper tantrums) were summed together to create the measure of each ODD symptom (Nunnally and Bernstein 1994). Because this approach resulted in an admissible solution, it was used in subsequent analyses. Items associated with each of the factors in the tested models are illustrated in Figs. 1 and 2.

Procedure

Research assistants approached parents at preschools and pediatric offices and informed them about the study. Subsequently, questionnaires, including the CSI, were mailed to interested parents. At the initial home visit, the PPVT was administered, with children scoring < 70 excluded from the study. The use of this exclusion criterion was necessary for completion of other study measures not included in the present study but described elsewhere (Lavigne et al. 2012). The DISC-YC was administered at this visit as well. Graduate research assistants also observed the parent and child interacting in the home environment for approximately 2 h after which they completed a scale noting any symptoms of autistic spectrum disorder that were observed and obtained information on special education programs the child was attending in order to screen for autism and complete study measures not pertaining to this report. Parents were then re-contacted approximately 1 year and 2 years after the initial visit for follow-up visits in which the CSI and DISC-YC were re-administered. Written consent to participate was obtained each year. This study was approved by the appropriate Institutional Review Boards.

Data Analysis

Comparing alternative models

To assess the appropriateness of each of the six models of the dimensions of ODD, we conducted separate CFAs within each of the three age-groups (ages 4, 5, and 6) using the data for each gender separately, as well as the pooled data of boys and girls, using LISREL 8.8 (Joreskog and Sorbom 2006) to analyze covariance matrices via maximum-likelihood (ML) estimation. To assess goodness-of-fit, we employed: (a) two indices of absolute fit (standardized root mean square residual, SRMR), one of which adjusts for model complexity (root mean square error of approximation, RMSEA); (b) two relative fit indices (non-normed fit index, NNFI; comparative fit index, CFI); and (c) an index of relative information-loss that corrects for sample size and model complexity in comparing measurement models (the Akaike Information Criterion, AIC). There is no universal agreement on verbal descriptors for different fit indices. Marsh et al. (2004), for example suggest that an RMSEA of less that 0.05 is a “close” fit (p. 321), and up to 0.08 is “reasonable” (p. 321) while others apply different descriptors and standards (a fuller discussion of the standards and verbal descriptors of fit indices is available on-line). Because of these differences, Table 2 provides information on the number of fit indices for which each model met the criteria for RMSEA < 0.05 and RMSEA ≤ 0.08. Fit based on both of these standards is described as reasonable. For other fit indices, the criteria for a reasonable fit were: NNFI ≥ 0.95; CFI ≥ 0.95 (Browne and Cudeck 1993; Hu and Bentler 1999); and SRMR < 0.08 (Brown 2006). Because they were non-nested models, we used AIC to compare goodness-of-fit of competing models, with smaller values representing better fit. As recommended by Brown (2006), we reported the Satorra-Bentler scaled chi-square (SBχ 2) (Satorra and Bentler 1994) but did not use it to assess overall fit because that measure is inflated by large sample sizes (Bollen 1989). To obtain a scaled ML chi-square values, we followed Bryant and Satorra’s (2012) guidelines.

Measurement invariance

We adopted Vandenberg and Lance’s (2000) recommended sequence for conducting tests of measurement invariance. To identify which of the six alternative models showed the best fit across all groups, three types of measurement invariance were examined. Configural invariance is present if the same number of factors and patterns of factor loadings are appropriate for each group (Meredith 1993). Testing configural invariance involves examining the model’s goodness-of-fit across groups or time rather than formal statistical null-hypothesis testing. Metric or “weak” invariance (Meredith 1993) exists if a one-unit change in the underlying factor is associated with a comparable change in measurement units for the same given item in each group. Scalar invariance (Meredith 1993) exists if the measurement origins for the items (i.e., item intercepts) are the same across groups in predicting item scores from the latent factors. If, for example, a boy and a girl with the same underlying level of ODD do not obtain the same score on a given observed item, then the item shows “uniform” differential functioning (Teresi 2006) and is biased to produce higher scores for one of the genders even at the same level of the latent trait. Tests of metric and scalar invariance involve assessing the statistical significance of differences in goodness-of-fit chi-square values across nested cross-sectional or longitudinal models. Strong invariance (Meredith 1993) is present if a model shows configural, metric, and scalar invariance. We chose not to assess invariance in item unique error variances and in factor variances-covariances because such analyses: (a) were not essential for meaningful between- and within-group comparisons of levels of ODD (Bontempo and Hofer 2007; Saban et al. 2010); and (b) would have unnecessarily increased the number of statistical tests we conducted, thereby requiring an even stricter Bonferroni adjustment that would predispose the results of null-hypothesis inferential testing toward invariance.

To evaluate the statistical significance of differences in goodness-of-fit between nested CFA models in invariance testing, we used a modified version of the SB χ 2 (Satorra and Bentler 2001) that yields a more accurate scaled difference test for LISREL (Bryant and Satorra 2012). If the SB χ 2 is statistically significant, the parameters in question are not invariant; if nonsignificant, the parameters are invariant. To assess effect size in testing measurement invariance, two indices were used: (a) the difference in CFI values (ΔCFI) between nested models, with ΔCFI ≤ 0.01 considered evidence of measurement invariance (Cheung and Rensvold 2002); (b) the effect size for each probability-based test of invariance expressed in terms of w 2, or the ratio of chi-square divided by N (Cohen 1988), which is analogous to R 2 (i.e., the proportion of explained variance) in multiple regression. A w 2 ≤ 0.01 is small; w 2 = 0.09, medium; w 2 ≥ 0.25, large (Cohen 1988).

Because perfectly invariant factors can obscure noninvariant factors and make multivariate global tests of invariance misleading (Bontempo and Hofer 2007), we tested the cross-sectional and longitudinal equivalence of factor loadings and item intercepts separately for each ODD factor. To further reduce the likelihood of capitalizing on chance, we corrected the Type I error rate for probability-based tests of invariance (Cribbie 2007) by imposing a sequentially-rejective Bonferroni adjustment to the generalized p value for each statistical test. Specifically, we used a Sidak step-down adjustment procedure (Holland and Copenhaver 1987; Sidak 1967), to ensure an experiment-wise Type I error rate of p < 0.05, correcting for the total number of statistical comparisons made (i.e., 20 = 6 tests of gender metric invariance, 6 tests of gender scalar invariance, 4 tests of age metric invariance, 4 tests of age scalar invariance).

Results

Preliminary Results

Table 1 includes means and standard deviations for each ODD symptom item for each age and gender group. The minimum number of scores for each item was 8, above the 5 category levels that can be used for ordinal data (Newsom, nd).

Table 1 Means and standard deviations for items for each year and gender: standard scores

Model Comparisons

Goodness-of-fit and configural invariance for alternative ODD models across age and gender

Figure 1 illustrates conceptual diagrams for the DSM-IV one-factor model (Fig. 1a), Pitt-2 two-factor model (Fig. 1b), the GSMS two-factor model (Fig. 1c), and the Pitt-3 three-factor model (Fig. 1d) of ODD symptoms. Figure 2 illustrates the conceptual diagram for the UK/DMS-5 three-factor model (Fig. 2a) and the EUR three-factor model (Fig. 2b). To retain the three-factor structure of the UK/DSM-5 model while retaining a single spiteful/vindictive item as specified in the DSM, a single manifest indictor was included for the hurtful factor, with the error variance fixed at zero. All other models were exactly as specified by their developers. Table 2 presents goodness-of-fit statistics for each of these measurement models.

Table 2 Goodness-of-fit statistics for alternative CFA models of ODD (combined items)

DSM-IV one-factor model

For both boys and girls separately at each age, and when both genders were pooled together, the DSM-IV model did not show reasonable fit on any of the fit indices.

Pitt two-factor model (Burke and Loeber 2010; Burke et al. 2005)

When the criteria for RMSEA was ≤ 0.08, the nine age x sex groups met the criteria for reasonable fit on all four fit indices for seven groups and for three of four fit indices for the two remaining groups. When the RMSEA criteria was < 0.05, the Pitt-2 model met criteria for all four fit indices for one group, and three of four fit indices for the remaining eight groups. None of the Pitt-2 models met criteria for ≤ 2 fit indices.

Pitt three-factor model (Burke et al. 2010)

For the Pitt-3 factor model, when the criteria for RMSEA was ≤ 0.08, the Pitt-3 model met criteria on all four fit indices for three groups, and for three of the four fit indices for one other group. For five groups, the Pitt-3 model met criteria on ≤ 2 fit indices. When the criteria for RMSEA was < 0.05, the Pitt-3 model did not meet criteria on four fit indices for any of the groups, but did meet criteria on three of four fit indices for one group. For the remaining six groups, the Pitt-3 model met criteria for ≤ 2 of the fit indices.

UK/DSM-5 three-factor model (Stringaris and Goodman 2009a, b)

The UK/DSM-5 model did not meet criteria on any of the four fit indices for any of the nine age x sex groups.

GSMS two-factor model (Rowe et al. 2010)

When the RMSEA criteria was ≤ 0.08, the GSMS model did not meet criteria on all four fit indices for any of the groups. The GSMS model met criteria for three of the four fit indices for one group, and for ≤ 2 fit indices for the remaining eight groups. The results were the same when the RMSEA criterion was < 0.05.

EUR three-factor model (Aebi et al. 2010)

When the RMSEA criteria was ≤ 0.08, model fit for the EUR three-factor model met criteria for all four fit indices for one group, but met criteria on ≤ 2 fit indices for the remaining eight age and gender groups. When the RMSEA criterion was < 0.05, the EUR model did not meet criteria for all four fit indices for any age x sex group, met criteria for three of four fit indices for 1 group, and met criteria for ≤ 2 fit indices for the remaining eight groups.

Model comparisons

Of the six models, only the Pitt-2 model met criteria for at least three of four fit indices for all 9 age x sex groups when the RMSEA criteria was ≤ 0.08 and at least three of four fit indices for all nine groups when the RMSEA criteria was < 0.05. For no model did the Pitt-2 model meet criteria on ≤ 2 fit indices. In comparison, the Pitt-3 model was closest to the Pitt-2 model in the number of fit indices meeting the “reasonable” criteria, but that model met criteria on ≤ 2 fit indices when RMSEA criterion was < 0.08 for seven of nine groups and for eight of nine groups when the RMSEA criterion was < 0.05.

When AICs are used to compare models, lower values indicate better fit. For each age x sex group, the AIC for the Pitt-2 model was lower than that of all alternative models. Thus, the Pitt-2 model is preferred in comparison to each of the other models.

Correlations between Pitt-2 dimensions

If ODDB and ODDNA factors of the Pitt-2 model are very highly correlated, that would suggest the two factors are conceptually redundant, so the strength of the correlation between ODDB and ODDNA at each age is of conceptual interest. Squaring the within-age factor correlations (factor correlations: age 4 boys, 0.82; age 5 boys, 0.56; age 6 boys, 0.67; age 4 girls, 0.67; age 5 girls, 0.59; age 6 girls, 0.69; age 4 combined sexes, 0.75; age 5 combined sexes, 0.59; age 6 combined sexes, 0.64) reveals that the two ODD factors share the following percentages of their variance at each age: for boys: age 4 (67.2 %), age 5 (31.4 %), age 6 (44.9 %); for girls: age 4 (44.9 %), age 5 (34.8 %), age 6 (47.6 %) for both sexes combined: age 4 (56.2 %), age 5 (34.7 %), and 6 (40.1 %). These results indicate that the two ODD dimensions are not so highly related as to be conceptually redundant, supporting the discriminant validity of the factors in the Pitt-2 model (see table in supplemental material, available on line, for the correlations among Pitt-2 factors.

Areas of local ill fit for the Pitt-2 model

Goodness-of-fit statistics provide a global index of model fit. While the global fit indices may be acceptable, it is possible that there are specific areas of ill fit or strain within each model (Brown 2006). In this study, there were 9 individual models of the best-fitting Pitt-2 model to consider. In examining the 153 standardized residuals (SRs) in the nine models, we considered residuals greater than the absolute value of 2.58, which corresponds to a p value of 0.01, to be significant because of the large sample size (Brown 2006); Bonferroni corrections were not applied in this supplemental analysis. Across the nine Pitt-2 models, there were a total of 27 areas of ill fit based on the standardized residuals. While there were areas of local ill fit in each one, no combination of factors showed significant residuals in all 9 groups, the most common problems involved the covariance of get even with defies (8 of 9 models), angry/argues (6 of 9 models), and defies/temper (5 of 9 models). A more extensive discussion of specific areas of ill fit is included on-line.

Examining modification indices may provide clues about ways in which the measurement models could be improved. Modification indices (MIs) can provide suggestions about specific estimated parameters that might be added to a model to improve model fit. MIs > 3.84 could possibly improve a model at a statistically significant level (p < 0.05). Modification indices, however, are sensitive to sample size—it is possible that estimating the parameter associated with a significant MI could result in a factor loading that is very small and of little value. For these reasons, examining MIs to gain insight into areas of poor model fit should also include examination of completely standardized expected parameter change (EPC) scores (See on-line supplementary tables for these values).

Overall, there were 6 significant MIs for the 3 models for boys, 11 significant MIs for girls, and 10 significant MIs for the combined sex groups when considering item cross-loadings. These results suggest that adding the factor loadings associated with these MIs would improve model fit overall for all or most of those models. However, low or moderate factor loadings would be eliminated because they were far below the desired standard for factor loadings of 0.70. As a result, only one large factor loading (age 4 boys, ODDNA → temper) might be retained, but doing so would have the disadvantage of eliminating configural invariance for boys with the Pitt-2 model. For these reasons, adding that cross-loading would not be desirable.

Gender Invariance of the Pitt-2 Model

Testing gender invariance

After establishing configural invariance for the Pitt-2 model (i.e., the same pattern of factor loadings for gender x age group), we examined measurement invariance for that model. For all multiple-group CFA models, we defined the variance units of the latent variable by fixing an unstandardized factor loading of one item at 1.0 for each factor. Because using a referent item that functions differently across groups can either mask or exacerbate nonequivalence in other items (Johnson et al. 2009), we selected referent items for which loadings were as comparable as possible across the single-group solutions. These referent items were “temper tantrums” for ODDB and “touchy” for ODDNA.

Metric invariance: gender within age groups

Table 3 presents the results of tests of metric invariance for the Pitt-2 ODD model with respect to gender within age groups. The loadings of both ODD factors in the Pitt-2 model were invariant with respect to gender within each age group: (1) Age 4: gender invariant loadings for ODDB, SB χ 2(2) = 1.91, Bonferroni p = 0.9996, ΔCFI = 0.0006, ω 2 = 0.002; gender invariant loadings for ODDNA, SB χ 2(2) = 1.47, Bonferroni p = 0.9999, ΔCFI = 0.0003, ω 2 = 0.002; (2) Age 5: gender invariant loadings for ODDB, SB χ 2(2) = 4.42, Bonferroni p = 0.9655, ΔCFI = 0.0018, ω 2 = 0.006; gender invariant loadings for ODDNA, SB χ 2(2) = 2.72, Bonferroni p = 0.9983, ΔCFI = 0.0007, ω 2 = 0.003; and (3) Age 6: gender invariant loadings for ODDB, SB χ 2(2) = 0.319, Bonferroni p = 0.9999, ΔCFI = 0.0003, ω 2 = 0.001; gender invariant loadings for ODDNA, SB χ 2(2) = 2.61, Bonferroni p = 0.9983, ΔCFI = 0.0003, ω 2 = 0.003. Therefore, we concluded that ODDB and ODDNA have the same meaning for 4-, 5-, and 6-year-old boys and girls.

Table 3 Testing metric invariance for the Pitt-2 ODD model with respect to gender within age groups

Scalar invariance

Table 4 presents the results of tests of scalar invariance for the Pitt-2 ODD model with respect to gender within age groups. The item intercepts of both ODD factors in the Pitt-2 model (behavior and negative affect) were invariant with respect to gender within each of the three age groups, as follows: (a) Age 4: gender invariant intercepts for ODDB, SB χ 2(2) = 2.765, Bonferroni p = 0.99, ΔCFI = 0.0005, ω 2 = 0.004; gender invariant intercepts for ODDNA, SB χ 2(2) = 0.87, Bonferroni p = 0.9999, ΔCFI = 0.0003, ω 2 = 0.001; (b) Age 5: gender invariant intercepts for ODDB, SB χ 2(2) = 3.94, Bonferroni p = 0.9983, ΔCFI = 0.0011, ω 2 = 0.005; gender invariant intercepts for ODDNA factor, SB χ 2(2) = 0.35, Bonferroni p = 0.9999, ΔCFI = 0.0021, ω 2 = 0.0004; and (c) Age 6: gender invariant intercepts for ODDB, SB χ 2(2) = 7.74, Bonferroni p = 0.5315, ΔCFI = 0.0076, ω 2 = 0.0097; gender invariant intercepts for ODDNA, SB χ 2(2) = 1.78, Bonferroni p = 0.9996, ΔCFI = 0.0001, ω 2 = 0.002. Thus, we concluded that the behavior and negative affect items function equivalently in assessing ODD for 4-, 5-, and 6-year-old boys and girls. Considered together, these findings indicate that the Pitt-2 model shows strong gender invariance (Meredith 1993) within all three age groups.

Table 4 Testing scalar invariance for the Pitt-2 ODD model with respect to gender within age groups

Age Invariance of the Pitt-2 Model

Metric invariance: age within gender

Having established configural, metric, and scalar invariance for the Pitt-2 model across gender at ages 4, 5, and 6, we next examined the measurement invariance of the Pitt-2 model with respect to age longitudinally within each gender. To test age invariance in ODD for boys and girls, we estimated longitudinal CFA models in which we specified the two Pitt-2 factors at ages 4, 5, and 6 as six correlated latent variables separately for each gender. We defined the variance units of the latent variables at each wave by fixing at 1.0 the factor loadings of the referent items for ODDB and ODDNA. Following common practice in longitudinal measurement modeling (Brown 2006), all three-wave CFA models included autocorrelated measurement errors, reflecting temporally stable indicator-specific variance, i.e., method effects (Brown 2006), through which the unique variance in each of the six ODD items at each wave was allowed to correlate with the unique variance in the same item at the other two waves. We also allowed all ODD factors to correlate with one another both within and across waves. This six-factor model provided an acceptable fit to the longitudinal ODD data of both boys, SB ML χ 2(102, N = 391) = 175.41, RMSEA = 0.041, CFI = 0.99, NNFI = 0.99, AIC = 305.685, and girls, SB ML χ 2(102, N = 405) = 149.13, RMSEA = 0.031, CFI = 0.99, NNFI = 0.99, AIC = 280.689.

Table 5 presents the results of tests of metric invariance for the Pitt-2 ODD model with respect to age within both genders. The loadings of the two ODD factors in the Pitt-2 model were invariant with respect to age for both boys and girls: (a) Boys: age invariant loadings for ODDB, SB χ 2(4) = 2.90, Bonferroni p = 0.9999, ΔCFI = 0.0001, ω 2 = 0.007; age invariant loadings for ODDNA, SB χ 2(4) = 1.73, Bonferroni p = 0.9999, ΔCFI = 0.0001, ω 2 = 0.004; (b) Girls: age invariant loadings for ODDB, SB χ 2(4) = 3.04, Bonferroni p = 0.9999, ΔCFI = 0.0001, ω 2 = 0.007; age invariant loadings for ODDNA, SB χ 2(4) = 7.54, Bonferroni p = 0.9655, ΔCFI = 0.0005, ω 2 = 0.019. Thus, we concluded that oppositional behavior and negative affect have the same meaning across ages 4, 5, and 6 for both boys and girls.

Table 5 Testing metric invariance for the Pitt-2 ODD model with respect to age within boys and girls

Scalar invariance

Table 6 presents the results of tests of scalar invariance for the Pitt-2 ODD model with respect to age within both genders. The item intercepts of both ODD factors in the Pitt-2 model were invariant with respect to age for both boys and girls:(a) Boys: age invariant intercepts for ODDB, SB χ 2(4) = 2.81, Bonferroni p = 0.9999, ΔCFI = 0.0002, ω 2 = 0.007; age invariant intercepts for ODDNA, SB χ 2(4) = 1.92, Bonferroni p = 0.9999, ΔCFI = 0.0003, ω 2 = 0.005; and (b) girls: age invariant intercepts for ODDB, SB χ 2(4) = 3.27, Bonferroni p = 0.9983, ΔCFI = 0.0001, ω 2 = 0.008; age invariance intercepts for ODDNA, SB χ 2(12) = 1.56, Bonferroni p = 0.9983, ΔCFI = 0.0003, ω 2 = 0.004. Thus, we concluded that the behavior and negative affect items function equivalently in assessing ODD for both boys and girls at ages 4, 5, and 6 years. Considered together, these findings indicate that the Pitt-2 model shows strong age invariance (Meredith 1993),i.e., configural, metric, and scalar invariance across ages 4, 5, and 6, for both boys and girls.

Table 6 Testing scalar invariance for the Pitt-2 ODD model with respect to age within boys and girls

Table 7 presents the gender- and age-invariant CFA factor loadings, squared multiple correlations, and Cronbach’s alphas for the Pitt-2 CFA model. Factor loadings were gender and age invariant, and squared multiple correlations were highly comparable across gender for both ODDB factor (boys: median = 0.53; girls: median = 0.51) and ODDNA (boys: median = 0.54; girls: median = 0.56). The ODDB subscale had reasonable internal consistency reliability at each age for both boys (age 4: α = 0.76; age 5: α = 0.74; age 6: α = 0.77) and girls (age 4: α = 0.73; age 5: α = 0.76; age 6: α = 0.78). The ODDNA subscale had reasonable internal consistency reliabilities at age 4 (boys: α = 0.77; girls: α = 0.73) and age 6 (boys: α = 0.74; girls: α = 0.71), but Cronbach’s alphas were lower for this subscale at age 5 for both boys (α = 0.68) and girls (α = 0.66). While a cutoff of 0.70 is often used for assessing the adequacy of alpha, lower scores may be acceptable when the measure has other desirable measurement properties (Schmitt 1996), as in the present model.

Table 7 Gender and age invariant CFA factor loadings, squared multiple correlations, and Cronbach’s alphas for the Pitt-2 ODD model for boys (N = 391) and girls (N = 405) at ages 4, 5, and 6–7

Discussion

Results of analyses comparing model fit of the 6 different models proposed to date showed that the two-factor ODD model (Pitt-2) identified by Burke et al. (2005) best fit the data, for both genders separately and when genders were combined, for all three age groups (4, 5, and 6). In addition, the results indicated: (a) there is configural invariance (Brown 2006) for both boys and girls across ages for the Pitt-2 model because the two-factor structure showed the best fit to the data for each age x gender group; (b) there is metric invariance with respect to age and gender, i.e., the factor loading of each measured indicator on its underlying ODD dimension was equivalent across age and gender groups; and (c) there is scalar invariance, with the ODD items producing equivalent scores for children with the same underlying level of ODD, regardless of gender or age. Thus, studies of homotypic and heterotypic continuity of the Pitt-2 ODD factors with other disorders in this age range can be conducted with clear evidence that the dimensions of ODD do not show developmental differences in structural form, factor loadings, or value of scale items. The results support Burke et al.’s conclusion that ODD is best characterized as being composed of separate processes of behavioral and affective dysregulation, rather than being a single distinct disorder.

These results have important implications for the structure of ODD in the recently-released DSM-5 (American Psychiatric Association 2013). First, because neither DSM-IV nor DSM-5 proposed gender or developmental differences in the structure of the symptoms of ODD, there is an implicit assumption that the ODD dimensions are invariant for gender and age. With the exception of the Pitt-2 model, age and gender invariance were not present for any of the other models, including the UK/DSM-5 model that forms the basis of the DSM-5 dimensions. These results support the use of the Pitt two-factor model as a tool for understanding, as well as for diagnosing ODD in children ages 4–7 rather than the three-factor model adopted in the DSM-5.

The results of the present study differ somewhat from those of prior comparisons, chiefly because neither the Krieger et al. (2013) nor the Ezpeleta et al. (2012) studies include the Pitt-2 model which, in the present study, showed the best fit overall in each of the separate and combined gender groups at all three ages. Furthermore, while multiple studies identified three-factor structures such as the one adopted by DSM-5, none of these studies were replications of one another because the factor loadings of items differed across models.

One characteristic of the Pitt-2 model is that the ODD symptoms of “annoys” and “blames others” did not load on either factor in the EFA conducted by Burke and Loeber (2010). The factor loadings of these two items differ across the other multidimensional models of ODD. Both items load on the same factor in the GSMS model (headstrong/spiteful), the Pitt three-factor model (antagonistic), and the UK/DSM-5 model (headstrong), while they load on different factors in the EUR model. Given these inconsistencies, these items could be described as “other” ODD symptoms and possibly eliminated as critical to diagnosis in the future. The decision to either retain or eliminate those items would depend on future research on the structural invariance of ODD with older children and the ability of the different ODD dimensions to predict heterotypic comorbidity of those dimensions with other disorders. In a separate report (2014), the Pitt-2 factors without the items “annoys” and “blames others” were found to be associated with subsequent depression.

Further research will also be needed to address possible limitations to the Pitt-2 model. While better than the alternatives, there is room to improve overall global fit as well as specific areas of local strain in the young child age group, and it remains to be seen whether the model fit and specific areas of strain are problems in older children. If the Pitt-2 model continues to show the best, but imperfect fit, across age and gender groups, further improvement in measuring the ODD and the Pitt-w model may require the development of measures that retain the same core indicators of ODD, but include multiple measures of each item to improve model fit, or allow for more fine-grained responses than the four-point scales often used on behavior problem checklists. Such changes, however, increase the number of parameters in the model and may also require cross-loadings between factors.

The few existing studies of ODD dimensions have been conducted with a diverse set of participant samples. Studies have been done with both clinical and community samples. Because high levels of comorbidity that could affect the internal structure of ODD symptoms are likely to be present in clinical samples (Caron and Rutter 1991) examining the structure of ODD in community samples is particularly important.

Outside of the U.S., community samples have been utilized in the United Kingdom and in Barcelona, Spain. The UK sample was highly representative of the population, but as noted earlier, no CFA of the three-factor model was conducted with that sample. In the Barcelona sample, 89.5 % of participants were white, and 78.7 % were high or middle SES. No information was provided on how representative this sample was of Barcelona or Spain. Presently, there are four studies that have examined dimensions of ODD in the U.S. One of these studies (Burke et al. 2005) was conducted with a referred sample of boys only. Another was of a community sample of low income girls from Pittsburgh that was 45 % African American and 50 % Caucasian. Clearly, these samples were not representative of their geographical areas even based on gender or race/ethnicity, and they included few Hispanics. The GSMS sample (Rowe et al. 2010) included 25 % Native Americans in the initial wave. The sample in the Rowe et al. 2010 report included 8 % African American, and < 1 % Hispanic. Neither the number of Native Americans nor SES information for the final sample was reported. Thus, none of the existing studies are truly representative of the U.S. population. Lacking national registries, it is likely that a series of studies of different community samples in the U.S. will be needed to understand the most representative version of the structure of ODD. It is also important to compare models within a variety of different samples as was done in the present study, to make direct comparisons among the competing models.

The current study has several limitations. First, findings are limited by the use of parent report of symptoms. Although parent report is the most common way in which symptom reports are obtained in young children, studies comparing reports of symptoms of ODD for teacher and parents suggest that they are source-specific (Drabick et al. 2011; Lavigne et al. 2014). Thus, it will be important to determine whether the two-factor structure of oppositional behavior and negative affect are invariant when measures from other sources (e.g. teachers, observers) are used. In addition, the findings of this study are clearly limited to the developmental period between preschool and formal school entry, and may differ for older children and adolescents. It is possible, as well, that these relationships are different in a clinical rather than in a community sample.

Nevertheless, this study has important research and clinical implications for understanding and treating ODD in children. By clearly establishing the best model for understanding ODD in young children it provides a framework for moving forward with research on the relationships between early occurring ODD, the most common early childhood disorder, and later externalizing and internalizing disorders in children. Clinically, this provides significant information about how to treat early childhood ODD. Presently, for example, parent management training is the most effective treatment for ODD in preschoolers (Webster-Stratton et al. 2004), but we also know that approximately 30 % of children do not benefit from this treatment. Possibly, the different dimensions of ODD may be important moderators of the effectiveness of parent management training for ODD. In addition, this study has implications for the ODD diagnostic criteria adopted for use in the DSM-5. The clinical results of this study suggest that the structure of ODD adopted for use in DSM-5 does not show invariance over gender and age in preschool and early school-age children, while an alternative two-factor model does. Since it is largely the DSM-5 which will drive future conceptualizations of ODD for both research and clinical purposes, recognizing that these dimensions might not best represent ODD symptoms in children is critical to future work on ODD. This disorder is highly prevalent in young children and has implications for the development of psychopathology over time. Understanding the dimensional aspects of ODD especially in the context of homotypic and heterotypic continuity over time is critical to developing the best possible early interventions for this disorder.