Externalizing behaviors are among the most common mental health problems among elementary school children (Lahey et al. 1999). Externalizing behavior problems are typically divided into two classes: (1) inattentive, impulsive, overactive (IO) behaviors that are the primary characteristics of Attention Deficit/Hyperactivity Disorder, and (2) oppositional, defiant, rule breaking (OD) behaviors that are the primary characteristics of oppositional defiant disorder and conduct disorder (American Psychiatric Association 2000). Numerous studies demonstrate that IO and OD behaviors are distinct but highly associated constructs, whether they are measured continuously or categorically (Hinshaw 1987; Waschbusch 2002), and that both sets of behaviors are significantly associated with academic difficulties, impairment in peer relationships, and conflicts with parents and teachers (Quay and Hogan 1999).

Empirically supported assessment is vital to understanding and treating externalizing problems in children and standardized behavior rating scales are an essential component of empirically supported assessment (McMahon and Frick 2005; Pelham et al. 2005a, b). One commonly used measure is the Inattention/Overactivity with Aggression (IOWA) Rating Scale. The IOWA was first developed by Loney and Milich (1982) by selecting items (see Table 1)from the abbreviated Conners Teacher Rating Scale (Conners 1969) that best differentiated IO behaviors from OD behaviors in children. The result of this work was a ten item measure that included a five item inattention-impulsivity-overactivity (IO) scale and a five item oppositional-defiance (OD) scale. The reliability and validity of the IOWA scales were supported in the initial study (Loney and Milich 1982), and subsequent work has continued to find high reliability and validity of the IOWA when completed by teachers (Atkins et al. 1988, 1989; Johnston and Pelham 1986; Milich and Fitzgerald 1985; Milich and Landau 1988; Nolan and Gadow 1994). In addition, normative data have been published for teacher ratings on the IOWA (Pelham et al. 1989). In the years since this work, the IOWA has become one of the most commonly used measures of IO and OD in children. Indeed, the publication outlining normative data for teacher ratings on the IOWA (Pelham et al. 1989) has been cited in more than 100 studies, including recent treatment studies (e.g., Remschmidt et al. 2005), laboratory studies (e.g., Oosterlan et al. 2005), family studies (e.g., Brent et al. 2004), and assessment studies (e.g., Gadow et al. 2004; Halperin et al. 2003).

Table 1 Polychoric correlations for mother and teacher-reported IOWA Conners items

The continued popularity of the IOWA may seem surprising given that many other empirically supported rating scales have been published in the nearly 25 years since the IOWA was first developed (see McMahon and Frick 2005; Pelham et al. 2005a, b for a review). The continued use of the IOWA is particularly surprising considering that many of the more recent rating scales for assessing externalizing behavior in children have the advantage of being indexed to the current version of the Diagnostic and Statistical Manual of Mental Disorders (American Psychiatric Association 2000) whereas the IOWA does not. Despite this important disadvantage, the IOWA continues to be widely used because it provides a number of relatively unique advantages not offered by many other rating scales.

First, the IOWA can be freely used by anyone with appropriate qualifications, whereas many other rating scales for evaluating IO and OD behaviors in children require a fee for each administration. Second, the IOWA is exceptionally brief. This makes it especially useful in situations when assessments are being administered repeatedly within short periods of time (e.g., in treatment studies), when conducting screening assessments on a large number of children, or when a large number of constructs are being measured at once. Third, the IOWA has been widely and consistently used in the same form over a number of years and across a wide variety of studies, providing the ability to compare results of current research and clinical efforts with past efforts in a relatively direct and meaningful way.

At the same time, there are at least three disadvantages associated with the IOWA rating scale. First, in contrast to the considerable body of evidence supporting the psychometric properties of the IOWA when completed by teachers, little or no research has examined the psychometric properties of the scale when completed by mothers. This is a serious limitation because many children are referred for treatment by their mothers and because mother ratings are often used to evaluate treatment response (e.g., Elgar et al. 2003; Pelham et al. 2005a, b; Remschmidt et al. 2005). Second, the cross-informant equivalence of mother and teacher use of the IOWA is unknown. In general, the correlation between parent and teacher ratings of externalizing is low (average r = 0.21, as per Achenbach et al. 1987), despite the fact that both provide valid information about externalizing behavior in children (e.g., Hart et al. 1994; Loeber et al. 1989). An improved understanding of cross-informant agreement would better inform questions about whether discrepancies between parent and teacher reports of child behavior using the IOWA reflect true behavior differences that occur across home and school contexts or instead reflect differential functioning of the scale across informants. Third, there is growing consensus that hyperactive/impulsive (HY) and inattentive (IN) behaviors are conceptually and empirically distinct (e.g., Barkley 1997; Lahey et al. 1988; Milich et al. 2001; Pillow et al. 1998), but the IO scale of the IOWA Connners conflates these two dimensions of behavior.

The purpose of this study was to examine the factor structure of the IOWA when employed by mothers and teachers and to examine the psychometric properties and performance of the factors. More specifically, five hypotheses were examined. First, it was hypothesized that a three factor model of the IOWA (consisting of hyperactive/impulsive, inattentive, and oppositional defiant factors) would provide a better fit to the data than the more commonly used two factor model (consisting of an inattentive-overactive-impulsive (IO) and oppositional defiant (OD) factors). Second, it was hypothesized that the factor structure of the IOWA would be equivalent across mothers and teachers. Third, it was hypothesized that both the three factor and two factor conceptualizations of the IOWA would demonstrate adequate reliability (internal consistency and test-retest) for both mothers and teachers. Fourth, it was hypothesized that factors would differ as a function of both grade and gender, as has been found in previous research examining teacher ratings on the IOWA (Pelham et al. 1989). Fifth, it was hypothesized that screening cutoffs based on the 90th percentile of the distribution would be useful for identifying children meeting diagnostic criteria for ADHD and ODD, as per the full set of DSM-IV symptoms and corresponding impairment criteria.

Method

Participants

Participants were mothers and/or teachers of 1,554 children who were students in one of seven public schools in a single school district in northern Nova Scotia, Canada. The participating school district included 58 elementary schools which served approximately 13,000 elementary students (Nova Scotia Department of Education 2003). The students ranged in age from 5 to 12 (mean = 8.13; SD = 1.94) and consisted of 809 boys (52.1%) and 745 girls (47.9%). Ethnic and racial information of participants was not collected (at the request of the school board), but the schools served communities that were over 95% Caucasian (Nova Scotia Department of Finance 2003). Of the 1,554 students, complete teacher data on the IOWA were available for 1,517 students, complete mother data on the IOWA were available for 728 students, and complete teacher and mother data on the IOWA were available for 711 students.

Measures

IOWA Conners Rating Scale

The IOWA Conners Rating Scale (Loney and Milich 1982; Pelham et al. 1989) consists of ten items (see Table 1) each of which is evaluated using a four point Likert scale with the following anchors: not at all (0); just a little (1); pretty much (2); and very much (3). The first five items on the IOWA are designed to measure inattentive-impulsive-overactive (IO) behaviors and the second five items are designed to measure oppositional-defiant (OD) behaviors. Included in the IO scale are three items (items 1, 2 and 3) that measure hyperactive/impulsive behaviors and two items (items 4 and 5) that measure inattentive behaviors. The IOWA was completed by mothers and teachers.

Assessment of Disruptive Symptoms DSM-IV (ADS-IV)

The ADS-IV is a rating scale designed to measure ADHD and ODD (as defined in the DSM-IV) in elementary school children (Waschbusch et al. 2003). The majority of items on the ADS-IV are symptoms of ADHD and ODD taken directly from the DSM-IV, with minor wording changes to make them more concise and appropriate for a rating scale format. Specifically, the word “often” was removed from all symptoms, as others have done (Burns et al. 1997), and some items were simplified (e.g., “Often blurts out answers before questions have been completed” was changed to “Blurts out answers before questions have been finished”). Each symptom is rated on a 0 (“much less than other children”) to 4 (“much more than other children”) Likert scale to indicate the extent to which the child expresses the symptom relative to others of the same age and gender. Symptoms were counted as present if they were rated 3 (“more than other children”) or 4 (“much more than other children”).

Following the nine ADHD-inattention symptoms, impairment resulting from symptoms is assessed by requesting teachers and parents rate the degree to which inattention symptoms cause the child problems, with possible responses ranging from (0) “no problems” to (4) “very severe problems.” Teachers rate the extent to which inattention caused problems at school, whereas parents rate the extent to which inattention caused problems at home, school, or other places (e.g., playground). Parents also estimate the age of onset of inattention problems. These questions were also asked following the nine ADHD-hyperactive/impulsive symptoms and following the eight ODD symptoms. The ADS-IV was requested from mothers and teachers. Complete teacher ratings on the ADS-IV were available for 1482 children (97.7% of the children with complete teacher rating on the IOWA), and complete mother ratings on the ADS-IV were available for 708 children (97.3% of the children with complete mother ratings on the IOWA).

The ADS-IV was used to form ADHD and ODD groups by applying DSM-IV symptom count and impairment criteria. Children were assigned to the ADHD group if: (a) they were rated as having at least six ADHD-inattention symptoms and were rated as having severe or very severe impairment from the inattention symptoms; (b) they were rated as having at least six ADHD-hyperactive/impulsive symptoms and were rated as having severe or very severe impairment from hyperactive/impulsive symptoms; or (c) they met both a and b criteria. Similarly, children were assigned to the ODD group if they were rated as having at least four ODD symptoms and were rated as having severe or very severe impairment from ODD symptoms. These categorical scores have been shown to be significantly associated with similar scores derived from other rating scales (the Disruptive Behavior Disorders Rating Scale; Pelham et al. 1992) and from structured diagnostic interviews (i.e., the Diagnostic Interview Scale for Children; NIMH-DISC Editorial Board 1999), as reported elsewhere (Waschbusch et al. 2007, 2003, 2006).

Procedure

Data were collected as part of the pre-intervention evaluation of the Behavior Education Support and Treatment (BEST) school intervention program. The BEST project was designed to prevent and treat disruptive behavior in elementary school settings using behavioral strategies delivered at universal, targeted, and clinical levels (see Waschbusch et al. 2005 for details). All procedures and measures were approved by a university Human Ethics Review Board and by the participating schools and school district and informed consent was obtained from parents and teachers. Schools were recruited by contacting principals and providing them information about the project through presentations and written materials. Principals then met with their staff and subsequently contacted the project coordinator if their school wished to participate. Seven schools volunteered to participate. Three of these seven schools were randomly assigned to the intervention condition and the remaining four participated as controls.

Approximately 6 weeks after the start of the school year, but before the start of the intervention, teachers and parents of students in participating schools were asked to complete a packet of rating scales, including those used in this study. All raters were told that completing the ratings was voluntary and that their responses would be confidential. Homeroom teachers were given the option of taking an in-service day to complete ratings on students in their homeroom classrooms, and all teachers (n = 66) elected to do so. Therefore, ratings by homeroom teachers were completed for nearly all students in participating schools, although teachers of 37 students (representing less than 1% of the sample) returned packets without complete data on the primary measure of interest in this study (the IOWA) and were therefore excluded from the analyses involving teacher ratings. Mothers of 728 children (46.8% of the eligible sample) returned packets with complete IOWA data. Of the 728 children with complete mother ratings, 711 (97.7%) also had complete teacher ratings. The remaining 826 mothers (53.2%) did not return complete IOWA ratings and were excluded from analyses of mother ratings. This rate of parental participation is consistent with other school-based assessments (e.g., Airaksinen et al. 2004; Frissell et al. 2004; Gottfredson and Gottfredson 2001; McGrew and Gilman 1991; Wolraich et al. 2004). Comparison of students with and without complete mother ratings showed that groups did not differ on age, sex, or teacher ratings of behavior.

These procedures were repeated at the end of the school year, approximately seven months later. The ratings collected at the end of the school year (time 2) were used to estimate test-retest reliability. Only students in control schools were used in test-retest analyses to remove any confounding effects of the intervention. There were 706 students in control schools with teacher IOWA data at time 1. Of these, 640 (90.7%) also had teacher IOWA data at time 2. There were 357 students in control schools with mother IOWA data at time 1. Of these, 182 (51.0%) also had mother IOWA ratings at time 2.

Data Analysis

This study was organized around five research questions that correspond with the hypotheses described earlier. The first question investigated the dimensionality of the IOWA Conners items. Specifically, the relative fit of two- and three-factor models were tested, both of which were specified a priori. The second question investigated whether the psychometric properties of the IOWA Conners were equivalent across informants. The third question investigated the test retest reliability of the IOWA Conners scale scores. The fourth question provided preliminary normative data for scale scores by child gender and grade, separately for mother and teacher informants, including 90th percentile cutoff scores for purposes of screening. The fifth question evaluated the proposed screening cutoffs by comparing them to parent and teacher-ratings of DSM-IV symptoms.

The first two questions were addressed using confirmatory factor analyses (CFAs), while the remaining three questions were accomplished using univariate statistics including bivariate correlations, as well as computation of recommended diagnostic statistics (Kessel and Zimmerman 1993). All CFA models were fit using Mplus version 4.00 using robust weighted least squares (WLSMV) estimation to accommodate the fact that IOWA items were rated using a four-point Likert scale, with the majority of respondents relying on the two lowest categories (Muthén et al. 1997). A recent simulation study indicated that the WLSMV estimation outperformed standard WLS estimation for the types of data, models, and sample sizes that are considered here (Flora and Curran 2004). Given the known dependence between sample size and chi-square model fit statistics, the comparative fit index (CFI), root mean squared error of approximation (RMSEA), and the weighted root mean square residual (WRMR) were used as indices of absolute model fit. Values of CFI ≥ 0.95, RMSEA ≤ 0.05, and WRMR ≤ 1.0 are indicative of good model fit, whereas values of CFI between 0.90 and 0.95, RMSEA between 0.06 and 0.07, and WRMR 1.0–1.5 indicate only moderately good fit, for the models and sample sizes that are considered here (Hutchinson and Olmos 1998; Yu 2003). More central to this research was the relative fit of two versus three factor models. Questions of relative fit were addressed using scaled chi-square difference tests (see Appendix 5 in Muthén and Muthén 2004; Satorra and Bentler 1999). The reader should appreciate that the degrees of freedom for WLSMV models are estimated (see Muthén et al. 1997 for elaboration).

Results

Factor Structure

Bivariate Correlations

Polychoric correlations for the 10 IOWA Conners items are summarized in Table 1, with mother-reported items below the diagonal and teacher-reported items above the diagonal.

Mother Ratings

A synopsisFootnote 1 of fit for all CFA models is provided in Table 2. In terms of absolute fit, the two- and three-factor models fit the data moderately well. In terms of relative fit, the three-factor model provided a better fit to the data than did the two-factor model, χ 2(2) = 57.6, p < 0.0001.

Table 2 Synopsis of model fit for confirmatory factor analyses

Teacher Ratings

A parallel set of CFA models were estimated for teachers. In terms of absolute fit, the two-factor model fit the data rather poorly while the three-factor model fit the data moderately well. In terms of relative fit, the three-factor model provided a better fit to the data than did the two-factor model, χ 2(2) = 113.5, p < 0.0001.

Simultaneous Mother and Teacher Ratings

A final set of two- and three- factor CFA models were simultaneously estimated for the 711 children who had complete mother and teacher data on the IOWA (i.e., a total of four factors, two for each informant, versus a total of six factors, three for each informant, were simultaneously estimated). In terms of absolute fit, the two-factor model fit the data moderately well while the three-factor model now provided a good fit to the data, as indexed by all three fit indices. In terms of relative fit, the WLSMV-scaled chi-square difference tests once again confirmed that the three-factor model fit better than did the two-factor model, χ 2(6) = 82.1, p < 0.0001.

Informant Equivalence

Having established that a three-factor model provided superior fit to mother and teacher-reported data, both alone and in combination, the next test examined whether individual IOWA items were related to their respective latent factors in an identical manner across informants. This was accomplished by re-estimating the three-factor model simultaneously for both informants with equality constraints imposed on all factor loadings. Whereas establishing that a similar number of factors underlie both mother and teacher-reported data is known as configural invariance, the test of equivalence of factor loadings across informants is known as a test of weak invariance (Meredith 1993). Although this model appeared to fit the data reasonably well (see Table 2), a scaled chi-square difference test indicated that the constrained model fit the data significantly worse than did the model in which factor loadings were free to vary across informants, χ 2(6) = 28.1, p = 0.0001. Inspection of individual factor loadings across informants did not indicate any items that were noticeably different across informants. This suggested that the significance of the scaled chi-square difference tests was a function of the large sample size (and hence high power). Thus, a final three-factor model was tested, simultaneously for mother and teacher informants, and only required that a single item for each factor (item 2: Hums and makes other odd noises; item 5: Fails to finish things; item 8: temper outbursts) take on equal factor loadings. This is known as partial measurement invariance and is the minimally sufficient condition for establishing that a scale works equivalently across groups (Byrne et al. 1989). The fit of this model was once again consistent with the previous two models (see Table 2). More importantly, this restricted model fit equally well as the model without constraints on factor loadings (i.e., partial measurement invariance was supported), χ 2(3) = 1.7, p = 0.63.

As a result of establishing partial measurement invariance across raters, cross-informant correspondence for the hyperactive-impulsive (HY), inattentive (IN), and oppositional defiant (OD) scores was examined using latent correlations, which were not attenuated by measurement error. As is summarized in Table 3, the latent and observed correlations between factors within informant and across informants were significantly different from zero. The relative improvement in cross-informant agreement using latent versus observed correlations is also evident by comparing corresponding values above and below the diagonal.

Table 3 Cross informant correlations

Computing Scale Scores

The preceding results demonstrated that the IOWA Conners items are best conceptualized as resulting from three factors (IN, HY, OD). This differs from current practice where IN and HY scores are aggregated to form a single inattentive-overactive scale (IO). Given the widespread use of the IOWA IO scale, it was included in all of the remaining analyses, along with the IN, HY, and OD scales that were supported by CFA models. Moreover, because a majority of the people who use the IOWA Conners will do so using the observed scores, the remaining analyses are based exclusively on scale scores that represent the sum of the observed items (not latent variables). More specifically, based on the a priori model and accepted use of the IOWA, an IO score was computed by summing together items 1 through 5 and an OD score was computed by summing together items 6 through 10 (see Table 1 for items and item numbers). In addition, based on CFA results, the IO score was further divided into a hyperactive/impulsive (HY) score by summing items 1, 2 and 3, and into an inattention (IN) score by summing items 4 and 5. Means and standard deviations for these scales are summarized in Table 4.

Table 4 Means, standard deviations (SD), and reliability coefficients for IOWA scales as a function of informant

Reliability

Mother Ratings

Coefficient alpha was used to evaluate the internal consistency and Pearson correlations were used to evaluate test–retest reliability of the IOWA scales as completed by mothers. Results are summarized in Table 4.

Teacher Ratings

The same reliability procedures were used to evaluate reliability of IOWA scales as completed by teachers. Results are summarized in Table 4.

Gender and Grade Differences

Mother Ratings

Data from 728 mothers were used to examine grade and gender differences using a series of 2 (gender) × 3 (grade: K/1 vs 2/3 vs 4/5/6) ANOVAsFootnote 2. As summarized in Table 5, there was a significant main effect of gender for each scale, but no main effects or interactions involving grade. Examination of means and standard deviations for the gender main effect (see Table 5) showed that boys had higher scores than girls on every scale, with differences characterized by small to medium effect sizes (Cohen’s D: IO = 0.29; OD = 0.20; HY = 0.24; IN = 0.31).

Table 5 ANOVA results, means, standard deviations (SD), and cutoffs for identifying clinically elevated scores on the IOWA Conners

Teacher Ratings

Grade × Gender ANOVAs were also used to examine data from the 1,517 teacher rating. As summarized in Table 5, there were significant main effects of gender for each scale, and a significant main effect of grade for HY and IN. Examination of means and standard deviations for the gender main effect (see Table 5) showed that boys had higher scores than girls on each scale, with differences characterized by small to medium effect sizes (Cohen’s D: IO = 0.38; OD = 0.26; HY = 0.36; IN = 0.35). Tukey honestly significant difference (HSD) post hoc tests of the main effect of grade for HY showed that children in grades K/1 had significantly higher HY scores than children in other grades, but other groups did not differ (Grades K/1: mean = 2.10, SD = 2.25; Grades 2/3: mean = 1.67, SD = 2.20; Grades 4/5/6: mean = 1.75, SD = 2.00). These differences were characterized by small effect sizes (Cohen’s D: K1/vs 2/3 = 0.20; K/1 vs 4/5/6 = 0.17; 2/3 vs 4/5/6 = −0.05). Tukey HSD tests of the main effect of grade for IN showed that children in grades 4/5/6 had significantly higher IN scores than children in grades 2/3, but other groups did not differ (Grades K/1: mean = 1.43, SD = 1.63; Grades 2/3: mean = 1.40, SD = 1.73; Grades 4/5/6: mean = 1.67, SD = 1.74). These differences were characterized by small effect sizes (Cohen’s D: K1/vs 2/3 = 0.02; K/1 vs 4/5/6 = −0.14; 2/3 vs 4/5/6 = −0.16).

Screening Cutoffs

The 90th percentile was selected as the criteria for identifying children with elevated scores. In other words, children in the upper 10% of the distribution were identified as having significantly elevated scores. The 90th percentile was selected to provide a reasonable estimate of the prevalence of externalizing behavior problems while minimizing the chance of making false negative errors (i.e., failing to identify children who should be identified), which are arguably the most serious type of error when conducting screening assessments. Cut scores were computed separately for each scale and separately for boys and girls. The proposed scale cutoffs, defined as the minimal score to identify children at or above the 90th percentile, are summarized in Table 5 for mother and teacher ratings.

Evaluation of Screening Cutoffs

Overview

The proposed cutoffs for the IOWA scales were examined using a series of 2 × 2 chi-square analyses to compare the IOWA groups (IO, OD, HY, IN) to comparable ADS-IV groups. A variety of recommended diagnostic evaluation statistics were also computed from these analyses, including: sensitivity, specificity, positive predictive power, negative predictive power, overall percent correct, and kappa (Kessel and Zimmerman 1993). Only children with complete data on both the IOWA and on the ADS-IV were used for these analyses (mother ratings n = 708, teacher ratings n = 1,482).

Mother Ratings

The 2 × 2 chi-square analyses showed: (a) IOWA IO (no vs yes) was significantly related to ADS-IV ADHD-any subtype (no vs yes), χ 2(1) = 247.59, p < 0.001; (b) IOWA OD (no vs yes) was significantly related to ADS-IV ODD (no vs yes), χ 2(1) = 223.26, p < 0.001; (c) IOWA HY (no vs yes) was significantly related to ADS-IV ADHD hyperactive/impulsive/combined types (no vs yes), χ 2(1) = 136.12, p < 0.001; (d) IOWA IN (no vs yes) was significantly related to ADS-IV ADHD inattentive/combined types (no vs yes), χ 2(1) = 160.45, p < 0.001. Diagnostic statistics (see Table 6) showed that the specificity (true negatives) was higher than the sensitivity (true positives) and the negative predictive power was higher than the positive predictive power for each IOWA scale. These results indicate that the selected cutoffs are more accurate at ruling out children who do not have IO, OD, HY or IN problems than they are at correctly identifying children who do have such problems. This is a good characteristic of a screening device.

Table 6 Number (percent) of children in IOWA groups and ADS-IV groups as rated by mothers

Teacher Ratings

The 2 × 2 chi-square analyses showed: (a) IOWA IO (no vs yes) was significantly related to ADS-IV ADHD-any subtype (no vs yes), χ 2(1) = 471.37, p < 0.001; (b) IOWA OD (no vs yes) was significantly related to ADS-IV ODD (no vs yes), χ 2(1) = 450.81, p < 0.001; (c) IOWA HY (no vs yes) was significantly related to ADS-IV ADHD hyperactive/impulsive/combined types (no vs yes), χ 2(1) = 396.91, p < 0.001; (d) IOWA IN (no vs yes) was significantly related to ADS-IV ADHD inattentive/combined types (no vs yes), χ 2(1) = 529.65, p < 0.001. Diagnostic statistics (see Table 7) showed that the specificity (true negatives) was higher than the sensitivity (true positives) and the negative predictive power was higher than the positive predictive power for each IOWA scale except the OD scale which had high sensitivity and specificity. These results indicate that the selected cutoffs are more accurate at ruling out children who do not have IO, OD, HY or IN problems than they are at correctly identifying children who do have such problems. This is a good characteristic of a screening device.

Table 7 Number (percent) of children in IOWA groups and ADS-IV groups as rated by teachers

Discussion

This study examined the factor structure and psychometric properties of mother and teacher ratings on the IOWA Conners Rating Scale. Based on the a priori model of the IOWA and on previous research and theory, a two factor model was tested, which consisted of an inattentive-impulsive-overactive (IO) factor and an oppositional-defiant (OD) factor, and a three factor model was tested in which the IO factor was further divided into a hyperactive/impulsive (HY) factor and an inattentive (IN) factor. Confirmatory factor analysis was used to test the hypothesized models and found that the three factor model provided the best absolute fit to the observed data and showed a significantly better fit relative to the two-factor model. These results are consistent with current diagnostic formulations of ADHD and ODD and with a large body of research suggesting that oppositional defiant, hyperactive/impulsive, and inattentive behaviors are independent but correlated behaviors (Hinshaw 1987; Waschbusch 2002).

Confirmatory factor analyses also demonstrated that the IOWA as completed by mothers and teachers are functionally equivalent. This has commonly been assumed to be true, but this assumption has rarely been tested. For both informants the three factor model was the most well supported, and the items appear to relate to the factors in an equivalent way. Further, the IOWA scales were highly reliable for both mothers and teachers.

Analyses of grade and gender differences showed that scores on each of the IOWA scales showed small to moderate differences between boys and girls but showed small to insignificant differences between grades. This pattern was found for both mother ratings and teacher ratings. It is not surprising that boys had higher scores on the IOWA than did girls as this is consistent with a large body of other research (e.g., Gaub and Carlson 1997; Keenan et al. 1999; Waschbusch et al. 2006). The finding that neither IO nor OD ratings differed as a function of grade, and that the grade differences for HY and IN were very small, is consistent with other studies that suggest IO and OD behaviors tend to be relatively stable over the course of elementary school years (Cohen et al. 1993; DuPaul et al. 1997; Pelham et al. 1992). Alternatively, this response may indicate that the informants adjust their thresholds for using the Likert response scales as a function of the age of the child that they are rating (e.g., parents and teachers may use different criteria for determining what constitutes “pretty much” fidgeting for first versus fourth graders).

Based on these findings, cutoffs scores for identifying children with elevated IOWA scores were proposed, with separate cutoffs computed for boys and girls. Analyses to evaluate the proposed cutoffs showed they performed reasonably well in that they were significantly related to diagnoses based on parent and teacher ratings of the full range of DSM-IV symptoms for ADHD and ODD, including evidence of symptom related impairment. However, the proposed cutoffs come from a largely Caucasian population of students who resided in non-urban settings. The generalizability of these cutoffs to other racial groups and/or geographic regions is uncertain.

Overall, this study makes a number of contributions that have considerable practical significance. First, the results show that the IOWA Conners has good psychometric properties when used by mothers and teachers. This is the first study to evaluate the psychometric properties of the instrument when used by mothers. Second, contrary to common use, this study indicated that the items on the IOWA IO scale are better represented by two separate scales, one measuring inattentive behavior and a second measuring hyperactive/impulsive behavior. This method of scoring the IOWA is consistent with recent factor analytic work and with the current diagnostic formulation of ADHD. Third, the proposed cutoff scores hold considerable promise for using the IOWA Conners as an initial screening measure for identifying children with externalizing behavior problems, although further research on their validity is certainly needed.

There were several limitations of this study. First, although racial information was not collected on individual children, the sample consisted largely of teachers and mothers of middle class, rural and Caucasian children. This reflected the communities the schools served, but generalization to other populations should be done cautiously. Second, it is possible that children of mothers who chose not to participate differ systematically from children of mothers who did participate in the study. For instance, it may be that some mothers elected not to participate because their children have especially high levels of disruptive behavior which they did not wish to call attention to. Alternatively, it may be that mothers elected not to participate because they did not feel the rating scale was relevant to their child. As described earlier, comparison of students with and without complete mother ratings suggested this is not the case as they did not differ on demographic measures (age, sex) or on teacher ratings of behavior. Even so, previous research suggests that mothers who participate in school-based research may differ from those who do not (Anderman et al. 1995), but how these findings apply to the present study is unclear. Third, it may be that schools that did not choose to participate differed systematically from those who did participate. While the schools included in the study appeared to be similar to other schools in the district, data to evaluate this empirically are not available. Fourth, although the analyses provided good support for distinguishing OD, HY and IN behaviors, both within and across informants, as well as strong indication of good internal consistency and retest reliability, absolute model fit was modest in some cases (see Table 2). These findings were likely due in part to large sample sizes (yielding high power to identify misfit), but the modest fit may also reflect imperfections in the construction of IOWA. For example, the IOWA was designed to yield an overall inattentive/impulsive/overactive (IO) scale even though more recent theory and research suggests inattentive behaviors should be distinguished from hyperactive/impulsive behaviors (e.g., Lahey et al. 1988; Milich et al. 2001; Pillow et al. 1998). Although the results supported distinguishing inattentive and hyperactive/impulsive behaviors as indexed by the IOWA Conners, the ability of this study to do so was limited by the low number of items used to measure these behaviors (3 HY and 2 IN). Moreover, the phrasing of some the OD items (e.g., acts ‘smart’, quarrelsome) may not be interpreted in the same way across informants. Nonetheless, the practical advantages of a free and brief screening measure for disruptive behavior outweigh many of these concerns.

In sum, these results indicate that the IOWA Conners is a psychometrically sound measure of externalizing behavior when completed by either mothers or teachers. Consistent with modern conceptualizations of the disruptive behavior disorders, a model that distinguished inattentive, hyperactive/impulsive, and oppositional behavior best represented the observed data. Nonetheless, there was modest support for current practice of using IO and OD scales. Users should appreciate that this ‘traditional’ scoring method confounds two distinct dimensions of behavior. This may be an important limitation in some situations (e.g., in studies interested in identifying children with different subtypes of ADHD) but may not be important in other situations (e.g., in medication trials involving children who are already diagnosed with combined type ADHD; for screening purposes where the goal is to differentiate pure ADHD from ADHD with co-occurring conduct problems). More generally, the results of this study, in combination with the fact that the IOWA Conners is an exceptionally brief measure that is in the public domain, provide a strong basis for its continued use in both clinical and research settings.

Additional research further evaluating the IOWA would benefit the many users of the IOWA. Of particular import is research using longitudinal designs to examine age changes in the IOWA, as well as research examining the validity of the IOWA scales and of the proposed screening cutoffs. Alternatively, research aimed at developing an updated version of the IOWA also seems warranted. Much progress has been made in conceptualizing disruptive behavior in children in the more than 20 years since the IOWA first developed. There is now consensus that inattentive behaviors are conceptually and empirically distinct from overactive-impulsive behaviors, as discussed earlier, but this distinction was not part of the a priori model of the IOWA. While the results of this study show that the distinction between inattentive and overactive-impulsive behavior can be made using the current version of the IOWA, it is likely that a more clear distinction between the constructions could be found if the items comprising the IOWA were updated with this purpose in mind. This would greatly benefit researchers and clinicians seeking to conduct brief evaluations of externalizing behavior in children.