Introduction

Autism spectrum disorders (ASD) are pervasive neurodevelopmental disorders characterized by deficits in social communication and the presence of restricted and repetitive behaviors or interests. Intellectual disability (ID) is characterized by significant impairments in cognitive functioning, including reasoning, problem solving, and abstract thinking, as well as deficits in adaptive behavior, including conceptual, social, and practical skills (APA 2000, 2013). ASD and ID can co-occur; recent prevalence estimates suggest that approximately half of individuals with ASD have an IQ in the average or above average range (Elsabbagh et al. 2012). Individuals diagnosed with ASD or ID often experience co-occurring emotional and behavioral problems. This includes symptoms of co-occurring psychiatric disorders, such as anxiety or mood disorders and attention-deficit hyperactivity disorder (ADHD), as well as other behavioral problems including irritability and aggression (e.g., Einfeld et al. 2006; Lecavalier 2006). Individuals with ASD or ID also often show deficits across a wide range of social skills, including difficulty interpreting or responding to social cues, avoiding eye contact, difficulty engaging in back-and-forth conversation, limited use of non-verbal behaviors including facial expression and gestures, difficulties with turn-taking or sharing, and poor conflict resolution skills (e.g., de Bildt et al. 2005).

Multi-Informant Agreement

When assessing psychological functioning, which includes emotional and behavioral problems and social skills, the use of multiple informants is critical to obtain an accurate and comprehensive picture of the individual. In fact, this is considered a “gold standard” in the assessment of psychopathology in children and adolescents (e.g., Mash and Hunsley 2005). The importance of using multiple informants lies in the fact that certain behaviors or symptoms may be absent or present depending on the environmental context, thus limiting the ability of a single informant to accurately report on these behaviors and symptoms (Achenbach et al. 1987; De Los Reyes 2011). Additionally, reports are influenced by informant biases, attributions, expectations, and standards. Finally, informants may differ in terms of how often they interact with or observe the child, and how their presence impacts the child’s behavior, all of which could contribute to discrepancies in information provided by different informants (De Los Reyes 2011; Hoyt 2000).

Agreement among informants has been widely studied in typically developing (TD) youth. Achenbach et al.’s (1987) seminal meta-analysis of 119 studies on informant agreement of behavioral and emotional problems showed that pairs of similar informants, such as two parents, demonstrated higher agreement on the Achenbach System of Empirically Based Assessment (ASEBA) rating scales (\( \bar{r} \) = .60) than pairs of different informants, such as a parent and teacher (\( \bar{r} \) = .28), or than the child him or herself with another informant (\( \bar{r} \) = .22). Across all pairs of raters, agreement was stronger for externalizing problems (\( \bar{r} \) = .41) compared to internalizing problems (\( \bar{r} \) = .32). Additionally, agreement among informants was significantly higher when assessing children aged six to eleven (\( \bar{r} \) = .51) than when assessing adolescents (\( \bar{r} \) = .41). Child gender and clinical status as well as the gender of the parent informant did not impact the level of agreement.

A more recent meta-analysis (Duhig et al. 2000) provided similar results regarding maternal and paternal ratings of internalizing and externalizing problems in TD children and adolescents. Based on the results of 60 studies, parents showed stronger agreement for externalizing problems (\( \bar{r} \) = .66) than internalizing problems (\( \bar{r} \) = .46). For both internalizing and externalizing problems, parental agreement was greater in adolescence (internalizing \( \bar{r} \) = .45; externalizing \( \bar{r} \) = .63) than in early (internalizing \( \bar{r} \) = .12; externalizing \( \bar{r} \) = .47) or middle childhood (internalizing \( \bar{r} \) = .28; externalizing \( \bar{r} \) = .55), which contrasts the findings of Achenbach et al. (1987).

Lastly, Renk and Phares’ (2004) meta-analysis of 74 studies of TD youth showed that agreement on social competence among pairs of different informants (\( \bar{r} \) ranging from .21 to .39 across rater pairs) was equivalent to that of similar informants (\( \bar{r} \) ranging from .36 to .48), which contrasts Achenbach et al.’s (1987) results showing higher agreement among similar informants. Agreement between parent- and child-report was greatest during middle childhood whereas agreement between peer- and child-report as well as between teacher- and peer-report was greatest during adolescence.

The Current Study

Rating scales are frequently used when assessing emotional and behavioral problems and social functioning in youth with ASD or ID. There is currently limited information regarding informant agreement on these scales for youth with ASD or ID. Thus, the current study focuses on informant agreement on behavioral and emotional problems and social skills in youth with ASD or ID using a meta-analytic strategy. It is the first such study in the field of developmental disabilities. As compared to TD youth, agreement among parents and teachers was hypothesized to be higher for youth with ASD or ID due to language and cognitive deficits that would lead informants to rely more on observable behaviors. However, it was hypothesized that agreement between self-report and other informants would be lower than TD youth due to these very same language and cognitive deficits which may impact the ability of individuals with ASD or ID to accurately report on their own functioning. Despite the limited published research focusing exclusively on informant agreement in ASD or ID, this information is often included in the context of other studies. Following a comprehensive literature search, meta-analytic methods were used to determine the average agreement among pairs of informants, such as parent and teacher or parent and child, as well as across similar (e.g., parent and parent) and different (e.g., parent and teacher) rater pairs. Moderators of the level of agreement, including the youth’s diagnosis (ASD vs. ID), age, and IQ, were also investigated.

Methods

Literature Search

The PsycInfo Database was searched for relevant articles. We used a total of 34 search terms. Examples of search terms included “Agreement,” “Concordance,” “Interrater,” “Informant” as well as the name of popular rating scales (e.g., Aberrant Behavior Checklist, Child Behavior Checklist) and authors known to have published in this area. Studies were considered for inclusion provided that they were: (a) Published in an academic journal between 2000 and April 2014, (b) Written in English, (c) Focused on emotional or behavioral problems or social skills, (d) Used rating scales completed by multiple informants, (e) Reported a statistic reflecting within-subjects agreement, and (f) Had samples consisting of children with ASD or ID. Any subset of ages, through age 22, was considered for inclusion. In terms of diagnosis, ASD diagnoses included Autistic Disorder, Asperger’s Disorder and Pervasive Developmental Disorder Not Otherwise Specified per DSM-IV-TR criteria. Samples that included both children with and without ASD or children with and without ID were considered for inclusion provided that demographic information and effect sizes were reported separately so that only information pertinent to the subsample of youth with ASD or ID could be included in the meta-analysis.

A total of 4,979 abstracts were generated with these searches. These abstracts were reviewed for the six inclusionary criteria listed above. The majority of abstracts excluded at this point of the literature search had samples of adults or youth with other diagnoses, did not use rating scales, or only utilized one informant. If it was not clear from an abstract whether the study met inclusionary criteria (e.g., not specifying who completed rating scales or what measures were used), the article was retrieved for further review. A total of 310 of the articles were retrieved based on appearing to meet criteria for inclusion, with 49 being eligible for inclusion in the meta-analysis. As seen in Fig. 1, the most common reasons for exclusion were lack of necessary statistical information (e.g., only reporting means and standard deviations or only reporting significant correlations), using only one informant or a different type of informant (e.g., clinicians), having a sample that was not comprised entirely of children with ASD or ID, using an assessment tool other than rating scales (e.g., interviews, observations), or not reporting on emotional or behavioral problems or social skills. Other reasons for exclusion included ratings that were collected at different time points (e.g., parent ratings collected 2 years after teacher ratings) and missing information, such as the relationship of informants to the child.

Fig. 1
figure 1

Literature search. ASD autism spectrum disorder, ID intellectual disability

Calculation of Effect Sizes

Using the 49 selected articles, a total of 107 effect sizes were identified. The authors independently reviewed the measures used in the 49 selected articles and classified the measures’ subscales as externalizing problems, internalizing problems, or social skills based on the content of the measures and subscales. The authors only disagreed on the classification of a minority of the subscales (approximately 10 %), and discussed these disagreements to reach consensus. The majority of the disagreements were on subscales assessing peer relationships (e.g., social problems on the ASEBA scales). Given that the content of these subscales could reflect social skills, externalizing behavior, or both, the authors ultimately decided not to include these subscales in the meta-analysis. The classification of these measures and subscales can be seen in Table 1, along with the demographic information for each sample and the calculated effect sizes. Consistent with other published meta-analyses (Achenbach et al. 1987; Duhig et al. 2000; Renk and Phares 2004), effect sizes for each cross-informant pair and behavior category were treated independently. However, while some studies reported only one effect size within a behavior category for each informant pair, several studies reported multiple effect sizes within a behavior category for one informant pair (e.g., reporting parent–teacher agreement separately for ADHD and oppositional defiant disorder, both externalizing problems). Including multiple effect sizes in the same behavior category for the same rater pair from the same study would violate the independence assumption, thus possibly inflating the sample size of the statistical tests and effect sizes beyond what is actually included in the meta-analysis (Wolf 1986). Therefore, when studies reported agreement among the same pair of informants for multiple behaviors on the same rating scale that would fall within one behavior category (externalizing problems, internalizing problems, or social skills), the effect sizes were averaged. When studies included multiple effect sizes from different rating scales that would fall within one category, the effect size from the more widely used measure was selected. For example, Ozsivadjian et al. (2013) reported correlations for SCAS Total Anxiety as well as for the total score on the Children’s Depression Inventory (CDI); as the CDI was not used in any other studies included in the meta-analysis and the SCAS was used in another study (Farrugia and Hudson 2006), it was the correlation for the SCAS that was included in the meta-analysis. Lastly, three studies reported parent–child and parent–teacher correlations separately for mothers and fathers (Baker et al. 2007; Kalyva 2010; van Steensel et al. 2013). To be consistent with other studies included in the meta-analysis, the correlations using mothers were used for the meta-analysis because parent respondents in other studies were 80–90 % mothers.

Table 1 Studies included in meta-analysis

The most commonly reported statistics were Pearson correlations and intra-class correlations. Paired sample t-tests were reported in two studies; these statistics were converted to Pearson correlations using the formula suggested by Rosenthal (1991). One study reported ANOVA results; this F-value was converted to a Pearson correlation using the formula suggested by Rosenthal (1991). To reduce the skew of the distribution of correlations, Pearson correlations and ICC were converted to a z score using Fisher’s z transformation (Fisher 1938), and standard errors were calculated for these effect size estimates. While some have argued that this transformation of the ICC can be biased in terms of the probability estimates, this bias is significantly reduced when the ICC represents the correlation between two groups rather than several groups (McGraw and Wong 1996). As there is currently no other identified way to convert ICC to a common effect size metric and all ICC included in the meta-analysis are of only two groups, Fisher’s z transformation was used in order to include these studies in the meta-analysis.

Planned Analyses

Similar to the meta-analyses conducted by Achenbach et al. (1987), Duhig et al. (2000), and Renk and Phares (2004), findings from the included studies were combined to determine the degree of correspondence within each cross-informant pair for the three behavior categories (externalizing problems, internalizing problems, social skills). Additionally, similar to Achenbach et al. (1987), additional average effect sizes were calculated for all behaviors, pairs of similar raters (e.g., parent–parent), pairs of different raters (e.g., parent–teacher, parent–child) and all raters. To control for sample size, effect sizes were weighted using the inverse of the squared standard error (as suggested by Rosenthal 1991), and weighted mean effect sizes were calculated in addition to un-weighted mean effect sizes. To determine the degree of heterogeneity of effect sizes contributing to a composite effect size, Hedges’ test for homogeneity (Hedges’ Q test) was used (Hedges and Olkin 1985); when this is statistically significant, this indicates that there is significant heterogeneity among the effect sizes contributing to the mean weighted effect size.

Potential moderators were considered to explain heterogeneity of the effect sizes. For both categorical moderators (diagnosis, and age and IQ ranges) and to compare effect sizes across raters, behavior categories, and children with ASD versus ID, analyses based on the principle of ANOVA were used as suggested by Hedges (1982) and Lipsey and Wilson (2001). As suggested by Card (2012), for continuous moderators (average age and IQ), regression analysis was used with the effect size (as a z score) as the outcome variable and the potential moderator as a predictor. Weighted least squares regression was used to give studies with larger sample sizes more weight in the regression model, with the inverse of the squared standard error serving as the weight for each study.

Results

Average Effect Sizes

Table 2 presents the mean unweighted and weighted correlations across the five rater pairs and three behavior categories. All mean weighted correlations were significantly >0, and most demonstrated significant heterogeneity in the contributing effect sizes. Using Hedges’ Q test for between group homogeneity, there were several significant differences among rater pairs for externalizing behavior: teacher–teacher agreement was significantly higher than parent–teacher (Q = 10.32, p < .001), parent–child (Q = 4.99, p = .03), and teacher–child agreement (Q = 6.77, p = .009); and parent–parent agreement was significantly higher than parent–teacher (Q = 35.99, p < .001), parent–child (Q = 19.16, p < .001), and teacher–child agreement (Q = 17.43, p < .001). Several significant differences across rater pairs were also found for internalizing behavior: parent–child agreement was significantly higher than parent–teacher agreement (Q = 20.83, p < .001); parent–parent agreement was significantly higher than parent–teacher (Q = 69.32, p < .001), parent–child (Q = 28.96, p < .001), and teacher–child agreement (Q = 26.69, p < .001); and teacher–teacher agreement was significantly higher than parent–teacher (Q = 7.22, p = .007) and teacher–child agreement (Q = 4.93, p < .001). Finally, for social skills, parent–parent agreement was significantly higher than parent–teacher agreement (Q = 4.89, p = .03).

Table 2 Average correlations across rater pairs and behavior categories for youth with ASD or ID

Table 3 presents the mean unweighted and weighted effect sizes across similar rater pairs (parent–parent and teacher–teacher), different rater pairs (parent–teacher, parent–child, and teacher–child), and all raters across the three behavior categories and the aggregate of all behaviors. All mean weighted correlations were significantly different from zero, and most of these correlations demonstrated significant heterogeneity in the contributing effect sizes. Using Hedges’ Q test for between group homogeneity, similar rater pairs showed higher agreement than different rater pairs for the aggregate of all behaviors (Q = 104.85, p < .001) and each of the three behavior categories (externalizing Q = 41.65, p < .001; internalizing Q = 59.40, p < .001; social skills Q = 5.30, p = .02). For all raters, agreement on externalizing problems was greater than internalizing problems (Q = 10.77, p = .001) and social skills (Q = 24.11, p < .001), and agreement on internalizing problems was higher than social skills (Q = 4.07, p = .04). For similar rater pairs, agreement on social skills was lower than either externalizing problems (Q = 7.79, p = .005) or internalizing problems (Q = 6.76, p = .009). For different rater pairs, agreement on externalizing problems was greater than agreement on internalizing problems (Q = 11.40, p < .001) or social skills (Q = 15.23, p < .001).

Table 3 Average correlations across similar, different and all raters for youth with ASD or ID

Moderators of Informant Agreement

Given the significant heterogeneity seen for the majority of the mean weighted correlations, several moderators were considered. These analyses are quite limited by the information (and lack thereof) reported in the published studies (as seen in Table 1), which limits the number of studies that can be included in the moderator analyses. Diagnosis was considered as a categorical moderator, comparing samples of ASD youth to samples of ID youth. There were no significant differences across youth with ASD and youth with ID for externalizing behavior or social skills. However, agreement on internalizing behavior was significantly higher for youth with ASD than youth with ID for all raters (ASD \( \bar{r} \) = .35; ID \( \bar{r} \) = .34; Q = 3.91, p < .05), similar rater pairs (ASD \( \bar{r} \) = .75; ID \( \bar{r} \) = .62; Q = 3.87, p < .05), and different rater pairs (ASD \( \bar{r} \) = .32; ID \( \bar{r} \) = .29; Q = 4.31, p = .04). For the aggregate of all behaviors, agreement among different rater pairs was higher for ASD youth (ASD \( \bar{r} \) = .334; ID \( \bar{r} \) = .328; Q = 3.87, p < .05) while agreement among all raters was significantly higher for ID youth (ASD \( \bar{r} \) = .35; ID \( \bar{r} \) = .38; Q = 5.74, p = .02).

Participant age was considered as both a categorical and a continuous moderator. As a categorical moderator, the age range of the sample was classified as preschool (age 5 and under), school-aged (age 5–12), or adolescent (age 12–21). The boundaries for these categories are arbitrary and we allowed the age range of a sample to fall 2 years outside the window (e.g., if the age range was 4–9 years, it was classified as school-age and if the age range was 10–16 years, it was classified as adolescent). As seen in Table 1, many studies assessed a broad range of ages (e.g., 3–21, 6–18) or did not report the age range of the sample and therefore could not be used in this moderator analysis. A total of five studies reported on preschool-aged samples, nine on school-aged samples, and 11 on adolescent samples. Given how few studies could be used in this analysis, it was not feasible to consider this moderator separately for similar and different rater pairs. For the aggregate of all behaviors and externalizing problems, there were no significant differences between the age categories. Informants showed higher agreement for internalizing problems in adolescents (\( \bar{r} \) = .36) than school-aged children (\( \bar{r} \) = .19; Q = 6.30, p = .01). Agreement among informants was greater for social skills in school-aged children (\( \bar{r} \) = .40) than adolescents (\( \bar{r} \) = .21; Q = 6.26, p = .01). As a continuous moderator, the average age of the sample was entered into the regression analysis. As seen in Table 1, six studies did not report the average age of the sample and could not be used in these analyses. Average age emerged as a significant moderator of informant agreement for pairs of different raters assessing internalizing problems (β = .38, p < .001), for pairs of similar raters assessing internalizing problems (β = −.65, p = .02), and for pairs of similar raters assessing the aggregate of all behaviors (β = −.43, p = .02).

Participant IQ was also considered as both a categorical and a continuous moderator. As a categorical moderator, the range of IQ for each study was categorized as falling in the ID range (below 70) or the non-ID range (above 70). The cutoff of 70 represents a rough boundary that varied by as much as 10 points (e.g., a sample with IQ all <75 was categorized in the ID range while a sample with IQ ranging from 66 to 133 was categorized in the non-ID range). Several studies included participants across the full range of IQ or did not report an IQ range, and thus could not be used in this moderator analysis. A total of five studies reported on samples in the ID range and fourteen studies reported on sample in the non-ID range. Given how few studies could be used in this analysis, it was not feasible to consider this moderator separately for similar and different raters. IQ did not emerge as a significant categorical moderator for the aggregate of all behaviors or for any of the three behavior categories. As a continuous moderator, the average IQ of the sample was entered into the regression analysis. As seen in Table 1, 23 studies reported an average IQ, thus these are the only studies included in the moderator analyses. Average IQ emerged as a significant moderator for all raters assessing internalizing problems (β = −.33, p = .005) and for pairs of similar raters assessing all behaviors (β = −.83, p < .001).

Discussion

This study was the first to report on informant agreement on emotional and behavior problems and social skills in youth with ASD or ID using meta-analytic methods. The mean weighted effect size across all raters and all behaviors was .36, reflecting moderate agreement. However, consistent with meta-analyses investigating this in TD youth (Achenbach et al. 1987; Duhig et al. 2000; Renk and Phares 2004), pairs of informants demonstrated differing levels of agreement, and this also varied across externalizing problems, internalizing problems, and social skills. The mean weighted effect sizes across informant pairs ranged from .34 to .71 for externalizing problems, from .25 to .69 for internalizing problems, and from .27 to .47 for social skills. Pairs of similar raters (e.g., parent–parent) showed significantly higher agreement on externalizing problems, internalizing problems, social skills, and the aggregate of all behaviors when compared to pairs of different raters (e.g., parent–teacher, teacher–child), which is likely due to the fact that similar raters observe the child in similar contexts, thus reducing the likelihood of context-dependent differences in child behavior. With all rater pairs combined, agreement was significantly higher for externalizing problems (\( \bar{r} \) = .42) than either internalizing problems (\( \bar{r} \) = .35) or social skills (\( \bar{r} \) = .30), and agreement on internalizing problems was significantly higher than agreement on social skills.

Comparison to Informant Agreement for TD Youth

Presented in Table 4 are the mean weighted effect sizes reported by Achenbach et al. (1987), Duhig et al. (2000), and Renk and Phares (2004), as well as those found in the current study. Given that these meta-analyses of TD youth did not report confidence intervals for the mean weighted correlations, it is not possible to make direct statistical comparisons with the current meta-analysis. However, some discrepancies are noteworthy. The greatest discrepancy is seen for parent–parent agreement on internalizing problems, which showed a mean weighted correlation of .69 in this meta-analysis, representing a difference of .24 when compared to Duhig et al. (2000) and a difference of .10 when compared to Achenbach et al. (1987) results. The current meta-analysis found a mean weighted correlation for parent–child agreement on social skills that was .15 higher than that reported in the Renk and Phares’ (2004) study. Conversely, it found a mean weighted correlation for teacher–teacher agreement on externalizing problems that was .12 lower than that reported by Achenbach et al. (1987). Lastly, the current meta-analysis yielded a mean weighted correlation for parent–teacher agreement on social skills that was .11 lower and a mean weighted correlation for parent–parent agreement on social skills that was .11 higher than that reported by Renk and Phares (2004). These discrepancies may result from youth with ASD or ID relying more heavily on caregivers for social and emotional support, which may lead to greater caregiver awareness of their emotional and behavioral problems. It is also possible that agreement differs due to differences in the nature of these problems across TD and ASD or ID populations. For instance, because emotional and behavioral problems and social skills deficits are more prevalent in youth with ASD or ID, caregivers may focus more on these concerns. Additionally, youth with ASD or ID may show a greater behavioral expression of internalizing problems, particularly anxiety, making these concerns more readily observable by caregivers (Ozsivadjian et al. 2013).

Table 4 Comparison with meta-analyses on cross-informant agreement in TD youth

While some discrepancies were observed, informant agreement in youth with ASD or ID is generally comparable that reported in TD youth. Indeed, for youth with ASD or ID as well as TD youth, agreement among pairs of similar informants is greater than that of pairs of different informants for externalizing and internalizing problems. While this same pattern was observed in this meta-analysis for social skills in youth with ASD or ID, Renk and Phares (2004) did not find greater agreement among pairs of similar raters for social skills in TD youth. While Renk and Phares (2004) hypothesized that emotional and behavioral problems are much more salient than social skills, thus leading to greater informant agreement, it may be the case that social skills are more salient for youth with ASD or ID, thus leading to higher agreement seen among similar rater pairs in the current meta-analysis.

Moderators of Informant Agreement

When considering diagnosis as a categorical moderator, few differences existed across youth with ASD and youth with ID. Agreement on internalizing behavior was significantly higher for youth with ASD across all raters as well as within similar raters and different raters. Additionally, for the aggregate of all behaviors, agreement among different raters pairs was higher for youth with ASD and agreement among all rater pairs was higher for youth with ID. However, the magnitude of these discrepancies was small, suggesting no meaningful practical difference. Additionally, the difference for the aggregate of all behaviors as assessed by all raters likely existed as 31 % of the contributing effect sizes for youth with ID were from similar raters while only 6 % were from similar raters for youth with ASD. As similar raters show higher agreement than different raters, this would lead to a higher mean weighted effect size for youth with ID. Overall, this suggests that agreement among pairs of informants is similar for youth with ASD and youth with ID, indicating that the use of multiple informants is equally important in each population in order to obtain a comprehensive description of psychological functioning.

When the age range of the sample was considered as a continuous moderator, agreement among similar raters on internalizing problems and the aggregate of all behaviors decreased as average age increased. It is possible that these behaviors may be more observable or more cross-situationally consistent for younger children. In contrast, as average age increased, agreement among different raters on internalizing problems also increased. It is possible that this increase in agreement among pairs of different raters reflects the fact that youth are able to more accurately complete self-report measures with increasing age. Slightly different patterns emerged when the average age of the sample was considered as a categorical moderator. Agreement among all pairs of informants was significantly higher for school-aged children as compared to adolescents when it came to social skills. Additionally, agreement on internalizing problems was significantly higher for adolescents as compared to school-aged children.

When the IQ range of the sample was considered as a categorical moderator, no significant differences emerged, indicating that agreement among informants was consistent for those in the ID range and those in the non-ID range. However, when the average IQ of the sample was considered as a continuous moderator, two significant relationships emerged: with increasing IQ, agreement among similar raters on the aggregate of all behaviors decreased, as did agreement among all raters on internalizing problems. For youth with borderline or below average IQ, emotional and behavioral problems may be more salient and more likely to be a topic of discussion among caregivers, which could lead to higher agreement. It is also possible that youth with lower IQ have less variability in their behavior across environments, which would lead to increased agreement among caregivers.

Moderator analyses yielded slightly different results when treating age and IQ as categorical or continuous variables. Significantly more studies were included when considering age and IQ as continuous moderators, thus increasing the power of these analyses. Additionally, analyses may be less precise when age and IQ were considered as categorical moderators because of the variability in the ranges used. Lastly, due to the limited number of studies that could be used in the categorical analyses, this relationship was only considered among all raters, rather than among similar and different raters separately.

Importance of Informant Agreement

The use of multiple informants in psychological assessment is critical in order to obtain a comprehensive picture of the individual’s functioning across environments. Parent–child and teacher–child agreement was similar across youth with ASD or ID, suggesting that youth contribute information that is different than that contributed by parents in the assessment of their own emotional, behavioral, and social functioning. Individuals with ASD or ID often recognize their difficulties in these areas of functioning and can contribute valid information (e.g., Douma et al. 2006; Emerson 2005; Knott et al. 2006; Lopata et al. 2010; van Steensel et al. 2013). In fact, given that the magnitude of informant agreement is similar to that observed in TD youth, we see no reason why the difficulties associated with the use of self-report by youth with ASD or ID would be different than their TD counterparts. The use of multiple informants may be even more important for this population, particularly youth with ASD, due to difficulties generalizing skills across contexts and settings.

Informant discrepancies in TD youth have shown to map onto variations in behavior observed in the laboratory (De Los Reyes et al. 2009). Other studies have linked informant discrepancies to meaningful differences in behavior across contexts, including increased parent–teacher agreement on aggressive behavior as similarities in the social experience across home and school environments increases (Hartley et al. 2011). Informant discrepancies are additionally predictive of long-term outcomes. In TD youth, greater discrepancies between parent and child report of psychopathology have been shown to be predictive of poorer treatment outcomes 16 weeks–4 years later (Ferdinand et al. 2006; Panichelli-Mindel et al. 2005), poorer young adult outcomes 4 years later (Ferdinand et al. 2004), and lower parental involvement in therapy over the course of 14 months (Israel et al. 2007). Across studies, informant discrepancies are the most predictive of long-term outcomes when the discrepancies are larger. Finally, the use of different informants in outcome studies can lead to different conclusions across studies. Examining these outcome patterns across studies can help to identify hypotheses regarding treatment effects (De Los Reyes 2011).

Limitations

The main limitations of this study, like most meta-analyses, lies in the fact that data were taken from previously published literature. This introduces the “file drawer problem,” which suggests that there may be a publication bias in that unpublished studies not included in the meta-analysis might show different results than the published studies (Rosenthal 1991). However, in this case, funnel plots of the included studies were symmetrical, suggesting that there is limited publication bias for this meta-analysis. The possible analyses used in the meta-analysis are also limited by the published information. This impacted the calculation of mean weighted effect sizes when few studies reported on agreement between a particular pair of raters for one of the behavior categories (e.g., only one study reported on teacher–teacher agreement on social skills), thus limiting the interpretability of these results. This also affected the moderator analyses. For example, only 23 studies reported average participant IQ and thus could be included in this moderator analyses; the results may differ if all included studies had reported IQ. Due to the limitations of the published data, moderator analyses could not be considered separately for youth with ASD and youth with ID. This is especially relevant when considering the IQ analyses, which may be biased due to lower IQ in youth with ID as compared to youth with ASD. Because of this, the results of the IQ moderator analyses should be interpreted with caution. Additionally, there was variability in the magnitude of informant agreement across the various rating scales included in the meta-analysis. While we investigated this, there were not enough effect sizes for each measure to conduct separate analyses to investigate this variability in a meaningful way.

Finally, highly heterogeneous variables, such as agreement among various pairs of informants on different measures for samples with varying demographic characteristics, are combined into one mean effect size, and meaningful information may be lost in the process (Rosenthal 1991). It is possible that the included studies did not utilize samples of youth with only ASD or ID, which may increase the heterogeneity of effect sizes. For example, some youth included in ID samples may have also had ASD or other co-occurring diagnoses. Moderators were investigated to explain this heterogeneity, but it is possible that other important moderators were not considered in these analyses. For example, in TD youth, informant agreement is impacted by factors such as the child’s social desirability, parental psychopathology, parental stress, and parental acceptance of the child (for review, see De Los Reyes and Kazdin 2005). However, information about these potential moderators was not reported in studies included in this meta-analysis; thus they could not be considered in these analyses. Additionally, due to the limitations of the published data, it was not possible to consider all moderators in one model; thus the potential interaction of these moderators could not be evaluated. For example, it is possible that there are diagnostic or IQ differences for specific age groups that may not have been identified when considering these as separate moderators.

Conclusions

This meta-analysis suggests that agreement among informants on behavioral and emotional problems and social skills in youth with ASD or ID is similar to that observed in TD youth. Overall, agreement falls in the moderate range, with higher agreement seen in pairs of similar raters than different raters. Agreement on externalizing problems is greater than agreement on internalizing problems or social skills. Several factors appear to moderate the level of agreement among informants, including the youth’s diagnosis, age, and IQ. These results highlight the need to use multiple informants when assessing psychological functioning in youth with ASD or ID. Each informant provides different but important information, and this is critical to obtain a comprehensive picture of the individual.

Future research should examine the extent to which informant discrepancies map onto observed behavior variations, similarly to what has been considered in TD samples. Additionally, potential moderators of informant agreement should be investigated further. In addition to the moderators examined here, communication skills, ASD symptom severity, and adaptive behavior may be of particular importance to individuals with developmental disabilities. A further look into patterns in ratings from different informants, such as whether mothers provide higher ratings of behavior problems than teachers or vice versa might also be particularly useful. Importantly, there is a need to evaluate the utility of informant discrepancies in the developmental disabilities population, including whether considering these discrepancies leads to the development of more meaningful treatment goals or if they are predictive of treatment or other long-term outcomes, as well as the role informant discrepancies play in the assessment of other domains of psychological functioning, such as adaptive behavior.